Back to Community
Trading with Sentiment Machine Learning

This post will introduce an algorithm that incorporated Sentiment Analysis and Machine Learning. In this algorithm, 5 companies(Apple, Boeing, Intel, Merck and Google) were invested. The sentiment data was from FinSentS Portal provided by InfoTrie Financial Solutions( I downloaded the data from quandl(

Theory Base
The post was primarily based on an assumption that sentiment data can be used to predict the stock price. This can be found from paper: Can Online Emotions Predict the Stock Market in China? (Zhou Z and Zhao J, 2016).

Here is the link:

Paper Abstract:
In this paper, the authors explored the relationship between the relationship of sentiment from Weibo(Chinese Twitter) and fluctuation of stock price. Firstly, the author removed the data from non-trading days due to significant cut down of sentiment volume on those days. Secondly, 5 sentiment labels were employed in this analysis. The five labels are anger, sadness, joy, disgust and fear.

The mechanism of these five sentiment labels can be found in this paper: An emotion-based sentiment analysis system for Chinese tweets(Zhou.J, etc, 2012).

The correlation analysis showed that: anger has no prediction ability for stock price; volume neither; fear affects open price distinctly; sadness and joy affect the highest and lowest price respectively; Disgust mainly affect close price. At last, the author tested 3 classifiers' performance regarding prediction using sentiment data. The results showed that SVM had better performance compared to linear classifier.

FinSentS sentiment data Introduction

The data has 5 attribute values for each stock, each day:

  • Sentiment Score: a numeric measure of the bullishness / bearishness of news coverage of the stock.
  • Sentiment High / Low: highest and lowest intra-day sentiment scores.
  • News Volume: the absolute number of news articles covering the stock.
  • News Buzz: a numeric measure of the change in coverage volume for the stock.

Strategy Explanation

  • Data preparation:
    I used four types of sentiment data in this analysis: sentiment score, sentiment range, news volume and news buzz. I combined the sentiment high and sentiment low into one attribute: sentiment range. This is to reduce noise as we already had sentiment score. A range data can replace the high and low data and it should be enough to cover necessary information.

    Quantopian did not support the function of fetching history data uploaded by user. Hence, a customized method to fetch past data was employed. This method was mainly inspired by this post:

    Generally, I used a post_func in the fetch_csv function. When reading the csv file, the past data was stored in a new column for each row. When I need those past data, I can just extract that column.

  • Cash allocation:
    For each day's trading, 5% of cash will be reserved, this is from empirical experience. In addition, for each stock, I will allocate at most one fifth of the rest portfolio cash. This is to avoid high leverage.

  • Training data set and testing data set:
    The training data is the 3 days' sentiment data before a specific day and the test data is the label of whether the stock will rise or fall in that specific day. For each stock at each day's trading, 96 pairs of past training data set and testing data sets were prepared to be trained for models.

  • Model fitting and prediction:
    In this algorithm, 4 common classifiers were employed: NuSVM, LinearSVM, Random Forest and Logistic Regression. To further eliminate noise and unnecessary tradings, I introduced a voting system. The trading action will only be implemented when 3 out of 4 or all classifier agree to long or short a stock.

  • Adding moving average:
    To further employ the advantages of technical analysis, a simple moving average strategy was employed. The basis of this strategies is when the short-term moving average is above the long-term moving average, the asset price is in an upward path. Hence, we want to follow the trend and long the asset. Conversely, we want to short the asset in the other case.

Results Evaluation
Basically, the algorithm's performance is quite good. Its beta is not very high and it also did not show high volatility. One of the drawbacks of the back test is that the whole period is a upward period for S&P 500 and it cannot show much about how this algorithm react to bad markets.

Future Improvement

  • Larger universe and lower position concentration .
    Due to quantopian's limitation on csv file size that can be fetched, I can only upload 5 stock's sentiment data. If there is a way to connect to FinsentS API or a solution to feed the sentiment data into the algorithm, a better and stable performance will be anticipated from larger universe.

  • Equal long/short exposure.
    Further calculating the probability of rise and fall for each stock is possible. After ranking, I can either choose the top few stocks to trade or set a benchmark for transaction. This action can be made to balance the long and short exposure. In this algorithm, because I only have 5 stocks, I did not implement this strategy. But it is worth trying for larger universe.

  • Using the non-trading day sentiment data
    In this algorithm, non-trading day sentiment data was not used due to low volume. But logically, it is worth trying to use non-trading day sentiment data.

  • Deep learning for time series data classification
    To simplify the model, I used past 3 days's sentiment attribute data to do the prediction. I just fit the model by a set of attribute information. However, more advance algorithm is encouraged since the attribute data set is also a set of time series data. Possibly, deep learning with latency factor can be employed. I will study further on this topic.

Zhao, J., Dong, L., Wu, J., Xu, K.: Moodlens: An Emoticon-based Sentiment Analysis System for Chinese Tweets. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 1528–1531.

Zhou Z, Zhao J, Xu K. Can Online Emotions Predict the Stock Market in China? International Conference on Web Information Systems Engineering. Springer International Publishing, 2016. pp. 328-342.

Clone Algorithm
Backtest from to with initial capital
Total Returns
Max Drawdown
Benchmark Returns
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
from datetime import timedelta
from pytz import timezone
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from collections import Counter
import talib
import statsmodels.api as sm
import numpy 
import pandas as pd

def preview(df):
    return df

def custom_split(string_list):
        Parses a string and returns it in list format, without the '[' ']' 

        :params string_list: a list that's been made into string e.g. "[ hello, hello2]"
        :returns: a string that's been made into a list e.g. "[hello, hello2]" => [hello, hello2]
    # Remove the '[' and ']'
    string_list = string_list[1:-1].split(',')
    # Convert to float
    string_list = [float(s) for s in string_list]
    return string_list

def get_day_delta(current_date):
        Takes in the current date, checks it's day of week, and returns an appropriate date_delta
        E.g. if it's a Monday, the previous date should be Friday, not Sunday

        :params current_date: Pandas TimeStamp
        :returns: an int 
    if current_date.isoweekday() == 1:
        return 3
        return 1

def fill_func(df, row, num_dates):
        Should be applied to every row of a dataframe. Reaches for the past thirty days of each dataframe,
        appends the data to a string, returns the string which should be unpacked later on.

        :params df: The dataframe in it's totality
        :params row: The row of each dataframe, passed through the lambda function of Dataframe.apply(lambda row: row)
        :params num_dates: How many dates to go back (e.g. 30 = 30 days of past data)

        :returns: A list in the form of a string (containing past data) which should be unpacked later on
    # Instantiate variables
    past_data = []
    # The current date is the name of the Series (row) being passed in 
    current_date =
    # print ("current_date ", current_date)
    # Iterate through the number of dates from 0->num_dates
    for i in range(num_dates):
        # How many days to get back, calls get_day_delta for accurate delta assessment
        day_delta = get_day_delta(current_date)
        # print ("day delta ", day_delta)
        # Get the current_date and update the current_date to minus day_delta from the date
        # To get the appropriate past date
        current_date = current_date - timedelta(days=day_delta)
        #print ("changed current_date ", current_date)
            #: Get the price at the given current_date found by get_day_delta
            data = df.ix[current_date]['sentiment']
            # print ("current date ", current_date, "data " ,data)
            data = df.ix[current_date]['sentiment high']- df.ix[current_date]['sentiment low']
            data = df.ix[current_date]['news volume']
            data = df.ix[current_date]['news buzz']
            #print ("past data " ,past_data)
        except KeyError:
            #: No data for this date, pass
    # print str(past_data)
    # Return the a list made into a string
    return str(past_data)

def post_func(df): 
    df = pd.DataFrame(df)
    df['past_data'] = df.apply(lambda row: fill_func(df, row, 99), axis=1)[0:11,6])
    return df

def initialize(context):
    ## Initialize list of securities we want to trade
    context.security_list = symbols('AAPL', 'BA', 'MRK', 'INTC', 'GOOG')
    ## Trailing stop loss
    context.stop_loss_pct = .995
    # We will weight each asset equally and leave a 5% cash
    # reserve. - actually this is sort of good idea
    context.weight = 0.95 / len(context.security_list)
    context.investment_size = (*context.weight)
    fetch_csv("", date_column = 'date', date_format = '%y-%m-%d', pre_func = preview, post_func = post_func)
    fetch_csv("", date_column = 'date', date_format = '%y-%m-%d', pre_func = preview, post_func = post_func)
    fetch_csv("", date_column = 'date', date_format = '%y-%m-%d', pre_func = preview, post_func = post_func)
    fetch_csv("", date_column = 'date', date_format = '%y-%m-%d', pre_func = preview, post_func = post_func)
    fetch_csv("", date_column = 'date', date_format = '%y-%m-%d', pre_func = preview, post_func = post_func)

    context.historical_bars = 100
    context.feature_window = 3
    schedule_function(myfunc, date_rules.every_day(), 
        time_rules.market_open(hours=0, minutes=1))
def myfunc(context, data):
        price_history = data.history(context.security_list, fields="price", bar_count=100, frequency="1d")
            # For loop for each stock traded everyday:
            for s in context.security_list:
                start_bar = context.feature_window
                price_list = price_history[s].tolist()
                past = data.current(s,'past_data')
                #print isinstance(past, str)
                #print isinstance(custom_split(past), list)
                print pastlist 
                print len(past)
                print len(pastlist)
                print len(price_list)
                #print past[1:-1]
                X = []
                y = []
                bar= start_bar
                # Loop for each machine learning data set
                while bar < len(price_list)-1:
                # print s," price: ",data.history(s, 'price', 100 , "1d")
                        end_price = price_list[bar]
                        start_price = price_list[bar-1]
                        features = pastlist[(bar-3)*4: bar*4]
                        # Featuers are the attribute value used for machine learning.
                        if end_price > start_price:
                            label = 1
                            label = -1
                        # Label is the indicator of whether this stock will rise or fall
                        bar +=1 
                        #print X 
                        #print y
                    except Exception as e:
                        bar +=1
                        print(('feature creation', str(e)))
                print ('len(X1)',len(X))
                # Call the machined learning model
                clf1 = RandomForestClassifier(n_estimators=100)
                clf2 = LinearSVC()
                clf3 = NuSVC()
                clf4 = LogisticRegression()
                # Rrepare the attribute information for prediction
                print ('len(X2)',len(X))
                # Rescall all the data
                X = preprocessing.scale(X)
                current_features = X[-1:]
                X = X[:-1]
                print current_features
                print ('len(X)',len(X))
                print ('len(y)',len(y))
                # Build the model
                # Predict the results 
                p1 = clf1.predict(current_features)[0]
                p2 = clf2.predict(current_features)[0]
                p3 = clf3.predict(current_features)[0]
                p4 = clf4.predict(current_features)[0]
                # If 3 out of 4 prediction votes for one same results, this results will be promted to be the one I will use. 
                if Counter([p1,p2,p3,p4]).most_common(1)[0][1] >= 3:
                    p = Counter([p1,p2,p3,p4]).most_common(1)[0][0]
                    p = 0
                current_price = data.current(s, 'price')
                current_position = context.portfolio.positions[s].amount
                cash =
                # Add one more feature: moving average
                print('price_list', price_list)
                sma_50 = numpy.mean(price_list[-50:])
                sma_20 = numpy.mean(price_list[-20:])
                print('sma_20', sma_20)
                print('sma_50', sma_50)
                open_orders = get_open_orders()
                # Everyday's trading activities: 
                if (p == 1) or (sma_20 > sma_50):
                    if s not in open_orders:
                        order_target_percent(s, context.weight, style=StopOrder(context.stop_loss_pct*current_price))
                elif (p == -1) or (sma_50 > sma_20):
                    if s not in open_orders:
        except Exception as e:
def handle_data(context, data):
    #Plot variables at the end of each day.
    long_count = 0
    short_count = 0

    for position in context.portfolio.positions.itervalues():
        if position.amount > 0:
            long_count += 1
        if position.amount < 0:
            short_count += 1
    record(num_long=long_count, num_short=short_count, leverage=context.account.leverage)
There was a runtime error.