Algo with Support Vector Machine in Pipeline

I wrote a base algo that incorporates machine learning in the pipeline. i.e. as the pipeline runs, it trains a ML model PER STOCK and comes up with a prediction on the stock's movement. The algo can then use the output of the pipeline and long the predicted up stocks and short the predicted short stocks.

As it stands, this algo does not perform well, but it can serve as a basis for someone else.

/Luc Prieur

541
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
from quantopian.pipeline import Pipeline, CustomFilter, CustomFactor
from quantopian.algorithm import attach_pipeline, pipeline_output
from quantopian.pipeline.factors import Latest
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.data.psychsignal import aggregated_twitter_withretweets_stocktwits as st
from sklearn.preprocessing import StandardScaler
from quantopian.pipeline.factors import SimpleMovingAverage
from quantopian.pipeline.filters import Q500US
import numpy as np
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import pandas as pd

class Prediction(CustomFactor):
def compute(self, today, asset_ids, out, bull_msgs, bear_msgs, close_prices, open_prices):
predictions = []
for i in range(close_prices.shape[1]):
bull_msg = bull_msgs[:, i]
bear_msg = bear_msgs[:, i]
result = (close_prices[:, i] > open_prices[:, i]) * 2 - 1
df = pd.DataFrame(data={'bull':bull_msg.flatten(), 'bear': bear_msg.flatten(), \
'result': result.flatten()})

# before shifting, we must record the last values as they will be used
# to run the model on for the prediction.
df['bull_ma'] = df['bull'].rolling(window=10).mean() #.shift(1)
df['bear_ma'] = df['bear'].rolling(window=10).mean() #.shift(1)
df['bull_rs'] = df['bull'].rolling(window=5).apply(compute_slope) #.shift(1)
df['bear_rs'] = df['bear'].rolling(window=5).apply(compute_slope) #.shift(1)

scaler = StandardScaler()
try:
features = ['bull_rs', 'bear_rs']
X_live = df[features][-1:]
df[features] = df[features].shift(1)
df.dropna(inplace=True)
X_train = scaler.fit_transform(df[features])
y_train = df['result']

model = GaussianNB()
prediction = model.fit(X_train, y_train).predict(scaler.transform(X_live))[0]

predictions.append(prediction)
except:
predictions.append(1)

out[:] = predictions

def custom_pipeline(context):
sma_10 = SimpleMovingAverage(inputs = [USEquityPricing.close], window_length=10)
sma_50 = SimpleMovingAverage(inputs = [USEquityPricing.close], window_length=50)

# for testing only
small_universe = SidInList(sid_list = (24))

#changed to be easier to read.
my_screen = (Q500US() & \
(sma_10 > sma_50) & \
(st.bull_scored_messages.latest > 10))

prediction = Prediction(inputs=[st.bull_scored_messages, st.bear_scored_messages, \
USEquityPricing.close, USEquityPricing.open],\

return Pipeline(
columns={
'sma10': sma_10,
'close': USEquityPricing.close.latest,
'prediction': prediction
},
screen=my_screen)

def initialize(context):

#ADDED TO MONITOR LEVERAGE MINUTELY.
context.minLeverage = [0]
context.maxLeverage = [0]

attach_pipeline(custom_pipeline(context), 'custom_pipeline')

schedule_function(evaluate, date_rules.every_day(), time_rules.market_open(minutes=1))
schedule_function(sell, date_rules.every_day(), time_rules.market_open())
schedule_function(buy, date_rules.every_day(), time_rules.market_open(minutes = 5))

context.longs = []
context.shorts = []

context.results = pipeline_output('custom_pipeline')

class SidInList(CustomFilter):
"""
Filter returns True for any SID included in parameter tuple passed at creation.
Usage: my_filter = SidInList(sid_list=(23911, 46631))
"""
inputs = []
window_length = 1
params = ('sid_list',)

def compute(self, today, assets, out, sid_list):
out[:] = np.in1d(assets, sid_list)

def compute_slope(a):
x = np.arange(0, len(a))
y = np.array(a)
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]
return m

def evaluate (context, data):
context.longs = []

for sec in context.results.index:
if context.results.loc[sec, 'prediction'] == 1:
print "Here"
if sec not in context.portfolio.positions:
context.longs.append(sec)

def sell (context,data):
for sec in context.portfolio.positions:
if sec not in context.longs:
order_target_percent(sec, 0.0)

for sec in context.longs:
order_target_percent(sec, 1.0 / (len(context.longs) + len(context.portfolio.positions)))
There was a runtime error.
29 responses

Thanks!

Thank you for this

Hi Luc,
As i am still basically "non-pythonic" (especially pandas) and i need to get there first before I can do much, i had not planned on diving in to either ML or alternative data from social media for quite a long while yet, but this thread and your algo look very interesting.

Please excuse me if some of my questions seem naieve or simply wrong, due to my limited knowledge of python.

From what I can infer from the little that i understand of the code in your algo, i think that you are taking 2 different sets of inputs from social media (aggregated_twitter_withretweets_stocktwits), one being bullish tweets and the other bearish tweets, and then taking the average number of each type over a rolling window and calculating the slope of those two average numbers of tweets, and then using those 2 items as input to your SVM. Am i more-or-less correct so far?

It also looks like you are taking SMA5 & SMA10 of price, but i can't quite figure out if these are also inputs to SVM or simply used in the filter. [I'm sorry if this sounds stupid on my part but, as I said, I'm new to python]. Anyway, at least as i understand it, you are using the SLOPES of the average numbers of bullish & bearish tweets, and possibly also the VALUES of the two SMA's as ML inputs. Please set me straight if i have this wrong so far.

Do you have any descriptive (i.e. other than python code) documentation of what you are doing that you could share? Perhaps then we could talk some more.

Despite my weakness in python, i have been trading for more than 30 years, i have at least some experience with SVM, and quite a lot of experience in the problems of using ML in trading systems, at least in the context of old-fashioned Neural Networks. Although there are obvious differences, there are also some similarities in the practical problems that one has with regard to good choices of indicators to use as input for any type of ML, and also with the issue of pre-processing.

I look forward to understanding more about what you are doing and sharing ideas.
Best regards, Tony

Tony,

The SMA5 and SMA10 are useless as well as filtering the stocks on SMA5 being larger than SMA10. I should have removed that bit of code. It is not within the scope of using ML in the pipeline. I had picked up that small bit of code from another algo posted in the communities.

As far as your understanding of the ML bit, you understand it correctly. I am using the slope of tweets (bearish and bullish) as input to the ML and the target of the ML is price movement of the following day for said equity.

I copied some of the ML code below.

Here I select data for the two features:
features = ['bull_rs', 'bear_rs']

This is the last row of feature data. It must be used to predict tomorrow's movement. Hence I call it live.
X_live = df[features][-1:]

This is the reminder of the data that I shift one day forward so that the previous day of data is aligned with the
current day price movement, i.e. "result". result is +1 if there was a gain, -1 for loss.

            df[features] = df[features].shift(1)


I remove NaN row (Should be only the first row of data.).
df.dropna(inplace=True)

Scale data
X_train = scaler.fit_transform(df[features])
Set target.
y_train = df['result']

Specify model. Funny, I mentioned SVM, but use Naïve. Changing model is just a dropin.
model = GaussianNB()

Here I train the model and run the prediction for tomorrow in one line.
prediction = model.fit(X_train, y_train).predict(scaler.transform(X_live))[0]

The whole algo is just presented as a template for others to build on. I did not try to make perform or anything.

BR/

Luc

You can contact me directly on linkedin if you wish.

Hi Luc,

Many thanks. I will contact you directly over the next few days.

One of the things I found with ML in general (of whatever type) is that usually the quality and success of the output is very strongly dependent on exactly what you do as pre-processing before the inputs actually go into the ML / AI. The more you can "help" the ML to get started in the right direction, the better, so that it can focus its efforts on the "important stuff" and doesn't have to waste its time trying to figure out (perhaps unsuccessfully) how to do something that we could have just told it beforehand. Specifically, in this case, you have 2 "raw" inputs, namely the number of bullish tweets and the number of bearish tweets. Although at first glance these might seem like logical choices for input to ML, actually the problem with using these 2 items as they are, is that both of them contains a mixture of 2 different types of info, namely 1) Bearishness vs Bullishness and 2) Changing levels of Enthusiasm for tweeting. My suggestion is to do a little bit of pre-processing to separate these two different aspects BEFORE inputting the data to the ML, as follows:

a) Count Total tweets = Bullish tweets + Bearish tweets.
b) Count NET Bullish tweets = Bullish tweets - Bearish tweets (will be +ve if predominantly Bullish, -ve if predominantly Bearish)
c) Proportion NET Bullish Tweets = a) / b). This is now normalized relative to the total number of tweets and becomes more purely a measure of + or - sentiment itself.
d) Take the Average number of NET Bullish tweets, and then from this take the ratio of Current NET Bullish Tweets to Average NET Bullish tweets. This gives a Short-term measure of day-to-day variability in bullishness or bearishness.
e) Slope = trend of NET bullish tweets, gives a Longer-term measure of changes in bullishness or bearishness.

Items c) d) and e) are now all normalized with respect to the total number of tweets and would potentially be useful as inputs to ML. The confusing factor of how many people are currently tweeting (either way) has been removed. We now have 3 inputs rather than the original 2, and they now contain info in a slightly different form that should be easier for any ML to work with. Please could you try this and see if it helps?

The other thought I have is that irrespective of whether you are using Naive Bayes or SVM, and Gaussian or Linear or RBF models, these are all Classifiers which, at least as I understand it, are designed to give a binary 1/0 output. Now is this really what you want? If you only want to decide Long or Short, then OK, but in fact we can probably do much better than that. In the context of an Equity Long-Short strategy with a large universe of possible equities to choose from, what we would actually like is a Ranking of all the equities on a continuous (rather than a binary) scale. The we go Long on the N= however many we want top-ranking (most bullish) equities, and go Short on the N bottom ranking (most bearish) equities. So, to do that, what we would need is something that gives a continuous-valued output rather than a 1/0 output.

Cheers, best wishes, Tony
(also on Linked-in, see Tony Morland)

Tony,

As far as feature engineering, yes, your proposal sounds good. My post was not to propose the best features to use, but rather a template algo for others to build on.

For usage of classifiers instead of regressors, of course one could use a regressor and use the output amplitude to run a long-short strategy. In my experience however, I found that ML has less problems trying to guess a direction rather that an amplitude (i.e. tomorrow's stock price). One could use the classifier's confidence level in a long-short strategy.

I encourage anyone with an improved algo based on mine to publish it in this thread.

/Luc

I have played a little bit more with this base algo and used PyshSignal twitter sentiment as feature to the SVM classifier. As well, I am using past returns as features.

So, as it stands now, on each day, for each stock in the pipeline, the algo trains an SVM classifier on past performance and spits out a prediction. That prediction is used as weight for the optimizer. The whole thing works fine unless you turn on slippage and commission, then not so good. That is normal as it trades every day.

If anyone has any idea as to how to improve this, please post your modified algo.

Thanks.

143
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
import quantopian.algorithm as algo
import quantopian.optimize as opt
from quantopian.pipeline import Pipeline, CustomFilter, CustomFactor
from quantopian.algorithm import attach_pipeline, pipeline_output
from quantopian.pipeline.factors import Latest, DailyReturns, Returns
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.data.psychsignal import aggregated_twitter_withretweets_stocktwits as st
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from quantopian.pipeline.factors import SimpleMovingAverage
from quantopian.pipeline.filters import Q500US, QTradableStocksUS
import numpy as np
from sklearn.svm import SVC, OneClassSVM, LinearSVC
from sklearn.naive_bayes import GaussianNB
import pandas as pd
# from quantopian.pipeline.data.alpha_vertex import precog_top_500
# from sklearn.preprocessing import PolynomialFeatures
# from quantopian.pipeline.data.sentdex import sentiment_free as sentdex
# from sklearn.neural_network import MLPClassifier

def compute_slope(a):
x = np.arange(0, len(a))
y = np.array(a)
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]
return m

class Prediction(CustomFactor):
def compute(self, today, asset_ids, out, bull_msgs, bear_msgs, total_msgs, \
returns):
predictions = []
for i in range(returns.shape[1]):
try:
bull_msg = bull_msgs[:, i]
bear_msg = bear_msgs[:, i]
result = (returns[:, i] > 0) * 2 - 1
df = pd.DataFrame(data={'bull':bull_msg.flatten(), \
'bear': bear_msg.flatten(), \
'total': total_msgs[:, i], \
'result': result.flatten(), \
'returns': returns[:, i].flatten()})

df.fillna(0, inplace=True)

# before shifting, we must record the last values as they will be used
# to run the model on for the prediction.
# df['bull_feature'] = df['bull'].rolling(window=5).apply(compute_slope)
# df['bear_feature'] = df['bear'].rolling(window=5).apply(compute_slope)
# df['total_feature'] = df['total'].rolling(window=5).apply(compute_slope)

# df['bull_feature'] = df['bull'].pct_change() #.shift(1)
# df['bear_feature'] = df['bear'].pct_change() #.shift(1)
# df['total_msgs_feature'] = df['total'].pct_change() #.shift(1)
df['returns-1'] = df['returns'].shift(1)
df['returns-2'] = df['returns'].shift(2)
df['returns-3'] = df['returns'].shift(3)
df['returns-4'] = df['returns'].shift(4)

# df['precog_feature'] = df['precog']

# df['bull_rs'] = df['bull'].rolling(window=5).apply(compute_slope) #.shift(1)
# df['bear_rs'] = df['bear'].rolling(window=5).apply(compute_slope) #.shift(1)

scaler = MinMaxScaler()
# features = ['bull_feature', 'bear_feature', 'total_feature', 'returns-1', \
#           'returns-2', 'returns-3', 'returns-4', 'returns-5']#, 'precog_feature']

features = ['returns-1', \
'returns-2', 'returns-3', 'returns-4']#, 'precog_feature']

X_live = df[features][-1:]
df[features] = df[features].shift(1)
df.dropna(inplace=True)
X_train = scaler.fit_transform(df[features])
y_train = df['result']

prediction = SVC(). \
fit(X_train, y_train). \
predict(scaler.transform(X_live))[0]

predictions.append(prediction)
except ValueError:
predictions.append(0)

out[:] = predictions

def custom_pipeline(context):
# my_screen = QTradableStocksUS() & SimpleMovingAverage(inputs=[st.total_scanned_messages], window_length=60).top(120)

my_screen = SimpleMovingAverage(inputs=[st.total_scanned_messages], window_length=60).top(60)

prediction = Prediction(inputs=[st.bull_scored_messages, \
st.bear_scored_messages, \
st.total_scanned_messages, \
DailyReturns()],\

return Pipeline(
columns={
'close': USEquityPricing.close.latest,
'prediction': prediction
},
screen=my_screen)

def initialize(context):
# set_benchmark(symbol('AAPL'))

attach_pipeline(custom_pipeline(context), 'custom_pipeline')

# schedule_function(rebalance, date_rules.week_start(days_offset=1), time_rules.market_open(minutes=3))
# schedule_function(rebalance, date_rules.week_start(days_offset=3), time_rules.market_open(minutes=3))

schedule_function(rebalance, date_rules.every_day(), time_rules.market_open(minutes=3))
# schedule_function(record_positions, date_rules.every_day(), time_rules.market_open(minutes=100))

context.longs = []
context.shorts = []

context.output = pipeline_output('custom_pipeline')

def rebalance(context, data):
context.output['weight'] = context.output['prediction']
weights = context.output['weight']

print context.output.shape

objectives = opt.TargetWeights(weights)

constraints = [
opt.MaxGrossExposure(1.0),
opt.DollarNeutral(),
# opt.PositionConcentration(default_max_weight=0.05)
# opt.PositionConcentration(0.00, 0.05),
]
algo.order_optimal_portfolio(objectives, constraints)

def record_positions(context, data):
print len(context.portfolio.positions)
There was a runtime error.

Tried to see if this is viable for a long/short algo with 250 positions on both side, so I applied your 'scanned twitter messages' filter to the QTradeableStocksUS universe, limited to 500 stocks total:

    my_screen = QTradableStocksUS() & SimpleMovingAverage(inputs=[st.total_scanned_messages], window_length=60).top(500)


This seems to yield about 300 stocks per day (probably not more in the twitter data feed?), but it times out so this is currently a dead end for real long/short algos.

6
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
import quantopian.algorithm as algo
import quantopian.optimize as opt
from quantopian.pipeline import Pipeline, CustomFilter, CustomFactor
from quantopian.algorithm import attach_pipeline, pipeline_output
from quantopian.pipeline.factors import Latest, DailyReturns, Returns
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.data.psychsignal import aggregated_twitter_withretweets_stocktwits as st
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from quantopian.pipeline.factors import SimpleMovingAverage
from quantopian.pipeline.filters import Q500US, QTradableStocksUS
import numpy as np
from sklearn.svm import SVC, OneClassSVM, LinearSVC
from sklearn.naive_bayes import GaussianNB
import pandas as pd
# from quantopian.pipeline.data.alpha_vertex import precog_top_500
# from sklearn.preprocessing import PolynomialFeatures
# from quantopian.pipeline.data.sentdex import sentiment_free as sentdex
# from sklearn.neural_network import MLPClassifier

def compute_slope(a):
x = np.arange(0, len(a))
y = np.array(a)
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]
return m

class Prediction(CustomFactor):
def compute(self, today, asset_ids, out, bull_msgs, bear_msgs, total_msgs, \
returns):
predictions = []
for i in range(returns.shape[1]):
try:
bull_msg = bull_msgs[:, i]
bear_msg = bear_msgs[:, i]
result = (returns[:, i] > 0) * 2 - 1
df = pd.DataFrame(data={'bull':bull_msg.flatten(), \
'bear': bear_msg.flatten(), \
'total': total_msgs[:, i], \
'result': result.flatten(), \
'returns': returns[:, i].flatten()})

df.fillna(0, inplace=True)

# before shifting, we must record the last values as they will be used
# to run the model on for the prediction.
# df['bull_feature'] = df['bull'].rolling(window=5).apply(compute_slope)
# df['bear_feature'] = df['bear'].rolling(window=5).apply(compute_slope)
# df['total_feature'] = df['total'].rolling(window=5).apply(compute_slope)

# df['bull_feature'] = df['bull'].pct_change() #.shift(1)
# df['bear_feature'] = df['bear'].pct_change() #.shift(1)
# df['total_msgs_feature'] = df['total'].pct_change() #.shift(1)
df['returns-1'] = df['returns'].shift(1)
df['returns-2'] = df['returns'].shift(2)
df['returns-3'] = df['returns'].shift(3)
df['returns-4'] = df['returns'].shift(4)

# df['precog_feature'] = df['precog']

# df['bull_rs'] = df['bull'].rolling(window=5).apply(compute_slope) #.shift(1)
# df['bear_rs'] = df['bear'].rolling(window=5).apply(compute_slope) #.shift(1)

scaler = MinMaxScaler()
# features = ['bull_feature', 'bear_feature', 'total_feature', 'returns-1', \
#           'returns-2', 'returns-3', 'returns-4', 'returns-5']#, 'precog_feature']

features = ['returns-1', \
'returns-2', 'returns-3', 'returns-4']#, 'precog_feature']

X_live = df[features][-1:]
df[features] = df[features].shift(1)
df.dropna(inplace=True)
X_train = scaler.fit_transform(df[features])
y_train = df['result']

prediction = SVC(). \
fit(X_train, y_train). \
predict(scaler.transform(X_live))[0]

predictions.append(prediction)
except ValueError:
predictions.append(0)

out[:] = predictions

def custom_pipeline(context):
# my_screen = QTradableStocksUS() & SimpleMovingAverage(inputs=[st.total_scanned_messages], window_length=60).top(120)

my_screen = QTradableStocksUS() & SimpleMovingAverage(inputs=[st.total_scanned_messages], window_length=60).top(500)

prediction = Prediction(inputs=[st.bull_scored_messages, \
st.bear_scored_messages, \
st.total_scanned_messages, \
DailyReturns()],\

return Pipeline(
columns={
'close': USEquityPricing.close.latest,
'prediction': prediction
},
screen=my_screen)

def initialize(context):
# set_benchmark(symbol('AAPL'))

attach_pipeline(custom_pipeline(context), 'custom_pipeline')

# schedule_function(rebalance, date_rules.week_start(days_offset=1), time_rules.market_open(minutes=3))
# schedule_function(rebalance, date_rules.week_start(days_offset=3), time_rules.market_open(minutes=3))

schedule_function(rebalance, date_rules.every_day(), time_rules.market_open(minutes=3))
# schedule_function(record_positions, date_rules.every_day(), time_rules.market_open(minutes=100))

context.longs = []
context.shorts = []

context.output = pipeline_output('custom_pipeline')

def rebalance(context, data):
context.output['weight'] = context.output['prediction']
weights = context.output['weight']

print context.output.shape

objectives = opt.TargetWeights(weights)

constraints = [
opt.MaxGrossExposure(1.0),
opt.DollarNeutral(),
# opt.PositionConcentration(default_max_weight=0.05)
# opt.PositionConcentration(0.00, 0.05),
]
algo.order_optimal_portfolio(objectives, constraints)

def record_positions(context, data):
print len(context.portfolio.positions)
There was a runtime error.

I've been toying around with your code and noticed that it held up well until approx Oct-31-2018 & Nov-01-2018 when something funky went on. Maybe another set of eyes can help me navigate that downturn without introducing too much over-fitting.

18
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
import quantopian.algorithm as algo
import quantopian.optimize as opt
from quantopian.pipeline import Pipeline, CustomFilter, CustomFactor
from quantopian.algorithm import attach_pipeline, pipeline_output
from quantopian.pipeline.factors import Latest, DailyReturns, Returns
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.data.psychsignal import aggregated_twitter_withretweets_stocktwits as st
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from quantopian.pipeline.factors import SimpleMovingAverage, AnnualizedVolatility
from quantopian.pipeline.filters import QTradableStocksUS
from quantopian.pipeline.classifiers.morningstar import Sector

import numpy as np
from sklearn.svm import SVC, OneClassSVM, LinearSVC
from sklearn.naive_bayes import GaussianNB
import pandas as pd
# from quantopian.pipeline.data.alpha_vertex import precog_top_500
# from sklearn.preprocessing import PolynomialFeatures
# from quantopian.pipeline.data.sentdex import sentiment_free as sentdex
# from sklearn.neural_network import MLPClassifier

MIN_TURN = 0.20 # 0.1

def compute_slope(a):
x = np.arange(0, len(a))
y = np.array(a)
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]
return m

class Prediction(CustomFactor):
def compute(self, today, asset_ids, out, bull_msgs, bear_msgs, total_msgs, msg_intensity, \
returns):
predictions = []
for i in range(returns.shape[1]):
try:
bull_msg = bull_msgs[:, i]
bear_msg = bear_msgs[:, i]
msg_int = msg_intensity[:, i]

result = (returns[:, i] > 0) * 2 - 1
df = pd.DataFrame(data={'bull':bull_msg.flatten(), \
'bear': bear_msg.flatten(), \
'total': total_msgs[:, i], \
'intensity': msg_int.flatten(), \
'result': result.flatten(), \
'returns': returns[:, i].flatten()})

df.fillna(0, inplace=True)

# before shifting, we must record the last values as they will be used
# to run the model on for the prediction.
#df['bull_feature'] = df['bull']#.rolling(window=5).apply(compute_slope)
#df['bear_feature'] = df['bear']#.rolling(window=5).apply(compute_slope)
#df['total_feature'] = df['total']#.rolling(window=5).apply(compute_slope)
df['intense'] = df['intensity'].rolling(window=5).apply(compute_slope)

#df['bull_feature_A'] = df['bull_feature'].shift(1)#.pct_change().shift(1)
#df['bear_feature_B'] = df['bear_feature'].shift(1)#.pct_change().shift(1)
#df['total_feature_C'] = df['total_feature'].shift(1)#.pct_change().shift(1)
df['intensity_feature'] = df['intense'].pct_change().shift(1)

df['returns-1'] = df['returns'].shift(1)
df['returns-2'] = df['returns'].shift(2)
df['returns-3'] = df['returns'].shift(3)
#df['returns-4'] = df['returns'].shift(4)
#df['returns-5'] = df['returns'].shift(5)

scaler = MinMaxScaler()
#features = ['bull_feature_A', 'bear_feature_B', 'total_feature_C', 'returns-1', \
#'returns-2', 'returns-3', 'returns-4', 'returns-5']

#features = ['bull_feature_A', 'bear_feature_B', 'total_feature_C', 'returns-1', \
#'returns-2', 'returns-3',]

features = ['intensity_feature', \
'returns-1', 'returns-2', 'returns-3']#, 'returns-4']

X_live = df[features][-1:]
df[features] = df[features].shift(1)
df.dropna(inplace=True)
X_train = scaler.fit_transform(df[features])
y_train = df['result']

prediction = SVC(). \
fit(X_train, y_train). \
predict(scaler.transform(X_live))[0]

predictions.append(prediction)
except ValueError:
predictions.append(0)

out[:] = predictions

def custom_pipeline(context):
# my_screen = QTradableStocksUS() & SimpleMovingAverage(inputs=[st.total_scanned_messages], window_length=60).top(120)

my_screen = SimpleMovingAverage(inputs=[st.total_scanned_messages], window_length=20).top(60)

prediction = Prediction(inputs=[st.bull_scored_messages, \
st.bear_scored_messages, \
st.total_scanned_messages, \
st.bull_bear_msg_ratio, \
DailyReturns()],\

return Pipeline(
columns={
'close': USEquityPricing.close.latest,
'prediction': prediction,
'sector': Sector(),
},
screen=my_screen)

def initialize(context):

attach_pipeline(custom_pipeline(context), 'custom_pipeline')

schedule_function(rebalance, date_rules.every_day(), time_rules.market_open(minutes=60))
schedule_function(record_positions, date_rules.every_day(), time_rules.market_close())

context.longs = []
context.shorts = []
context.init = True

context.output = pipeline_output('custom_pipeline')#.iloc[:100]

def rebalance(context, data):
pipeline_data = context.output

#alpha_weight = pipeline_data['prediction']
#alpha_weight_norm = alpha_weight / alpha_weight.abs().sum()

context.output['weight'] = context.output['prediction']
weights = context.output['weight']

print context.output.shape

objective = opt.TargetWeights(weights)
#objective = opt.TargetWeights(alpha_weight_norm)

constraints = []

constraints.append(opt.MaxGrossExposure(1.0))

constraints.append(opt.DollarNeutral())

constraints.append(
opt.PositionConcentration.with_equal_bounds(
min=-0.10,
max=0.10
))

risk_model_exposure = opt.experimental.RiskModelExposure(
min_momentum = -0.10,#-
max_momentum = 0.10,
min_size = -0.10,#-
max_size = 0.10,
min_value = -0.10,#-
max_value = 0.10,
min_short_term_reversal = -0.10,#-
max_short_term_reversal = 0.10,
min_volatility = -0.10,#-
max_volatility = 0.10,
)
constraints.append(risk_model_exposure)

sector_neutral = opt.NetGroupExposure.with_equal_bounds(
labels=pipeline_data.sector,
min=-0.10, #-
max=0.10,
)
constraints.append(sector_neutral)

if context.init:
order_optimal_portfolio(
objective=objective,
constraints=constraints,
)
context.init = False
return

turnover = np.linspace(MIN_TURN,0.65,num=100)

for max_turnover in turnover:

constraints.append(opt.MaxTurnover(max_turnover))

try:
order_optimal_portfolio(
objective=objective,
constraints=constraints,
)
constraints = constraints[:-1]
#record(max_turnover = max_turnover)
return
except:
constraints = constraints[:-1]

def record_positions(context, data):
#print len(context.portfolio.positions)
longs = shorts = 0
for position in context.portfolio.positions.itervalues():
if position.amount > 0:
longs += 1

elif position.amount < 0:
shorts += 1
record(Positions = longs + shorts)
record(Lev = context.account.leverage)
record(Cash = context.account.settled_cash)
There was a runtime error.

Daniel,

I am surprised to see such big jump in the P&L curve. Something weird is happening. I'll check. I like the changes you have made to the original code.

/Luc

Hi Luc, Daniel,

I'm glad to see this thread bumped because it reminded me to revisit a variant of Luc's base algo I did back in June 2018 that passed all contest constraints. Made some changes to inputs, training window, constraints but kept standard SVC routine and limited it to trade approximately 95-100 stocks to avoid timeout issues. The live portion of the tearsheet below is just the updated period from which I last ran the backtest, kinda OOS. What do you guys think, overfitted?

15
Notebook previews are currently unavailable.

James,

It is difficult to say if it is over fitting. There are not many tuning parameters aside from the training window length, the SVC parameter C (if used, but in my original code, it was set to default, i.e. 1.0) and the choice of features. The universe selection is just there to lower the computation time. It is possible that the current market regime is still being learned by the SVC. As such, I don't see how it could be "manually tuned" to over fit.

Make sure that the features you use (or engineer) are stationary in time and that the training window is not too short. Non stationary features are not handled well in classical ML (e.g. if you were to use stock prices instead of returns) and that the training window is long enough such that the SVC is training on a sufficient amount of data.

Thanks for posting your live results. If you could post an older version of your code, that would be nice as well.

Hey Luc,

Thanks for your feedback. I pretty much agree with you that it might be too early to judge overfitting based on the OOS period which was a very tough period including the major correction of Dec. 24 and subsequent slow recovery. If you look at the tail end of OOS performance it looks like it's bouncing back!

My major problem here is limited computation resources that results to timeout issues. The training window is 252 days and limit number of positions to 90-100 for it not to timeout. Given this configuration setup, I get portfolio volatility of ~10%, something I know I can lower if say I can extend training window to 756 and trade 500 positions without timeout issues.

Key here is the inputs, the secret sauce. As to code revision, it' pretty much the same as what Daniel did with TargetWeights and its added constraints, that's it!

@ James, i found myself circling back to this thread and wondering if you have any guidance you can share related to the below...
For example, say i have the below code of two factors: 1) Predictions (SVC) and 2) Alpha(regression). Can I / How would I, use the output from the Alpha custom factor as an input into the Predictions custom factor in addition to the total scanned data already being used?

def compute_slope(a):
x = np.arange(0, len(a))
y = np.array(a)
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]
return m

class Prediction(CustomFactor):
def compute(self, today, asset_ids, out, total_msgs, alpha_f, returns):
predictions = []
for i in range(returns.shape[1]):
try:
result = (returns[:, i] > 0) * 2 - 1
df = pd.DataFrame(data={'total': total_msgs[:, i].flatten(), \
'alphas_FF': alpha_f[:, i].flatten(), \
'result': result.flatten(), \
'returns': returns[:, i].flatten()})
df.fillna(0, inplace=True)

# before shifting, we must record the last values as they will be used
# to run the model on for the prediction.

df['total_'] = df['total'].rolling(window=5).apply(compute_slope)

df['total_feature'] = df['total_'].pct_change().shift(1)
df['alpha_features'] = df['alphas_FF'].shift(1)

df['returns-1'] = df['returns'].shift(1)
df['returns-2'] = df['returns'].shift(2)
df['returns-3'] = df['returns'].shift(3)

scaler = MinMaxScaler()
features = ['total_feature', \
'returns-1', 'returns-2', 'returns-3']
X_live = df[features][-1:]
df[features] = df[features].shift(1)
df.dropna(inplace=True)
X_train = scaler.fit_transform(df[features])
y_train = df['result']

prediction = SVC(). \
fit(X_train, y_train). \
predict(scaler.transform(X_live))[0]

predictions.append(prediction)
except ValueError:
predictions.append(0)
out[:] = predictions

class Alpha(CustomFactor):
inputs = [USEquityPricing.close]
window_safe = True
def compute(self, today, assets, out, close):
returns = pd.DataFrame(close, columns=assets).pct_change()[1:]
spy_returns = returns[symbol('SPY')]
# get beta and alpha by running linear regression
A = np.vstack([spy_returns, np.ones(len(spy_returns))]).T
m, p = np.linalg.lstsq(A, returns)[0]
out[:] = p

def custom_pipeline(context):

alphas_f = Alpha(window_length = 60)
m &= SimpleMovingAverage(inputs=[st.total_scanned_messages], window_length=21).top(100, mask=m)

# Filter for stocks that are not within 2 days of an earnings announcement.
#m &= ~((BusinessDaysUntilNextEarnings() <= 2) | (BusinessDaysSincePreviousEarnings() <= 2))

# Filter for stocks that are announced acquisition target.
#m &= ~IsAnnouncedAcqTarget()

prediction = Prediction(inputs=[st.total_scanned_messages, \
alphas_f, \
DailyReturns()],\


I think i answered my own question with the attached revised code snip it. I just hope it does not timeout while running

Hi Daniel,

Haha, yes you just did answer your own question.! Given that you mask to trade on 100 stocks with 4-5 features and training period of 100 days, you shouldn't time out, My max without timeout error on 4-5 features, 252 days of training, trading 250 stocks. Good luck! Kindly post interesting results.

I second that as well. That would work. Thanks for posting.

/L

Here's a first cut at what I've come up with so far. I need to have a closer look into the leverage dropping below the min limits quite often. However, the only constraint added was to sector exposure +- 10% and position concentration limited. Next steps, I may look to adding : 1) Add one more feature 2) Increase the training period 3) Increase number of positions. I've held out testing past 1/31/2017 as i would like to use the rest as OOS testing.

Note, the results in the attached notebook are not derived from the exact coding snippet I posted above but similar in that it revolves around sentiment

If anything sticks out to anyone I encourage the feedback (good or bad) :)

2
Notebook previews are currently unavailable.

Hi Daniel,

Nice first cut! Regarding leverage dropping below min limits, this is a sign that your factor based stock selection is biased on one side. You can try and add maximum leverage = 1 and/or dollar_neutral constraint and see if it fixes this problem. I would try first increasing your training period and number of positions before adding more feature. See if it lowers volatilility and increase Sharpe. Did you try and see if the results on holdout data (after 1/31/2017) is consistent to training results?

Thanks James. I had the max lev set to 1.0 but left dollar_neutral unconstrained. Using the below code brought my leverage within the contest criteria. I agree adding features would be last on the list mentioned. I haven't tested after 1/31/2017 but I will run a tearsheet now to see.


constraints.append(opt.MaxGrossExposure(1.0))

constraints.append(opt.DollarNeutral(tolerance = 0.005))


Attached is a tear sheet from OOS data between 1/1/2017 - 4/15/2019. The results appear to hold up reasonably well between the two periods. Note the starting capital was set @ \$10mm for each. Some takeaways for me are to smooth out the Annual Volatility and increase number of positions.

3
Notebook previews are currently unavailable.

Great, Daniel,the OOS holdout results is pretty consistent with training results. In my experience, once you've established the model to have some predictive power, it can be calibrated to its optimal performance within the limit of its constraints. As you scale the number of positions, volatility should be tempered and this is key because you are targeting risk control. The objective should be higher risk adjusted returns.

Hi All,

I have been revisiting this algo. Keep in mind that it was meant to be an algo for someone else to use as a code base. In no way was it meant to trade. That said, one of the big problem now is that the selection of the universe was made pretty much "ramdomly". I just used pretty much any filter code available to reduce the universe to 50-60 stocks such that the algo would not time out. So using number of tweets as filter has no underlying logic to it. Its random and it happens to work ok. I have tried changing the universe to something more basic, like "top 50 market cap) and the whole thing falls apart.

I guess is that someone would need to find a universe that makes sense for the SVM pipeline that works. Otherwise, this is just overfitting or lucky. Or maybe, at least find some plausible reason why using message numbers works.

/L

James, do you have any tricks or tips I could leverage from that you used to scale the training period to 252 days and 250 positions? So far I've only been able to train on up to 126 days and couldn't increase # of positions without the timeout.

Hey Daniel,

To workaround the timeout issues, first I limit the stock selection to 250 based on some generic factor (i.e. most liquid, top market cap, etc.) and second compute input factors within the pipeline and masked by specific stock selection. It is more about code and resource efficiency. If you re a good python coder, you can try and transfer SVM routine inside BTS (before trading starts) which have a 5 minute limit, I tried but failed.

As Luc just said above, ",...someone would need to find a universe that makes sense for the SVM pipeline that works". Hope this helps.

Cool, thanks for the info James and Luc. I'll go back to the drawing board to see what I can work up. I agree in that it comes down to more about code/structure and efficiency.

Wondering if either of you have experienced the same error message I've been receiving when I use functions like the below.

"KeyError: 8554 There was a runtime error"

Specifically, when using --> spy_returns = returns[symbol('SPY')] or spy_returns = returns[sid(8554)] within a custom factor

class Alpha(CustomFactor):
inputs = [USEquityPricing.close]
window_safe = True
def compute(self, today, assets, out, close):
returns = pd.DataFrame(close, columns=assets).pct_change()[1:]
spy_returns = returns[symbol('SPY')]
# get beta and alpha by running linear regression
A = np.vstack([spy_returns, np.ones(len(spy_returns))]).T
m, p = np.linalg.lstsq(A, returns)[0]
out[:] = p


Daniel, SPY is not there in your pipeline. you will probably need to add it using StaticAssets explicitly. Something like this

from quantopian.pipeline.filters import StaticAssets
spy_universe = StaticAssets(symbol('SPY'))
universe = universe | spy_universe