Simple Machine Learning Example Mk II

My original machine learning example was a popular post, and I figure it's about time for an update.

Although machine learning usually seems complicated at first, it's actually easy to work with.

Here, a model is created based off of past events and their outcomes. There are 3 input variables, or previous events, considered in this algorithm. They are the previous 3 days' changes in price. The outcome is whether a price increased or decreased in the following bar. Many of these events and their outcomes are used to generate a model using regression in scikit-learn . The model is then used to try to predict future changes in price.

Note that this is just an example, and should be improved before real use. Clone the algorithm, and let me know if you have any questions!

3241
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Use the previous 10 bars' movements to predict the next movement.

# Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html
from sklearn.ensemble import RandomForestRegressor
import numpy as np

def initialize(context):
context.security = sid(8554) # Trade SPY
context.model = RandomForestRegressor()

context.lookback = 3 # Look back 3 days
context.history_range = 400 # Only consider the past 400 days' history

# Generate a new model every week
schedule_function(create_model, date_rules.week_end(), time_rules.market_close(minutes=10))

# Trade at the start of every day

def create_model(context, data):
# Get the relevant daily prices
recent_prices = history(context.history_range, '1d', 'price')[context.security].values

# Get the price changes
price_changes = np.diff(recent_prices).tolist()

X = [] # Independent, or input variables
Y = [] # Dependent, or output variable

# For each day in our history
for i in range(context.history_range-context.lookback-1):
X.append(price_changes[i:i+context.lookback]) # Store prior price changes
Y.append(price_changes[i+context.lookback]) # Store the day's price change

context.model.fit(X, Y) # Generate our model

if context.model: # Check if our model is generated

# Get recent prices
recent_prices = history(context.lookback+1, '1d', 'price')[context.security].values

# Get the price changes
price_changes = np.diff(recent_prices).tolist()

# Predict using our model and the recent prices
prediction = context.model.predict(price_changes)
record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
if prediction > 0:
order_target_percent(context.security, 1.0)
else:
order_target_percent(context.security, -1.0)

def handle_data(context, data):
pass
We have migrated this algorithm to work with a new version of the Quantopian API. The code is different than the original version, but the investment rationale of the algorithm has not changed. We've put everything you need to know here on one page.
There was a runtime error.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

44 responses

Cheers for posting!

Code is so clean and simple compared to the models I have seen written entirely in python

Very good concept . . and the framework given is also neat.

I tried my hands at it and found that everytime I backtest without changing code over
same date range results are different.
ie every time the model built with same input and output data set is behaving differently.

Dears,

Can anyone nominate a good trading Algo for me? I need to buy a good one with monthly ROI 10%

Thank you

How come you are generating the model at the end of each week instead of each day? and have you had much experience with the other esembles like classifier etc?

Yagnesh, that's just because of the model I am using — random forest. A different non-random model could be chosen, as well.

d36, that's a mostly arbitrary choice I made as an example, and in hopes to slightly speed up the backtest. The model should hardly change much day to day when I am considering 400 days of history, but it will usually change a nontrivial amount week to week.

I thought so. Even similar experience with neural nets they learn different things from same
set of data every time.

I am not expert on machine learning but went through sklearn and think 'polynomial interpolation'
and Decision tree regression with Adaboost has some promise.

Yeah, there is definitely a ton of background on selecting different models for different situations. It is very easy to use machine learning to solve a problem decently well, but much more difficult to be sure how it is applying to the situation and to select exactly the right model. Many models can be applied generally, though, and work fine in most situations.

In the trade method, should the prediction be based on recent_prices or the first difference of recent_prices, like in create_model?

You're right! My mistake. I was mucking around and I forgot to add that line back in. I updated the post, thanks for pointing that out.

Hi,author.Could you supply the code with minute ?I am curious about the performance in miuntes.Many Thanks

Actually, the code is in minutes! This is indicated above the backtest. Again, this is just a code example and you should not expect to make money with this.

Hello Gus!
I was wondering what this 'fit' method really does..
In simple words can we say that it looks at the current last 3 days and search back in history if there was something similar and then consider the day after these 3 days patterns to see if history repeat itself to make a prediction?
In this case i have a question..
(sorry for my bad english :) ) i was doing something that looks similar but with a 'manual' approach, do you think this machine learning method is more efficient than a simple correlation analysis when it comes to find similar patterns in history?

31

i have updated my notebook with some comments ( cannot edit the previous notebook attachment )

59

Thanks for the explanation of your work, cool notebook!

That's what it does to an extent, however, this is dependent upon the model used by sk-learn. For example, the RandomForestRegressor I'm using makes use of decision trees. It's not an easy subject, and I'm not an expert myself, but you can read about it here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble

It looks to me like you made your own sort-of model that you are using to look at the price time series. You can simply compare models to see what works best, but beware that you may be overfitting. Also, generally, it's difficult to make signals based purely off of price time series work! It may be wise to implement some other streams of information in, like volume data etc. Good luck.

I haven't been able to replicate your backtest results so far, Gus. I take it the algorithm is stochastic, since I get different results each time I run it. I have yet to see it produce positive returns, though. This is what I get running the code as-is on the same time period. Any idea why the stochastic effects might be giving such drastically different outcomes? Is there a good way to make this deterministic?

22
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Use the previous 10 bars' movements to predict the next movement.

# Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html
from sklearn.ensemble import RandomForestRegressor
import numpy as np

def initialize(context):
context.security = sid(8554) # Trade SPY
context.model = RandomForestRegressor()

context.lookback = 3 # Look back 3 days
context.history_range = 400 # Only consider the past 400 days' history

# Generate a new model every week
schedule_function(create_model, date_rules.week_end(), time_rules.market_close(minutes=10))

# Trade at the start of every day

def create_model(context, data):
# Get the relevant daily prices
recent_prices = history(context.history_range, '1d', 'price')[context.security].values

# Get the price changes
price_changes = np.diff(recent_prices).tolist()

X = [] # Independent, or input variables
Y = [] # Dependent, or output variable

# For each day in our history
for i in range(context.history_range-context.lookback-1):
X.append(price_changes[i:i+context.lookback]) # Store prior price changes
Y.append(price_changes[i+context.lookback]) # Store the day's price change

context.model.fit(X, Y) # Generate our model

if context.model: # Check if our model is generated

# Get recent prices
recent_prices = history(context.lookback+1, '1d', 'price')[context.security].values

# Get the price changes
price_changes = np.diff(recent_prices).tolist()

# Predict using our model and the recent prices
prediction = context.model.predict(price_changes)
record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
if prediction > 0:
order_target_percent(context.security, 1.0)
else:
order_target_percent(context.security, -1.0)

def handle_data(context, data):
pass
There was a runtime error.

The reason it's stochastic is because this uses a random forest model. I just selected this as an example because it's a common model. You can read about alternative models here. As I noted, this is just an example, and you should not expect to get consistent positive returns out of something like this. In order to improve the likelihood of success, you should broaden the scope of the input data (for example, include trade volume, earnings, etc.). It's generally very difficult to get an algorithm that yields consistently positive returns. Good luck!

Just a small FYI for anyone interested in ensemble models, default number of individual classifiers in sklearn is very (too) small. Random forest should have at least hundreds (thousands is better) classifiers. Default in sklearn is 10 (!). As these are [almost] random decision trees it's no wonder that youy will get wildly different results each time.

So, I saw this and was interested in the algo. When I ran it, I noticed something rather curious {not the randomforrest(), that I understand}. Rather, that when running the algo, it would place a buy order, and then the subsequent day place a sell order for twice the amount held. Then it gets more bazzare. The algo would buy at 140.57, sell at 139.00; then magic occurs, and your cash value goes up a few grand.

I ran this back test during 1/1/2008-1/1/2009 to stress test, but I think I broke something.

Ideas??

3
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Use the previous 10 bars' movements to predict the next movement.

# Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html
from sklearn.ensemble import RandomForestRegressor
import numpy as np

def initialize(context):
context.security = sid(8554) # Trade SPY
context.model = RandomForestRegressor()

context.lookback = 10 # Look back 3 days
context.history_range = 400 # Only consider the past 400 days' history

# Generate a new model every week
schedule_function(create_model, date_rules.week_end(), time_rules.market_close(minutes=10))

# Trade at the start of every day

def create_model(context, data):
# Get the relevant daily prices
recent_prices = history(context.history_range, '1d', 'price')[context.security].values

# Get the price changes
price_changes = np.diff(recent_prices).tolist()

X = [] # Independent, or input variables
Y = [] # Dependent, or output variable

# For each day in our history
for i in range(context.history_range-context.lookback-1):
X.append(price_changes[i:i+context.lookback]) # Store prior price changes
Y.append(price_changes[i+context.lookback]) # Store the day's price change

context.model.fit(X, Y) # Generate our model

if context.model: # Check if our model is generated

# Get recent prices
recent_prices = history(context.lookback+1, '1d', 'price')[context.security].values

# Get the price changes
price_changes = np.diff(recent_prices).tolist()

# Predict using our model and the recent prices
prediction = context.model.predict(price_changes)
record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
if prediction > 0:
order_target_percent(context.security, 1.0)
else:
order_target_percent(context.security, -1.0)

def handle_data(context, data):
pass
There was a runtime error.

Corey, that is to be expected because we are either going short or long. To go from long to short we For your second point, if we short at $140.57 and then reduce our position to zero at$139.00 then we have made just made \$1.57 per share. Are you sure you're looking at a long and not a short? Other unexpected cash fluctuations may occur for a variety of reasons: slippage taking effect, commission, only being able to buy integer numbers of shares, etc. These are all designed to best simulate market conditions. Cash fluctuations should not be too surprising, although a few thousand dollars is a lot in this case, so that sounds like profit showing up in cash later due to us trading in daily mode. If you want to dig deeper you can log parameters of interest or run in minute mode. Hope that helps.

I ran this with a couple changes:
1. change stock symbol to BDCL
2. change look back to 10 days.

It lost about 100% of the money, so then I changed logic to sell short when the prediction was up and it still lost about 100%.

Is it just frequent trading and commissions that are killing it?

10
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Use the previous 10 bars' movements to predict the next movement.
#Cloned from "Simple Machine Learning Example Mk II"

# Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html
from sklearn.ensemble import RandomForestRegressor
import numpy as np

def initialize(context):
#context.security = sid(8554) # Trade SPY
context.security = symbol('BDCL')
context.model = RandomForestRegressor()

context.lookback = 5 # Look back 3 days
context.history_range = 400 # Only consider the past 400 days' history

# Generate a new model every week
schedule_function(create_model, date_rules.week_end(), time_rules.market_close(minutes=10))

# Trade at the start of every day

def create_model(context, data):
# Get the relevant daily prices
recent_prices = history(context.history_range, '1d', 'price')[context.security].values

# Get the price changes
price_changes = np.diff(recent_prices).tolist()

X = [] # Independent, or input variables
Y = [] # Dependent, or output variable

# For each day in our history
for i in range(context.history_range-context.lookback-1):
X.append(price_changes[i:i+context.lookback]) # Store prior price changes
Y.append(price_changes[i+context.lookback]) # Store the day's price change

context.model.fit(X, Y) # Generate our model

if context.model: # Check if our model is generated

# Get recent prices
recent_prices = history(context.lookback+1, '1d', 'price')[context.security].values

# Get the price changes
price_changes = np.diff(recent_prices).tolist()

# Predict using our model and the recent prices
prediction = context.model.predict(price_changes)
record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
#if prediction > 0:
if prediction < 0:
order_target_percent(context.security, 1.0)
else:
order_target_percent(context.security, -1.0)

def handle_data(context, data):
pass
There was a runtime error.

Looks like it is mostly due to slippage in this case. This can be fixed by limiting trading amount or limiting trading frequency. Here is your algorithm with slippage and commission off:

25
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Use the previous 10 bars' movements to predict the next movement.
#Cloned from "Simple Machine Learning Example Mk II"

# Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html
from sklearn.ensemble import RandomForestRegressor
import numpy as np

def initialize(context):
#context.security = sid(8554) # Trade SPY
context.security = symbol('BDCL')
context.model = RandomForestRegressor()

context.lookback = 5 # Look back 3 days
context.history_range = 400 # Only consider the past 400 days' history

# Generate a new model every week
schedule_function(create_model, date_rules.week_end(), time_rules.market_close(minutes=10))

# Trade at the start of every day

def create_model(context, data):
# Get the relevant daily prices
recent_prices = history(context.history_range, '1d', 'price')[context.security].values

# Get the price changes
price_changes = np.diff(recent_prices).tolist()

X = [] # Independent, or input variables
Y = [] # Dependent, or output variable

# For each day in our history
for i in range(context.history_range-context.lookback-1):
X.append(price_changes[i:i+context.lookback]) # Store prior price changes
Y.append(price_changes[i+context.lookback]) # Store the day's price change

context.model.fit(X, Y) # Generate our model

if context.model: # Check if our model is generated

# Get recent prices
recent_prices = history(context.lookback+1, '1d', 'price')[context.security].values

# Get the price changes
price_changes = np.diff(recent_prices).tolist()

# Predict using our model and the recent prices
prediction = context.model.predict(price_changes)
record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
if prediction > 0:
order_target_percent(context.security, 1.0)
else:
order_target_percent(context.security, -1.0)

def handle_data(context, data):
pass
There was a runtime error.

Gus,

I've been playing around with this model, and I've been trying to add in another independent variable, but keep getting errors. Any chance you could post an updated backtest for us that has something like volume as an additional independent variable?

Many thanks,

Joseph

The best way to do this, I think, is to just add the inputs together into the same list for each sample. Here's an example with volume as an input. I also changed how often the model was generated and when we trade. Remember, this is just an example of how to use some sklearn models. There are also better ways to implement this now, like with the Pipeline API.

124
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Use the previous 10 bars' movements to predict the next movement.

# Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html
from sklearn.ensemble import RandomForestRegressor
import numpy as np

def initialize(context):
context.security = sid(8554) # Trade SPY
context.model = RandomForestRegressor()

context.lookback = 3 # Look back 3 days
context.history_range = 400 # Only consider the past 400 days' history

# Generate a new model every month
schedule_function(create_model, date_rules.month_end(), time_rules.market_close(minutes=10))

# Trade at the start of every week

def create_model(context, data):
# Get the relevant daily data
recent_prices = data.history(context.security, 'price', context.history_range, '1d').values
recent_volumes = data.history(context.security, 'volume', context.history_range, '1d').values

# Find the changes
price_changes = np.diff(recent_prices).tolist()
volume_changes = np.diff(recent_volumes).tolist()

X = [] # Independent, or input variables
Y = [] # Dependent, or output variable

# For each day in our history
for i in range(context.history_range-context.lookback-1):
X.append(price_changes[i:i+context.lookback] + volume_changes[i:i+context.lookback]) # Store prior price changes
Y.append(price_changes[i+context.lookback]) # Store the day's price change

context.model.fit(X, Y) # Generate our model

if context.model: # Check if our model is generated

# Get recent data
recent_prices = data.history(context.security, 'price', context.lookback+1, '1d').values
recent_volumes = data.history(context.security, 'volume', context.lookback+1, '1d').values

# Get the changes
price_changes = np.diff(recent_prices).tolist()
volume_changes = np.diff(recent_volumes).tolist()

# Predict using our model and the recent prices
prediction = context.model.predict(price_changes + volume_changes)
record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
if prediction > 0:
order_target_percent(context.security, 1.0)
else:
order_target_percent(context.security, -1.0)

There was a runtime error.

Gotcha. Thanks for taking the time to post this, Gus!

@ Gus Gordon
Hello Gus, I am not able to train any other model using your method of implementation ML. I tried your code as template and applied QDA(),LogisticRegression,SVC() on features like daily_returns,multiple_day_returns, rolling_mean and time_lagged but each time except RandomForest It returns error that model is not fit yet.

Is there something wrong in the implementation? On the other hand in my local IPython /Jupyter notebook I am able to train models.

47
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
from sklearn.qda import QDA
#from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing
import numpy as np

import pandas as pd
def initialize(context):
context.assets = sid(8554) # Trade SPY
context.model = QDA()
context.lookback = 5 # Look back
context.history_range = 200

# Generate a new model every week
schedule_function(create_model, date_rules.week_end(), time_rules.market_close(minutes=10))

# Trade at the start of every day

def create_model(context, data):
# Get the relevant daily prices
recent_prices = data.history(context.assets, 'price',context.history_range, '1d')

context.ma_50 =recent_prices.values[-50:].mean()
context.ma_200 = recent_prices.values[-200:].mean()
#print context.ma_50
#print context.ma_200
time_lags = pd.DataFrame(index=recent_prices.index)
time_lags['price']=recent_prices.values
time_lags['daily_returns']=time_lags['price'].pct_change()
time_lags['multiple_day_returns'] =  time_lags['price'].pct_change(3)
time_lags['rolling_mean'] = time_lags['daily_returns'].rolling(window = 4,center=False).mean()

time_lags['time_lagged'] = time_lags['price']-time_lags['price'].shift(-2)
X = time_lags[['price','daily_returns','multiple_day_returns','rolling_mean']].dropna()

time_lags['updown'] = time_lags['daily_returns']
time_lags.updown[time_lags['daily_returns']>=0]='up'
time_lags.updown[time_lags['daily_returns']<0]='down'
le = preprocessing.LabelEncoder()
time_lags['encoding']=le.fit(time_lags['updown']).transform(time_lags['updown'])
#  X = time_lags[['lag1','lag2']] # Independent, or input variables
# Y = time_lags['direction'] # Dependent, or output variable
context.model.fit(X,time_lags['encoding'][4:]) # Generate our model

if context.model: # Check if our model is generated

# Get recent prices
recent_prices = data.history(context.assets,'price',context.lookback, '1d')

time_lags = pd.DataFrame(index=recent_prices.index)
time_lags['price']=recent_prices.values
time_lags['daily_returns']=time_lags['price'].pct_change()
time_lags['multiple_day_returns'] =  time_lags['price'].pct_change(3)
time_lags['rolling_mean'] = time_lags['daily_returns'].rolling(window = 4,center=False).mean()

time_lags['time_lagged'] = time_lags['price']-time_lags['price'].shift(-2)
X = time_lags[['price','daily_returns','multiple_day_returns','rolling_mean']].dropna()

prediction = context.model.predict(X)
if prediction == 1 and context.ma_50 > context.ma_200:
order_target_percent(context.assets, 1.0)
elif prediction == 2 and context.ma_50 < context.ma_200:
order_target_percent(context.assets, -1.0)
else:
pass

def handle_data(context, data):
pass

There was a runtime error.

Gus, I wonder can the framework be easily modified to have multiple stocks as input?

Giuseppe, your notebook is very cool, do you have an algorithm framework similar to this one for it?

I modified the code to include multiple stocks as an input to Random Forest Regressor. A new model is trained for each stock. I parameterized the regressor with 64 trees and a max depth of 6. I suppose the parameters can be tuned using cross-validation. A good alternative to Random Forest is XGBoost (https://github.com/dmlc/xgboost). However, it's not available yet in the python IDE.

25
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
from sklearn.ensemble import RandomForestRegressor
import numpy as np

def initialize(context):

set_slippage(slippage.VolumeShareSlippage(volume_limit=0.025, price_impact=0.1))

context.security_list = [ sid(19662),  # XLY Consumer Discrectionary SPDR Fund
sid(19656),  # XLF Financial SPDR Fund
sid(19658),  # XLK Technology SPDR Fund
sid(19655),  # XLE Energy SPDR Fund
sid(19661),  # XLV Health Care SPRD Fund
sid(19657),  # XLI Industrial SPDR Fund
sid(19659),  # XLP Consumer Staples SPDR Fund
sid(19654),  # XLB Materials SPDR Fund
sid(19660)   # XLU Utilities SPRD Fund
]
context.models = {}
context.prediction = np.zeros_like(context.security_list)

context.lookback = 3 # Look back 3 days
context.history_range = 400 # Only consider the past 400 days' history

# Generate a new model every week
schedule_function(create_model, date_rules.week_end(), time_rules.market_close(minutes=60))

# Trade at the start of every day

def create_model(context, data):

for idx, security in enumerate(context.security_list):
train_model(context, data, idx)

def train_model(context, data, idx):
# Get the relevant daily prices
recent_prices = data.history(context.security_list[idx], 'price',\
context.history_range, '1d').values

# Get the price changes
price_changes = np.diff(recent_prices).tolist()

X = [] # Independent, or input variables
Y = [] # Dependent, or output variable

# For each day in our history
for i in range(context.history_range-context.lookback-1):
X.append(price_changes[i:i+context.lookback]) # Store prior price changes
Y.append(price_changes[i+context.lookback]) # Store the day's price change

model = RandomForestRegressor(n_estimators=64, max_depth=6)
model.fit(X, Y) # Generate our model
context.models[idx] = model

if context.models:

for idx, security in enumerate(context.security_list):

# Get recent prices
recent_prices = data.history(security, 'price', \
context.lookback+1, '1d').values
# Get the price changes
price_changes = np.diff(recent_prices).tolist()

# Predict using our model and the recent prices
prediction = context.models[idx].predict(price_changes)
weight = 1.0 / len(context.security_list)

context.prediction[idx] = prediction
# Go long if we predict the price will rise, short otherwise
if prediction > 0:
order_target_percent(security, +weight)
else:
order_target_percent(security, -weight)
#end for
record(prediction = context.prediction[0])

def handle_data(context, data):
pass
There was a runtime error.

How does something like scikit-learn get added as a support package to this quantopian platform?

I tested it with volume alone and a different lookback period and history range. Seems like it can still generate alpha.

168
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Use the previous 10 bars' movements to predict the next movement.

# Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html
from sklearn.ensemble import RandomForestRegressor
import numpy as np

def initialize(context):
context.security = sid(8554) # Trade SPY
context.model = RandomForestRegressor()

context.lookback = 5 # Look back 5 days, was 3 days
context.history_range = 200 # Only consider the past 200 days' history, was 400

# Generate a new model every week
schedule_function(create_model, date_rules.week_end(), time_rules.market_close(minutes=10))

# Trade 1 minute after the start of every day

def create_model(context, data):
# Get the relevant daily prices
##Changed to volume
recent_volumes = data.history(context.security, 'volume', context.history_range, '1d').values

# Get the price changes
# Volume here too
volume_changes = np.diff(recent_volumes).tolist()

X = [] # Independent, or input variables
Y = [] # Dependent, or output variable

# For each day in our history
for i in range(context.history_range-context.lookback-1):
X.append(volume_changes[i:i+context.lookback]) # Store prior price changes
Y.append(volume_changes[i+context.lookback]) # Store the day's price change

context.model.fit(X, Y) # Generate our model

if context.model: # Check if our model is generated

# Get recent prices
recent_volumes = data.history(context.security, 'volume', context.lookback+1, '1d').values

# Get the price changes
volume_changes = np.diff(recent_volumes).tolist()

# Predict using our model and the recent prices
prediction = context.model.predict(volume_changes)
record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
## Volume again!
if prediction > 0:
order_target_percent(context.security, 1.0)
else:
order_target_percent(context.security, -1.0)

There was a runtime error.

Thanks for sharing, it is an amazing concept and the framework is really looking as clean as possible. We need more people like you in the world!

When people first explained machine learning to me it sounded like rocket science, but as you can see from this explanation, it's really not that hard to work with. Thanks Gus! Really great post.

I ran a backtest using volume to predict SPY moves but then had the algo trade XIV instead. Huge drawdowns so this strategy needs more work, but recent results seem promising.

187
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Use the previous 10 bars' movements to predict the next movement.

# Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html
from sklearn.ensemble import RandomForestRegressor
import numpy as np

def initialize(context):
context.security = sid(8554) # ETF predictor
context.target = sid(40516) # ETF target
context.model = RandomForestRegressor()

context.lookback = 5 # Look back 5 days, was 3 days
context.history_range = 200 # Only consider the past 200 days' history, was 400

# Generate a new model every week
schedule_function(create_model, date_rules.week_end(), time_rules.market_close(minutes=10))

# Trade 1 minute after the start of every day

def create_model(context, data):
# Get the relevant daily prices
##Changed to volume
recent_volumes = data.history(context.security, 'volume', context.history_range, '1d').values

# Get the price changes
# Volume here too
volume_changes = np.diff(recent_volumes).tolist()

X = [] # Independent, or input variables
Y = [] # Dependent, or output variable

# For each day in our history
for i in range(context.history_range-context.lookback-1):
X.append(volume_changes[i:i+context.lookback]) # Store prior price changes
Y.append(volume_changes[i+context.lookback]) # Store the day's price change

context.model.fit(X, Y) # Generate our model

if context.model: # Check if our model is generated

# Get recent prices
recent_volumes = data.history(context.security, 'volume', context.lookback+1, '1d').values

# Get the price changes
volume_changes = np.diff(recent_volumes).tolist()

# Predict using our model and the recent prices
prediction = context.model.predict(volume_changes)
record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
## Volume again!
if prediction > 0:
order_target_percent(context.target, 1.0)
else:
order_target_percent(context.target, -1.0)

There was a runtime error.

Gus, any idea how to add a second variable to the Random Forrest Prediction? For example, if I wanted to add the MACD from Ta-lib as a variable, how would I do that?

I need help using this algo with order_optimal_portfolio. Can someone recreate?

@Vadim Smolyakov I have tried using this algo with Gradient Boosting Regression from sklearn. It's similar to that of XGBoost in that they both use gradient boosted trees, a form of ensemble learning. Does anyone know if Quantopian has LightGBM support? It is much more faster and efficient due to its low-level architecture.

Recreated using two regression models: prediction = 0.5 random forest + 0.5 gradient boosting. The idea is that we want don't base our decision purely off of one regression technique. Still managed to create alpha using MSFT instead of SPY. Attaching a visual pipeline image that explains the process soon. Still assumes slippage and commissions.

121
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Use the previous 10 bars' movements to predict the next movement.

# Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html
import numpy as np

def initialize(context):
set_slippage(slippage.VolumeShareSlippage(volume_limit=0.025, price_impact=0.1))

context.model1 = GradientBoostingRegressor(learning_rate = 0.1, n_estimators = 150)
context.model2 = RandomForestRegressor()

context.lookback = 5 # Look back 5 days, was 3 days
context.history_range = 150 #Only consider the past 200 days' history, was 400

# Generate a new model every week
schedule_function(create_model, date_rules.week_end(), time_rules.market_close(minutes=10))

# Trade 1 minute after the start of every day

def create_model(context, data):
# Get the relevant daily prices
##Changed to volume
recent_volumes = data.history(context.security, 'volume', context.history_range, '1d').values

# Get the price changes
# Volume here too
volume_changes = np.diff(recent_volumes).tolist()

X = [] # Independent, or input variables
Y = [] # Dependent, or output variable

# For each day in our history
for i in range(context.history_range-context.lookback-1):
X.append(volume_changes[i:i+context.lookback]) # Store prior volume changes
Y.append(volume_changes[i+context.lookback]) # Store the day's volume change

context.model1.fit(X, Y) # Generate our model
context.model2.fit(X, Y)

if context.model1 and context.model2: # Check if our model is generated

# Get recent prices
recent_volumes = data.history(context.security, 'volume', context.lookback+1, '1d').values

# Get the price changes
volume_changes = np.diff(recent_volumes).tolist()

# Predict using our model and the recent prices
prediction = (0.5*context.model1.predict(volume_changes)) + (0.5*context.model2.predict(volume_changes))
record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
## Volume again!
if prediction > 0:
order_target_percent(context.security, 1.0)
else:
order_target_percent(context.security, -1.0)
There was a runtime error.

Adding some more complexity to the "simple" program just for fun :

1. Utilizing a scaler on the training data of this trading algorithm using Sklearn's MinMaxScaler, so that the data is now scaled from -1 to 1 .
2. Giving the ability to use multiple regression models to create a prediction.
3. Healthcare + tech are a good combo since they naturally diversify from each other.
4. Decreasing the frequency of shorting to once a week (previously, there was a chance it would sell everyday).

ps. I wouldn't use gradient boosting or any ensemble methods in the real world. Bayesian classifiers, in my opinion, should have a better performance.
@Vadim Smolyakovl ,would there not be a flaw in using cross-validation with time-series data since we are dealing with sequential movements?

131
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Use the previous 10 bars' movements to predict the next movement.

# Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import talib

def initialize(context):
set_slippage(slippage.VolumeShareSlippage(volume_limit=0.025, price_impact=0.1))

context.security_list = [ sid(5061),  # MSFT  - Tech
sid(23103) # ANTM - Healthcare,
]

context.volume_scalers = {}
context.price_scalers  = {}

context.rfr_fit = None
context.gbr_fit = None

context.scaler = MinMaxScaler(feature_range=(-1, 1))
context.volume_scaler_fit = None
context.price_scaler_fit = None

context.rfr_models = {}
context.gbr_models = {}

context.prediction = np.zeros_like(context.security_list)

context.lookback = 5 # Look back 5 days, was 3 days
context.history_range = 150 #Only consider the past 200 days' history, was 400

# Generate a new model every week
schedule_function(create_model, date_rules.week_end(), time_rules.market_close(minutes=10))

# Trade 1 minute after the start of every day
schedule_function(short, date_rules.month_start(), time_rules.market_open(minutes=2))

def create_model(context, data):

for idx, security in enumerate(context.security_list):
train_model(context, data, idx)

def train_model(context, data, idx):
recent_volumes = data.history(context.security_list[idx], 'volume', context.history_range, '1d').values
recent_prices = data.history(context.security_list[idx], 'price', context.history_range, '1d').values

# Get the price changes
# Volume here too
volume_changes = np.diff(recent_volumes).tolist()
price_changes = np.diff(recent_prices).tolist()

context.volume_scalers[idx] = context.scaler.fit(volume_changes)
#
X_vol_changes_train = context.volume_scalers[idx].transform(volume_changes).tolist()

context.price_scalers[idx] = context.scaler.fit(price_changes)
X_price_changes_train = context.price_scalers[idx].transform(price_changes).tolist()

X = [] # Independent, or input variables
Y = [] # Dependent, or output variable

# For each day in our history
for i in range(150-context.lookback-1):
X.append(X_price_changes_train[i:i+context.lookback] + X_vol_changes_train[i:i+context.lookback]) # Store prior volume changes
Y.append(price_changes[i+context.lookback] + volume_changes[i+context.lookback]) # Store the day's volume change

rfr = GradientBoostingRegressor(learning_rate = 0.1, n_estimators = 150)
gbr = RandomForestRegressor()

rfr.fit(X, Y) # Generate our model
gbr.fit(X, Y)
context.rfr_models[idx] = rfr
context.gbr_models[idx] = gbr

if context.rfr_models and context.gbr_models: # Check if our model is generated
for idx, security in enumerate(context.security_list):
# Get recent prices and volume
recent_volumes = data.history(security, 'volume', context.lookback+1, '1d').values
recent_prices = data.history(security, 'price', context.lookback+1, '1d').values

# Get the price and volume changes
volume_changes = np.diff(recent_volumes).tolist()
price_changes = np.diff(recent_prices).tolist()

# Use our MinMaxScaler on testing data
volume_changes = context.volume_scalers[idx].transform(volume_changes).tolist()
price_changes = context.price_scalers[idx].transform(price_changes).tolist()

weight = 1.0 / len(context.security_list)
# Predict using our model and the recent prices
prediction = 0.5*(context.gbr_models[idx].predict( price_changes + volume_changes)) + 0.5*(context.rfr_models[idx].predict( price_changes + volume_changes))

record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
## Volume again!
if prediction > 0:
order_target_percent(security, +weight)
else:
order_target_percent(security, weight)

def short(context, data):
if context.rfr_models and context.gbr_models: # Check if our model is generated
for idx, security in enumerate(context.security_list):
# Get recent prices and volume
recent_volumes = data.history(security, 'volume', context.lookback+1, '1d').values
recent_prices = data.history(security, 'price', context.lookback+1, '1d').values

# Get the price and volume changes
volume_changes = np.diff(recent_volumes).tolist()
price_changes = np.diff(recent_prices).tolist()

# Use our MinMaxScaler on testing data
volume_changes = context.volume_scalers[idx].transform(volume_changes).tolist()
price_changes = context.price_scalers[idx].transform(price_changes).tolist()

weight = 1.0 / len(context.security_list)
# Predict using our model and the recent prices
prediction = 0.5*(context.gbr_models[idx].predict( price_changes + volume_changes)) + 0.5*(context.rfr_models[idx].predict( price_changes + volume_changes))

record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
## Volume again!
if prediction < 0:
order_target_percent(security, -weight)
There was a runtime error.

So it's safe to say that this is a sliding window model. I challenge someone to create an expanding window model and use AUC to compare the models. I have turned this into a logistic regression problem (msg me if you want the script). It should be fairly easy to convert.

Simple equal weight rebalanced daily

# ------------------------------------------------
security_list, lev  = symbols('MSFT', 'ANTM'), 1.0
# ------------------------------------------------
def initialize(context):

for sec in security_list:
order_target_percent(sec, lev / len(security_list))


has such performance for the same period:

100000

START
06/03/2016
END
09/30/2018

Total Returns
127.1%
Benchmark Returns
44.62%
Alpha
0.18
Beta
1.10
Sharpe
2.31
Sortino
3.58
Volatility
0.16
Max Drawdown
-11.61%

What we get from Machine Learning Models?

What you guys found was a period (06/03/2016 to 09/30/2018) where volume and returns of two stocks that had fantastic correlations that ML was able to easily extract alpha. But when you change the period, things don't seem to work! Just to remind you that one needs to cross validate before getting excited.

15
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Use the previous 10 bars' movements to predict the next movement.

# Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import talib

def initialize(context):
#set_slippage(slippage.VolumeShareSlippage(volume_limit=0.025, price_impact=0.1))

context.security_list = [ sid(5061),  # MSFT  - Tech
sid(23103) # ANTM - Healthcare,
]

context.volume_scalers = {}
context.price_scalers  = {}

context.rfr_fit = None
context.gbr_fit = None

context.scaler = MinMaxScaler(feature_range=(-1, 1))
context.volume_scaler_fit = None
context.price_scaler_fit = None

context.rfr_models = {}
context.gbr_models = {}

context.prediction = np.zeros_like(context.security_list)

context.lookback = 5 # Look back 5 days, was 3 days
context.history_range = 150 #Only consider the past 200 days' history, was 400

# Generate a new model every week
schedule_function(create_model, date_rules.week_end(), time_rules.market_close(minutes=10))

# Trade 1 minute after the start of every day
schedule_function(short, date_rules.month_start(), time_rules.market_open(minutes=2))

def create_model(context, data):

for idx, security in enumerate(context.security_list):
train_model(context, data, idx)

def train_model(context, data, idx):
recent_volumes = data.history(context.security_list[idx], 'volume', context.history_range, '1d').values
recent_prices = data.history(context.security_list[idx], 'price', context.history_range, '1d').values

# Get the price changes
# Volume here too
volume_changes = np.diff(recent_volumes).tolist()
price_changes = np.diff(recent_prices).tolist()

context.volume_scalers[idx] = context.scaler.fit(volume_changes)
#
X_vol_changes_train = context.volume_scalers[idx].transform(volume_changes).tolist()

context.price_scalers[idx] = context.scaler.fit(price_changes)
X_price_changes_train = context.price_scalers[idx].transform(price_changes).tolist()

X = [] # Independent, or input variables
Y = [] # Dependent, or output variable

# For each day in our history
for i in range(150-context.lookback-1):
X.append(X_price_changes_train[i:i+context.lookback] + X_vol_changes_train[i:i+context.lookback]) # Store prior volume changes
Y.append(price_changes[i+context.lookback] + volume_changes[i+context.lookback]) # Store the day's volume change

rfr = GradientBoostingRegressor(learning_rate = 0.1, n_estimators = 150)
gbr = RandomForestRegressor()

rfr.fit(X, Y) # Generate our model
gbr.fit(X, Y)
context.rfr_models[idx] = rfr
context.gbr_models[idx] = gbr

if context.rfr_models and context.gbr_models: # Check if our model is generated
for idx, security in enumerate(context.security_list):
# Get recent prices and volume
recent_volumes = data.history(security, 'volume', context.lookback+1, '1d').values
recent_prices = data.history(security, 'price', context.lookback+1, '1d').values

# Get the price and volume changes
volume_changes = np.diff(recent_volumes).tolist()
price_changes = np.diff(recent_prices).tolist()

# Use our MinMaxScaler on testing data
volume_changes = context.volume_scalers[idx].transform(volume_changes).tolist()
price_changes = context.price_scalers[idx].transform(price_changes).tolist()

weight = 1.0 / len(context.security_list)
# Predict using our model and the recent prices
prediction = 0.5*(context.gbr_models[idx].predict( price_changes + volume_changes)) + 0.5*(context.rfr_models[idx].predict( price_changes + volume_changes))

record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
## Volume again!
if prediction > 0:
order_target_percent(security, +weight)
else:
order_target_percent(security, weight)

def short(context, data):
if context.rfr_models and context.gbr_models: # Check if our model is generated
for idx, security in enumerate(context.security_list):
# Get recent prices and volume
recent_volumes = data.history(security, 'volume', context.lookback+1, '1d').values
recent_prices = data.history(security, 'price', context.lookback+1, '1d').values

# Get the price and volume changes
volume_changes = np.diff(recent_volumes).tolist()
price_changes = np.diff(recent_prices).tolist()

# Use our MinMaxScaler on testing data
volume_changes = context.volume_scalers[idx].transform(volume_changes).tolist()
price_changes = context.price_scalers[idx].transform(price_changes).tolist()

weight = 1.0 / len(context.security_list)
# Predict using our model and the recent prices
prediction = 0.5*(context.gbr_models[idx].predict( price_changes + volume_changes)) + 0.5*(context.rfr_models[idx].predict( price_changes + volume_changes))

record(prediction = prediction)

# Go long if we predict the price will rise, short otherwise
## Volume again!
if prediction < 0:
order_target_percent(security, -weight)
There was a runtime error.

@James Villa

K-fold cross-validation will not work with time series models; see this post. Could perhaps use TimeSeriesSplit to cv data in the train_model function. The traditional supervised learning assumption of i.i.d. observations doesn't hold in this case since financial data is sequential. Though you are correct, there is a lot of room to be improved from this model:

e.g.
1. Will an expanding window model or rolling window model be more accurate?
2. How does the model perform with X-step ahead forecasts
3. Can prediction intervals help make the perform better?
4. Can we add any other endogenous variables?
5. Does it only work for a subset of stocks? (e.g. high-cap stocks)

Nice ideas and a good base for further playing! I haven't been active for a while so I am not sure if my observations are correct, but maybe helpful for you guys:

(1) I think the prediction of the price change of the current day (y, i+context.lookback) is based on price changes which include the current day (x, i+context.lookback).

# For each day in our history
for i in range(context.history_range-context.lookback-1):
X.append(price_changes[i:i+context.lookback]) # Store prior price changes including the day's price change
Y.append(price_changes[i+context.lookback]) # Store the day's price change


Would Y.append(price_changes[i+context.lookback+1]) work better? Or even more days and not only the next day?

(2) In the last example, a prediction is made for the sum of price and volume changes, however it would be sufficient to know if the price is going up or down, so Y could be shortened.

    #Old
Y.append(price_changes[i+context.lookback] + volume_changes[i+context.lookback]) # Store the day's volume change
#New
Y.append(price_changes[i+context.lookback+1]) # Store the day's price change only


Also, a little swap happened in the variable names:

    rfr = GradientBoostingRegressor(learning_rate = 0.1, n_estimators = 150)
gbr = RandomForestRegressor()


(3) GBR can give better results if parameters (e.g. learning curve) are slightly adopted, e.g.

    gbr = GradientBoostingRegressor(learning_rate = 0.01, n_estimators = 150, max_depth = 4, min_samples_split = 2)


(4) Mean forecast error can be calculated without negative impact by fitting twice

    # Generate our models
rfr = RandomForestRegressor()
gbr = GradientBoostingRegressor(learning_rate = 0.01, n_estimators = 150, max_depth = 4, min_samples_split = 2)

# Test our models on independent test data
offset = int(len(X) * 0.8)
X_train, Y_train = X[:offset], Y[:offset]
X_test, Y_test = X[offset:], Y[offset:]

rfr.fit(X_train, Y_train)
rfr_me = math.sqrt(mean_squared_error(Y_test, rfr.predict(X_test)))
context.rfr_me[idx] = rfr_me

gbr.fit(X_train, Y_train)
gbr_me = math.sqrt(mean_squared_error(Y_test, gbr.predict(X_test)))
context.gbr_me[idx] = gbr_me

# Fit our models with all data
rfr.fit(X, Y)
gbr.fit(X, Y)


and recording the ratio can show that GBR is slightly better:

    #record(mean_error_rfr = context.rfr_me[idx])
#record(mean_error_gbr = context.gbr_me[idx])
record(mean_error_gbr_rfr_ratio = context.gbr_me[idx]/context.rfr_me[idx])


(5) Finally, it seems like training is done with closing prices, but trading is done including the opening price, as the schedule_function starts the trading every morning and:

    recent_prices = data.history(security, 'price', context.lookback+1, '1d').values


Maybe training should also include open price in every last data point.

Have fun,
Frank

One addition on scaling: according to this study, the impact of scaling on bagging algorithms such as 'RandomForestRegressor' and boosting algorithms such as 'GradientBoostingRegressor' is negligible since they are decision-tree based. Furthermore, this scaler

context.scaler = MinMaxScaler(feature_range=(-1, 1))


scales each feature individually and does not preserve symmetry, so it is probably better not to use it here to make sure all price increases remain > 0 and all price decreases < 0 as input for the ML algo.

@Frank, thanks for peer reviewing this.

1. This talk highlights how hyperparameter tuning does not have an effect on alpha, so I am not sure how adding hyperparameter tuning using tree methods could improve returns.
2. I have read that study where you mention scaling does not have an impact on performance for trees. I did not post it, but the reason I implemented scalers was to fit on volume changes, and then using that fit, I transform both price and volume changes. It may sound dumb, but I think it could find those correlations between price and volume changes better than putting in raw vol/price changes. Basically, my hypothesis is fit(x1) -> transform (x1)-> transform (x2). Would love to hear your opinion on this

Also, that paper you mentioned is for iid events. Perhaps we could use TimeSeriesSplit in SKLearn? Can we implement Hidden Markov Model to infer latent states and create actions off of that?

Good input on the RMSE. I ended up just summing the errors to see which one had a less sum. However, another metric I think we should be interested in is the accuracy metric. E.g. possibly confusion matrix along with sum(RMSE)