Simple Machine Learning Example

I've seen a number of posts here involving machine learning. Although machine learning probably seems complicated at first, it is actually easy to work with. I wanted to try to create a simple algorithm and post to introduce people to the concept who aren't familiar.

The goal of machine learning is to create an accurate model based off of past data then use that model to predict future events. There are mainly two types of machine learning used in quantitative finance:

• Regression is used to predict a continuous value, like predict a price will rise $0.46. • Classification is used to predict a category, like just predict a price will rise. This example uses classification. A model needs to be created based off of past independent and dependent variables, then that model can be used to try to predict future changes in the price. There are 10 independent variables, or input variables in this algorithm. They are whether a price increased or decreased on the 10 bars before a selected bar. The dependent variable, or the output variable is whether a price increased or decreased on that selected bar. Once there are enough data points, a model can be created to try to predict future prices. You can find more information about machine learning and the module used, sklearn, here. I'd also like to thank Alex for inspiring this and Thomas for helping me. Feel free to copy and use the code, and let me know if you have any questions or ideas! 4429 Loading... Backtest from to with initial capital Total Returns -- Alpha -- Beta -- Sharpe -- Sortino -- Max Drawdown -- Benchmark Returns -- Volatility --  Returns 1 Month 3 Month 6 Month 12 Month  Alpha 1 Month 3 Month 6 Month 12 Month  Beta 1 Month 3 Month 6 Month 12 Month  Sharpe 1 Month 3 Month 6 Month 12 Month  Sortino 1 Month 3 Month 6 Month 12 Month  Volatility 1 Month 3 Month 6 Month 12 Month  Max Drawdown 1 Month 3 Month 6 Month 12 Month # Use the previous 10 bars' movements to predict the next movement. # Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html from sklearn.ensemble import RandomForestClassifier from collections import deque import numpy as np def initialize(context): context.security = sid(698) # Boeing context.window_length = 10 # Amount of prior bars to study context.classifier = RandomForestClassifier() # Use a random forest classifier # deques are lists with a maximum length where old entries are shifted out context.recent_prices = deque(maxlen=context.window_length+2) # Stores recent prices context.X = deque(maxlen=500) # Independent, or input variables context.Y = deque(maxlen=500) # Dependent, or output variable context.prediction = 0 # Stores most recent prediction def handle_data(context, data): context.recent_prices.append(data[context.security].price) # Update the recent prices if len(context.recent_prices) == context.window_length+2: # If there's enough recent price data # Make a list of 1's and 0's, 1 when the price increased from the prior bar changes = np.diff(context.recent_prices) > 0 context.X.append(changes[:-1]) # Add independent variables, the prior changes context.Y.append(changes[-1]) # Add dependent variable, the final change if len(context.Y) >= 100: # There needs to be enough data points to make a good model context.classifier.fit(context.X, context.Y) # Generate the model context.prediction = context.classifier.predict(changes[1:]) # Predict # If prediction = 1, buy all shares affordable, if 0 sell all shares order_target_percent(context.security, context.prediction) record(prediction=int(context.prediction)) We have migrated this algorithm to work with a new version of the Quantopian API. The code is different than the original version, but the investment rationale of the algorithm has not changed. We've put everything you need to know here on one page. There was a runtime error. Disclaimer The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. 44 responses Hi Gus, Thanks for the example. I re-ran it with a Boeing as the benchmark. Seems like the algo needs tweaking...or am I missing something? Grant 1729 Loading... Backtest from to with initial capital Total Returns -- Alpha -- Beta -- Sharpe -- Sortino -- Max Drawdown -- Benchmark Returns -- Volatility --  Returns 1 Month 3 Month 6 Month 12 Month  Alpha 1 Month 3 Month 6 Month 12 Month  Beta 1 Month 3 Month 6 Month 12 Month  Sharpe 1 Month 3 Month 6 Month 12 Month  Sortino 1 Month 3 Month 6 Month 12 Month  Volatility 1 Month 3 Month 6 Month 12 Month  Max Drawdown 1 Month 3 Month 6 Month 12 Month # Use the previous 10 bars' movements to predict the next movement. # Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html from sklearn.ensemble import RandomForestClassifier from collections import deque import numpy as np def initialize(context): set_benchmark(sid(698)) # Boeing context.security = sid(698) # Boeing context.window_length = 10 # Amount of prior bars to study context.classifier = RandomForestClassifier() # Use a random forest classifier # deques are lists with a maximum length where old entries are shifted out context.recent_prices = deque(maxlen=context.window_length+2) # Stores recent prices context.X = deque(maxlen=500) # Independent, or input variables context.Y = deque(maxlen=500) # Dependent, or output variable context.prediction = 0 # Stores most recent prediction def handle_data(context, data): context.recent_prices.append(data[context.security].price) # Update the recent prices if len(context.recent_prices) == context.window_length+2: # If there's enough recent price data # Make a list of 1's and 0's, 1 when the price increased from the prior bar changes = np.diff(context.recent_prices) > 0 context.X.append(changes[:-1]) # Add independent variables, the prior changes context.Y.append(changes[-1]) # Add dependent variable, the final change if len(context.Y) >= 100: # There needs to be enough data points to make a good model context.classifier.fit(context.X, context.Y) # Generate the model context.prediction = context.classifier.predict(changes[1:]) # Predict # If prediction = 1, buy all shares affordable, if 0 sell all shares order_target_percent(context.security, context.prediction) record(prediction=int(context.prediction)) There was a runtime error. Yes, this is just meant to be an example of how to use machine learning. You probably can rarely expect real returns from a simple algorithm like this that just uses one stock's price. Although, there are likely comparatively simple methods that can be built that have good performance. One idea is to import a bunch of data streams from http://www.quandl.com/ or something into a Quantopian algo then using machine learning to model prices based off of that. Any idea how to get it to generate superior returns? Could multiple securities (i.e. a portfolio) be used? --Grant Multiple securities could be used, I believe all that has to be done is add another dimension to the lists. And the idea is to try to think of a stock price that will be correlated to some other value. Here's a good article: http://jspauld.com/post/35126549635/how-i-made-500k-with-machine-learning-and-hft Here's my attempt at a multi-security version of the algo (a complete hack on my part, since I have no idea what this thingy is doing...just followed Gus' example as best I could). Did I do it correctly? --Grant 1729 Loading... Backtest from to with initial capital Total Returns -- Alpha -- Beta -- Sharpe -- Sortino -- Max Drawdown -- Benchmark Returns -- Volatility --  Returns 1 Month 3 Month 6 Month 12 Month  Alpha 1 Month 3 Month 6 Month 12 Month  Beta 1 Month 3 Month 6 Month 12 Month  Sharpe 1 Month 3 Month 6 Month 12 Month  Sortino 1 Month 3 Month 6 Month 12 Month  Volatility 1 Month 3 Month 6 Month 12 Month  Max Drawdown 1 Month 3 Month 6 Month 12 Month from sklearn.ensemble import RandomForestClassifier import numpy as np from pytz import timezone trading_freq = 20 # trading frequency, days def initialize(context): context.stocks = [ sid(19662), # XLY Consumer Discrectionary SPDR Fund sid(19656), # XLF Financial SPDR Fund sid(19658), # XLK Technology SPDR Fund sid(19655), # XLE Energy SPDR Fund sid(19661), # XLV Health Care SPRD Fund sid(19657), # XLI Industrial SPDR Fund sid(19659), # XLP Consumer Staples SPDR Fund sid(19654), # XLB Materials SPDR Fund sid(19660)] # XLU Utilities SPRD Fund context.classifier = RandomForestClassifier() # Use a random forest classifier context.prediction = np.ones_like(context.stocks) set_commission(commission.PerShare(cost=0.013, min_trade_cost=1.3)) context.day_count = -1 def handle_data(context, data): # Trade only once per day loc_dt = get_datetime().astimezone(timezone('US/Eastern')) if loc_dt.hour == 16 and loc_dt.minute == 0: context.day_count += 1 pass else: return # Limit trading frequency if context.day_count % trading_freq != 0.0: return prices = history(401,'1d','price').as_matrix(context.stocks) changes = np.diff(prices,axis=0) > 0 for k in range(len(context.stocks)): X = np.split(changes[:,k],20) Y = np.split(changes[:,k],20)[-1] context.classifier.fit(X, Y) # Generate the model context.prediction[k] = context.classifier.predict(Y) allocation = context.prediction.astype(float) denom = np.sum(allocation) if denom != 0.0: allocation = allocation/np.sum(allocation) for stock,percent in zip(context.stocks,allocation): order_target_percent(stock,percent) There was a runtime error. Cool :). Looks correct to me. Also nice implementation in minute mode with the history API. Hello Gus, Here's a variant that uses SPY as a reference. Seems to provide a slight advantage over the benchmark. Grant 1729 Loading... Backtest from to with initial capital Total Returns -- Alpha -- Beta -- Sharpe -- Sortino -- Max Drawdown -- Benchmark Returns -- Volatility --  Returns 1 Month 3 Month 6 Month 12 Month  Alpha 1 Month 3 Month 6 Month 12 Month  Beta 1 Month 3 Month 6 Month 12 Month  Sharpe 1 Month 3 Month 6 Month 12 Month  Sortino 1 Month 3 Month 6 Month 12 Month  Volatility 1 Month 3 Month 6 Month 12 Month  Max Drawdown 1 Month 3 Month 6 Month 12 Month from sklearn.ensemble import RandomForestClassifier import numpy as np from pytz import timezone from scipy import stats trading_freq = 5 # trading frequency, days def initialize(context): context.stocks = [ sid(19662), # XLY Consumer Discrectionary SPDR Fund sid(19656), # XLF Financial SPDR Fund sid(19658), # XLK Technology SPDR Fund sid(19655), # XLE Energy SPDR Fund sid(19661), # XLV Health Care SPRD Fund sid(19657), # XLI Industrial SPDR Fund sid(19659), # XLP Consumer Staples SPDR Fund sid(19654), # XLB Materials SPDR Fund sid(19660), # XLU Utilities SPRD Fund sid(8554) ] # SPY S&P 500 ETF Trust context.classifier = RandomForestClassifier() # Use a random forest classifier context.prediction = np.ones_like(context.stocks[0:-1]) set_commission(commission.PerShare(cost=0.013, min_trade_cost=1.3)) context.day_count = -1 def handle_data(context, data): # Trade only once per day loc_dt = get_datetime().astimezone(timezone('US/Eastern')) if loc_dt.hour == 16 and loc_dt.minute == 0: context.day_count += 1 pass else: return # Limit trading frequency if context.day_count % trading_freq != 0.0: return prices = history(400,'1d','price').as_matrix(context.stocks) changes_all = stats.zscore(prices, axis=0, ddof=1) changes = changes_all[:,0:-1] - np.tile(changes_all[:,-1],(9,1)).T changes = changes > 0 for k in range(len(context.stocks)-1): X = np.split(changes[:,k],20) Y = X[-1] context.classifier.fit(X, Y) # Generate the model context.prediction[k] = context.classifier.predict(Y) allocation = context.prediction.astype(float) denom = np.sum(allocation) if denom != 0.0: allocation = allocation/np.sum(allocation) for stock,percent in zip(context.stocks[0:-1],allocation): order_target_percent(stock,percent) There was a runtime error. A longer run of the same algo as immediately above. --Grant 1729 Loading... Backtest from to with initial capital Total Returns -- Alpha -- Beta -- Sharpe -- Sortino -- Max Drawdown -- Benchmark Returns -- Volatility --  Returns 1 Month 3 Month 6 Month 12 Month  Alpha 1 Month 3 Month 6 Month 12 Month  Beta 1 Month 3 Month 6 Month 12 Month  Sharpe 1 Month 3 Month 6 Month 12 Month  Sortino 1 Month 3 Month 6 Month 12 Month  Volatility 1 Month 3 Month 6 Month 12 Month  Max Drawdown 1 Month 3 Month 6 Month 12 Month from sklearn.ensemble import RandomForestClassifier import numpy as np from pytz import timezone from scipy import stats trading_freq = 5 # trading frequency, days def initialize(context): context.stocks = [ sid(19662), # XLY Consumer Discrectionary SPDR Fund sid(19656), # XLF Financial SPDR Fund sid(19658), # XLK Technology SPDR Fund sid(19655), # XLE Energy SPDR Fund sid(19661), # XLV Health Care SPRD Fund sid(19657), # XLI Industrial SPDR Fund sid(19659), # XLP Consumer Staples SPDR Fund sid(19654), # XLB Materials SPDR Fund sid(19660), # XLU Utilities SPRD Fund sid(8554) ] # SPY S&P 500 ETF Trust context.classifier = RandomForestClassifier() # Use a random forest classifier context.prediction = np.ones_like(context.stocks[0:-1]) set_commission(commission.PerShare(cost=0.013, min_trade_cost=1.3)) context.day_count = -1 def handle_data(context, data): # Trade only once per day loc_dt = get_datetime().astimezone(timezone('US/Eastern')) if loc_dt.hour == 16 and loc_dt.minute == 0: context.day_count += 1 pass else: return # Limit trading frequency if context.day_count % trading_freq != 0.0: return prices = history(400,'1d','price').as_matrix(context.stocks) changes_all = stats.zscore(prices, axis=0, ddof=1) changes = changes_all[:,0:-1] - np.tile(changes_all[:,-1],(9,1)).T changes = changes > 0 for k in range(len(context.stocks)-1): X = np.split(changes[:,k],20) Y = X[-1] context.classifier.fit(X, Y) # Generate the model context.prediction[k] = context.classifier.predict(Y) allocation = context.prediction.astype(float) denom = np.sum(allocation) if denom != 0.0: allocation = allocation/np.sum(allocation) for stock,percent in zip(context.stocks[0:-1],allocation): order_target_percent(stock,percent) There was a runtime error. Here's a tweaked version. It trades every 20 days, and also does not adjust the portfolio if the algo does not call for a change in the mix of securities: # return if allocation unchanged if np.array_equal(context.allocation,allocation): return context.allocation = allocation  Comments & improvements welcome. Grant 1729 Loading... Backtest from to with initial capital Total Returns -- Alpha -- Beta -- Sharpe -- Sortino -- Max Drawdown -- Benchmark Returns -- Volatility --  Returns 1 Month 3 Month 6 Month 12 Month  Alpha 1 Month 3 Month 6 Month 12 Month  Beta 1 Month 3 Month 6 Month 12 Month  Sharpe 1 Month 3 Month 6 Month 12 Month  Sortino 1 Month 3 Month 6 Month 12 Month  Volatility 1 Month 3 Month 6 Month 12 Month  Max Drawdown 1 Month 3 Month 6 Month 12 Month from sklearn.ensemble import RandomForestClassifier import numpy as np from pytz import timezone from scipy import stats trading_freq = 20 # trading frequency, days def initialize(context): context.stocks = [ sid(19662), # XLY Consumer Discrectionary SPDR Fund sid(19656), # XLF Financial SPDR Fund sid(19658), # XLK Technology SPDR Fund sid(19655), # XLE Energy SPDR Fund sid(19661), # XLV Health Care SPRD Fund sid(19657), # XLI Industrial SPDR Fund sid(19659), # XLP Consumer Staples SPDR Fund sid(19654), # XLB Materials SPDR Fund sid(19660), # XLU Utilities SPRD Fund sid(8554) ] # SPY SPDR S&P 500 ETF Trust # context.stocks = [sid(8554),sid(32268)] # SPY & SH context.classifier = RandomForestClassifier() # Use a random forest classifier context.prediction = np.ones_like(context.stocks[0:-1]) set_commission(commission.PerShare(cost=0.013, min_trade_cost=1.3)) context.day_count = -1 context.allocation = -1.0*np.ones_like(context.stocks[0:-1]) def handle_data(context, data): # Trade only once per day loc_dt = get_datetime().astimezone(timezone('US/Eastern')) if loc_dt.hour == 16 and loc_dt.minute == 0: context.day_count += 1 pass else: return # Limit trading frequency if context.day_count % trading_freq != 0.0: return prices = history(400,'1d','price').as_matrix(context.stocks) changes_all = stats.zscore(prices, axis=0, ddof=1) changes = changes_all[:,0:-1] - np.tile(changes_all[:,-1],(len(context.stocks)-1,1)).T record(changes_med = np.median(changes)) changes = changes > np.median(changes) for k in range(len(context.stocks)-1): X = np.split(changes[:,k],20) Y = X[-1] context.classifier.fit(X, Y) # Generate the model context.prediction[k] = context.classifier.predict(Y) allocation = context.prediction.astype(float) denom = np.sum(allocation) if denom != 0.0: allocation = allocation/np.sum(allocation) # return if allocation unchanged if np.array_equal(context.allocation,allocation): return context.allocation = allocation for stock,percent in zip(context.stocks[0:-1],allocation): order_target_percent(stock,percent) There was a runtime error. Improved return by using: changes = changes > 0  versus: changes = changes > np.median(changes)  1729 Loading... Backtest from to with initial capital Total Returns -- Alpha -- Beta -- Sharpe -- Sortino -- Max Drawdown -- Benchmark Returns -- Volatility --  Returns 1 Month 3 Month 6 Month 12 Month  Alpha 1 Month 3 Month 6 Month 12 Month  Beta 1 Month 3 Month 6 Month 12 Month  Sharpe 1 Month 3 Month 6 Month 12 Month  Sortino 1 Month 3 Month 6 Month 12 Month  Volatility 1 Month 3 Month 6 Month 12 Month  Max Drawdown 1 Month 3 Month 6 Month 12 Month from sklearn.ensemble import RandomForestClassifier import numpy as np from pytz import timezone from scipy import stats trading_freq = 20 # trading frequency, days def initialize(context): context.stocks = [ sid(19662), # XLY Consumer Discrectionary SPDR Fund sid(19656), # XLF Financial SPDR Fund sid(19658), # XLK Technology SPDR Fund sid(19655), # XLE Energy SPDR Fund sid(19661), # XLV Health Care SPRD Fund sid(19657), # XLI Industrial SPDR Fund sid(19659), # XLP Consumer Staples SPDR Fund sid(19654), # XLB Materials SPDR Fund sid(19660), # XLU Utilities SPRD Fund sid(8554) ] # SPY SPDR S&P 500 ETF Trust context.classifier = RandomForestClassifier() # Use a random forest classifier context.prediction = np.ones_like(context.stocks[0:-1]) set_commission(commission.PerShare(cost=0.013, min_trade_cost=1.3)) context.day_count = -1 context.allocation = -1.0*np.ones_like(context.stocks[0:-1]) def handle_data(context, data): # Trade only once per day loc_dt = get_datetime().astimezone(timezone('US/Eastern')) if loc_dt.hour == 16 and loc_dt.minute == 0: context.day_count += 1 pass else: return # Limit trading frequency if context.day_count % trading_freq != 0.0: return prices = history(400,'1d','price').as_matrix(context.stocks) changes_all = stats.zscore(prices, axis=0, ddof=1) changes = changes_all[:,0:-1] - np.tile(changes_all[:,-1],(len(context.stocks)-1,1)).T # record(changes_med = np.median(changes)) # changes = changes > np.median(changes) changes = changes > 0 for k in range(len(context.stocks)-1): X = np.split(changes[:,k],20) Y = X[-1] context.classifier.fit(X, Y) # Generate the model context.prediction[k] = context.classifier.predict(Y) allocation = context.prediction.astype(float) denom = np.sum(allocation) if denom != 0.0: allocation = allocation/np.sum(allocation) # return if allocation unchanged if np.array_equal(context.allocation,allocation): return context.allocation = allocation for stock,percent in zip(context.stocks[0:-1],allocation): order_target_percent(stock,percent) There was a runtime error. So you are using z score instead of just the flat changes in the price, or something along the lines of that? Can you give some insight into this, and any idea why it seems to work better? Hello Gus, Yes, this line of code converts the prices into z-scores: changes_all = stats.zscore(prices, axis=0, ddof=1)  The second detail is that the changes in the sector funds are relative to the SPY benchmark (if I've coded it properly): changes = changes_all[:,0:-1] - np.tile(changes_all[:,-1],(len(context.stocks)-1,1)).T  So, the variable 'changes' is the z-score difference for each sector fund and the underlying benchmark, i.e. changes = z_sector - z_SPY. So, if changes > 0, it indicates that the sector fund was statistically higher than the benchmark on a given day. Thus z-scoring allows for the direct normalization of the sector funds to their collective benchmark. My sense is that this normalization approach works because the sector funds are basically SPY chopped up into various categories (and each is still highly correlated to the benchmark). So, the normalization resolves which sectors are favorable over the benchmark. But the Random Forest voodoo is a mystery to me, so this interpretation could be off-base. From a practical standpoint, I'm not sure that I've captured all of the costs with: set_commission(commission.PerShare(cost=0.013, min_trade_cost=1.3))  Does this accurately account for all of the Interactive Brokers (IB) costs? And last I heard, Quantopian might charge ~$100 per month per algorithm, so that would need to be rolled in. And short-term capital gains taxes (assuming a non-IRA account)?

Note, also, that orders are submitted at the daily close. My understanding is that in live trading, the orders would be cancelled, correct? This is probably not fundamental, but I thought I'd point it out in case someone tries the algo live with IB.

Grant

Ah, that's cool, makes sense. Random forest is a large mystery to me too, however I know ML is pretty durable. As for the IB commissions, I'm not too familiar, however I know commission is a very variable thing, and could be more or less depending on a number of different factors. The default Quantopian model is meant to be a good approximation, but nonetheless able to be adjusted. I'm sure I can get more details on that if you want them.

Yes, in live trading those orders would be cancelled. However, you could use some code to trade a few minutes before the close, for example run some function 15 minutes before the daily close (this is for minute mode only):

from zipline.utils.tradingcalendar import get_early_closes
import pandas as pd

def handle_data(context, data):
exchange_time = pd.Timestamp(get_datetime()).tz_convert('US/Eastern')

if exchange_time.date() in context.early_closes and exchange_time.hour == 12 and exchange_time.minute == 45:
close_day(context)
elif exchange_time.date() not in context.early_closes and exchange_time.hour == 15 and exchange_time.minute == 45:
close_day(context)


Grant,
The commission model you are using there should be a conservative estimate of actual commissions from IB. I just pulled up my trading reports from IB and I'm being charged \$1.00 for trades that do not meet the minimum shares requirement. It's a couple cents extra for short sales, but I think that's actually a tax.

David

Thanks Gus, David,

Any idea how to improve the algo? One thought is that the model just predicts which securities to hold, equally weighted in the portfolio. Could it be modified to predict the optimum unequal weighting? One risk is that this would generate excessive trading, due to more frequent portfolio adjustment.

Also, perhaps someone could have a closer look at the implementation and advise if the machine learning approach could be improved (e.g. settings, different data set pre-processing, etc.).

This might be a nice example to run on zipline, since daily data can be used.

From a general standpoint, I'm curious if this is the kind of trading style that Quantopian is aiming to support under their "quantitative investing" offering (e.g. monthly re-balancing via a handful of trades)? Or would the returns get wiped out by the Quantopian algo fee and other costs?

Grant

I expect the aglo could combine VIX index as independent variables or try the SVM model (combine-different-machine-learning-methods) to improve the model , I had been to try but been failure because I have limit python skill to familiarize the algo now.

I'm not exactly sure how better returns could be made, maybe give that a try. I haven't looked too in-depth at machine learning methods. We aren't quite ready to say anything certain about that stuff yet, but I can tell you that our goal is definitely not to wipe out your returns with an algo fee!

Hi Gus,

One angle would be to find an optimum set of ETFs. For example, I see that there are lots to pick from the list on http://www.forbes.com/sites/baldwin/2014/06/04/best-etfs-qqq-and-the-sector-funds/. The question is how to do the picking? Any ideas?

Grant

Great link. Following is the parent article which has a few more leads http://www.forbes.com/sites/baldwin/2014/06/04/best-etfs-for-investors/

Here's a rough update of my efforts to explore this algo. It trades every day, with zero commission cost (for development). Also, I switched to accumulating a window of minute bars. The securities are:

# SPY Top 10 Holdings, as of Apr 29, 2014 (17.67% of Total Assets)
# http://finance.yahoo.com/q/hl?s=SPY+Holdings
context.stocks = [ sid(24),     # AAPL
sid(8347),   # XOM
sid(5061),   # MSFT
sid(4151),   # JNJ
sid(3149),   # GE
sid(23112),  # CVX
sid(8151),   # WFC
sid(11100),  # BRK_B
sid(5938),   # PG
sid(25006),  # JPM
sid(8554) ]  # SPY


SPY serves as a normalizing benchmark only; no position in SPY is taken.

An "attaboy" to the first person to explain clearly (without web links, references, etc.) what the Random Forest is doing.

Grant

1729
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from pytz import timezone
from scipy import stats
import pandas as pd
import math

pct_cash = 0.0
window = 30
window = window**2

def initialize(context):
# context.stocks = [ sid(19662),  # XLY Consumer Discrectionary SPDR Fund
# sid(19656),  # XLF Financial SPDR Fund
# sid(19658),  # XLK Technology SPDR Fund
# sid(19655),  # XLE Energy SPDR Fund
# sid(19661),  # XLV Health Care SPRD Fund
# sid(19657),  # XLI Industrial SPDR Fund
# sid(19659),  # XLP Consumer Staples SPDR Fund
# sid(19654),  # XLB Materials SPDR Fund
# sid(19660),  # XLU Utilities SPRD Fund
# sid(8554) ]  # SPY SPDR S&P 500 ETF Trust

# SPY Top 10 Holdings, as of Apr 29, 2014 (17.67% of Total Assets)
# http://finance.yahoo.com/q/hl?s=SPY+Holdings

context.stocks = [ sid(24),     # AAPL
sid(8347),   # XOM
sid(5061),   # MSFT
sid(4151),   # JNJ
sid(3149),   # GE
sid(23112),  # CVX
sid(8151),   # WFC
sid(11100),  # BRK_B
sid(5938),   # PG
sid(25006),  # JPM
sid(8554) ]  # SPY

context.classifier = RandomForestClassifier() # Use a random forest classifier

context.prediction = np.ones_like(context.stocks[0:-1])

set_commission(commission.PerShare(cost=0.0))

context.day_count = -1

context.allocation = -1.0*np.ones_like(context.stocks[0:-1])

# set_long_only()

context.prices = pd.DataFrame()

def handle_data(context, data):

price = history(1,'1d','price')

context.prices = context.prices.append(price)
context.prices = context.prices.tail(window)

if len(context.prices.index) < window:
return

# for stock in context.stocks[:-1]:
# shares = context.portfolio.positions[stock].amount
# if shares < 0:
# print stock.sid
# print stock

# record(cash = context.portfolio.cash)

# Trade only once per day
loc_dt = get_datetime().astimezone(timezone('US/Eastern'))
if loc_dt.hour == 16 and loc_dt.minute == 0:
context.day_count += 1
pass
else:
return

if context.day_count % trading_freq != 0.0:
return

# prices = history(400,'1d','price')
prices = context.prices.as_matrix(context.stocks)

changes_all = stats.zscore(prices, axis=0, ddof=1)
changes = changes_all[:,0:-1] - np.tile(changes_all[:,-1],(len(context.stocks)-1,1)).T
changes = changes > 0

for k in range(len(context.stocks)-1):

X = np.split(changes[:,k],math.sqrt(window))
Y = X[-1]

context.classifier.fit(X, Y) # Generate the model

context.prediction[k] = context.classifier.predict(Y)

allocation = context.prediction.astype(float)
denom = np.sum(allocation)
if denom != 0.0:
allocation = (1.0-pct_cash)*allocation/np.sum(allocation)

# return if allocation unchanged
if np.array_equal(context.allocation,allocation):
return

context.allocation = allocation

record(num_secs = np.count_nonzero(allocation))

for stock,percent in zip(context.stocks[0:-1],allocation):
order_target_percent(stock,percent)
There was a runtime error.

Grant, I found this explanation of random forests pretty informative.

http://citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics/

Can someone briefly explain to me what's happening in lines 94-99.

X is a set of 30 periods (each of which contains 30 values). Lets assume X are the days in november and the elements in X are 30 evenly sampled "changes", for example at 20 minute intervals, so that X[0][0] is 9AM, X[0][1] is 9:20, etc..
Y is the last of those periods. (ie. november 30th).
You then fit the 30 periods in X to the last period in Y. This is same as fitting November 1st to 9AM, then November 2nd to 9:20AM, November 3rd to 9:40AM, etc..
You then predict use Y to predict the next window.

This seems a bit odd to me, so I think there must be something fundamental that I either don't understand about history or classifier..?

Hi Chris,

I'll have a look tomorrow or this weekend. Frankly, I never quite figured this all out in detail...just hacked my way through it.

Grant

Hello Chris,

I took a look at your concern, and I just don't have the time to dig into it. Perhaps someone else can shed some light on this?

Grant

No problem, thanks for getting back to me. I figured it might be a simple misunderstanding.

Chris F,

I also had a small issue understanding why the splits were used in the way they are. To understand I quickly broke the algorithm down; you can see a notebook for this here:

http://nbviewer.ipython.org/gist/anonymous/fee5be4c6b59a62b87b2

The notebook shows what I see as the training data and target labels/classes. Please see the notebook for a full description, here's a snippet:

Changes in Feb-March 2012 are being used to predict the change on 26
August 2013. Then March-April 2012 are being used to predict 27th
August 2013. April-May 2012 used to predict 28th August 2013.

This gap slowly closes with changes for July-August 2013 to predict
20th September 2013. Finally, August-September 2013 to predict 23rd
September 2013. A crucial note: the last training set actually
includes the target value.

The labels are then used to predict the next day.

From a machine learning perspective I'm also unsure how this training data makes sense.

New to this. Does anyone have an example of machine learning where the length of an Talib MA is changed for optimum results based on past data?

Nathan, so basically figure out what length of MA would yield the best results for historical data, then use that length MA for the current time frame? I'm not sure that is a job for machine learning, but more just selection of the strategy that has the highest returns if I'm understanding you correctly.

I have an example where random strategies are tested and the best one is used that you may find interesting: https://www.quantopian.com/posts/evolutionary-strategy

Gus

Ah yes, thanks Gus!

Hi,

Thank you for sharing this awesome machine learning strategy. I have several questions here.

1) is there any particular advantages in using the change in stock price (1,0,0,1) rather than using the original price itself?
Please see that attached backtest that I did, using your code with minor modification of using original price series as
input parameters.

2) Should not prediction be either 1 or 0? Why do we see other values such as 0.4, 0.6 in the graph?

3) Currently, the code trains the each time we get new price data, which is cool. But, I want to understand assumptions behind using
this approach. for example, Do we believe the pattern in stock price changes each time we get new data ?

28
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Use the previous 10 bars' movements to predict the next movement.

# Use a random forest classifier. More here: http://scikit-learn.org/stable/user_guide.html
from sklearn.ensemble import RandomForestClassifier
from collections import deque
import numpy as np

def initialize(context):
context.security = sid(698) # Boeing
context.window_length = 10 # Amount of prior bars to study

context.classifier = RandomForestClassifier() # Use a random forest classifier

# deques are lists with a maximum length where old entries are shifted out
context.recent_prices = deque(maxlen=context.window_length+2) # Stores recent prices
context.X = deque(maxlen=500) # Independent, or input variables
context.Y = deque(maxlen=500) # Dependent, or output variable

context.prediction = 0 # Stores most recent prediction

def handle_data(context, data):
context.recent_prices.append(data[context.security].price) # Update the recent prices
if len(context.recent_prices) == context.window_length+2: # If there's enough recent price data
RecentPrice=list(context.recent_prices)   # converting deque to list
# Make a list of 1's and 0's, 1 when the price increased from the prior bar
changes = np.diff(context.recent_prices) > 0

#context.X.append(changes[:-1]) # Add independent variables, the prior changes
context.X.append(RecentPrice[1:-1])
context.Y.append(changes[-1]) # Add dependent variable, the final change

if len(context.Y) >= 100: # There needs to be enough data points to make a good model

context.classifier.fit(context.X, context.Y) # Generate the model

context.prediction = context.classifier.predict(changes[1:]) # Predict

# If prediction = 1, buy all shares affordable, if 0 sell all shares
order_target_percent(context.security, context.prediction)

record(prediction=int(context.prediction))
There was a runtime error.

Thanks Nyan!

1) There is no real advantage, I just thought it would be easier to understand, because then it would be a binary long or short signal thats input was also a binary signal.

2) The graph shows the other values because it's smoothed since there are so many data points, so a few predictions are averaged for a single data point shown on the graph.

3) I think two key assumptions to make are that a) we will never have a perfect model and b) the model is constantly changing as the world changes. However, the rate of change of the model is negligible when compared to the inaccuracies from our imperfect model. So the goal is basically to improve the model as new data comes in. That's the way I've been thinking about it, anyway.

Hope that helps! Let me know if you have any more questions.

Gus

Hi Gus,

Thank you for your answers. They are helpful. I have another question. My understanding is that it is always possible to overfit in using
these machine learning algorithms, and you have to use cross-validation and/or regularization to prevents overfitting.

Since our model here is to train the algorithm using the last 12 bars, are we doing anything here to address overfitting issue?
Obviously, I am not an expert on machine learning, and does not know much about this Random Forest algorithm, other than what I learn from a quick google search.

Nyan

That's a good point. I'm not doing anything here to account for overfitting, in fact I'm not doing much of anything besides showing the basic features. In order to have a realistic algorithm, some alternative signals would probably need to be used, I'd say that's the first step. But overfitting does not necessarily need to be accounted for, just smart selection of independent and dependent variables.

Gus

@Grant,

I'm getting what looks like a square root error in your version that trades once per day on a basket of stocks from SPY.

"AttributeError: sqrt There was a runtime error on line 88."

Quantopian suggested that square root came from calculating z-score, which I understand and seems reasonable, but too ignorant of packages to know where the z-score function is coming from. Perhaps we need to use another z-score routine or implement own.

Btw, a random forest is an ensemble of decision trees. Each tree makes decisions like "If yesterday was 1 and the day before was 0, then today is a 1" (obviously an overly simple toy decision). Each individual tree is not generally great, perhaps it predicts well in only one part of the high dimensional space. But together, the tree errors cancel so that the forest aggregate prediction is good and does not overfit. I'm guessing a properly regularized nonlinear SVM would perform quite similarly.

8
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from pytz import timezone
from scipy import stats
import pandas as pd
import math

pct_cash = 0.0
window = 30
window = window**2

def initialize(context):
# context.stocks = [ sid(19662),  # XLY Consumer Discrectionary SPDR Fund
# sid(19656),  # XLF Financial SPDR Fund
# sid(19658),  # XLK Technology SPDR Fund
# sid(19655),  # XLE Energy SPDR Fund
# sid(19661),  # XLV Health Care SPRD Fund
# sid(19657),  # XLI Industrial SPDR Fund
# sid(19659),  # XLP Consumer Staples SPDR Fund
# sid(19654),  # XLB Materials SPDR Fund
# sid(19660),  # XLU Utilities SPRD Fund
# sid(8554) ]  # SPY SPDR S&P 500 ETF Trust

# SPY Top 10 Holdings, as of Apr 29, 2014 (17.67% of Total Assets)
# http://finance.yahoo.com/q/hl?s=SPY+Holdings

context.stocks = [ sid(24),     # AAPL
sid(8347),   # XOM
sid(5061),   # MSFT
sid(4151),   # JNJ
sid(3149),   # GE
sid(23112),  # CVX
sid(8151),   # WFC
sid(11100),  # BRK_B
sid(5938),   # PG
sid(25006),  # JPM
sid(8554) ]  # SPY

context.classifier = RandomForestClassifier() # Use a random forest classifier

context.prediction = np.ones_like(context.stocks[0:-1])

set_commission(commission.PerShare(cost=0.0))

context.day_count = -1

context.allocation = -1.0*np.ones_like(context.stocks[0:-1])

# set_long_only()

context.prices = pd.DataFrame()

def handle_data(context, data):

price = history(1,'1d','price')

context.prices = context.prices.append(price)
context.prices = context.prices.tail(window)

if len(context.prices.index) < window:
return

# for stock in context.stocks[:-1]:
# shares = context.portfolio.positions[stock].amount
# if shares < 0:
# print stock.sid
# print stock

# record(cash = context.portfolio.cash)

# Trade only once per day
loc_dt = get_datetime().astimezone(timezone('US/Eastern'))
if loc_dt.hour == 16 and loc_dt.minute == 0:
context.day_count += 1
pass
else:
return

if context.day_count % trading_freq != 0.0:
return

# prices = history(400,'1d','price')
prices = context.prices.as_matrix(context.stocks)

changes_all = stats.zscore(prices, axis=0, ddof=1)
changes = changes_all[:,0:-1] - np.tile(changes_all[:,-1],(len(context.stocks)-1,1)).T
changes = changes > 0

for k in range(len(context.stocks)-1):

X = np.split(changes[:,k],math.sqrt(window))
Y = X[-1]

context.classifier.fit(X, Y) # Generate the model

context.prediction[k] = context.classifier.predict(Y)

allocation = context.prediction.astype(float)
denom = np.sum(allocation)
if denom != 0.0:
allocation = (1.0-pct_cash)*allocation/np.sum(allocation)

# return if allocation unchanged
if np.array_equal(context.allocation,allocation):
return

context.allocation = allocation

record(num_secs = np.count_nonzero(allocation))

for stock,percent in zip(context.stocks[0:-1],allocation):
order_target_percent(stock,percent)
There was a runtime error.

Sorry to randomly butt in, I was also working on a machine learning algorithm, more specifically using several layers of neural networks and with a conditional probability and return matrix as a decision maker. I was going to import data from quantl.com and since my algorithm is neural based, I naturally am going to want as much raw data as possible. With this in mind and also the fact that I want to run this data every minute, I have to ask if you guys have a limit on the amount of processing power or code in one algorithm.
Thanks!

Hey @Grant, I'm also interested in using the z-score to find arbitrage opportunities. My only programming experience is in VBA and it shows in my python script.. @Will Chen -- I don't know how to leverage the stats libraries so I'm calculating the z-score manually [(price-average price)/stdev]. Would you guys mind taking a look at my script? I'm trying to do something similar without the random trees function (which i wish i knew how to use). @Grant, like you, I'm looking for a difference between the z-score of the spy and the z-score of each of the 9 sector ETFs. If the sector_z > spy_z I will go long that particular sector as momentum trade. The algo seems to work decently well: I can avoid large drawdowns and still make excess returns above the index.

This community has been a great resource. Somehow a guy who programs in excel can write an algo in python--thats amazing!

81
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
'''
This algorithm defines a long-only equal weight portfolio and rebalances it at a user-specified frequency
NOTE: This algo is intended to run in minute-mode simulation and is compatible with LIVE TRADING.

'''
# Import the libraries we will use here

from pytz import timezone

def initialize(context):
# This initialize function sets any data or variables that you'll use in your
# algorithm. You'll also want to define any parameters or values you're going to use.

# In our example, we're looking at 9 sector ETFs.

context.secs =   [ sid(19662),  # XLY Consumer Discrectionary SPDR Fund
sid(19656),  # XLF Financial SPDR Fund
sid(19658),  # XLK Technology SPDR Fund
sid(19655),  # XLE Energy SPDR Fund
sid(19661),  # XLV Health Care SPRD Fund
sid(19657),  # XLI Industrial SPDR Fund
sid(19659),  # XLP Consumer Staples SPDR Fund
sid(19654),  # XLB Materials SPDR Fund
sid(19660)]  # XLU Utilities SPRD Fund

#storing the security objects
context.mavg = {}
context.stddev = {}
context.price = {}

# default commissions and slippage
set_slippage(slippage.VolumeShareSlippage(volume_limit=0.25, price_impact=0.1))

context.day_count = -1

def handle_data(context, data):

# Trade only once per day
loc_dt = get_datetime().astimezone(timezone('US/Eastern'))
if loc_dt.hour == 16 and loc_dt.minute == 0:
context.day_count += 1
pass
else:
return

if context.day_count % trading_freq != 0.0:
return

# info on SPY
spy_mean = data[sid(8554)].mavg(120)
spy_sigma = data[sid(8554)].stddev(120)
spy_price = data[sid(8554)].price
spy_z = (spy_price - spy_mean) / spy_sigma

# run through the stocks again and perform the below calculations
for stock in context.secs:
mean = data[stock].mavg(120)
sigma = data[stock].stddev(120)
current_price = data[stock].price
sect_z = (current_price - mean) / sigma

# condition required to enter a sell trade.
if sect_z < spy_z:
order_target_percent(stock, .11)

elif sect_z > spy_z:
order_target_percent(stock, .13)

record(lev = context.account.leverage)


There was a runtime error.

Hello Jamie,

You might start by having a look at the algo I posted on https://www.quantopian.com/posts/working-with-history-dataframes (June 21, 2014). Just clone it and see if you can understand everything. If you have questions, I recommend posting them to https://www.quantopian.com/posts/working-with-history-dataframes (or post an improved example!).

Grant

I've been playing around with random forest a bit as well, I'm wondering if anyone knows if there is a way to construct the classifier to have multiple inputs. So instead of using fit(x,y) you would do fit(x1,x2,y). Is there a way to do this? Originally was thinking to do a separate fit for each input and averaging the resulting predictions but I think it would be ideal to include them all in one function to capture the interplay between the inputs.

The machine learning algorithm seems to have under-performed recently.

97
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from pytz import timezone
from scipy import stats
import pandas as pd
import math

pct_cash = 0.0
window = 30
window = window**2

def initialize(context):
# context.stocks = [ sid(19662),  # XLY Consumer Discrectionary SPDR Fund
# sid(19656),  # XLF Financial SPDR Fund
# sid(19658),  # XLK Technology SPDR Fund
# sid(19655),  # XLE Energy SPDR Fund
# sid(19661),  # XLV Health Care SPRD Fund
# sid(19657),  # XLI Industrial SPDR Fund
# sid(19659),  # XLP Consumer Staples SPDR Fund
# sid(19654),  # XLB Materials SPDR Fund
# sid(19660),  # XLU Utilities SPRD Fund
# sid(8554) ]  # SPY SPDR S&P 500 ETF Trust

# SPY Top 10 Holdings, as of Apr 29, 2014 (17.67% of Total Assets)
# http://finance.yahoo.com/q/hl?s=SPY+Holdings

context.stocks = [ sid(24),     # AAPL
sid(8347),   # XOM
sid(5061),   # MSFT
sid(4151),   # JNJ
sid(3149),   # GE
sid(23112),  # CVX
sid(8151),   # WFC
sid(11100),  # BRK_B
sid(5938),   # PG
sid(25006),  # JPM
sid(8554) ]  # SPY

context.classifier = RandomForestClassifier() # Use a random forest classifier

context.prediction = np.ones_like(context.stocks[0:-1])

set_commission(commission.PerShare(cost=0.0))

context.day_count = -1

context.allocation = -1.0*np.ones_like(context.stocks[0:-1])

# set_long_only()

context.prices = pd.DataFrame()

def handle_data(context, data):

price = history(1,'1d','price')

context.prices = context.prices.append(price)
context.prices = context.prices.tail(window)

if len(context.prices.index) < window:
return

# for stock in context.stocks[:-1]:
# shares = context.portfolio.positions[stock].amount
# if shares < 0:
# print stock.sid
# print stock

# record(cash = context.portfolio.cash)

# Trade only once per day
loc_dt = get_datetime().astimezone(timezone('US/Eastern'))
if loc_dt.hour == 16 and loc_dt.minute == 0:
context.day_count += 1
pass
else:
return

if context.day_count % trading_freq != 0.0:
return

# prices = history(400,'1d','price')
prices = context.prices.as_matrix(context.stocks)

changes_all = stats.zscore(prices, axis=0, ddof=1)
changes = changes_all[:,0:-1] - np.tile(changes_all[:,-1],(len(context.stocks)-1,1)).T
changes = changes > 0

for k in range(len(context.stocks)-1):

X = np.split(changes[:,k],math.sqrt(window))
Y = X[-1]

context.classifier.fit(X, Y) # Generate the model

context.prediction[k] = context.classifier.predict(Y)

allocation = context.prediction.astype(float)
denom = np.sum(allocation)
if denom != 0.0:
allocation = (1.0-pct_cash)*allocation/np.sum(allocation)

# return if allocation unchanged
if np.array_equal(context.allocation,allocation):
return

context.allocation = allocation

record(num_secs = np.count_nonzero(allocation))

for stock,percent in zip(context.stocks[0:-1],allocation):
order_target_percent(stock,percent)
There was a runtime error.

New to this. forgive me if i am wrong.
It seems to me that, all the data are used for training, and those data were used for testing purpose too.
We should separate training and testing at least?

I've got a very stupid noob question :-). How can i add my own features to this algorithm?

Start by cloning the algorithm. Then you can review the code, maybe run a few backtests over different time periods and get comfortable with what it's doing. Then modify or add to the code as you see fit.

To elaborate my response above - build your algorithm with a training data set, but test it using out of sample data.

Good luck.

I'll try to simplify Random Forest as much as possible. Feel free for anyone to correct me:

Usually you have a lot of variables or a fair amount in order to classify and, similar to the overfitting lecture on quantopian, if you make a tree with all variables, you'll probably overfit. There's also obviously some thought that goes into tree splits for variables, usually a variable gets a score on how well it can split the different distributions of classes (in our case 1's and 0's or positive and negative days) and so on.

So RF says, let's back up, and have a whole bunch of mini trees and each tree will take a few variables or so (you decide), and make small trees to make predictions. So instead of a disgusting long tree that has overfit the data, you have hundreds or a large number of mini trees to make predictions. This acts as a voting mechanism and the output from all of the mini trees are aggregated and averaged to form a response.

For those of you aware of the bias-variance tradeoff. A tree with all variables has high variance. A tree with fewer variables has more bias. Introduce many trees with randomly sampled variables and RF does a decent job of finding a mid-ground in a lot of scenarios.

Based on this explanation:
In it's essence, the algo you guys posted more or less just trades on autocorrelation. (is a day in the past predictive of today's returns?)

The only problem is that this is worse than solely trading on autocorrelated variables because, if I understand this correctly, even if you did have some statistically significant variable (say t-5) that had a high autocorrelation to your current t, you'd noise up the signal with all the other t's that you introduced.

Hopefully this helps. @Gus and @Grant