Back to Community
Simple Machine Learning Example

EDIT: More recent version here.

I've seen a number of posts here involving machine learning. Although machine learning probably seems complicated at first, it is actually easy to work with. I wanted to try to create a simple algorithm and post to introduce people to the concept who aren't familiar.

The goal of machine learning is to create an accurate model based off of past data then use that model to predict future events. There are mainly two types of machine learning used in quantitative finance:

  • Regression is used to predict a continuous value, like predict a price will rise $0.46.
  • Classification is used to predict a category, like just predict a price will rise.

This example uses classification.

A model needs to be created based off of past independent and dependent variables, then that model can be used to try to predict future changes in the price. There are 10 independent variables, or input variables in this algorithm. They are whether a price increased or decreased on the 10 bars before a selected bar. The dependent variable, or the output variable is whether a price increased or decreased on that selected bar. Once there are enough data points, a model can be created to try to predict future prices.

You can find more information about machine learning and the module used, sklearn, here.

I'd also like to thank Alex for inspiring this and Thomas for helping me. Feel free to copy and use the code, and let me know if you have any questions or ideas!

Clone Algorithm
4138
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 537cff15798348070dd3a3db
We have migrated this algorithm to work with a new version of the Quantopian API. The code is different than the original version, but the investment rationale of the algorithm has not changed. We've put everything you need to know here on one page.
There was a runtime error.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

44 responses

Hi Gus,

Thanks for the example. I re-ran it with a Boeing as the benchmark. Seems like the algo needs tweaking...or am I missing something?

Grant

Clone Algorithm
1673
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 537d479734fcfc07130baf11
There was a runtime error.

Yes, this is just meant to be an example of how to use machine learning. You probably can rarely expect real returns from a simple algorithm like this that just uses one stock's price. Although, there are likely comparatively simple methods that can be built that have good performance. One idea is to import a bunch of data streams from http://www.quandl.com/ or something into a Quantopian algo then using machine learning to model prices based off of that.

Any idea how to get it to generate superior returns? Could multiple securities (i.e. a portfolio) be used? --Grant

Multiple securities could be used, I believe all that has to be done is add another dimension to the lists. And the idea is to try to think of a stock price that will be correlated to some other value. Here's a good article: http://jspauld.com/post/35126549635/how-i-made-500k-with-machine-learning-and-hft

Here's my attempt at a multi-security version of the algo (a complete hack on my part, since I have no idea what this thingy is doing...just followed Gus' example as best I could). Did I do it correctly?

--Grant

Clone Algorithm
1673
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 5380a7afdeab3d0720e00b7f
There was a runtime error.

Cool :). Looks correct to me. Also nice implementation in minute mode with the history API.

Hello Gus,

Here's a variant that uses SPY as a reference. Seems to provide a slight advantage over the benchmark.

Grant

Clone Algorithm
1673
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 53852d43d4851d071f6f9b82
There was a runtime error.

A longer run of the same algo as immediately above. --Grant

Clone Algorithm
1673
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 5385af3697539507124127c7
There was a runtime error.

Here's a tweaked version. It trades every 20 days, and also does not adjust the portfolio if the algo does not call for a change in the mix of securities:

# return if allocation unchanged  
    if np.array_equal(context.allocation,allocation):  
        return  
    context.allocation = allocation  

Comments & improvements welcome.

Grant

Clone Algorithm
1673
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 5389bc3e814a25072810d0ca
There was a runtime error.

Improved return by using:

changes = changes > 0  

versus:

changes = changes > np.median(changes)  
Clone Algorithm
1673
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 5389d348e4c7bf06f867c0b7
There was a runtime error.

So you are using z score instead of just the flat changes in the price, or something along the lines of that? Can you give some insight into this, and any idea why it seems to work better?

Hello Gus,

Yes, this line of code converts the prices into z-scores:

changes_all = stats.zscore(prices, axis=0, ddof=1)  

The second detail is that the changes in the sector funds are relative to the SPY benchmark (if I've coded it properly):

changes = changes_all[:,0:-1] - np.tile(changes_all[:,-1],(len(context.stocks)-1,1)).T  

So, the variable 'changes' is the z-score difference for each sector fund and the underlying benchmark, i.e. changes = z_sector - z_SPY. So, if changes > 0, it indicates that the sector fund was statistically higher than the benchmark on a given day. Thus z-scoring allows for the direct normalization of the sector funds to their collective benchmark.

My sense is that this normalization approach works because the sector funds are basically SPY chopped up into various categories (and each is still highly correlated to the benchmark). So, the normalization resolves which sectors are favorable over the benchmark. But the Random Forest voodoo is a mystery to me, so this interpretation could be off-base.

From a practical standpoint, I'm not sure that I've captured all of the costs with:

set_commission(commission.PerShare(cost=0.013, min_trade_cost=1.3))  

Does this accurately account for all of the Interactive Brokers (IB) costs? And last I heard, Quantopian might charge ~$100 per month per algorithm, so that would need to be rolled in. And short-term capital gains taxes (assuming a non-IRA account)?

Note, also, that orders are submitted at the daily close. My understanding is that in live trading, the orders would be cancelled, correct? This is probably not fundamental, but I thought I'd point it out in case someone tries the algo live with IB.

Grant

Ah, that's cool, makes sense. Random forest is a large mystery to me too, however I know ML is pretty durable. As for the IB commissions, I'm not too familiar, however I know commission is a very variable thing, and could be more or less depending on a number of different factors. The default Quantopian model is meant to be a good approximation, but nonetheless able to be adjusted. I'm sure I can get more details on that if you want them.

Yes, in live trading those orders would be cancelled. However, you could use some code to trade a few minutes before the close, for example run some function 15 minutes before the daily close (this is for minute mode only):

from zipline.utils.tradingcalendar import get_early_closes  
import pandas as pd

def handle_data(context, data):  
    exchange_time = pd.Timestamp(get_datetime()).tz_convert('US/Eastern')

    if exchange_time.date() in context.early_closes and exchange_time.hour == 12 and exchange_time.minute == 45:  
        close_day(context)  
    elif exchange_time.date() not in context.early_closes and exchange_time.hour == 15 and exchange_time.minute == 45:  
        close_day(context)  

Grant,
The commission model you are using there should be a conservative estimate of actual commissions from IB. I just pulled up my trading reports from IB and I'm being charged $1.00 for trades that do not meet the minimum shares requirement. It's a couple cents extra for short sales, but I think that's actually a tax.

David

Thanks Gus, David,

Any idea how to improve the algo? One thought is that the model just predicts which securities to hold, equally weighted in the portfolio. Could it be modified to predict the optimum unequal weighting? One risk is that this would generate excessive trading, due to more frequent portfolio adjustment.

Also, perhaps someone could have a closer look at the implementation and advise if the machine learning approach could be improved (e.g. settings, different data set pre-processing, etc.).

This might be a nice example to run on zipline, since daily data can be used.

From a general standpoint, I'm curious if this is the kind of trading style that Quantopian is aiming to support under their "quantitative investing" offering (e.g. monthly re-balancing via a handful of trades)? Or would the returns get wiped out by the Quantopian algo fee and other costs?

Grant

I expect the aglo could combine VIX index as independent variables or try the SVM model (combine-different-machine-learning-methods) to improve the model , I had been to try but been failure because I have limit python skill to familiarize the algo now.

I'm not exactly sure how better returns could be made, maybe give that a try. I haven't looked too in-depth at machine learning methods. We aren't quite ready to say anything certain about that stuff yet, but I can tell you that our goal is definitely not to wipe out your returns with an algo fee!

Hi Gus,

One angle would be to find an optimum set of ETFs. For example, I see that there are lots to pick from the list on http://www.forbes.com/sites/baldwin/2014/06/04/best-etfs-qqq-and-the-sector-funds/. The question is how to do the picking? Any ideas?

Grant

Great link. Following is the parent article which has a few more leads http://www.forbes.com/sites/baldwin/2014/06/04/best-etfs-for-investors/

Here's a rough update of my efforts to explore this algo. It trades every day, with zero commission cost (for development). Also, I switched to accumulating a window of minute bars. The securities are:

# SPY Top 10 Holdings, as of Apr 29, 2014 (17.67% of Total Assets)  
    # http://finance.yahoo.com/q/hl?s=SPY+Holdings  
    context.stocks = [ sid(24),     # AAPL  
                       sid(8347),   # XOM  
                       sid(5061),   # MSFT  
                       sid(4151),   # JNJ  
                       sid(3149),   # GE  
                       sid(23112),  # CVX  
                       sid(8151),   # WFC  
                       sid(11100),  # BRK_B  
                       sid(5938),   # PG  
                       sid(25006),  # JPM  
                       sid(8554) ]  # SPY  

SPY serves as a normalizing benchmark only; no position in SPY is taken.

Questions/comments/improvements welcome.

An "attaboy" to the first person to explain clearly (without web links, references, etc.) what the Random Forest is doing.

Grant

Clone Algorithm
1673
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 53935f34292f83070b0397fb
There was a runtime error.

Grant, I found this explanation of random forests pretty informative.

http://citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics/

Can someone briefly explain to me what's happening in lines 94-99.

X is a set of 30 periods (each of which contains 30 values). Lets assume X are the days in november and the elements in X are 30 evenly sampled "changes", for example at 20 minute intervals, so that X[0][0] is 9AM, X[0][1] is 9:20, etc..
Y is the last of those periods. (ie. november 30th).
You then fit the 30 periods in X to the last period in Y. This is same as fitting November 1st to 9AM, then November 2nd to 9:20AM, November 3rd to 9:40AM, etc..
You then predict use Y to predict the next window.

This seems a bit odd to me, so I think there must be something fundamental that I either don't understand about history or classifier..?

Hi Chris,

I'll have a look tomorrow or this weekend. Frankly, I never quite figured this all out in detail...just hacked my way through it.

Grant

Hello Chris,

I took a look at your concern, and I just don't have the time to dig into it. Perhaps someone else can shed some light on this?

Grant

No problem, thanks for getting back to me. I figured it might be a simple misunderstanding.

Chris F,

I also had a small issue understanding why the splits were used in the way they are. To understand I quickly broke the algorithm down; you can see a notebook for this here:

http://nbviewer.ipython.org/gist/anonymous/fee5be4c6b59a62b87b2

The notebook shows what I see as the training data and target labels/classes. Please see the notebook for a full description, here's a snippet:

Changes in Feb-March 2012 are being used to predict the change on 26
August 2013. Then March-April 2012 are being used to predict 27th
August 2013. April-May 2012 used to predict 28th August 2013.

This gap slowly closes with changes for July-August 2013 to predict
20th September 2013. Finally, August-September 2013 to predict 23rd
September 2013. A crucial note: the last training set actually
includes the target value.

The labels are then used to predict the next day.

From a machine learning perspective I'm also unsure how this training data makes sense.

New to this. Does anyone have an example of machine learning where the length of an Talib MA is changed for optimum results based on past data?

Nathan, so basically figure out what length of MA would yield the best results for historical data, then use that length MA for the current time frame? I'm not sure that is a job for machine learning, but more just selection of the strategy that has the highest returns if I'm understanding you correctly.

I have an example where random strategies are tested and the best one is used that you may find interesting: https://www.quantopian.com/posts/evolutionary-strategy

Gus

Ah yes, thanks Gus!

Hi,

Thank you for sharing this awesome machine learning strategy. I have several questions here.

1) is there any particular advantages in using the change in stock price (1,0,0,1) rather than using the original price itself?
Please see that attached backtest that I did, using your code with minor modification of using original price series as
input parameters.

2) Should not prediction be either 1 or 0? Why do we see other values such as 0.4, 0.6 in the graph?

3) Currently, the code trains the each time we get new price data, which is cool. But, I want to understand assumptions behind using
this approach. for example, Do we believe the pattern in stock price changes each time we get new data ?

Clone Algorithm
27
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 5428b40054398308e0cdba2a
There was a runtime error.

Thanks Nyan!

1) There is no real advantage, I just thought it would be easier to understand, because then it would be a binary long or short signal thats input was also a binary signal.

2) The graph shows the other values because it's smoothed since there are so many data points, so a few predictions are averaged for a single data point shown on the graph.

3) I think two key assumptions to make are that a) we will never have a perfect model and b) the model is constantly changing as the world changes. However, the rate of change of the model is negligible when compared to the inaccuracies from our imperfect model. So the goal is basically to improve the model as new data comes in. That's the way I've been thinking about it, anyway.

Hope that helps! Let me know if you have any more questions.

Gus

Hi Gus,

Thank you for your answers. They are helpful. I have another question. My understanding is that it is always possible to overfit in using
these machine learning algorithms, and you have to use cross-validation and/or regularization to prevents overfitting.

Since our model here is to train the algorithm using the last 12 bars, are we doing anything here to address overfitting issue?
Obviously, I am not an expert on machine learning, and does not know much about this Random Forest algorithm, other than what I learn from a quick google search.

Nyan

That's a good point. I'm not doing anything here to account for overfitting, in fact I'm not doing much of anything besides showing the basic features. In order to have a realistic algorithm, some alternative signals would probably need to be used, I'd say that's the first step. But overfitting does not necessarily need to be accounted for, just smart selection of independent and dependent variables.

Gus

@Grant,

I'm getting what looks like a square root error in your version that trades once per day on a basket of stocks from SPY.

"AttributeError: sqrt There was a runtime error on line 88."

Quantopian suggested that square root came from calculating z-score, which I understand and seems reasonable, but too ignorant of packages to know where the z-score function is coming from. Perhaps we need to use another z-score routine or implement own.

Btw, a random forest is an ensemble of decision trees. Each tree makes decisions like "If yesterday was 1 and the day before was 0, then today is a 1" (obviously an overly simple toy decision). Each individual tree is not generally great, perhaps it predicts well in only one part of the high dimensional space. But together, the tree errors cancel so that the forest aggregate prediction is good and does not overfit. I'm guessing a properly regularized nonlinear SVM would perform quite similarly.

Clone Algorithm
8
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 54c0014fccb1e131470d4ef1
There was a runtime error.

Sorry to randomly butt in, I was also working on a machine learning algorithm, more specifically using several layers of neural networks and with a conditional probability and return matrix as a decision maker. I was going to import data from quantl.com and since my algorithm is neural based, I naturally am going to want as much raw data as possible. With this in mind and also the fact that I want to run this data every minute, I have to ask if you guys have a limit on the amount of processing power or code in one algorithm.
Thanks!

Hey @Grant, I'm also interested in using the z-score to find arbitrage opportunities. My only programming experience is in VBA and it shows in my python script.. @Will Chen -- I don't know how to leverage the stats libraries so I'm calculating the z-score manually [(price-average price)/stdev]. Would you guys mind taking a look at my script? I'm trying to do something similar without the random trees function (which i wish i knew how to use). @Grant, like you, I'm looking for a difference between the z-score of the spy and the z-score of each of the 9 sector ETFs. If the sector_z > spy_z I will go long that particular sector as momentum trade. The algo seems to work decently well: I can avoid large drawdowns and still make excess returns above the index.

This community has been a great resource. Somehow a guy who programs in excel can write an algo in python--thats amazing!

Clone Algorithm
78
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 54c4f3500a90970a2c7ea02a
There was a runtime error.

Hello Jamie,

You might start by having a look at the algo I posted on https://www.quantopian.com/posts/working-with-history-dataframes (June 21, 2014). Just clone it and see if you can understand everything. If you have questions, I recommend posting them to https://www.quantopian.com/posts/working-with-history-dataframes (or post an improved example!).

Grant

I've been playing around with random forest a bit as well, I'm wondering if anyone knows if there is a way to construct the classifier to have multiple inputs. So instead of using fit(x,y) you would do fit(x1,x2,y). Is there a way to do this? Originally was thinking to do a separate fit for each input and averaging the resulting predictions but I think it would be ideal to include them all in one function to capture the interplay between the inputs.

The machine learning algorithm seems to have under-performed recently.

Clone Algorithm
92
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 553b22492e5f5b0d22c03801
There was a runtime error.

New to this. forgive me if i am wrong.
It seems to me that, all the data are used for training, and those data were used for testing purpose too.
We should separate training and testing at least?

Yes

I've got a very stupid noob question :-). How can i add my own features to this algorithm?

Start by cloning the algorithm. Then you can review the code, maybe run a few backtests over different time periods and get comfortable with what it's doing. Then modify or add to the code as you see fit.

To elaborate my response above - build your algorithm with a training data set, but test it using out of sample data.

Good luck.

I'll try to simplify Random Forest as much as possible. Feel free for anyone to correct me:

Usually you have a lot of variables or a fair amount in order to classify and, similar to the overfitting lecture on quantopian, if you make a tree with all variables, you'll probably overfit. There's also obviously some thought that goes into tree splits for variables, usually a variable gets a score on how well it can split the different distributions of classes (in our case 1's and 0's or positive and negative days) and so on.

So RF says, let's back up, and have a whole bunch of mini trees and each tree will take a few variables or so (you decide), and make small trees to make predictions. So instead of a disgusting long tree that has overfit the data, you have hundreds or a large number of mini trees to make predictions. This acts as a voting mechanism and the output from all of the mini trees are aggregated and averaged to form a response.

For those of you aware of the bias-variance tradeoff. A tree with all variables has high variance. A tree with fewer variables has more bias. Introduce many trees with randomly sampled variables and RF does a decent job of finding a mid-ground in a lot of scenarios.

Based on this explanation:
In it's essence, the algo you guys posted more or less just trades on autocorrelation. (is a day in the past predictive of today's returns?)

The only problem is that this is worse than solely trading on autocorrelated variables because, if I understand this correctly, even if you did have some statistically significant variable (say t-5) that had a high autocorrelation to your current t, you'd noise up the signal with all the other t's that you introduced.

Hopefully this helps. @Gus and @Grant