Back to Community
Machine Learning on Quantopian Part 3: Building an Algorithm

This is the third part of our series on Machine Learning on Quantopian. Most of the code is borrowed from Part 1, which showed how to train a model on static data, and Part 2, which showed how to train a model in an online fashion. Both of these were in research so they weren't functional algorithms. I highly recommend reading those before as it will make the code here much clearer.

It was pleasantly easy to copy over the code from research to make a functional algorithm. The new Optimization API made the portfolio construction and trade execution part very simple. Thus, with a few lines of code we have an algorithm with the following desirable properties:

  • Uses Machine Learning on a Factor-based workflow.
  • Retrains the model periodically.
  • Trades a large universe of stocks, using the Q1500US universe (chose a subset of 1000 stocks here).
  • Beta-neutral by going long-short.
  • Sector-neutral due to new optimization API.
  • Sets strict limits on maximum weight of any individual stock.

I also tried to make this algorithm to be template-like. If you clone it, you should be able to very easily put in your own alpha factors and they will be automatically picked up and incorporated. You can also configure this algorithm with a few prominent high-level settings:

N_STOCKS_TO_TRADE = 1000 # Will be split 50% long and 50% short  
ML_TRAINING_WINDOW = 21 # Number of days to train the classifier on, easy to run out of memory here  
PRED_N_FWD_DAYS = 1 # train on returns over N days into the future  
TRADE_FREQ = date_rules.week_start() # How often to trade, for daily, set to date_rules.every_day()  

However, this is definitely not the be-all-end-all algorithm. There are still many missing pieces and some rough edges:

  • Ideally, we could train on a longer window, like 6 months. But it's very easy to run out-of-memory here. We are acutely aware of this problem and are working hard to resolve it. Until then, we have to make due with being limited in this regard.
  • As you can see, performance is not great. I actually think this is quite honest. The alpha factors I included are all widely known and can probably not be expected to carry a significant amount of alpha. No ML fanciness will convert a non-predictable signal into a predictable one. I also noticed that it is very hard to get good performance out of this. That, however, is a good thing in my perspective. It means that because we are making so many bets, it is very difficult to overfit this strategy by making a few lucrative bets that pay off big-time but can't be expected to re-occur out-of-sample.
  • I deactivated all trading costs to focus on the pure performance. Obviously this would need to be taken into account to create a viable strategy.

As I said, there is still some work required and we will keep pushing the boundaries. Until then, please check this out, improve it, and poke holes in it.

Clone Algorithm
61
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 594a7ce5230e5169ff2fe0cb
There was a runtime error.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

75 responses

Here is the tear-sheet.

Loading notebook preview...
Notebook previews are currently unavailable.

This is great news! I'm excited to take a look at your code and give it a try.

Bummer about the memory restrictions though. That's a very limited amount of training data and I'm skeptical that we will be able to train a good model with so little data.

Out of curiosity, what were the changes you required of the infrastructure code for algorithms in order to be able to finally publish this?

@Rudiger: Yes, reducing memory constraints is definitely a top concern for us right now. There's also lots of optimizations we can do for fundamentals.

Scott Sanderson ultimately unlocked the workflow in its current form. Mainly there were some pipeline improvements. I'll let him chime in but I think the main one was that pipeline was caching data of all pipline nodes, even those that were no longer needed: https://github.com/quantopian/zipline/pull/1484

Thanks Thomas -

You describe the predictions as "probabilities ranging from 0 to 1." Is that before or after you perform this operation:

predictions -= 0.5  

Presumably, you are shifting the mean to be centered around 0, so that predictions ranges from -0.5 to 0.5 as an input to objective = opt.MaximizeAlpha(predictions) (which requires a signed input, even though it has a market_neutral constraint)?

Are the predictions the relative forecast returns, by stock in the universe? The stock with the highest prediction value is the surest bet for a long allocation, and the stock with the lowest prediction value is the surest bet for a short allocation?

You are generating the alpha factors and apply the ML (alpha combination) using Q1500US(), which, as I understand, has sector diversification (see the help page, "...choosing no more than 30% of the assets from any single market sector"). But then, within the optimizer, a constraint is applied:

sector_neutral = opt.NetPartitionExposure.with_equal_bounds(  
            labels=context.risk_factors.Sector.dropna(),  
            min=-0.0001,  
            max=0.0001,  
    )  

I'm wondering if this is the right way to approach things? Wouldn't you want to start with a sector-diverse universe and let things play out? It seems like with the sector_neutral constraint, you are a priori taking it as beneficial to have sector neutrality over a potentially higher SR without the constraint. I guess the idea is that you can reduce "sector risk" but it seems like you'll end up undoing any advantageous sector tilt that might result from the prior steps. Or maybe in the world of long-short investing, it is well-established that sector tilt should be avoided?

As a structural consideration, I'd look at moving the optimization step into before_trading_start. All of your alpha factors are based on trailing daily bars, so intraday, you are only using the current portfolio state information in the optimizer. I guess the idea is that you'll do a better job of optimizing if the portfolio state is fresh (overnight gaps/drift could be significant). Or maybe run the alpha combination step intraday on daily bars plus the current minutely bars (or a daily VWAP up to the current minutely time)? Given the time scale of the algo, it might be just as well and tidier to do everything in before_trading_start and then before the market opens, hand off the new portfolio weights to the O/EMS system that you are presumably developing. It just seems kinda muddled structurally to run the alpha combination before the market open on trailing daily bars, and then to do the optimization intraday. Also, how are your approaching combining individual algos into the fund? I'd think each one would be a kind of alpha factor, right? So if you'll be combining them on a daily basis, you'll need all of them to report in before the market opens, right? You have schedule_function(my_rebalance, TRADE_FREQ, time_rules.market_open(minutes=10)), but there is nothing to prevent users from fiddling with the execution time, and then you'll have all of the algos spitting out portfolio updates willy-nilly which would not seem to be the best scenario for combining the algos and your O/EMS system.

As a side note, you posted an example algo that requires a paid subscription to a data set over the period of the backtest (see https://www.quantopian.com/data/zacks/earnings_surprises ). It mucks up the paradigm of anyone being able to clone your algo, fiddle with it, and then posting a 1:1 comparison.

Hi Grant,

Presumably, you are shifting the mean to be centered around 0, so that predictions ranges from -0.5 to 0.5 as an input to objective = opt.MaximizeAlpha(predictions) (which requires a signed input, even though it has a market_neutral constraint)?

Exactly. Probabilities of < 0.5 indicate the stock dropping and will be converted to a short-signal after subtracting 0.5.

Re Sectors: Yes, this is the same logic as being market-neutral. The optimization will long/short equally within each sector to achieve sector-neutrality. The logic is the same, we do not try to predict market or sector movements here, so it's a good idea to remove that source of risk. If you had a model that accurately predicted how sectors move you might of course want to introduce some sector-tilt.

Re moving portfolio construction into before_trading_start: Not a bad idea, we could definitely do that. Although not sure there would be a lot of upside, the function right now does portfolio construction and execution simultaneously which is convenient. But yeah, as we make portfolio construction more complex that could make sense.

Extracting alpha factors is still ongoing research. Having the portfolio construction happen in before_trading_start might help with that. But one could also imagine us wanting to do our construction based on pure alphas in that approach.

Re earnings surprises: Thanks for alerting me to that, I wasn't aware. I'll update the algo to comment it out.

Thanks Thomas -

Personally, my biggest gap is with respect to the basics of ML. If there is a concise primer on the exact flavor you are using, it would be helpful (without having to read Python code line-by-line, which in the end is good, but doesn't give the big picture). Maybe I can get an article onto my Kindle and eventually educate myself on your modern computational voodoo.

By the way, as I mentioned to Scot, it'd be nice to run the optimizer on an expanding trailing window, and then combine the results, for smoothing. I think this would amount to running the alpha combination step N times per day, storing the predictions, running the optimizer on each prediction, and then combining the results in some fashion (e.g. a weighted average, weighted by expected return). Effectively, I think this means in before_trading_start, you'd have to be able to call the "alpha combiner" with a parameter specifying the trailing window size. Is this feasible?

Your comment makes intuitive sense:

As you can see, performance is not great. I actually think this is quite honest. The alpha factors I included are all widely known and can probably not be expected to carry a significant amount of alpha. No ML fanciness will convert a non-predictable signal into a predictable one. I also noticed that it is very hard to get good performance out of this. That, however, is a good thing in my perspective. It means that because we are making so many bets, it is very difficult to overfit this strategy by making a few lucrative bets that pay off big-time but can't be expected to re-occur out-of-sample.

The issue I see is that with the set of factors and real-world data you used above, you can kinda debug the prototype tool, answering questions like:

  • How many factors can it handle?
  • Will it produce sensible, stable results, or go wildly off the rails?
  • Can a useful template be provided to the end-user, with a set of known real-world factors?

The problem, though, is that I think it is gonna be hard to tell if the tool is working properly, since there is a convolution of the noisy, uncertain inputs, and the unproven tool. You are basically saying "garbage in, garbage out" is to be expected. But the tool may be contributing to the "garbage out" and it seems like it'll be hard to sort its contribution, if any, to the "garbage out."

Say I wanted to show that my new, fancy method for fitting a straight line works. I guess I'd synthesize a data set with known characteristics (e.g. y = mx + b, with noise), and then apply my new-fangled algorithm to it. Could something analogous be done here? It would be nice to see that if a high SR output is expected (e.g. (1+r)^n type returns), one actually gets it.

Extracting alpha factors is still ongoing research. Having the portfolio construction happen in before_trading_start might help with that. But one could also imagine us wanting to do our construction based on pure alphas in that approach.

Well, it seems like a no-brainer. I think the idea is for users to research and develop long-short alpha factors, and implement them in pipeline, using daily bars. My sense is that most of the value-add will be at this step (although presently, the only way to get paid for the effort is to write a full-up algo). Y'all could grab the alpha factors from each user, and take it from there, doing a global combination and optimization (on a high-performance computing platform, so you don't have to fuss with memory limitations, etc.). Of course, the over-arching question is, with this sort of a relatively long timescale system and equities, what sort of SR is achievable?

Thanks for this incredible work Thomas ! I have eagerly begun playing around with it.

Do you know if its possible to plot the weights assigned to each factor during each period as the algo runs?

Grant: Good point regarding garbage in garbage out. Ideally we'd have some simulated data to demonstrate it works. I think this should be pretty straight-forward to do in research.

Lex: Yes, the commented out line 286 (log.debug(self.clf.feature_importances_)) prints it out, albeit without the name of the factors associated to the importance. Perhaps you can store them in a global var, combine them with the factor-names and then plot them. That would be a great contribution.

Aside from synthetic data & factors that would be expected to yield consistent, high SR, you'd also want to feed it noise to show that it doesn't extract a spurious signal from the noise (i.e. "make a silk purse out of a sow's ear," or "overfit" per the standard lingo).

One point of confusion here is with respect to The 101 Alphas Project which seems to suggest that if approached correctly, lots of sketchy, mini-alpha factors can be combined into a mega-alpha that will be viable. Is there any overlap between what you've done, and what is claimed in the paper (that many "faint and ephemeral" alphas can be combined advantageously)? Maybe with the right set of 101 alphas, your algo would be a winner?

Yes, the 101 Alphas Project would be a fantastic place to start for adding new alphas. This post is really about Alpha Combination, not Alpha Discovery.

@ Lex -

Regarding "Do you know if its possible to plot the weights assigned to each factor during each period as the algo runs?" you can only plot 5 variables. However, a tedious hack would be to run the backtest N times, recording a different set of variables each time. Within the research platform, the recorded variables are accessible. For example, try (with your own backtest_id):

bt = get_backtest('584bf25a0ca16a64879c92f1')  
dir(bt)  
print bt.bt.recorded_vars  

Once you get all of the factor weights versus time into the research platform, you could then plot them together.

One tweak to the Quantopian backtester would be to allow recording of a larger number of variables, but perhaps still only allow plotting 5 per backtest. Then, in one backtest, all variables of interest could be loaded into the backtester in one shot.

I cant help but feel somethings not quite right here. I ran a single factor (fcfy) with only 2 constraints (market neutral & gross leverage), set it to monthly rebalancing & 20-period prediction & 20 percentile ranges. I was expecting to get rather similar results as my monthly long/short factor rebalancing model for the same factor but it is rather very different. If the leverage still hovers around 0. Why would the results deviate so much if it is only being passed one factor?

Also without significantly removing constraints do I find it difficult to achieve non-flat line results. Am I missing something here?

N_STOCKS_TO_TRADE = 200 # Will be split 50% long and 50% short  
ML_TRAINING_WINDOW = 60 # Number of days to train the classifier on, easy to run out of memory here  
PRED_N_FWD_DAYS = 20 # train on returns over N days into the future  
TRADE_FREQ = date_rules.month_start() # How often to trade, for daily, set to date_rules.every_day()  

Lex, there's not a lot go on here. In general, the ML combines factors in non-linear way, so it could end up completely different from the weighting you do. Perhaps remove that part or replace it with something linear like LinearSVC?

The memory limitation on the Q platform is quite serious when it comes to really delving into machine learning techniques. The size of a data cube with quite a few factors one needs to extract generalizations and keep predication forward variance down does not fit into the current memory requirements. Additionally lack of being able to serialize variables and models using the pickle library makes offline training impossible too.

I hope the Q team can see that in order to support the ML field beyond conceptual token examples the platform needs to significantly improve in functionality and performance !

Hi Thomas -

I'm wondering how to handle "magic numbers" in factors, particularly with respect to time scales. For example, say I defined a simple price mean reversion factor as:

np.mean(prices[-n:,:],axis=0)/price[-1,:]  

At any point in time, there may be a sweet spot for n (and presumably, using alphalens, I could sort that out, and kinda ballpark an optimum). However, I could also code a whole set of factors, each with a different trailing window length for the mean, and then let the ML combine them. For example, I could write 8 separate factors, with:

np.mean(prices[-3:,:],axis=0)/price[-1,:]

np.mean(prices[-4:,:],axis=0)/price[-1,:]

np.mean(prices[-5:,:],axis=0)/price[-1,:]

np.mean(prices[-6:,:],axis=0)/price[-1,:]

np.mean(prices[-7:,:],axis=0)/price[-1,:]

np.mean(prices[-8:,:],axis=0)/price[-1,:]

np.mean(prices[-9:,:],axis=0)/price[-1,:]

np.mean(prices[-10:,:],axis=0)/price[-1,:]  

This would seem to have the advantage that I've eliminated (or at least smoothed) a magic number factor setting. Also, perhaps the ML algorithm would then dynamically combine them versus time in an optimal way. Or would it just create a mess? If it would make sense, is there a more elegant way to code things, so that one does not have to write N identical factors, only differing by their settings (e.g. be able to call a single factor N times with different parameters)?

On a separate note, does pipeline support CVXPY? Is it possible to write factors that include the application of CVXPY?

Regarding platform limitations, my two cents is that there is an odd unidirectional relationship between the research platform and the backtester. It seems that they could be better integrated/unified, so that data could pass freely between them, and backtests could be called from the research platform. It seems also that one should be able to output a set of data from the research platform, and it should be available to algos (both for backtesting and live trading). Trying to do everything online in the algo is limiting, but I guess in the end, you need stand-alone contributions to the Q fund; you can't have "managers" fiddling with them offline.

@Thomas - Thanks for continuing the machine learning thread here with working code...that is really useful! What is on the horizon next for this thread?

@Kamran - yes..without something like a powerful compute/memory nvidia-docker image or cluster to work with, you have no real way to compute massive regression-based or deep machine learning algorithms. Perhaps one could deploy already learned models with Q's infrastructure though?

Just finished reading a blog article that I found interesting:
https:[email protected]/deep-learning-the-stock-market-df853d139e02

Besides the detail involving the different ways to learn sequential information (Recurrent Neural Network, LTSM, etc.), I found the overall targets the author delineates useful, so I quote from that section of the blog article by @TalPerry:

In our case of “predicting the market”, we need to ask ourselves what exactly we want to market to predict? Some of the options that I thought about were:
1. Predict the next price for each of the 1000 stocks
2. Predict the value of some index (S&P, VIX etc) in the next n minutes.
3. Predict which of the stocks will move up by more than x% in the next n minutes
4. (My personal favorite) Predict which stocks will go up/down by 2x% in the next n minutes while not going down/up by more than x% in that time.
5. (The one we’ll follow for the remainder of this article). Predict when the VIX will go up/down by 2x% in the next n minutes while not going down/up by more than x% in that time.
1 and 2 are regression problems, where we have to predict an actual number instead of the likelihood of a specific event (like the letter n appearing or the market going up). Those are fine but not what I want to do.
3 and 4 are fairly similar, they both ask to predict an event (In technical jargon — a class label). An event could be the letter n appearing next or it could be Moved up 5% while not going down more than 3% in the last 10 minutes. The trade-off between 3 and 4 is that 3 is much more common and thus easier to learn about while 4 is more valuable as not only is it an indicator of profit but also has some constraint on risk.
5 is the one we’ll continue with for this article because it’s similar to 3 and 4 but has mechanics that are easier to follow. The VIX is sometimes called the Fear Index and it represents how volatile the stocks in the S&P500 are. It is derived by observing the implied volatility for specific options on each of the stocks in the index.

I like target number 4 also, yet wonder...
What do others think of these targets ?
alan

@Thomas

I have been following with interest all your posts regarding factors combinations using ML. As I have been researching this topic for quite a while on my own, if ML turns out to be the best way to combine factors, then I would be very happy.

Now, regarding the algorithm you posted there are 2 critical bugs that prevent the ML factor to work:

  • Line 266: today.weekday is method so it should be today.weekday(). This bug results in the classifier to be trained only once, at initialization time.
  • Line 279: that should be shift_mask_data(X, Y, fwd_days=PRED_N_FWD_DAYS) otherwise PRED_N_FWD_DAYS is bound to upper_percentile argument.

The same bugs are present in the NB posted in ML part 2

Here is a fixed version of the algorithm. Unfortunately the results are even worse but I will get back to this in the next post.

Clone Algorithm
21
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 58512ab66f9bae633455a3fb
There was a runtime error.

@Luca: Thanks so much -- those are two critical bugs. I'll update the original code too.

@Thomas

As I would like to see evidence of ML factor producing better results than other techniques I created this NB where I test a very simple scenario: 3 alpha factors. I firstly run Alphalens on the 3 factors individually to see how they perform. Then I try the most basic factors combination I can imagine, that is the linear combination with equal weighting: this is like averaging the effect of the 3 factors. Finally I run Alphalens on the ML factor trying different window_lenghts/n_fwd_days combinations to check if ML can beat the previous basic approach.

I would be very interested in hearing your thoughts on the results.

Please note that you can use this NB to run Alphalens on a large time span (I tried 14 years with a universe of 1000 securities) and the pipeline won't run out of memory as I split the call to pipeline in many chunks.

Attached a 6 years analysis.

Loading notebook preview...
Notebook previews are currently unavailable.

Not sure if you're doing this, because I haven't had time to look at your code carefully, but beware of lookahead bias. You should use a different timeperiod to run alphalens on your factors to the timeperiod where you test the efficacy of the "simple" model. By the same token, the ML model should be trained using data from the first timeperiod and tested using the second timeperiod.

@Rudiger Lippert, thanks for your comment but that's not an issue. I applied the same approach Thomas used in his ML part 2.

@Luca: That's a very insightful analysis. It's clear that the ML does not come off very well here. Since the factors are already pretty good and linear, perhaps this isn't too surprising as the ML would just muddle with that. So two experiments would be to try a linear classifier and adding a nonsense-factor which hopefully the ML would identify and not listen to when making predictions. I have already tried experiment 1 by replacing the ML algo with logistic regression. However, it didn't do much to improve metrics like Annualized IR. As another suggestion, I would condense the outputs and only store metrics like spread, IC and IR. Then display a big comparison table at the end comparing all the individual experiments. The full tear-sheets are too difficult to compare directly.

I also took another stab at improving the code. Specifically, I'm now tracking past predictions to evaluate the alpha model directly on held out data. Just looking at returns is too far detached from checking if the model actually works, as Luca and Grant highlighted. Thus, we can now track all kinds of metrics like accuracy, log-loss and spread. Surprisingly, these look pretty good on the held-out data (accuracy ~55%). This raises the question of why the algorithm isn't doing better than it is. Possibly because the spread is still pretty small, or because there are still bugs. In any case, I also simplified the portfolio allocation to try and get the trading of the algorithm closer to what is suggested by the alpha model.

Ultimately, we really need to develop tools and methods to evaluate each stage independently to track where things go wrong.

Clone Algorithm
500
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 585291f24ab362625a438fc6
There was a runtime error.

I'm not sure we're thinking about this problem correctly. It's not about whether "ML" works as means for alpha combination or not. Any technique that learns how to predict values from training data is technically ML, even if it is an extremely simple technique like linear regression.

I think it's more about the bias-variance tradeoff, which tells us that simpler models will do a better job at not fitting to random noise, but can be systematically biased if output function isn't itself simple (like trying to fit a polynomial with a line). The problem of fitting to noise is particularly exacerbated by having very little data. The more complex model (like Adaboost) tries to infer properties from limited examples and comes up with incorrect inferences (for example, if you never saw a dog and you were only given 5 pictures of dogs, where each of them contained a tree, you might infer that the tree is the dog. Only by adding many more examples, where trees aren't present, do you have a chance to learn a better rule for identifying a dog). I think the biggest problem with this new ML factor is that it's training on so little data. The model needs to see what happens when the market is in different conditions or it will assume that the last month is completely representative of how markets behave. Ernie Chan commented in one of his books that financial data is so noisy that even if you have a lot of data it makes sense to use very simple models or you're going to overfit. Steps like ranking features, ranking targets, changing the target into a classification problem, using a linear combination of ranks, are all steps to try to tame the noise beast.

I wrote a bit about these kinds of issues in my blog post for my former company, if you're interested: https://www.factual.com/blog/the-wisdom-of-crowds

Rudiger: All great points. It's definitely worthwhile to start with linear classifiers and some robust factors. Note that we're already doing ranking features and changing the target into a classification problem. I think Luca's experiments actually are compatible with your thoughts. The simplest linear classifier is an equal-weighted one. If something that is learned is doing poorer, it means there's so much noise we can't even infer stable combinations.

Ideally I'd like to test with "fake" factors that have great predictive power (like today's return, introducing look-ahead bias essentially). This would really show the benefit (can add non-linearities and noise on top), and be good for debugging and making sure everything works.

Here's Thomas' recent revision (first post in thread, Backtest ID: 58517784ee8d8363d0d9790d), with:

predictions -= 0.5 # predictions are probabilities ranging from 0 to 1  
predictions *= -1 # flip sign of predictions, to see if return improves  

Not sure what it tells us, but I recall that a SR of as low as 0.7 could qualify for the Q fund. If it gets funded, I stake my claim to a share of the profits!

One interesting thing is that beta is on the high side. Would there be a way to mix in a hedging instrument (SPY?) to knock it down? I suspect that we may just be seeing the effect of a finite beta.

Clone Algorithm
47
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 5853beaf97d940625309c809
There was a runtime error.

Grant: That's funny ;). For some reason the original algorithm has negative beta, not sure why yet. I suspect the returns here also due to the higher beta.

Hi Thomas & all,

Yesterday, I'd posted some comments that were judged to be off the main topic here and deleted by a Quantopian moderator. I moved them to:

https://www.quantopian.com/posts/optimize-api-now-available-in-algorithms
https://www.quantopian.com/posts/side-comment-to-machine-learning-on-quantopian-part-3-building-an-algorithm

Grant

A potential problem with this algorithm is the ML technique you are applying. AdaBoost is a meta-algorithm which requires a base learner as an argument (see the sklearn documentation on this). If you do not pass in a base learner it defaults to a decision stump, which may not be optimal (or even useful) for predicting returns.

how can we fix the memory problem please ? can we pay premium to expand our memory restriction ? or is it possible to train the learner offline ? whats the setup for parallel computing on this ?

@Nicholas: Note that we're not predicting returns directly, but rather if returns go up or down (relative to the other in the universe). Certainly this is a choice but I think it's a good default to start with (although I now start with linear classifiers to remove complexity). Having said that, it's certainly something to experiment around with. What base learner do you think would be better suited?

@Jun: We hope to be able to just increase memory for everyone.

@Thomas,

I was thinking about how to make sure you account for 1) securities that go broke and are removed from the entire stock universe and/or 2) securities that exit your particular screen. I'm not 100% sure your shifting approach would account for them. I've attached a Notebook with how I would approach getting accurate returns for a specific screen (Q1500US for my example).

Anyway, not sure it would be possible in the algorithm code but something to consider.

Loading notebook preview...
Notebook previews are currently unavailable.

It turned out ML factor works after all. There was another annoying bug where the factors results were not correctly aligned to the right returns (it's always them, the Off-by-one errors ) in shift_mask_data(). In that way ML was trying to learn today returns instead of tomorrow returns.

So here is the backtest, but when I have some time I really want to try my NB above to see if ML really works.

Clone Algorithm
38
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 587ba634022d335e054d2a24
There was a runtime error.

I'll get this posted in the next few days, work has been hectic -- I have been running an SVM on the S&P 500 that is showing promise as well. I am thinking that an approach that equal weights holdings from SVM, NB and Ensemble might do well. Stay tight.

ML combines the factors in a non-linear way. Does that help get rid of the problem of correlated factors (used in linear regression based factor models) to some extent?

Luca: Great catch, did you debug the function in a NB?

Bharath: In the case of correlated features, Random Forests will pick one feature to make predictions and then give the other a very low weighting. This should be taken into account when looking at feature importances. See e.g. http://blog.datadive.net/selecting-good-features-part-iii-random-forests/.

I still don't get any good results in the NB, even after this bug fix. Then I believe it was just luck that the backtest improved.

Thanks Thomas

We need Python pickle serialize/deserialize.

Hi Thomas/Q team -

Above, it is stated regarding memory limitations "We are acutely aware of this problem and are working hard to resolve it."

What is the status? What approaches are being considered? Timeline for implementation?

I've been playing with ML locally using zipline for some time now and I'm not sure 6 months of Daily data is going to do much in the way of ML. I've found considerably better results with 15-20k data points. This being said, we definitely need a way to take a model trained locally and then import it into quantopian. If this could be done on a daily basis (Train at home, upload, and have the Q script fetch it), the idea of using ML with Quantopian would make a lot more sense. Just my thoughts, maybe people have had better luck with smaller datasets?

@Visser Agree with you on the model import importance for Q platform. Might be hard for them to do for large user-base...don't know...
We, also, haven't gotten good ML results with a small number of days over a small number of factors...which actually feels right.
It should take a large number of days over a large number of factors to eke out an alpha that is not easily detected by human inspection.

When you say "15-20k data points", do you mean 1.5-2k time-points over 10 factors, or 15-20k time-points over one factor?
Thanks!
alan

I'll second Andy's point - I know it's been asked for before. A method for taking a model trained locally and uploading it and using it with Quantopian data. The computation-heavy part of machine learning is the training of the model. If memory limitations are the problem, let us train the model outside the platform and import it in.

@Alan - the models I'm playing with locally are trained using 750k lines across 25-30 factors, so between 18M and 22.5M data points, covering two years of data. More factors means more training data is needed to get a model that generalizes well, although I'm probably using more than I need (I have another 500k lines or so I'm using as a testing set). I agree that using a small number of factors over a short time is likely to yield poor results, or results that only apply to a specific situation.

Training those models can take hours (most take a few minutes though), but predictions are much quicker (usually seconds). Offloading the computation heavy part to a local machine and keeping the predictions, which are relatively easy, would be the least painful solution for everyone.

On the other hand, I'm not a computer scientist, so I have no idea whats going on around the back end. :)

I think part of the problem may be how this would fit with the Q fund concept of being able to time-stamp algos once an initial backtest has been run, and then re-running the backtest in 6 months out-of-sample for evaluation. Would the ML model be stale by then? How often would uploads be required to keep it fresh? And how would Q manage the uploads in evaluating the algos for the fund? Q is all about the Q fund, so one has to consider everything in that context.

It seems like one solution might be to super-charge the research platform, and then work out the mechanics of launching an algo with the ability for the user to push data from the research platform to the algo (or perhaps call an automated script tied to the algo).

@Alan 15-20k lines with 75 or so features each. Your idea of 750k to 1mil is probably where I should be though ;)

EDIT
Sorry, I meant @John Cook

@Grant, that's an idea. I'd love to be able to do it all with Quantopian.

@John - Thanks for the info...sounds great!
Have you used multi-time-scale learning yet, with fusion...or is your time scale all in seconds ?

Hi,

Rather than using ML to predict tomorrow's return, I want to predict today's return and see if prediction is greater than or less than actual return. Go long if prediction > today's return and short if prediction < today's return. Could someone help me do this if it is not too much to ask?

Best regards,
Pravin

@Pravin Bezwada it seems easy. You need simple comparison function something like create an array of today's returns(Those you can just calculate by using %change function) and create an array of tomorrow's-perditions and shift by -1 so those will become today's returns. After that all you need to do is compare two arrays. :) On the other side you can do this operation using Pandas as well. Hope it will help.

Thanks Arshpreet Singh. That could work; although I wanted to regress today's factor on today's returns instead of yesterday's factors on today's returns.

@Pravin, there is this variable

PRED_N_FWD_DAYS = 1 # train on returns over N days into the future  

But keep in mind the following: pipeline is run every day after market close (precisely before market open of the next day, but still the data available to pipeline is the same) so after market close you can compute your latest factor values but, as the returns are calculated as % difference in open prices (because algorithm enters/exits positions at market open), pipeline will compute today returns (that is tomorrow open / today open) only tomorrow after market close. This is the reason why the algorithm discards today pipeline factor and uses yesterday pipeline factor with pipeline returns calculated today (today open / yesterday open), that actually means using ML on yesterday factor and yesterday actual returns. This is what you get with PRED_N_FWD_DAYS = 1.

This is also a bug, that I reported above, as the algorithm was intended to run ML on future returns, not current returns. That is, suppose you want to use 1 day future returns, the algorithm should use returns calculated today and factor values calculated 2 day ago (not yesterday).

Pravin Bezwada: Yes that was random guess. :)

Thanks Arshpreet.

@Luca, thanks for your valuable feedback. I now see how it works. Here is my attempt with ML. My take on this problem is that what happens tomorrow is anyone's guess. But you can always tell if today has been undervalued or overvalued. The attached algorithm has a few bugs and crashes but it is so far profitable with transaction costs and slippage. If someone could take a look and fix the bug, I can improve it further.

Clone Algorithm
60
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 58c9055040128517fc5d690e
There was a runtime error.

@ Thomas/Quantopian support -

What is the status of this effort, in terms of releasing an API? Is that the goal? Is it still a work-in-progress on your end? Or is it complete, from your perspective?

The approach taken with the optimization API is nice, and presumably it will eventually be "released" and moved out of quantopian.experimental with the source code on github. Is that the game plan here, too? If so, what remains to be done?

Some issues I see:

  1. Memory constraints.
  2. Testing of ML on known (synthesized) datasets (to confirm its functionality independent of noisy, real-world factors).
  3. Encapsulation into an API with documentation and revision control on github.
  4. Other?

After further reviewing this algorithm, I realized that we club the factors and returns across all securities for the prediction. I want to run multiple predictions, one per security by regressing time series of factors per security against security returns. My problem is I don't know how to fetch a time series (say 60 days) of factor values per security. Sigh, pipeline is so hard.

@Grant: I see the same issues you do (especially memory). A lot of engineering work recently went into improving the infrastructure but hopefully we can work on removing these issues soon.

@Pravin: Correct. You don't need to change anything to pipeline, however, since it is already providing the ticker information. The classifier just ignores it. I recommend starting with the NB and static data: https://www.quantopian.com/posts/machine-learning-on-quantopian as it's much easier to intuit about what needs to be done.

@Luca: That logic seems correct to me. Returns will be open of day T-2 to open of day T-1, so really returns of day T-2. I'll fix the code in the various posts.

@Thomas: I'm glad to see this discussion is still active.

I have a question regarding the X values used as input for AdaBoostClassifier. When I was previously learning about AdaBoost, X would be the predicted results of the X classifiers and they'd be the same possible values as Y, i.e., 1 or -1. So for example, if there were three classifiers, in the case of the first training data item, you may have [-1, 1, 1], with the expected Y as [1].

In the case of your provided sample code, you're initially fitting the distributed ranking values (so they're between 0 and 1) as your X data.

My Question: Is there a point in the AdaBoostClassifier in which these X input values get classified into a 1 or -1? If so, where is that happening? Is that the job of the DecisionTreeClassifier?

Btw, this site has opened my eyes to quantitative trading. I really appreciate the vast amount of material available here.

Actually, never mind, I see what's going on. By default AdaBoostClassifier is using DecisionTreeClassifier as the base estimator.

@ Thomas -

I would add to the to-do list a re-consideration of your architecture that limits the ML to pipeline. My view is that a general ML API should accept alpha factors from both pipeline, and from the minutely algo. This doesn't mean that the ML would need to run during the trading day, but just that it would need to accept alpha factors derived from minute bars, for example. As it stands, I don't think there is a way to do that now, since one can't feed data into pipeline.

@Grant: The problem is still that pipeline only returns the most recent row, rather than the full history of factors required to train the classifier. I suppose we could somehow hack around that and store the history in a global variable and then train the classifier in before_trading_start(), or handle_data() if you want to collect minute data and feed that in as well.

Ok, I understand where I was going wrong earlier.. Being new to both quantitative trading and machine learning, maybe it's just me, but I'll state what I was mixing up in case anyone else ran into the same thing.

I was mixing up the Pipeline factors with the AdaBoost classifiers. That is, I was thinking that the factor functions were the classifiers that AdaBoost was expected to use and improve upon. From a naive perspective, the factor functions look similar to what I would define as "weak learners."

But in reality, the weak learners (classifiers) are decision tree stump classifiers that are used by the AdaBoost classifier. The Pipeline factor ranks are the data fed to those weak learners (classifiers).

@ Thomas - If I understand correctly, within pipeline, we have access to the full history of factors to train the classifier, but one can't get pipeline to spit out trailing windows of factors. But you think globals will work? Maybe I'll try that at some point, just for yucks. Is there a time frame for upgrading pipeline to output multiple rows? Then you could elegantly park the ML outside pipeline to allow for combining pipeline and non-pipeline factors.

Good news! Since we doubled RAM in research and the backtester we can finally start running the ML workflow with longer lookback windows. In this backtest I run with a half-year window, training weekly over a 4 year period.

As you can see, the algorithm still does not perform magically better and my hunch is that the factors themselves do not have enough alpha in them, so I think exploring other factors in this framework would be a good next step.

Also, the next bottleneck that popped up are timeouts in before_trading_starts. These happen if we use fundamentals (MorningStar) which unfortunately many factors rely on. I removed all of these from this backtest to make things work. There are plans to speed fundamentals up eleviating this problem. Until then, there are many things that can be done in this framework already, also using factors based on other data sources.

Clone Algorithm
61
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 593128407ebbf351f09797f7
There was a runtime error.

Sounds good thanks for the update!

Awesome!

Can someone explain why results can change between two backtests with same algorithm and parameters. I am using AdaBoostClassifier with default parameters and their is two possible results. Is this an effect of random state ? How I can avoid this effect ?
Thanks

Mat: I added a random seed now to the classifier instantiation. Can you see if the problem persists?

Clone Algorithm
61
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 594a7ce5230e5169ff2fe0cb
There was a runtime error.

@Thomas, even with random_state=1337 param set, there is still 2 results for my backtest with AdaBoostClassifier. I have no problem with GradientBoostingClassifier with default param but backtest result is less interesting

@Mat: We haven an open issue about this, will try to fix it and post an update here.

You can check self.clf.score(X, Y) between to backtest run is not always the same. Maybe an effect of default AdaBoostClassifier param algorithm=SAMME.R. For weekly training does not seems to have any effect but on yearly training you can see a huge effect on backtest returns

@Mat - Are you sure that the results of ensemble.AdaBoostClassifier should be fully deterministic, even with random_state set ?
I looked, yet couldn't resolve that question.
One could make a case for a slight bit of non-determinism based on the base-classifier being a two-leaf tree(e.g. one if-statement), and the choice flipping one way or the other for "equal" values...not sure about that argument though.

@Mat - From what I can tell googling, there was a problem with random_state in late 2016, which was fixed in the sklearn 0.18 release.
See: http://scikit-learn.org/stable/whats_new.html

Fix bug where ensemble.AdaBoostClassifier and
ensemble.AdaBoostRegressor would perform poorly if the random_state
was fixed (#7411). By Joel Nothman.

Fix bug in ensembles with

randomization where the ensemble would not set random_state on base
estimators in a pipeline or similar nesting. (#7411). Note, results
for ensemble.BaggingClassifier ensemble.BaggingRegressor,
ensemble.AdaBoostClassifier and ensemble.AdaBoostRegressor will now
differ from previous versions. By Joel Nothman

@Thomas - Can't tell what version of sklearn is being used anymore, so don't know if the bug is fixed in the Quantopian ResearchPlatform,

import sklearn  
print('The scikit-learn version is {}.'.format(sklearn.__version__))

produces

SecurityViolation: 0002 Security Violation(s): Accessing
sklearn.version raised an AttributeError. No attributes with a
similar name were found.

alan

@alan: we run 0.16.1 (quite old) so this definitely seems like it could be the cause, really need to upgrade. I think with the same seed it should behave reliably.

So it sounds like the bug is specific to AdaBoostClassifier, in which case we can instantly fix it by switching to e.g. RandomForestClassifier.

@Mat - Based on my understanding, AdaBoostClassifier uses the DecisionTreeClassifier by default, which produces "best" splits (using gini computation). I believe that in some cases, there are several splits that result in the same gini score. It's possible that, in such a case, the DecisionTreeClassifier randomly picks which split to use.

I found this: https://github.com/scikit-learn/scikit-learn/issues/2386, which seems to support the theory.

Sam

@Sam: Even if it's random, it should be deterministic with a set seed.

Thank you all for your help, I switched to RandomForestClassifier(n_estimators=300, random_state=10) that is deterministic but very sensitive to training period. With a 5 factors model I have to fit model only few month at beginning of each year to find some good results, maybe classifier find some factors relations that do no exist the rest of the year. Difficult to explain