Back to Community
Machine Learning on Quantopian Part 3: Building an Algorithm

This is the third part of our series on Machine Learning on Quantopian. Most of the code is borrowed from Part 1, which showed how to train a model on static data, and Part 2, which showed how to train a model in an online fashion. Both of these were in research so they weren't functional algorithms. I highly recommend reading those before as it will make the code here much clearer.

It was pleasantly easy to copy over the code from research to make a functional algorithm. The new Optimization API made the portfolio construction and trade execution part very simple. Thus, with a few lines of code we have an algorithm with the following desirable properties:

  • Uses Machine Learning on a Factor-based workflow.
  • Retrains the model periodically.
  • Trades a large universe of stocks, using the Q1500US universe (chose a subset of 1000 stocks here).
  • Beta-neutral by going long-short.
  • Sector-neutral due to new optimization API.
  • Sets strict limits on maximum weight of any individual stock.

I also tried to make this algorithm to be template-like. If you clone it, you should be able to very easily put in your own alpha factors and they will be automatically picked up and incorporated. You can also configure this algorithm with a few prominent high-level settings:

N_STOCKS_TO_TRADE = 1000 # Will be split 50% long and 50% short  
ML_TRAINING_WINDOW = 21 # Number of days to train the classifier on, easy to run out of memory here  
PRED_N_FWD_DAYS = 1 # train on returns over N days into the future  
TRADE_FREQ = date_rules.week_start() # How often to trade, for daily, set to date_rules.every_day()  

However, this is definitely not the be-all-end-all algorithm. There are still many missing pieces and some rough edges:

  • Ideally, we could train on a longer window, like 6 months. But it's very easy to run out-of-memory here. We are acutely aware of this problem and are working hard to resolve it. Until then, we have to make due with being limited in this regard.
  • As you can see, performance is not great. I actually think this is quite honest. The alpha factors I included are all widely known and can probably not be expected to carry a significant amount of alpha. No ML fanciness will convert a non-predictable signal into a predictable one. I also noticed that it is very hard to get good performance out of this. That, however, is a good thing in my perspective. It means that because we are making so many bets, it is very difficult to overfit this strategy by making a few lucrative bets that pay off big-time but can't be expected to re-occur out-of-sample.
  • I deactivated all trading costs to focus on the pure performance. Obviously this would need to be taken into account to create a viable strategy.

As I said, there is still some work required and we will keep pushing the boundaries. Until then, please check this out, improve it, and poke holes in it.

Clone Algorithm
445
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 594a7ce5230e5169ff2fe0cb
There was a runtime error.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

92 responses

Here is the tear-sheet.

Loading notebook preview...
Notebook previews are currently unavailable.

This is great news! I'm excited to take a look at your code and give it a try.

Bummer about the memory restrictions though. That's a very limited amount of training data and I'm skeptical that we will be able to train a good model with so little data.

Out of curiosity, what were the changes you required of the infrastructure code for algorithms in order to be able to finally publish this?

@Rudiger: Yes, reducing memory constraints is definitely a top concern for us right now. There's also lots of optimizations we can do for fundamentals.

Scott Sanderson ultimately unlocked the workflow in its current form. Mainly there were some pipeline improvements. I'll let him chime in but I think the main one was that pipeline was caching data of all pipline nodes, even those that were no longer needed: https://github.com/quantopian/zipline/pull/1484

Thanks Thomas -

You describe the predictions as "probabilities ranging from 0 to 1." Is that before or after you perform this operation:

predictions -= 0.5  

Presumably, you are shifting the mean to be centered around 0, so that predictions ranges from -0.5 to 0.5 as an input to objective = opt.MaximizeAlpha(predictions) (which requires a signed input, even though it has a market_neutral constraint)?

Are the predictions the relative forecast returns, by stock in the universe? The stock with the highest prediction value is the surest bet for a long allocation, and the stock with the lowest prediction value is the surest bet for a short allocation?

You are generating the alpha factors and apply the ML (alpha combination) using Q1500US(), which, as I understand, has sector diversification (see the help page, "...choosing no more than 30% of the assets from any single market sector"). But then, within the optimizer, a constraint is applied:

sector_neutral = opt.NetPartitionExposure.with_equal_bounds(  
            labels=context.risk_factors.Sector.dropna(),  
            min=-0.0001,  
            max=0.0001,  
    )  

I'm wondering if this is the right way to approach things? Wouldn't you want to start with a sector-diverse universe and let things play out? It seems like with the sector_neutral constraint, you are a priori taking it as beneficial to have sector neutrality over a potentially higher SR without the constraint. I guess the idea is that you can reduce "sector risk" but it seems like you'll end up undoing any advantageous sector tilt that might result from the prior steps. Or maybe in the world of long-short investing, it is well-established that sector tilt should be avoided?

As a structural consideration, I'd look at moving the optimization step into before_trading_start. All of your alpha factors are based on trailing daily bars, so intraday, you are only using the current portfolio state information in the optimizer. I guess the idea is that you'll do a better job of optimizing if the portfolio state is fresh (overnight gaps/drift could be significant). Or maybe run the alpha combination step intraday on daily bars plus the current minutely bars (or a daily VWAP up to the current minutely time)? Given the time scale of the algo, it might be just as well and tidier to do everything in before_trading_start and then before the market opens, hand off the new portfolio weights to the O/EMS system that you are presumably developing. It just seems kinda muddled structurally to run the alpha combination before the market open on trailing daily bars, and then to do the optimization intraday. Also, how are your approaching combining individual algos into the fund? I'd think each one would be a kind of alpha factor, right? So if you'll be combining them on a daily basis, you'll need all of them to report in before the market opens, right? You have schedule_function(my_rebalance, TRADE_FREQ, time_rules.market_open(minutes=10)), but there is nothing to prevent users from fiddling with the execution time, and then you'll have all of the algos spitting out portfolio updates willy-nilly which would not seem to be the best scenario for combining the algos and your O/EMS system.

As a side note, you posted an example algo that requires a paid subscription to a data set over the period of the backtest (see https://www.quantopian.com/data/zacks/earnings_surprises ). It mucks up the paradigm of anyone being able to clone your algo, fiddle with it, and then posting a 1:1 comparison.

Hi Grant,

Presumably, you are shifting the mean to be centered around 0, so that predictions ranges from -0.5 to 0.5 as an input to objective = opt.MaximizeAlpha(predictions) (which requires a signed input, even though it has a market_neutral constraint)?

Exactly. Probabilities of < 0.5 indicate the stock dropping and will be converted to a short-signal after subtracting 0.5.

Re Sectors: Yes, this is the same logic as being market-neutral. The optimization will long/short equally within each sector to achieve sector-neutrality. The logic is the same, we do not try to predict market or sector movements here, so it's a good idea to remove that source of risk. If you had a model that accurately predicted how sectors move you might of course want to introduce some sector-tilt.

Re moving portfolio construction into before_trading_start: Not a bad idea, we could definitely do that. Although not sure there would be a lot of upside, the function right now does portfolio construction and execution simultaneously which is convenient. But yeah, as we make portfolio construction more complex that could make sense.

Extracting alpha factors is still ongoing research. Having the portfolio construction happen in before_trading_start might help with that. But one could also imagine us wanting to do our construction based on pure alphas in that approach.

Re earnings surprises: Thanks for alerting me to that, I wasn't aware. I'll update the algo to comment it out.

Thanks Thomas -

Personally, my biggest gap is with respect to the basics of ML. If there is a concise primer on the exact flavor you are using, it would be helpful (without having to read Python code line-by-line, which in the end is good, but doesn't give the big picture). Maybe I can get an article onto my Kindle and eventually educate myself on your modern computational voodoo.

By the way, as I mentioned to Scot, it'd be nice to run the optimizer on an expanding trailing window, and then combine the results, for smoothing. I think this would amount to running the alpha combination step N times per day, storing the predictions, running the optimizer on each prediction, and then combining the results in some fashion (e.g. a weighted average, weighted by expected return). Effectively, I think this means in before_trading_start, you'd have to be able to call the "alpha combiner" with a parameter specifying the trailing window size. Is this feasible?

Your comment makes intuitive sense:

As you can see, performance is not great. I actually think this is quite honest. The alpha factors I included are all widely known and can probably not be expected to carry a significant amount of alpha. No ML fanciness will convert a non-predictable signal into a predictable one. I also noticed that it is very hard to get good performance out of this. That, however, is a good thing in my perspective. It means that because we are making so many bets, it is very difficult to overfit this strategy by making a few lucrative bets that pay off big-time but can't be expected to re-occur out-of-sample.

The issue I see is that with the set of factors and real-world data you used above, you can kinda debug the prototype tool, answering questions like:

  • How many factors can it handle?
  • Will it produce sensible, stable results, or go wildly off the rails?
  • Can a useful template be provided to the end-user, with a set of known real-world factors?

The problem, though, is that I think it is gonna be hard to tell if the tool is working properly, since there is a convolution of the noisy, uncertain inputs, and the unproven tool. You are basically saying "garbage in, garbage out" is to be expected. But the tool may be contributing to the "garbage out" and it seems like it'll be hard to sort its contribution, if any, to the "garbage out."

Say I wanted to show that my new, fancy method for fitting a straight line works. I guess I'd synthesize a data set with known characteristics (e.g. y = mx + b, with noise), and then apply my new-fangled algorithm to it. Could something analogous be done here? It would be nice to see that if a high SR output is expected (e.g. (1+r)^n type returns), one actually gets it.

Extracting alpha factors is still ongoing research. Having the portfolio construction happen in before_trading_start might help with that. But one could also imagine us wanting to do our construction based on pure alphas in that approach.

Well, it seems like a no-brainer. I think the idea is for users to research and develop long-short alpha factors, and implement them in pipeline, using daily bars. My sense is that most of the value-add will be at this step (although presently, the only way to get paid for the effort is to write a full-up algo). Y'all could grab the alpha factors from each user, and take it from there, doing a global combination and optimization (on a high-performance computing platform, so you don't have to fuss with memory limitations, etc.). Of course, the over-arching question is, with this sort of a relatively long timescale system and equities, what sort of SR is achievable?

Thanks for this incredible work Thomas ! I have eagerly begun playing around with it.

Do you know if its possible to plot the weights assigned to each factor during each period as the algo runs?

Grant: Good point regarding garbage in garbage out. Ideally we'd have some simulated data to demonstrate it works. I think this should be pretty straight-forward to do in research.

Lex: Yes, the commented out line 286 (log.debug(self.clf.feature_importances_)) prints it out, albeit without the name of the factors associated to the importance. Perhaps you can store them in a global var, combine them with the factor-names and then plot them. That would be a great contribution.

Aside from synthetic data & factors that would be expected to yield consistent, high SR, you'd also want to feed it noise to show that it doesn't extract a spurious signal from the noise (i.e. "make a silk purse out of a sow's ear," or "overfit" per the standard lingo).

One point of confusion here is with respect to The 101 Alphas Project which seems to suggest that if approached correctly, lots of sketchy, mini-alpha factors can be combined into a mega-alpha that will be viable. Is there any overlap between what you've done, and what is claimed in the paper (that many "faint and ephemeral" alphas can be combined advantageously)? Maybe with the right set of 101 alphas, your algo would be a winner?

Yes, the 101 Alphas Project would be a fantastic place to start for adding new alphas. This post is really about Alpha Combination, not Alpha Discovery.

@ Lex -

Regarding "Do you know if its possible to plot the weights assigned to each factor during each period as the algo runs?" you can only plot 5 variables. However, a tedious hack would be to run the backtest N times, recording a different set of variables each time. Within the research platform, the recorded variables are accessible. For example, try (with your own backtest_id):

bt = get_backtest('584bf25a0ca16a64879c92f1')  
dir(bt)  
print bt.bt.recorded_vars  

Once you get all of the factor weights versus time into the research platform, you could then plot them together.

One tweak to the Quantopian backtester would be to allow recording of a larger number of variables, but perhaps still only allow plotting 5 per backtest. Then, in one backtest, all variables of interest could be loaded into the backtester in one shot.

I cant help but feel somethings not quite right here. I ran a single factor (fcfy) with only 2 constraints (market neutral & gross leverage), set it to monthly rebalancing & 20-period prediction & 20 percentile ranges. I was expecting to get rather similar results as my monthly long/short factor rebalancing model for the same factor but it is rather very different. If the leverage still hovers around 0. Why would the results deviate so much if it is only being passed one factor?

Also without significantly removing constraints do I find it difficult to achieve non-flat line results. Am I missing something here?

N_STOCKS_TO_TRADE = 200 # Will be split 50% long and 50% short  
ML_TRAINING_WINDOW = 60 # Number of days to train the classifier on, easy to run out of memory here  
PRED_N_FWD_DAYS = 20 # train on returns over N days into the future  
TRADE_FREQ = date_rules.month_start() # How often to trade, for daily, set to date_rules.every_day()  

Lex, there's not a lot go on here. In general, the ML combines factors in non-linear way, so it could end up completely different from the weighting you do. Perhaps remove that part or replace it with something linear like LinearSVC?

The memory limitation on the Q platform is quite serious when it comes to really delving into machine learning techniques. The size of a data cube with quite a few factors one needs to extract generalizations and keep predication forward variance down does not fit into the current memory requirements. Additionally lack of being able to serialize variables and models using the pickle library makes offline training impossible too.

I hope the Q team can see that in order to support the ML field beyond conceptual token examples the platform needs to significantly improve in functionality and performance !

Hi Thomas -

I'm wondering how to handle "magic numbers" in factors, particularly with respect to time scales. For example, say I defined a simple price mean reversion factor as:

np.mean(prices[-n:,:],axis=0)/price[-1,:]  

At any point in time, there may be a sweet spot for n (and presumably, using alphalens, I could sort that out, and kinda ballpark an optimum). However, I could also code a whole set of factors, each with a different trailing window length for the mean, and then let the ML combine them. For example, I could write 8 separate factors, with:

np.mean(prices[-3:,:],axis=0)/price[-1,:]

np.mean(prices[-4:,:],axis=0)/price[-1,:]

np.mean(prices[-5:,:],axis=0)/price[-1,:]

np.mean(prices[-6:,:],axis=0)/price[-1,:]

np.mean(prices[-7:,:],axis=0)/price[-1,:]

np.mean(prices[-8:,:],axis=0)/price[-1,:]

np.mean(prices[-9:,:],axis=0)/price[-1,:]

np.mean(prices[-10:,:],axis=0)/price[-1,:]  

This would seem to have the advantage that I've eliminated (or at least smoothed) a magic number factor setting. Also, perhaps the ML algorithm would then dynamically combine them versus time in an optimal way. Or would it just create a mess? If it would make sense, is there a more elegant way to code things, so that one does not have to write N identical factors, only differing by their settings (e.g. be able to call a single factor N times with different parameters)?

On a separate note, does pipeline support CVXPY? Is it possible to write factors that include the application of CVXPY?

Regarding platform limitations, my two cents is that there is an odd unidirectional relationship between the research platform and the backtester. It seems that they could be better integrated/unified, so that data could pass freely between them, and backtests could be called from the research platform. It seems also that one should be able to output a set of data from the research platform, and it should be available to algos (both for backtesting and live trading). Trying to do everything online in the algo is limiting, but I guess in the end, you need stand-alone contributions to the Q fund; you can't have "managers" fiddling with them offline.

@Thomas - Thanks for continuing the machine learning thread here with working code...that is really useful! What is on the horizon next for this thread?

@Kamran - yes..without something like a powerful compute/memory nvidia-docker image or cluster to work with, you have no real way to compute massive regression-based or deep machine learning algorithms. Perhaps one could deploy already learned models with Q's infrastructure though?

Just finished reading a blog article that I found interesting:
https:[email protected]/deep-learning-the-stock-market-df853d139e02

Besides the detail involving the different ways to learn sequential information (Recurrent Neural Network, LTSM, etc.), I found the overall targets the author delineates useful, so I quote from that section of the blog article by @TalPerry:

In our case of “predicting the market”, we need to ask ourselves what exactly we want to market to predict? Some of the options that I thought about were:
1. Predict the next price for each of the 1000 stocks
2. Predict the value of some index (S&P, VIX etc) in the next n minutes.
3. Predict which of the stocks will move up by more than x% in the next n minutes
4. (My personal favorite) Predict which stocks will go up/down by 2x% in the next n minutes while not going down/up by more than x% in that time.
5. (The one we’ll follow for the remainder of this article). Predict when the VIX will go up/down by 2x% in the next n minutes while not going down/up by more than x% in that time.
1 and 2 are regression problems, where we have to predict an actual number instead of the likelihood of a specific event (like the letter n appearing or the market going up). Those are fine but not what I want to do.
3 and 4 are fairly similar, they both ask to predict an event (In technical jargon — a class label). An event could be the letter n appearing next or it could be Moved up 5% while not going down more than 3% in the last 10 minutes. The trade-off between 3 and 4 is that 3 is much more common and thus easier to learn about while 4 is more valuable as not only is it an indicator of profit but also has some constraint on risk.
5 is the one we’ll continue with for this article because it’s similar to 3 and 4 but has mechanics that are easier to follow. The VIX is sometimes called the Fear Index and it represents how volatile the stocks in the S&P500 are. It is derived by observing the implied volatility for specific options on each of the stocks in the index.

I like target number 4 also, yet wonder...
What do others think of these targets ?
alan

@Thomas

I have been following with interest all your posts regarding factors combinations using ML. As I have been researching this topic for quite a while on my own, if ML turns out to be the best way to combine factors, then I would be very happy.

Now, regarding the algorithm you posted there are 2 critical bugs that prevent the ML factor to work:

  • Line 266: today.weekday is method so it should be today.weekday(). This bug results in the classifier to be trained only once, at initialization time.
  • Line 279: that should be shift_mask_data(X, Y, fwd_days=PRED_N_FWD_DAYS) otherwise PRED_N_FWD_DAYS is bound to upper_percentile argument.

The same bugs are present in the NB posted in ML part 2

Here is a fixed version of the algorithm. Unfortunately the results are even worse but I will get back to this in the next post.

Clone Algorithm
27
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 58512ab66f9bae633455a3fb
There was a runtime error.

@Luca: Thanks so much -- those are two critical bugs. I'll update the original code too.

@Thomas

As I would like to see evidence of ML factor producing better results than other techniques I created this NB where I test a very simple scenario: 3 alpha factors. I firstly run Alphalens on the 3 factors individually to see how they perform. Then I try the most basic factors combination I can imagine, that is the linear combination with equal weighting: this is like averaging the effect of the 3 factors. Finally I run Alphalens on the ML factor trying different window_lenghts/n_fwd_days combinations to check if ML can beat the previous basic approach.

I would be very interested in hearing your thoughts on the results.

Please note that you can use this NB to run Alphalens on a large time span (I tried 14 years with a universe of 1000 securities) and the pipeline won't run out of memory as I split the call to pipeline in many chunks.

Attached a 6 years analysis.

Loading notebook preview...
Notebook previews are currently unavailable.

Not sure if you're doing this, because I haven't had time to look at your code carefully, but beware of lookahead bias. You should use a different timeperiod to run alphalens on your factors to the timeperiod where you test the efficacy of the "simple" model. By the same token, the ML model should be trained using data from the first timeperiod and tested using the second timeperiod.

@Rudiger Lippert, thanks for your comment but that's not an issue. I applied the same approach Thomas used in his ML part 2.

@Luca: That's a very insightful analysis. It's clear that the ML does not come off very well here. Since the factors are already pretty good and linear, perhaps this isn't too surprising as the ML would just muddle with that. So two experiments would be to try a linear classifier and adding a nonsense-factor which hopefully the ML would identify and not listen to when making predictions. I have already tried experiment 1 by replacing the ML algo with logistic regression. However, it didn't do much to improve metrics like Annualized IR. As another suggestion, I would condense the outputs and only store metrics like spread, IC and IR. Then display a big comparison table at the end comparing all the individual experiments. The full tear-sheets are too difficult to compare directly.

I also took another stab at improving the code. Specifically, I'm now tracking past predictions to evaluate the alpha model directly on held out data. Just looking at returns is too far detached from checking if the model actually works, as Luca and Grant highlighted. Thus, we can now track all kinds of metrics like accuracy, log-loss and spread. Surprisingly, these look pretty good on the held-out data (accuracy ~55%). This raises the question of why the algorithm isn't doing better than it is. Possibly because the spread is still pretty small, or because there are still bugs. In any case, I also simplified the portfolio allocation to try and get the trading of the algorithm closer to what is suggested by the alpha model.

Ultimately, we really need to develop tools and methods to evaluate each stage independently to track where things go wrong.

Clone Algorithm
505
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 585291f24ab362625a438fc6
There was a runtime error.

I'm not sure we're thinking about this problem correctly. It's not about whether "ML" works as means for alpha combination or not. Any technique that learns how to predict values from training data is technically ML, even if it is an extremely simple technique like linear regression.

I think it's more about the bias-variance tradeoff, which tells us that simpler models will do a better job at not fitting to random noise, but can be systematically biased if output function isn't itself simple (like trying to fit a polynomial with a line). The problem of fitting to noise is particularly exacerbated by having very little data. The more complex model (like Adaboost) tries to infer properties from limited examples and comes up with incorrect inferences (for example, if you never saw a dog and you were only given 5 pictures of dogs, where each of them contained a tree, you might infer that the tree is the dog. Only by adding many more examples, where trees aren't present, do you have a chance to learn a better rule for identifying a dog). I think the biggest problem with this new ML factor is that it's training on so little data. The model needs to see what happens when the market is in different conditions or it will assume that the last month is completely representative of how markets behave. Ernie Chan commented in one of his books that financial data is so noisy that even if you have a lot of data it makes sense to use very simple models or you're going to overfit. Steps like ranking features, ranking targets, changing the target into a classification problem, using a linear combination of ranks, are all steps to try to tame the noise beast.

I wrote a bit about these kinds of issues in my blog post for my former company, if you're interested: https://www.factual.com/blog/the-wisdom-of-crowds

Rudiger: All great points. It's definitely worthwhile to start with linear classifiers and some robust factors. Note that we're already doing ranking features and changing the target into a classification problem. I think Luca's experiments actually are compatible with your thoughts. The simplest linear classifier is an equal-weighted one. If something that is learned is doing poorer, it means there's so much noise we can't even infer stable combinations.

Ideally I'd like to test with "fake" factors that have great predictive power (like today's return, introducing look-ahead bias essentially). This would really show the benefit (can add non-linearities and noise on top), and be good for debugging and making sure everything works.

Here's Thomas' recent revision (first post in thread, Backtest ID: 58517784ee8d8363d0d9790d), with:

predictions -= 0.5 # predictions are probabilities ranging from 0 to 1  
predictions *= -1 # flip sign of predictions, to see if return improves  

Not sure what it tells us, but I recall that a SR of as low as 0.7 could qualify for the Q fund. If it gets funded, I stake my claim to a share of the profits!

One interesting thing is that beta is on the high side. Would there be a way to mix in a hedging instrument (SPY?) to knock it down? I suspect that we may just be seeing the effect of a finite beta.

Clone Algorithm
49
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 5853beaf97d940625309c809
There was a runtime error.

Grant: That's funny ;). For some reason the original algorithm has negative beta, not sure why yet. I suspect the returns here also due to the higher beta.

Hi Thomas & all,

Yesterday, I'd posted some comments that were judged to be off the main topic here and deleted by a Quantopian moderator. I moved them to:

https://www.quantopian.com/posts/optimize-api-now-available-in-algorithms
https://www.quantopian.com/posts/side-comment-to-machine-learning-on-quantopian-part-3-building-an-algorithm

Grant

A potential problem with this algorithm is the ML technique you are applying. AdaBoost is a meta-algorithm which requires a base learner as an argument (see the sklearn documentation on this). If you do not pass in a base learner it defaults to a decision stump, which may not be optimal (or even useful) for predicting returns.

how can we fix the memory problem please ? can we pay premium to expand our memory restriction ? or is it possible to train the learner offline ? whats the setup for parallel computing on this ?

@Nicholas: Note that we're not predicting returns directly, but rather if returns go up or down (relative to the other in the universe). Certainly this is a choice but I think it's a good default to start with (although I now start with linear classifiers to remove complexity). Having said that, it's certainly something to experiment around with. What base learner do you think would be better suited?

@Jun: We hope to be able to just increase memory for everyone.

@Thomas,

I was thinking about how to make sure you account for 1) securities that go broke and are removed from the entire stock universe and/or 2) securities that exit your particular screen. I'm not 100% sure your shifting approach would account for them. I've attached a Notebook with how I would approach getting accurate returns for a specific screen (Q1500US for my example).

Anyway, not sure it would be possible in the algorithm code but something to consider.

Loading notebook preview...
Notebook previews are currently unavailable.

It turned out ML factor works after all. There was another annoying bug where the factors results were not correctly aligned to the right returns (it's always them, the Off-by-one errors ) in shift_mask_data(). In that way ML was trying to learn today returns instead of tomorrow returns.

So here is the backtest, but when I have some time I really want to try my NB above to see if ML really works.

Clone Algorithm
42
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 587ba634022d335e054d2a24
There was a runtime error.

I'll get this posted in the next few days, work has been hectic -- I have been running an SVM on the S&P 500 that is showing promise as well. I am thinking that an approach that equal weights holdings from SVM, NB and Ensemble might do well. Stay tight.

ML combines the factors in a non-linear way. Does that help get rid of the problem of correlated factors (used in linear regression based factor models) to some extent?

Luca: Great catch, did you debug the function in a NB?

Bharath: In the case of correlated features, Random Forests will pick one feature to make predictions and then give the other a very low weighting. This should be taken into account when looking at feature importances. See e.g. http://blog.datadive.net/selecting-good-features-part-iii-random-forests/.

I still don't get any good results in the NB, even after this bug fix. Then I believe it was just luck that the backtest improved.

Thanks Thomas

We need Python pickle serialize/deserialize.

Hi Thomas/Q team -

Above, it is stated regarding memory limitations "We are acutely aware of this problem and are working hard to resolve it."

What is the status? What approaches are being considered? Timeline for implementation?

I've been playing with ML locally using zipline for some time now and I'm not sure 6 months of Daily data is going to do much in the way of ML. I've found considerably better results with 15-20k data points. This being said, we definitely need a way to take a model trained locally and then import it into quantopian. If this could be done on a daily basis (Train at home, upload, and have the Q script fetch it), the idea of using ML with Quantopian would make a lot more sense. Just my thoughts, maybe people have had better luck with smaller datasets?

@Visser Agree with you on the model import importance for Q platform. Might be hard for them to do for large user-base...don't know...
We, also, haven't gotten good ML results with a small number of days over a small number of factors...which actually feels right.
It should take a large number of days over a large number of factors to eke out an alpha that is not easily detected by human inspection.

When you say "15-20k data points", do you mean 1.5-2k time-points over 10 factors, or 15-20k time-points over one factor?
Thanks!
alan

I'll second Andy's point - I know it's been asked for before. A method for taking a model trained locally and uploading it and using it with Quantopian data. The computation-heavy part of machine learning is the training of the model. If memory limitations are the problem, let us train the model outside the platform and import it in.

@Alan - the models I'm playing with locally are trained using 750k lines across 25-30 factors, so between 18M and 22.5M data points, covering two years of data. More factors means more training data is needed to get a model that generalizes well, although I'm probably using more than I need (I have another 500k lines or so I'm using as a testing set). I agree that using a small number of factors over a short time is likely to yield poor results, or results that only apply to a specific situation.

Training those models can take hours (most take a few minutes though), but predictions are much quicker (usually seconds). Offloading the computation heavy part to a local machine and keeping the predictions, which are relatively easy, would be the least painful solution for everyone.

On the other hand, I'm not a computer scientist, so I have no idea whats going on around the back end. :)

I think part of the problem may be how this would fit with the Q fund concept of being able to time-stamp algos once an initial backtest has been run, and then re-running the backtest in 6 months out-of-sample for evaluation. Would the ML model be stale by then? How often would uploads be required to keep it fresh? And how would Q manage the uploads in evaluating the algos for the fund? Q is all about the Q fund, so one has to consider everything in that context.

It seems like one solution might be to super-charge the research platform, and then work out the mechanics of launching an algo with the ability for the user to push data from the research platform to the algo (or perhaps call an automated script tied to the algo).

@Alan 15-20k lines with 75 or so features each. Your idea of 750k to 1mil is probably where I should be though ;)

EDIT
Sorry, I meant @John Cook

@Grant, that's an idea. I'd love to be able to do it all with Quantopian.

@John - Thanks for the info...sounds great!
Have you used multi-time-scale learning yet, with fusion...or is your time scale all in seconds ?

Hi,

Rather than using ML to predict tomorrow's return, I want to predict today's return and see if prediction is greater than or less than actual return. Go long if prediction > today's return and short if prediction < today's return. Could someone help me do this if it is not too much to ask?

Best regards,
Pravin

@Pravin Bezwada it seems easy. You need simple comparison function something like create an array of today's returns(Those you can just calculate by using %change function) and create an array of tomorrow's-perditions and shift by -1 so those will become today's returns. After that all you need to do is compare two arrays. :) On the other side you can do this operation using Pandas as well. Hope it will help.

Thanks Arshpreet Singh. That could work; although I wanted to regress today's factor on today's returns instead of yesterday's factors on today's returns.

@Pravin, there is this variable

PRED_N_FWD_DAYS = 1 # train on returns over N days into the future  

But keep in mind the following: pipeline is run every day after market close (precisely before market open of the next day, but still the data available to pipeline is the same) so after market close you can compute your latest factor values but, as the returns are calculated as % difference in open prices (because algorithm enters/exits positions at market open), pipeline will compute today returns (that is tomorrow open / today open) only tomorrow after market close. This is the reason why the algorithm discards today pipeline factor and uses yesterday pipeline factor with pipeline returns calculated today (today open / yesterday open), that actually means using ML on yesterday factor and yesterday actual returns. This is what you get with PRED_N_FWD_DAYS = 1.

This is also a bug, that I reported above, as the algorithm was intended to run ML on future returns, not current returns. That is, suppose you want to use 1 day future returns, the algorithm should use returns calculated today and factor values calculated 2 day ago (not yesterday).

Pravin Bezwada: Yes that was random guess. :)

Thanks Arshpreet.

@Luca, thanks for your valuable feedback. I now see how it works. Here is my attempt with ML. My take on this problem is that what happens tomorrow is anyone's guess. But you can always tell if today has been undervalued or overvalued. The attached algorithm has a few bugs and crashes but it is so far profitable with transaction costs and slippage. If someone could take a look and fix the bug, I can improve it further.

Clone Algorithm
87
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 58c9055040128517fc5d690e
There was a runtime error.

@ Thomas/Quantopian support -

What is the status of this effort, in terms of releasing an API? Is that the goal? Is it still a work-in-progress on your end? Or is it complete, from your perspective?

The approach taken with the optimization API is nice, and presumably it will eventually be "released" and moved out of quantopian.experimental with the source code on github. Is that the game plan here, too? If so, what remains to be done?

Some issues I see:

  1. Memory constraints.
  2. Testing of ML on known (synthesized) datasets (to confirm its functionality independent of noisy, real-world factors).
  3. Encapsulation into an API with documentation and revision control on github.
  4. Other?

After further reviewing this algorithm, I realized that we club the factors and returns across all securities for the prediction. I want to run multiple predictions, one per security by regressing time series of factors per security against security returns. My problem is I don't know how to fetch a time series (say 60 days) of factor values per security. Sigh, pipeline is so hard.

@Grant: I see the same issues you do (especially memory). A lot of engineering work recently went into improving the infrastructure but hopefully we can work on removing these issues soon.

@Pravin: Correct. You don't need to change anything to pipeline, however, since it is already providing the ticker information. The classifier just ignores it. I recommend starting with the NB and static data: https://www.quantopian.com/posts/machine-learning-on-quantopian as it's much easier to intuit about what needs to be done.

@Luca: That logic seems correct to me. Returns will be open of day T-2 to open of day T-1, so really returns of day T-2. I'll fix the code in the various posts.

@Thomas: I'm glad to see this discussion is still active.

I have a question regarding the X values used as input for AdaBoostClassifier. When I was previously learning about AdaBoost, X would be the predicted results of the X classifiers and they'd be the same possible values as Y, i.e., 1 or -1. So for example, if there were three classifiers, in the case of the first training data item, you may have [-1, 1, 1], with the expected Y as [1].

In the case of your provided sample code, you're initially fitting the distributed ranking values (so they're between 0 and 1) as your X data.

My Question: Is there a point in the AdaBoostClassifier in which these X input values get classified into a 1 or -1? If so, where is that happening? Is that the job of the DecisionTreeClassifier?

Btw, this site has opened my eyes to quantitative trading. I really appreciate the vast amount of material available here.

Actually, never mind, I see what's going on. By default AdaBoostClassifier is using DecisionTreeClassifier as the base estimator.

@ Thomas -

I would add to the to-do list a re-consideration of your architecture that limits the ML to pipeline. My view is that a general ML API should accept alpha factors from both pipeline, and from the minutely algo. This doesn't mean that the ML would need to run during the trading day, but just that it would need to accept alpha factors derived from minute bars, for example. As it stands, I don't think there is a way to do that now, since one can't feed data into pipeline.

@Grant: The problem is still that pipeline only returns the most recent row, rather than the full history of factors required to train the classifier. I suppose we could somehow hack around that and store the history in a global variable and then train the classifier in before_trading_start(), or handle_data() if you want to collect minute data and feed that in as well.

Ok, I understand where I was going wrong earlier.. Being new to both quantitative trading and machine learning, maybe it's just me, but I'll state what I was mixing up in case anyone else ran into the same thing.

I was mixing up the Pipeline factors with the AdaBoost classifiers. That is, I was thinking that the factor functions were the classifiers that AdaBoost was expected to use and improve upon. From a naive perspective, the factor functions look similar to what I would define as "weak learners."

But in reality, the weak learners (classifiers) are decision tree stump classifiers that are used by the AdaBoost classifier. The Pipeline factor ranks are the data fed to those weak learners (classifiers).

@ Thomas - If I understand correctly, within pipeline, we have access to the full history of factors to train the classifier, but one can't get pipeline to spit out trailing windows of factors. But you think globals will work? Maybe I'll try that at some point, just for yucks. Is there a time frame for upgrading pipeline to output multiple rows? Then you could elegantly park the ML outside pipeline to allow for combining pipeline and non-pipeline factors.

Good news! Since we doubled RAM in research and the backtester we can finally start running the ML workflow with longer lookback windows. In this backtest I run with a half-year window, training weekly over a 4 year period.

As you can see, the algorithm still does not perform magically better and my hunch is that the factors themselves do not have enough alpha in them, so I think exploring other factors in this framework would be a good next step.

Also, the next bottleneck that popped up are timeouts in before_trading_starts. These happen if we use fundamentals (MorningStar) which unfortunately many factors rely on. I removed all of these from this backtest to make things work. There are plans to speed fundamentals up eleviating this problem. Until then, there are many things that can be done in this framework already, also using factors based on other data sources.

Clone Algorithm
445
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 593128407ebbf351f09797f7
There was a runtime error.

Sounds good thanks for the update!

Awesome!

Can someone explain why results can change between two backtests with same algorithm and parameters. I am using AdaBoostClassifier with default parameters and their is two possible results. Is this an effect of random state ? How I can avoid this effect ?
Thanks

Mat: I added a random seed now to the classifier instantiation. Can you see if the problem persists?

Clone Algorithm
445
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 594a7ce5230e5169ff2fe0cb
There was a runtime error.

@Thomas, even with random_state=1337 param set, there is still 2 results for my backtest with AdaBoostClassifier. I have no problem with GradientBoostingClassifier with default param but backtest result is less interesting

@Mat: We haven an open issue about this, will try to fix it and post an update here.

You can check self.clf.score(X, Y) between to backtest run is not always the same. Maybe an effect of default AdaBoostClassifier param algorithm=SAMME.R. For weekly training does not seems to have any effect but on yearly training you can see a huge effect on backtest returns

@Mat - Are you sure that the results of ensemble.AdaBoostClassifier should be fully deterministic, even with random_state set ?
I looked, yet couldn't resolve that question.
One could make a case for a slight bit of non-determinism based on the base-classifier being a two-leaf tree(e.g. one if-statement), and the choice flipping one way or the other for "equal" values...not sure about that argument though.

@Mat - From what I can tell googling, there was a problem with random_state in late 2016, which was fixed in the sklearn 0.18 release.
See: http://scikit-learn.org/stable/whats_new.html

Fix bug where ensemble.AdaBoostClassifier and
ensemble.AdaBoostRegressor would perform poorly if the random_state
was fixed (#7411). By Joel Nothman.

Fix bug in ensembles with

randomization where the ensemble would not set random_state on base
estimators in a pipeline or similar nesting. (#7411). Note, results
for ensemble.BaggingClassifier ensemble.BaggingRegressor,
ensemble.AdaBoostClassifier and ensemble.AdaBoostRegressor will now
differ from previous versions. By Joel Nothman

@Thomas - Can't tell what version of sklearn is being used anymore, so don't know if the bug is fixed in the Quantopian ResearchPlatform,

import sklearn  
print('The scikit-learn version is {}.'.format(sklearn.__version__))

produces

SecurityViolation: 0002 Security Violation(s): Accessing
sklearn.version raised an AttributeError. No attributes with a
similar name were found.

alan

@alan: we run 0.16.1 (quite old) so this definitely seems like it could be the cause, really need to upgrade. I think with the same seed it should behave reliably.

So it sounds like the bug is specific to AdaBoostClassifier, in which case we can instantly fix it by switching to e.g. RandomForestClassifier.

@Mat - Based on my understanding, AdaBoostClassifier uses the DecisionTreeClassifier by default, which produces "best" splits (using gini computation). I believe that in some cases, there are several splits that result in the same gini score. It's possible that, in such a case, the DecisionTreeClassifier randomly picks which split to use.

I found this: https://github.com/scikit-learn/scikit-learn/issues/2386, which seems to support the theory.

Sam

@Sam: Even if it's random, it should be deterministic with a set seed.

Thank you all for your help, I switched to RandomForestClassifier(n_estimators=300, random_state=10) that is deterministic but very sensitive to training period. With a 5 factors model I have to fit model only few month at beginning of each year to find some good results, maybe classifier find some factors relations that do no exist the rest of the year. Difficult to explain

understand that we 're doing ranking on the features/alpha factors
is it possible for us to deploy sector neutralization technique on the signal level ?
==> focusing our algorithm on the relative factor score and stock performance within each sector

e.g the outperformers might have negative forward one- month returns as long as they outperform relative to their sectors

can you show me how to do the above pls ?

I noticed today on LinkedIn a nice post by Harish Davarajin at DeepValue on this subject. He looks at the combination of 32 alphas from the 101 Alphas publication using various classifiers in sklearn (which are all available on Quantopian). He finds the best results with the SVM classifier (sklearn.svm.SVC) at default settings (e.g., radial bias kernel).

The post is here.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

I will be posting improvements shortly:

Thanks for the giggle: "random_state=1337"

@jacob: Great, eager to see improvements to this workflow. Re random_state: ;)

Hi Jonathan/Thomas -

Regarding the post by Harish Davarajin at DeepValue, is it something that could be done on Quantopian? If so, is there an outline you could provide of how to approach it?

I've taken an interest in learning some machine learning (perhaps as I age, I can augment my feeble mind with AI!), so perhaps this would be a good entry point.

Thanks,

Grant "Off-White Seal" Kiehne

Has anyone modified this framework to train on intraday (open to close) returns?

@jacob shrum come on then :D

If the entire market is in a bull, and all returns are positive for whatever universe we select, then will this algorithm still short the worst performing?

@Thomas when do you think you might have a fix for the Morningstar related timeouts?

From what I can see in backtesting, pipeline appears to batch multiple days of factor calculations and if the arbitrary batching breaches the before_trading_start timeout of 5 mins the whole thing fails. I wish there was a way for us to control how the batching gets done as even if we are sure that one day of factor calculations only takes say 100secs because of the arbitrary batching we seem to get seemingly random timeouts.

The example below resulted in a timout even though the before_trading pipeline itself ran in under 250 secs for a 6 day batch job.

2017-01-03 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-03 08:45:00-05:00
2017-01-03 21:45 compute:482 INFO weekday 2017-01-03 00:00:00+00:00 1
2017-01-03 21:45 compute:484 INFO 2017-01-03 00:00:00+00:00
2017-01-03 21:45 compute:489 INFO Training took 1.717320 secs
2017-01-03 21:45 compute:482 INFO weekday 2017-01-04 00:00:00+00:00 2
2017-01-03 21:45 compute:484 INFO 2017-01-04 00:00:00+00:00
2017-01-03 21:45 compute:489 INFO Training took 1.708877 secs
2017-01-03 21:45 compute:482 INFO weekday 2017-01-05 00:00:00+00:00 3
2017-01-03 21:45 compute:484 INFO 2017-01-05 00:00:00+00:00
2017-01-03 21:45 compute:489 INFO Training took 1.675035 secs
2017-01-03 21:45 compute:482 INFO weekday 2017-01-06 00:00:00+00:00 4
2017-01-03 21:45 compute:484 INFO 2017-01-06 00:00:00+00:00
2017-01-03 21:45 compute:489 INFO Training took 1.717600 secs
2017-01-03 21:45 compute:482 INFO weekday 2017-01-09 00:00:00+00:00 0
2017-01-03 21:45 compute:484 INFO 2017-01-09 00:00:00+00:00
2017-01-03 21:45 compute:489 INFO Training took 1.901481 secs
2017-01-03 21:45 compute:482 INFO weekday 2017-01-10 00:00:00+00:00 1
2017-01-03 21:45 compute:484 INFO 2017-01-10 00:00:00+00:00
2017-01-03 21:45 compute:489 INFO Training took 1.795630 secs
2017-01-03 21:45 before_trading_start:1089 INFO Time to run pipeline 246.63 secs
2017-01-03 21:45 before_trading_start:1124 DEBUG Number of secruities to be considered: 200
2017-01-03 21:45 get_best_portfolio:1052 INFO one good optimization done
2017-01-03 21:45 WARN sklearn/cross_validation.py:417: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=3.
2017-01-03 21:45 WARN sklearn/cross_validation.py:417: Warning: The least populated class in y has only 2 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=3.
2017-01-03 21:45 WARN Logging limit exceeded; some messages discarded
2017-01-04 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-04 08:45:00-05:00
2017-01-04 21:45 before_trading_start:1089 INFO Time to run pipeline 0.00 secs
2017-01-04 21:45 before_trading_start:1124 DEBUG Number of secruities to be considered: 200
2017-01-04 21:45 get_best_portfolio:1052 INFO one good optimization done
2017-01-04 21:45 before_trading_start:1151 INFO Time to run optimizer 33.54 secs
2017-01-04 21:45 before_trading_start:1156 INFO total weight: 1.0, total abs weight: 2.0, maxweight : 0.005, -0.005
2017-01-05 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-05 08:45:00-05:00
2017-01-05 21:45 before_trading_start:1089 INFO Time to run pipeline 0.00 secs
2017-01-05 21:45 before_trading_start:1124 DEBUG Number of secruities to be considered: 200
2017-01-05 21:45 get_best_portfolio:1052 INFO one good optimization done
2017-01-05 21:45 before_trading_start:1151 INFO Time to run optimizer 30.15 secs
2017-01-05 21:45 before_trading_start:1156 INFO total weight: 1.0, total abs weight: 2.0, maxweight : 0.005, -0.005
2017-01-06 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-06 08:45:00-05:00
2017-01-06 21:45 before_trading_start:1089 INFO Time to run pipeline 0.00 secs
2017-01-06 21:45 before_trading_start:1124 DEBUG Number of secruities to be considered: 200
2017-01-06 21:45 get_best_portfolio:1052 INFO one good optimization done
2017-01-06 21:45 before_trading_start:1151 INFO Time to run optimizer 29.93 secs
2017-01-06 21:45 before_trading_start:1156 INFO total weight: 1.0, total abs weight: 1.99999999999, maxweight : 0.005, -0.00499999999999
2017-01-09 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-09 08:45:00-05:00
2017-01-09 21:45 before_trading_start:1089 INFO Time to run pipeline 0.00 secs
2017-01-09 21:45 before_trading_start:1124 DEBUG Number of secruities to be considered: 200
2017-01-09 21:45 get_best_portfolio:1052 INFO one good optimization done
2017-01-09 21:45 before_trading_start:1151 INFO Time to run optimizer 32.78 secs
2017-01-09 21:45 before_trading_start:1156 INFO total weight: 1.0, total abs weight: 2.0, maxweight : 0.005, -0.005
2017-01-10 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-10 08:45:00-05:00
2017-01-10 21:45 before_trading_start:1089 INFO Time to run pipeline 0.00 secs
2017-01-10 21:45 before_trading_start:1124 DEBUG Number of secruities to be considered: 200
2017-01-10 21:45 get_best_portfolio:1052 INFO one good optimization done
2017-01-10 21:45 before_trading_start:1151 INFO Time to run optimizer 30.61 secs
2017-01-10 21:45 before_trading_start:1156 INFO total weight: 1.0, total abs weight: 1.99999999998, maxweight : 0.00499999999998, -0.00499999999997
2017-01-11 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-11 08:45:00-05:00
End of logs.

This has been a wonderful example to learn from Thomas, thank you.

Is there a way to make the factors from make_factors() "window_safe"? I find that if I remove .rank(mask=universe) from when the functions are evaluated I get a NonWindowSafeInput: Can't compute windowed expression error.

    # Rather than Instantiate ranked factors, just get the actual values  
    # This works in notebook research environment but fails in the algorithm arena  
    for name, f in factors.iteritems():  
        #factors_pipe[name] = f().rank(mask=universe)  
        factors_pipe[name] = f()  

Which, honestly, I'm a bit confused by because I would think that before .rank() could run the main function would first have to be evaluated.

Hi Thomas -

I'm following up on your comment above:

@Grant: The problem is still that pipeline only returns the most recent row, rather than the full history of factors required to train the classifier. I suppose we could somehow hack around that and store the history in a global variable and then train the classifier in before_trading_start(), or handle_data() if you want to collect minute data and feed that in as well.

Has there been any progress on upgrading Pipeline to return the full history of factors? In terms of a more generic architecture, my thinking is that within before_trading_start() (where we have access to a 5-minute compute window...well sort of, since Pipeline chunking can consume a lot of it), one would run the ML code, combining both Pipeline-derived alphas and alphas computed using minute data. Additionally, this would allow control over when the ML step is run (you are running it every day, correct?). Unfortunately, it could be a bit of a kludge, since I'm not sure schedule_function() is compatible with before_trading_start() so some sort of flag would be required.

Regarding the idea of using globals, I see them used pretty much willy nilly in examples from Quantopian, but I thought they were a bad practice in writing software. Are they confined to the sand-boxed algo API, so it is copacetic? Or is there a risk that globals of the same name are used elsewhere in your software, and there would be a conflict?

Overall, what is the Q priority on this ML stuff? Is the code you published above running to your satisfaction? What are the next steps? Or is it on the back burner?

@Pumpelrod: Instead of .rank(), use .zscore().

@Grant: Good questions. I agree that having access to pipeline output elsewhere is likely to be the way to go. Training does not happen daily, however. There is actually meaningful progress to this workflow and I will post an update soon.

Global variables might not be an unreasonable stop-gap solution, even though it's not good practice. The algo is sand-boxed so there is no concern of overwriting some other variables.

I breezed through a learning/deep learning book recently:

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
by Aurélien Géron

It does cover AdaBoost, but later on, in the Tensorflow section, it has a chapter on recurrent neural networks as being particularly well-suited for time series problems. Presumably, RNN's are used in quantitative trading, but maybe that's beyond the scope here.

As I'd pointed out earlier, I'd recommend figuring out how to synthesize an input data set with a known output, to see if your ML code is actually working. For example, you could make up a bunch of factors that, when combined simplistically (e.g. linear combination, equal weights), gives a good result. Then, you could turn on the ML to see if it improves the result. Otherwise, if your factors are sketchy, volatile pieces of junk, you won't be able to test the potential benefit of ML.

@Grant: You have your finger on the pulse, RNNs (specifically LSTMs) are very interesting in that regard. Check out this blog post: http://www.jakob-aungiers.com/articles/a/LSTM-Neural-Network-for-Time-Series-Prediction

Some unittests around the ML framework would also be very helpful, the code complexity is pretty high.

By the way, I'm sure you've thought of this, but if I were y'all, I'd set up a rig in your office with access to all of the Q data, and tools, plus whatever state-of-the-art ML tools are out there, and get a whip-smart Boston-area intern or starving grad student to play around with it. Presumably, the licensing terms with your data vendors would allow this. I bet for $5-10K, you could build a little supercomputer to play with. There have been numerous requests for high-end deep learning on Q, so you could gather support that it would be beneficial (or show that it stinks) and start to understand how to implement it (or drop it).

@Grant: Admire your thought.. the hardest part in ML for me is feature engineering to support any ML model to run productively and predictably (random state 1337 notwithstanding). That presumes the ability to aggregate and store data to be resampled for train-test splits to continually feed ML for forward predictions. Sessions in Q are transient batches of backtests and paper lives; there is no meaningful way to retrieve the trading data (heavens forbid manually copying from log exhaust) for ex post analyses. I'm running both DNN and RFC for ASX trading here, and they continually generate, aggregate and re-learn the data that the algorithm produces before, during and after every trade. Of course it is entirely possible that I'm doing it all in an antipodean way.

@ Thomas -

Regarding your comment above that some unit tests around the ML framework would be in order, how would you propose this be done? Presumably, one would need to use the Q research platform and paste chunks of your code into it. Ideally, I'd think that it would be nice to test the entire algo with controllable inputs. One could imagine Q devising and releasing a single toy data set for this purpose, from which multiple factors could be created. Using the magic of look-ahead bias, one could generate the toy data set from actual stock market data, controlling the degree of predictability of the data set. The standard, configurable toy data set would then be accessible from both the research platform and the backtester. It seems like the toy data set would drop right into the existing Q framework, no?

Thanks for the LSTM reference; I'll need to continue to get up the learning curve on ML. Intuitively, it seems that if one has reasonable stationarity, then one can learn from the recent past without any fanciness. By analogy, identifying pictures containing ducks will be the same problem today as it will be in a year. However, if in a year, there is no such thing as a duck, but we are using a 1-year look back, it won't work. I'd seen some comments on Q that for the market, certain time-dependent ML techniques should be used. Is that what you are working to implement on Q?

Also, regarding your comment "Training does not happen daily" it appears you are using this code within class ML(CustomFactor):

if (not self.init) or (today.weekday() == 0): # Monday  

I suggest sorting out how to move this control to a scheduled function, rather than burying it in the code, if you are developing a general framework/template/workflow. For example, to run the ML on a Monday before trading starts, then scheduling a function on Friday to set a flag would do the trick.

Also, I'm wondering if you could have a control for the degree of ML to be imposed to set the relative factor weights in the alpha combination step? For example, one could have:

alpha_combined = a*alpha_naive + (1-a)*alpha_ML  

By setting a one could control the influence of the ML relative to a naive alpha combination (e.g. equal weights). In the ML lingo, I gather that a would be called a hyper-parameter, to be optimized as part of the development.

Another free tip is that although you do a great job of commenting your code, some accompanying schematics/flow charts/swim lanes/state diagrams/etc. would be really helpful.