Machine Learning on Quantopian Part 3: Building an Algorithm

This is the third part of our series on Machine Learning on Quantopian. Most of the code is borrowed from Part 1, which showed how to train a model on static data, and Part 2, which showed how to train a model in an online fashion. Both of these were in research so they weren't functional algorithms. I highly recommend reading those before as it will make the code here much clearer.

It was pleasantly easy to copy over the code from research to make a functional algorithm. The new Optimization API made the portfolio construction and trade execution part very simple. Thus, with a few lines of code we have an algorithm with the following desirable properties:

• Uses Machine Learning on a Factor-based workflow.
• Retrains the model periodically.
• Trades a large universe of stocks, using the Q1500US universe (chose a subset of 1000 stocks here).
• Beta-neutral by going long-short.
• Sector-neutral due to new optimization API.
• Sets strict limits on maximum weight of any individual stock.

I also tried to make this algorithm to be template-like. If you clone it, you should be able to very easily put in your own alpha factors and they will be automatically picked up and incorporated. You can also configure this algorithm with a few prominent high-level settings:

N_STOCKS_TO_TRADE = 1000 # Will be split 50% long and 50% short
ML_TRAINING_WINDOW = 250 # Number of days to train the classifier on, easy to run out of memory here
PRED_N_FWD_DAYS = 1 # train on returns over N days into the future
TRADE_FREQ = date_rules.week_start() # How often to trade, for daily, set to date_rules.every_day()


However, this is definitely not the be-all-end-all algorithm. There are still many missing pieces and some rough edges:

• Ideally, we could train on a longer window, like 6 months. But it's very easy to run out-of-memory here. We are acutely aware of this problem and are working hard to resolve it. Until then, we have to make due with being limited in this regard.
• As you can see, performance is not great. I actually think this is quite honest. The alpha factors I included are all widely known and can probably not be expected to carry a significant amount of alpha. No ML fanciness will convert a non-predictable signal into a predictable one. I also noticed that it is very hard to get good performance out of this. That, however, is a good thing in my perspective. It means that because we are making so many bets, it is very difficult to overfit this strategy by making a few lucrative bets that pay off big-time but can't be expected to re-occur out-of-sample.
• I deactivated all trading costs to focus on the pure performance. Obviously this would need to be taken into account to create a viable strategy.

As I said, there is still some work required and we will keep pushing the boundaries. Until then, please check this out, improve it, and poke holes in it.

1183
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 59f06653a7091643c96928d7
There was a runtime error.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

202 responses

Here is the tear-sheet.

181
Notebook previews are currently unavailable.

This is great news! I'm excited to take a look at your code and give it a try.

Bummer about the memory restrictions though. That's a very limited amount of training data and I'm skeptical that we will be able to train a good model with so little data.

Out of curiosity, what were the changes you required of the infrastructure code for algorithms in order to be able to finally publish this?

@Rudiger: Yes, reducing memory constraints is definitely a top concern for us right now. There's also lots of optimizations we can do for fundamentals.

Scott Sanderson ultimately unlocked the workflow in its current form. Mainly there were some pipeline improvements. I'll let him chime in but I think the main one was that pipeline was caching data of all pipline nodes, even those that were no longer needed: https://github.com/quantopian/zipline/pull/1484

Thanks Thomas -

You describe the predictions as "probabilities ranging from 0 to 1." Is that before or after you perform this operation:

predictions -= 0.5


Presumably, you are shifting the mean to be centered around 0, so that predictions ranges from -0.5 to 0.5 as an input to objective = opt.MaximizeAlpha(predictions) (which requires a signed input, even though it has a market_neutral constraint)?

Are the predictions the relative forecast returns, by stock in the universe? The stock with the highest prediction value is the surest bet for a long allocation, and the stock with the lowest prediction value is the surest bet for a short allocation?

You are generating the alpha factors and apply the ML (alpha combination) using Q1500US(), which, as I understand, has sector diversification (see the help page, "...choosing no more than 30% of the assets from any single market sector"). But then, within the optimizer, a constraint is applied:

sector_neutral = opt.NetPartitionExposure.with_equal_bounds(
labels=context.risk_factors.Sector.dropna(),
min=-0.0001,
max=0.0001,
)


I'm wondering if this is the right way to approach things? Wouldn't you want to start with a sector-diverse universe and let things play out? It seems like with the sector_neutral constraint, you are a priori taking it as beneficial to have sector neutrality over a potentially higher SR without the constraint. I guess the idea is that you can reduce "sector risk" but it seems like you'll end up undoing any advantageous sector tilt that might result from the prior steps. Or maybe in the world of long-short investing, it is well-established that sector tilt should be avoided?

As a structural consideration, I'd look at moving the optimization step into before_trading_start. All of your alpha factors are based on trailing daily bars, so intraday, you are only using the current portfolio state information in the optimizer. I guess the idea is that you'll do a better job of optimizing if the portfolio state is fresh (overnight gaps/drift could be significant). Or maybe run the alpha combination step intraday on daily bars plus the current minutely bars (or a daily VWAP up to the current minutely time)? Given the time scale of the algo, it might be just as well and tidier to do everything in before_trading_start and then before the market opens, hand off the new portfolio weights to the O/EMS system that you are presumably developing. It just seems kinda muddled structurally to run the alpha combination before the market open on trailing daily bars, and then to do the optimization intraday. Also, how are your approaching combining individual algos into the fund? I'd think each one would be a kind of alpha factor, right? So if you'll be combining them on a daily basis, you'll need all of them to report in before the market opens, right? You have schedule_function(my_rebalance, TRADE_FREQ, time_rules.market_open(minutes=10)), but there is nothing to prevent users from fiddling with the execution time, and then you'll have all of the algos spitting out portfolio updates willy-nilly which would not seem to be the best scenario for combining the algos and your O/EMS system.

As a side note, you posted an example algo that requires a paid subscription to a data set over the period of the backtest (see https://www.quantopian.com/data/zacks/earnings_surprises ). It mucks up the paradigm of anyone being able to clone your algo, fiddle with it, and then posting a 1:1 comparison.

Hi Grant,

Presumably, you are shifting the mean to be centered around 0, so that predictions ranges from -0.5 to 0.5 as an input to objective = opt.MaximizeAlpha(predictions) (which requires a signed input, even though it has a market_neutral constraint)?

Exactly. Probabilities of < 0.5 indicate the stock dropping and will be converted to a short-signal after subtracting 0.5.

Re Sectors: Yes, this is the same logic as being market-neutral. The optimization will long/short equally within each sector to achieve sector-neutrality. The logic is the same, we do not try to predict market or sector movements here, so it's a good idea to remove that source of risk. If you had a model that accurately predicted how sectors move you might of course want to introduce some sector-tilt.

Re moving portfolio construction into before_trading_start: Not a bad idea, we could definitely do that. Although not sure there would be a lot of upside, the function right now does portfolio construction and execution simultaneously which is convenient. But yeah, as we make portfolio construction more complex that could make sense.

Extracting alpha factors is still ongoing research. Having the portfolio construction happen in before_trading_start might help with that. But one could also imagine us wanting to do our construction based on pure alphas in that approach.

Re earnings surprises: Thanks for alerting me to that, I wasn't aware. I'll update the algo to comment it out.

Thanks Thomas -

Personally, my biggest gap is with respect to the basics of ML. If there is a concise primer on the exact flavor you are using, it would be helpful (without having to read Python code line-by-line, which in the end is good, but doesn't give the big picture). Maybe I can get an article onto my Kindle and eventually educate myself on your modern computational voodoo.

By the way, as I mentioned to Scot, it'd be nice to run the optimizer on an expanding trailing window, and then combine the results, for smoothing. I think this would amount to running the alpha combination step N times per day, storing the predictions, running the optimizer on each prediction, and then combining the results in some fashion (e.g. a weighted average, weighted by expected return). Effectively, I think this means in before_trading_start, you'd have to be able to call the "alpha combiner" with a parameter specifying the trailing window size. Is this feasible?

Your comment makes intuitive sense:

As you can see, performance is not great. I actually think this is quite honest. The alpha factors I included are all widely known and can probably not be expected to carry a significant amount of alpha. No ML fanciness will convert a non-predictable signal into a predictable one. I also noticed that it is very hard to get good performance out of this. That, however, is a good thing in my perspective. It means that because we are making so many bets, it is very difficult to overfit this strategy by making a few lucrative bets that pay off big-time but can't be expected to re-occur out-of-sample.

The issue I see is that with the set of factors and real-world data you used above, you can kinda debug the prototype tool, answering questions like:

• How many factors can it handle?
• Will it produce sensible, stable results, or go wildly off the rails?
• Can a useful template be provided to the end-user, with a set of known real-world factors?

The problem, though, is that I think it is gonna be hard to tell if the tool is working properly, since there is a convolution of the noisy, uncertain inputs, and the unproven tool. You are basically saying "garbage in, garbage out" is to be expected. But the tool may be contributing to the "garbage out" and it seems like it'll be hard to sort its contribution, if any, to the "garbage out."

Say I wanted to show that my new, fancy method for fitting a straight line works. I guess I'd synthesize a data set with known characteristics (e.g. y = mx + b, with noise), and then apply my new-fangled algorithm to it. Could something analogous be done here? It would be nice to see that if a high SR output is expected (e.g. (1+r)^n type returns), one actually gets it.

Extracting alpha factors is still ongoing research. Having the portfolio construction happen in before_trading_start might help with that. But one could also imagine us wanting to do our construction based on pure alphas in that approach.

Well, it seems like a no-brainer. I think the idea is for users to research and develop long-short alpha factors, and implement them in pipeline, using daily bars. My sense is that most of the value-add will be at this step (although presently, the only way to get paid for the effort is to write a full-up algo). Y'all could grab the alpha factors from each user, and take it from there, doing a global combination and optimization (on a high-performance computing platform, so you don't have to fuss with memory limitations, etc.). Of course, the over-arching question is, with this sort of a relatively long timescale system and equities, what sort of SR is achievable?

Thanks for this incredible work Thomas ! I have eagerly begun playing around with it.

Do you know if its possible to plot the weights assigned to each factor during each period as the algo runs?

Grant: Good point regarding garbage in garbage out. Ideally we'd have some simulated data to demonstrate it works. I think this should be pretty straight-forward to do in research.

Lex: Yes, the commented out line 286 (log.debug(self.clf.feature_importances_)) prints it out, albeit without the name of the factors associated to the importance. Perhaps you can store them in a global var, combine them with the factor-names and then plot them. That would be a great contribution.

Aside from synthetic data & factors that would be expected to yield consistent, high SR, you'd also want to feed it noise to show that it doesn't extract a spurious signal from the noise (i.e. "make a silk purse out of a sow's ear," or "overfit" per the standard lingo).

One point of confusion here is with respect to The 101 Alphas Project which seems to suggest that if approached correctly, lots of sketchy, mini-alpha factors can be combined into a mega-alpha that will be viable. Is there any overlap between what you've done, and what is claimed in the paper (that many "faint and ephemeral" alphas can be combined advantageously)? Maybe with the right set of 101 alphas, your algo would be a winner?

Yes, the 101 Alphas Project would be a fantastic place to start for adding new alphas. This post is really about Alpha Combination, not Alpha Discovery.

@ Lex -

Regarding "Do you know if its possible to plot the weights assigned to each factor during each period as the algo runs?" you can only plot 5 variables. However, a tedious hack would be to run the backtest N times, recording a different set of variables each time. Within the research platform, the recorded variables are accessible. For example, try (with your own backtest_id):

bt = get_backtest('584bf25a0ca16a64879c92f1')
dir(bt)
print bt.bt.recorded_vars


Once you get all of the factor weights versus time into the research platform, you could then plot them together.

One tweak to the Quantopian backtester would be to allow recording of a larger number of variables, but perhaps still only allow plotting 5 per backtest. Then, in one backtest, all variables of interest could be loaded into the backtester in one shot.

I cant help but feel somethings not quite right here. I ran a single factor (fcfy) with only 2 constraints (market neutral & gross leverage), set it to monthly rebalancing & 20-period prediction & 20 percentile ranges. I was expecting to get rather similar results as my monthly long/short factor rebalancing model for the same factor but it is rather very different. If the leverage still hovers around 0. Why would the results deviate so much if it is only being passed one factor?

Also without significantly removing constraints do I find it difficult to achieve non-flat line results. Am I missing something here?

N_STOCKS_TO_TRADE = 200 # Will be split 50% long and 50% short
ML_TRAINING_WINDOW = 60 # Number of days to train the classifier on, easy to run out of memory here
PRED_N_FWD_DAYS = 20 # train on returns over N days into the future
TRADE_FREQ = date_rules.month_start() # How often to trade, for daily, set to date_rules.every_day()


Lex, there's not a lot go on here. In general, the ML combines factors in non-linear way, so it could end up completely different from the weighting you do. Perhaps remove that part or replace it with something linear like LinearSVC?

The memory limitation on the Q platform is quite serious when it comes to really delving into machine learning techniques. The size of a data cube with quite a few factors one needs to extract generalizations and keep predication forward variance down does not fit into the current memory requirements. Additionally lack of being able to serialize variables and models using the pickle library makes offline training impossible too.

I hope the Q team can see that in order to support the ML field beyond conceptual token examples the platform needs to significantly improve in functionality and performance !

Hi Thomas -

I'm wondering how to handle "magic numbers" in factors, particularly with respect to time scales. For example, say I defined a simple price mean reversion factor as:

np.mean(prices[-n:,:],axis=0)/price[-1,:]


At any point in time, there may be a sweet spot for n (and presumably, using alphalens, I could sort that out, and kinda ballpark an optimum). However, I could also code a whole set of factors, each with a different trailing window length for the mean, and then let the ML combine them. For example, I could write 8 separate factors, with:

np.mean(prices[-3:,:],axis=0)/price[-1,:]

np.mean(prices[-4:,:],axis=0)/price[-1,:]

np.mean(prices[-5:,:],axis=0)/price[-1,:]

np.mean(prices[-6:,:],axis=0)/price[-1,:]

np.mean(prices[-7:,:],axis=0)/price[-1,:]

np.mean(prices[-8:,:],axis=0)/price[-1,:]

np.mean(prices[-9:,:],axis=0)/price[-1,:]

np.mean(prices[-10:,:],axis=0)/price[-1,:]


This would seem to have the advantage that I've eliminated (or at least smoothed) a magic number factor setting. Also, perhaps the ML algorithm would then dynamically combine them versus time in an optimal way. Or would it just create a mess? If it would make sense, is there a more elegant way to code things, so that one does not have to write N identical factors, only differing by their settings (e.g. be able to call a single factor N times with different parameters)?

On a separate note, does pipeline support CVXPY? Is it possible to write factors that include the application of CVXPY?

Regarding platform limitations, my two cents is that there is an odd unidirectional relationship between the research platform and the backtester. It seems that they could be better integrated/unified, so that data could pass freely between them, and backtests could be called from the research platform. It seems also that one should be able to output a set of data from the research platform, and it should be available to algos (both for backtesting and live trading). Trying to do everything online in the algo is limiting, but I guess in the end, you need stand-alone contributions to the Q fund; you can't have "managers" fiddling with them offline.

@Thomas - Thanks for continuing the machine learning thread here with working code...that is really useful! What is on the horizon next for this thread?

@Kamran - yes..without something like a powerful compute/memory nvidia-docker image or cluster to work with, you have no real way to compute massive regression-based or deep machine learning algorithms. Perhaps one could deploy already learned models with Q's infrastructure though?

Just finished reading a blog article that I found interesting:
https://medium.com/@TalPerry/deep-learning-the-stock-market-df853d139e02

Besides the detail involving the different ways to learn sequential information (Recurrent Neural Network, LTSM, etc.), I found the overall targets the author delineates useful, so I quote from that section of the blog article by @TalPerry:

In our case of “predicting the market”, we need to ask ourselves what exactly we want to market to predict? Some of the options that I thought about were:
1. Predict the next price for each of the 1000 stocks
2. Predict the value of some index (S&P, VIX etc) in the next n minutes.
3. Predict which of the stocks will move up by more than x% in the next n minutes
4. (My personal favorite) Predict which stocks will go up/down by 2x% in the next n minutes while not going down/up by more than x% in that time.
5. (The one we’ll follow for the remainder of this article). Predict when the VIX will go up/down by 2x% in the next n minutes while not going down/up by more than x% in that time.
1 and 2 are regression problems, where we have to predict an actual number instead of the likelihood of a specific event (like the letter n appearing or the market going up). Those are fine but not what I want to do.
3 and 4 are fairly similar, they both ask to predict an event (In technical jargon — a class label). An event could be the letter n appearing next or it could be Moved up 5% while not going down more than 3% in the last 10 minutes. The trade-off between 3 and 4 is that 3 is much more common and thus easier to learn about while 4 is more valuable as not only is it an indicator of profit but also has some constraint on risk.
5 is the one we’ll continue with for this article because it’s similar to 3 and 4 but has mechanics that are easier to follow. The VIX is sometimes called the Fear Index and it represents how volatile the stocks in the S&P500 are. It is derived by observing the implied volatility for specific options on each of the stocks in the index.

I like target number 4 also, yet wonder...
What do others think of these targets ?
alan

@Luca: Thanks so much -- those are two critical bugs. I'll update the original code too.

Not sure if you're doing this, because I haven't had time to look at your code carefully, but beware of lookahead bias. You should use a different timeperiod to run alphalens on your factors to the timeperiod where you test the efficacy of the "simple" model. By the same token, the ML model should be trained using data from the first timeperiod and tested using the second timeperiod.

@Rudiger Lippert, thanks for your comment but that's not an issue. I applied the same approach Thomas used in his ML part 2.

@Luca: That's a very insightful analysis. It's clear that the ML does not come off very well here. Since the factors are already pretty good and linear, perhaps this isn't too surprising as the ML would just muddle with that. So two experiments would be to try a linear classifier and adding a nonsense-factor which hopefully the ML would identify and not listen to when making predictions. I have already tried experiment 1 by replacing the ML algo with logistic regression. However, it didn't do much to improve metrics like Annualized IR. As another suggestion, I would condense the outputs and only store metrics like spread, IC and IR. Then display a big comparison table at the end comparing all the individual experiments. The full tear-sheets are too difficult to compare directly.

I also took another stab at improving the code. Specifically, I'm now tracking past predictions to evaluate the alpha model directly on held out data. Just looking at returns is too far detached from checking if the model actually works, as Luca and Grant highlighted. Thus, we can now track all kinds of metrics like accuracy, log-loss and spread. Surprisingly, these look pretty good on the held-out data (accuracy ~55%). This raises the question of why the algorithm isn't doing better than it is. Possibly because the spread is still pretty small, or because there are still bugs. In any case, I also simplified the portfolio allocation to try and get the trading of the algorithm closer to what is suggested by the alpha model.

Ultimately, we really need to develop tools and methods to evaluate each stage independently to track where things go wrong.

528
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 585291f24ab362625a438fc6
There was a runtime error.

I'm not sure we're thinking about this problem correctly. It's not about whether "ML" works as means for alpha combination or not. Any technique that learns how to predict values from training data is technically ML, even if it is an extremely simple technique like linear regression.

I think it's more about the bias-variance tradeoff, which tells us that simpler models will do a better job at not fitting to random noise, but can be systematically biased if output function isn't itself simple (like trying to fit a polynomial with a line). The problem of fitting to noise is particularly exacerbated by having very little data. The more complex model (like Adaboost) tries to infer properties from limited examples and comes up with incorrect inferences (for example, if you never saw a dog and you were only given 5 pictures of dogs, where each of them contained a tree, you might infer that the tree is the dog. Only by adding many more examples, where trees aren't present, do you have a chance to learn a better rule for identifying a dog). I think the biggest problem with this new ML factor is that it's training on so little data. The model needs to see what happens when the market is in different conditions or it will assume that the last month is completely representative of how markets behave. Ernie Chan commented in one of his books that financial data is so noisy that even if you have a lot of data it makes sense to use very simple models or you're going to overfit. Steps like ranking features, ranking targets, changing the target into a classification problem, using a linear combination of ranks, are all steps to try to tame the noise beast.

I wrote a bit about these kinds of issues in my blog post for my former company, if you're interested: https://www.factual.com/blog/the-wisdom-of-crowds

Rudiger: All great points. It's definitely worthwhile to start with linear classifiers and some robust factors. Note that we're already doing ranking features and changing the target into a classification problem. I think Luca's experiments actually are compatible with your thoughts. The simplest linear classifier is an equal-weighted one. If something that is learned is doing poorer, it means there's so much noise we can't even infer stable combinations.

Ideally I'd like to test with "fake" factors that have great predictive power (like today's return, introducing look-ahead bias essentially). This would really show the benefit (can add non-linearities and noise on top), and be good for debugging and making sure everything works.

Here's Thomas' recent revision (first post in thread, Backtest ID: 58517784ee8d8363d0d9790d), with:

predictions -= 0.5 # predictions are probabilities ranging from 0 to 1
predictions *= -1 # flip sign of predictions, to see if return improves


Not sure what it tells us, but I recall that a SR of as low as 0.7 could qualify for the Q fund. If it gets funded, I stake my claim to a share of the profits!

One interesting thing is that beta is on the high side. Would there be a way to mix in a hedging instrument (SPY?) to knock it down? I suspect that we may just be seeing the effect of a finite beta.

77
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 5853beaf97d940625309c809
There was a runtime error.

Grant: That's funny ;). For some reason the original algorithm has negative beta, not sure why yet. I suspect the returns here also due to the higher beta.

Hi Thomas & all,

Yesterday, I'd posted some comments that were judged to be off the main topic here and deleted by a Quantopian moderator. I moved them to:

Grant

A potential problem with this algorithm is the ML technique you are applying. AdaBoost is a meta-algorithm which requires a base learner as an argument (see the sklearn documentation on this). If you do not pass in a base learner it defaults to a decision stump, which may not be optimal (or even useful) for predicting returns.

how can we fix the memory problem please ? can we pay premium to expand our memory restriction ? or is it possible to train the learner offline ? whats the setup for parallel computing on this ?

@Nicholas: Note that we're not predicting returns directly, but rather if returns go up or down (relative to the other in the universe). Certainly this is a choice but I think it's a good default to start with (although I now start with linear classifiers to remove complexity). Having said that, it's certainly something to experiment around with. What base learner do you think would be better suited?

@Jun: We hope to be able to just increase memory for everyone.

@Thomas,

I was thinking about how to make sure you account for 1) securities that go broke and are removed from the entire stock universe and/or 2) securities that exit your particular screen. I'm not 100% sure your shifting approach would account for them. I've attached a Notebook with how I would approach getting accurate returns for a specific screen (Q1500US for my example).

Anyway, not sure it would be possible in the algorithm code but something to consider.

9
Notebook previews are currently unavailable.

I'll get this posted in the next few days, work has been hectic -- I have been running an SVM on the S&P 500 that is showing promise as well. I am thinking that an approach that equal weights holdings from SVM, NB and Ensemble might do well. Stay tight.

ML combines the factors in a non-linear way. Does that help get rid of the problem of correlated factors (used in linear regression based factor models) to some extent?

Luca: Great catch, did you debug the function in a NB?

Bharath: In the case of correlated features, Random Forests will pick one feature to make predictions and then give the other a very low weighting. This should be taken into account when looking at feature importances. See e.g. http://blog.datadive.net/selecting-good-features-part-iii-random-forests/.

I still don't get any good results in the NB, even after this bug fix. Then I believe it was just luck that the backtest improved.

Thanks Thomas

We need Python pickle serialize/deserialize.

Hi Thomas/Q team -

Above, it is stated regarding memory limitations "We are acutely aware of this problem and are working hard to resolve it."

What is the status? What approaches are being considered? Timeline for implementation?

I've been playing with ML locally using zipline for some time now and I'm not sure 6 months of Daily data is going to do much in the way of ML. I've found considerably better results with 15-20k data points. This being said, we definitely need a way to take a model trained locally and then import it into quantopian. If this could be done on a daily basis (Train at home, upload, and have the Q script fetch it), the idea of using ML with Quantopian would make a lot more sense. Just my thoughts, maybe people have had better luck with smaller datasets?

@Visser Agree with you on the model import importance for Q platform. Might be hard for them to do for large user-base...don't know...
We, also, haven't gotten good ML results with a small number of days over a small number of factors...which actually feels right.
It should take a large number of days over a large number of factors to eke out an alpha that is not easily detected by human inspection.

When you say "15-20k data points", do you mean 1.5-2k time-points over 10 factors, or 15-20k time-points over one factor?
Thanks!
alan

I'll second Andy's point - I know it's been asked for before. A method for taking a model trained locally and uploading it and using it with Quantopian data. The computation-heavy part of machine learning is the training of the model. If memory limitations are the problem, let us train the model outside the platform and import it in.

@Alan - the models I'm playing with locally are trained using 750k lines across 25-30 factors, so between 18M and 22.5M data points, covering two years of data. More factors means more training data is needed to get a model that generalizes well, although I'm probably using more than I need (I have another 500k lines or so I'm using as a testing set). I agree that using a small number of factors over a short time is likely to yield poor results, or results that only apply to a specific situation.

Training those models can take hours (most take a few minutes though), but predictions are much quicker (usually seconds). Offloading the computation heavy part to a local machine and keeping the predictions, which are relatively easy, would be the least painful solution for everyone.

On the other hand, I'm not a computer scientist, so I have no idea whats going on around the back end. :)

I think part of the problem may be how this would fit with the Q fund concept of being able to time-stamp algos once an initial backtest has been run, and then re-running the backtest in 6 months out-of-sample for evaluation. Would the ML model be stale by then? How often would uploads be required to keep it fresh? And how would Q manage the uploads in evaluating the algos for the fund? Q is all about the Q fund, so one has to consider everything in that context.

It seems like one solution might be to super-charge the research platform, and then work out the mechanics of launching an algo with the ability for the user to push data from the research platform to the algo (or perhaps call an automated script tied to the algo).

@Alan 15-20k lines with 75 or so features each. Your idea of 750k to 1mil is probably where I should be though ;)

EDIT
Sorry, I meant @John Cook

@Grant, that's an idea. I'd love to be able to do it all with Quantopian.

@John - Thanks for the info...sounds great!
Have you used multi-time-scale learning yet, with fusion...or is your time scale all in seconds ?

Hi,

Rather than using ML to predict tomorrow's return, I want to predict today's return and see if prediction is greater than or less than actual return. Go long if prediction > today's return and short if prediction < today's return. Could someone help me do this if it is not too much to ask?

Best regards,
Pravin

@Pravin Bezwada it seems easy. You need simple comparison function something like create an array of today's returns(Those you can just calculate by using %change function) and create an array of tomorrow's-perditions and shift by -1 so those will become today's returns. After that all you need to do is compare two arrays. :) On the other side you can do this operation using Pandas as well. Hope it will help.

Thanks Arshpreet Singh. That could work; although I wanted to regress today's factor on today's returns instead of yesterday's factors on today's returns.

@Pravin, there is this variable

PRED_N_FWD_DAYS = 1 # train on returns over N days into the future


But keep in mind the following: pipeline is run every day after market close (precisely before market open of the next day, but still the data available to pipeline is the same) so after market close you can compute your latest factor values but, as the returns are calculated as % difference in open prices (because algorithm enters/exits positions at market open), pipeline will compute today returns (that is tomorrow open / today open) only tomorrow after market close. This is the reason why the algorithm discards today pipeline factor and uses yesterday pipeline factor with pipeline returns calculated today (today open / yesterday open), that actually means using ML on yesterday factor and yesterday actual returns. This is what you get with PRED_N_FWD_DAYS = 1.

This is also a bug, that I reported above, as the algorithm was intended to run ML on future returns, not current returns. That is, suppose you want to use 1 day future returns, the algorithm should use returns calculated today and factor values calculated 2 day ago (not yesterday).

Pravin Bezwada: Yes that was random guess. :)

Thanks Arshpreet.

@Luca, thanks for your valuable feedback. I now see how it works. Here is my attempt with ML. My take on this problem is that what happens tomorrow is anyone's guess. But you can always tell if today has been undervalued or overvalued. The attached algorithm has a few bugs and crashes but it is so far profitable with transaction costs and slippage. If someone could take a look and fix the bug, I can improve it further.

160
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 58c9055040128517fc5d690e
There was a runtime error.

@ Thomas/Quantopian support -

What is the status of this effort, in terms of releasing an API? Is that the goal? Is it still a work-in-progress on your end? Or is it complete, from your perspective?

The approach taken with the optimization API is nice, and presumably it will eventually be "released" and moved out of quantopian.experimental with the source code on github. Is that the game plan here, too? If so, what remains to be done?

Some issues I see:

1. Memory constraints.
2. Testing of ML on known (synthesized) datasets (to confirm its functionality independent of noisy, real-world factors).
3. Encapsulation into an API with documentation and revision control on github.
4. Other?

After further reviewing this algorithm, I realized that we club the factors and returns across all securities for the prediction. I want to run multiple predictions, one per security by regressing time series of factors per security against security returns. My problem is I don't know how to fetch a time series (say 60 days) of factor values per security. Sigh, pipeline is so hard.

@Grant: I see the same issues you do (especially memory). A lot of engineering work recently went into improving the infrastructure but hopefully we can work on removing these issues soon.

@Pravin: Correct. You don't need to change anything to pipeline, however, since it is already providing the ticker information. The classifier just ignores it. I recommend starting with the NB and static data: https://www.quantopian.com/posts/machine-learning-on-quantopian as it's much easier to intuit about what needs to be done.

@Luca: That logic seems correct to me. Returns will be open of day T-2 to open of day T-1, so really returns of day T-2. I'll fix the code in the various posts.

@Thomas: I'm glad to see this discussion is still active.

I have a question regarding the X values used as input for AdaBoostClassifier. When I was previously learning about AdaBoost, X would be the predicted results of the X classifiers and they'd be the same possible values as Y, i.e., 1 or -1. So for example, if there were three classifiers, in the case of the first training data item, you may have [-1, 1, 1], with the expected Y as [1].

In the case of your provided sample code, you're initially fitting the distributed ranking values (so they're between 0 and 1) as your X data.

My Question: Is there a point in the AdaBoostClassifier in which these X input values get classified into a 1 or -1? If so, where is that happening? Is that the job of the DecisionTreeClassifier?

Btw, this site has opened my eyes to quantitative trading. I really appreciate the vast amount of material available here.

Actually, never mind, I see what's going on. By default AdaBoostClassifier is using DecisionTreeClassifier as the base estimator.

@ Thomas -

I would add to the to-do list a re-consideration of your architecture that limits the ML to pipeline. My view is that a general ML API should accept alpha factors from both pipeline, and from the minutely algo. This doesn't mean that the ML would need to run during the trading day, but just that it would need to accept alpha factors derived from minute bars, for example. As it stands, I don't think there is a way to do that now, since one can't feed data into pipeline.

@Grant: The problem is still that pipeline only returns the most recent row, rather than the full history of factors required to train the classifier. I suppose we could somehow hack around that and store the history in a global variable and then train the classifier in before_trading_start(), or handle_data() if you want to collect minute data and feed that in as well.

Ok, I understand where I was going wrong earlier.. Being new to both quantitative trading and machine learning, maybe it's just me, but I'll state what I was mixing up in case anyone else ran into the same thing.

I was mixing up the Pipeline factors with the AdaBoost classifiers. That is, I was thinking that the factor functions were the classifiers that AdaBoost was expected to use and improve upon. From a naive perspective, the factor functions look similar to what I would define as "weak learners."

But in reality, the weak learners (classifiers) are decision tree stump classifiers that are used by the AdaBoost classifier. The Pipeline factor ranks are the data fed to those weak learners (classifiers).

@ Thomas - If I understand correctly, within pipeline, we have access to the full history of factors to train the classifier, but one can't get pipeline to spit out trailing windows of factors. But you think globals will work? Maybe I'll try that at some point, just for yucks. Is there a time frame for upgrading pipeline to output multiple rows? Then you could elegantly park the ML outside pipeline to allow for combining pipeline and non-pipeline factors.

Good news! Since we doubled RAM in research and the backtester we can finally start running the ML workflow with longer lookback windows. In this backtest I run with a half-year window, training weekly over a 4 year period.

As you can see, the algorithm still does not perform magically better and my hunch is that the factors themselves do not have enough alpha in them, so I think exploring other factors in this framework would be a good next step.

Also, the next bottleneck that popped up are timeouts in before_trading_starts. These happen if we use fundamentals (MorningStar) which unfortunately many factors rely on. I removed all of these from this backtest to make things work. There are plans to speed fundamentals up eleviating this problem. Until then, there are many things that can be done in this framework already, also using factors based on other data sources.

837
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 593128407ebbf351f09797f7
There was a runtime error.

Sounds good thanks for the update!

Awesome!

Can someone explain why results can change between two backtests with same algorithm and parameters. I am using AdaBoostClassifier with default parameters and their is two possible results. Is this an effect of random state ? How I can avoid this effect ?
Thanks

Mat: I added a random seed now to the classifier instantiation. Can you see if the problem persists?

837
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 594a7ce5230e5169ff2fe0cb
There was a runtime error.

@Thomas, even with random_state=1337 param set, there is still 2 results for my backtest with AdaBoostClassifier. I have no problem with GradientBoostingClassifier with default param but backtest result is less interesting

@Mat: We haven an open issue about this, will try to fix it and post an update here.

You can check self.clf.score(X, Y) between to backtest run is not always the same. Maybe an effect of default AdaBoostClassifier param algorithm=SAMME.R. For weekly training does not seems to have any effect but on yearly training you can see a huge effect on backtest returns

@Mat - Are you sure that the results of ensemble.AdaBoostClassifier should be fully deterministic, even with random_state set ?
I looked, yet couldn't resolve that question.
One could make a case for a slight bit of non-determinism based on the base-classifier being a two-leaf tree(e.g. one if-statement), and the choice flipping one way or the other for "equal" values...not sure about that argument though.

@Mat - From what I can tell googling, there was a problem with random_state in late 2016, which was fixed in the sklearn 0.18 release.
See: http://scikit-learn.org/stable/whats_new.html

Fix bug where ensemble.AdaBoostClassifier and
ensemble.AdaBoostRegressor would perform poorly if the random_state
was fixed (#7411). By Joel Nothman.

Fix bug in ensembles with

randomization where the ensemble would not set random_state on base
estimators in a pipeline or similar nesting. (#7411). Note, results
for ensemble.BaggingClassifier ensemble.BaggingRegressor,
differ from previous versions. By Joel Nothman

@Thomas - Can't tell what version of sklearn is being used anymore, so don't know if the bug is fixed in the Quantopian ResearchPlatform,

import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))



produces

SecurityViolation: 0002 Security Violation(s): Accessing
sklearn.version raised an AttributeError. No attributes with a
similar name were found.

alan

@alan: we run 0.16.1 (quite old) so this definitely seems like it could be the cause, really need to upgrade. I think with the same seed it should behave reliably.

So it sounds like the bug is specific to AdaBoostClassifier, in which case we can instantly fix it by switching to e.g. RandomForestClassifier.

@Mat - Based on my understanding, AdaBoostClassifier uses the DecisionTreeClassifier by default, which produces "best" splits (using gini computation). I believe that in some cases, there are several splits that result in the same gini score. It's possible that, in such a case, the DecisionTreeClassifier randomly picks which split to use.

I found this: https://github.com/scikit-learn/scikit-learn/issues/2386, which seems to support the theory.

Sam

@Sam: Even if it's random, it should be deterministic with a set seed.

Thank you all for your help, I switched to RandomForestClassifier(n_estimators=300, random_state=10) that is deterministic but very sensitive to training period. With a 5 factors model I have to fit model only few month at beginning of each year to find some good results, maybe classifier find some factors relations that do no exist the rest of the year. Difficult to explain

understand that we 're doing ranking on the features/alpha factors
is it possible for us to deploy sector neutralization technique on the signal level ?
==> focusing our algorithm on the relative factor score and stock performance within each sector

e.g the outperformers might have negative forward one- month returns as long as they outperform relative to their sectors

can you show me how to do the above pls ?

I noticed today on LinkedIn a nice post by Harish Davarajin at DeepValue on this subject. He looks at the combination of 32 alphas from the 101 Alphas publication using various classifiers in sklearn (which are all available on Quantopian). He finds the best results with the SVM classifier (sklearn.svm.SVC) at default settings (e.g., radial bias kernel).

The post is here.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

I will be posting improvements shortly:

Thanks for the giggle: "random_state=1337"

@jacob: Great, eager to see improvements to this workflow. Re random_state: ;)

Hi Jonathan/Thomas -

Regarding the post by Harish Davarajin at DeepValue, is it something that could be done on Quantopian? If so, is there an outline you could provide of how to approach it?

I've taken an interest in learning some machine learning (perhaps as I age, I can augment my feeble mind with AI!), so perhaps this would be a good entry point.

Thanks,

Grant "Off-White Seal" Kiehne

Has anyone modified this framework to train on intraday (open to close) returns?

@jacob shrum come on then :D

If the entire market is in a bull, and all returns are positive for whatever universe we select, then will this algorithm still short the worst performing?

@Thomas when do you think you might have a fix for the Morningstar related timeouts?

From what I can see in backtesting, pipeline appears to batch multiple days of factor calculations and if the arbitrary batching breaches the before_trading_start timeout of 5 mins the whole thing fails. I wish there was a way for us to control how the batching gets done as even if we are sure that one day of factor calculations only takes say 100secs because of the arbitrary batching we seem to get seemingly random timeouts.

The example below resulted in a timout even though the before_trading pipeline itself ran in under 250 secs for a 6 day batch job.

2017-01-03 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-03 08:45:00-05:00
2017-01-03 21:45 compute:482 INFO weekday 2017-01-03 00:00:00+00:00 1
2017-01-03 21:45 compute:484 INFO 2017-01-03 00:00:00+00:00
2017-01-03 21:45 compute:489 INFO Training took 1.717320 secs
2017-01-03 21:45 compute:482 INFO weekday 2017-01-04 00:00:00+00:00 2
2017-01-03 21:45 compute:484 INFO 2017-01-04 00:00:00+00:00
2017-01-03 21:45 compute:489 INFO Training took 1.708877 secs
2017-01-03 21:45 compute:482 INFO weekday 2017-01-05 00:00:00+00:00 3
2017-01-03 21:45 compute:484 INFO 2017-01-05 00:00:00+00:00
2017-01-03 21:45 compute:489 INFO Training took 1.675035 secs
2017-01-03 21:45 compute:482 INFO weekday 2017-01-06 00:00:00+00:00 4
2017-01-03 21:45 compute:484 INFO 2017-01-06 00:00:00+00:00
2017-01-03 21:45 compute:489 INFO Training took 1.717600 secs
2017-01-03 21:45 compute:482 INFO weekday 2017-01-09 00:00:00+00:00 0
2017-01-03 21:45 compute:484 INFO 2017-01-09 00:00:00+00:00
2017-01-03 21:45 compute:489 INFO Training took 1.901481 secs
2017-01-03 21:45 compute:482 INFO weekday 2017-01-10 00:00:00+00:00 1
2017-01-03 21:45 compute:484 INFO 2017-01-10 00:00:00+00:00
2017-01-03 21:45 compute:489 INFO Training took 1.795630 secs
2017-01-03 21:45 before_trading_start:1089 INFO Time to run pipeline 246.63 secs
2017-01-03 21:45 before_trading_start:1124 DEBUG Number of secruities to be considered: 200
2017-01-03 21:45 get_best_portfolio:1052 INFO one good optimization done
2017-01-03 21:45 WARN sklearn/cross_validation.py:417: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=3.
2017-01-03 21:45 WARN sklearn/cross_validation.py:417: Warning: The least populated class in y has only 2 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=3.
2017-01-03 21:45 WARN Logging limit exceeded; some messages discarded
2017-01-04 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-04 08:45:00-05:00
2017-01-04 21:45 before_trading_start:1089 INFO Time to run pipeline 0.00 secs
2017-01-04 21:45 before_trading_start:1124 DEBUG Number of secruities to be considered: 200
2017-01-04 21:45 get_best_portfolio:1052 INFO one good optimization done
2017-01-04 21:45 before_trading_start:1151 INFO Time to run optimizer 33.54 secs
2017-01-04 21:45 before_trading_start:1156 INFO total weight: 1.0, total abs weight: 2.0, maxweight : 0.005, -0.005
2017-01-05 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-05 08:45:00-05:00
2017-01-05 21:45 before_trading_start:1089 INFO Time to run pipeline 0.00 secs
2017-01-05 21:45 before_trading_start:1124 DEBUG Number of secruities to be considered: 200
2017-01-05 21:45 get_best_portfolio:1052 INFO one good optimization done
2017-01-05 21:45 before_trading_start:1151 INFO Time to run optimizer 30.15 secs
2017-01-05 21:45 before_trading_start:1156 INFO total weight: 1.0, total abs weight: 2.0, maxweight : 0.005, -0.005
2017-01-06 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-06 08:45:00-05:00
2017-01-06 21:45 before_trading_start:1089 INFO Time to run pipeline 0.00 secs
2017-01-06 21:45 before_trading_start:1124 DEBUG Number of secruities to be considered: 200
2017-01-06 21:45 get_best_portfolio:1052 INFO one good optimization done
2017-01-06 21:45 before_trading_start:1151 INFO Time to run optimizer 29.93 secs
2017-01-06 21:45 before_trading_start:1156 INFO total weight: 1.0, total abs weight: 1.99999999999, maxweight : 0.005, -0.00499999999999
2017-01-09 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-09 08:45:00-05:00
2017-01-09 21:45 before_trading_start:1089 INFO Time to run pipeline 0.00 secs
2017-01-09 21:45 before_trading_start:1124 DEBUG Number of secruities to be considered: 200
2017-01-09 21:45 get_best_portfolio:1052 INFO one good optimization done
2017-01-09 21:45 before_trading_start:1151 INFO Time to run optimizer 32.78 secs
2017-01-09 21:45 before_trading_start:1156 INFO total weight: 1.0, total abs weight: 2.0, maxweight : 0.005, -0.005
2017-01-10 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-10 08:45:00-05:00
2017-01-10 21:45 before_trading_start:1089 INFO Time to run pipeline 0.00 secs
2017-01-10 21:45 before_trading_start:1124 DEBUG Number of secruities to be considered: 200
2017-01-10 21:45 get_best_portfolio:1052 INFO one good optimization done
2017-01-10 21:45 before_trading_start:1151 INFO Time to run optimizer 30.61 secs
2017-01-10 21:45 before_trading_start:1156 INFO total weight: 1.0, total abs weight: 1.99999999998, maxweight : 0.00499999999998, -0.00499999999997
2017-01-11 21:45 before_trading_start:1086 INFO started running pipeline on 2017-01-11 08:45:00-05:00
End of logs.

This has been a wonderful example to learn from Thomas, thank you.

Is there a way to make the factors from make_factors() "window_safe"? I find that if I remove .rank(mask=universe) from when the functions are evaluated I get a NonWindowSafeInput: Can't compute windowed expression error.

    # Rather than Instantiate ranked factors, just get the actual values
# This works in notebook research environment but fails in the algorithm arena
for name, f in factors.iteritems():
factors_pipe[name] = f()


Which, honestly, I'm a bit confused by because I would think that before .rank() could run the main function would first have to be evaluated.

Hi Thomas -

I'm following up on your comment above:

@Grant: The problem is still that pipeline only returns the most recent row, rather than the full history of factors required to train the classifier. I suppose we could somehow hack around that and store the history in a global variable and then train the classifier in before_trading_start(), or handle_data() if you want to collect minute data and feed that in as well.

Has there been any progress on upgrading Pipeline to return the full history of factors? In terms of a more generic architecture, my thinking is that within before_trading_start() (where we have access to a 5-minute compute window...well sort of, since Pipeline chunking can consume a lot of it), one would run the ML code, combining both Pipeline-derived alphas and alphas computed using minute data. Additionally, this would allow control over when the ML step is run (you are running it every day, correct?). Unfortunately, it could be a bit of a kludge, since I'm not sure schedule_function() is compatible with before_trading_start() so some sort of flag would be required.

Regarding the idea of using globals, I see them used pretty much willy nilly in examples from Quantopian, but I thought they were a bad practice in writing software. Are they confined to the sand-boxed algo API, so it is copacetic? Or is there a risk that globals of the same name are used elsewhere in your software, and there would be a conflict?

Overall, what is the Q priority on this ML stuff? Is the code you published above running to your satisfaction? What are the next steps? Or is it on the back burner?

@Pumpelrod: Instead of .rank(), use .zscore().

@Grant: Good questions. I agree that having access to pipeline output elsewhere is likely to be the way to go. Training does not happen daily, however. There is actually meaningful progress to this workflow and I will post an update soon.

Global variables might not be an unreasonable stop-gap solution, even though it's not good practice. The algo is sand-boxed so there is no concern of overwriting some other variables.

I breezed through a learning/deep learning book recently:

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
by Aurélien Géron

It does cover AdaBoost, but later on, in the Tensorflow section, it has a chapter on recurrent neural networks as being particularly well-suited for time series problems. Presumably, RNN's are used in quantitative trading, but maybe that's beyond the scope here.

As I'd pointed out earlier, I'd recommend figuring out how to synthesize an input data set with a known output, to see if your ML code is actually working. For example, you could make up a bunch of factors that, when combined simplistically (e.g. linear combination, equal weights), gives a good result. Then, you could turn on the ML to see if it improves the result. Otherwise, if your factors are sketchy, volatile pieces of junk, you won't be able to test the potential benefit of ML.

@Grant: You have your finger on the pulse, RNNs (specifically LSTMs) are very interesting in that regard. Check out this blog post: http://www.jakob-aungiers.com/articles/a/LSTM-Neural-Network-for-Time-Series-Prediction

Some unittests around the ML framework would also be very helpful, the code complexity is pretty high.

By the way, I'm sure you've thought of this, but if I were y'all, I'd set up a rig in your office with access to all of the Q data, and tools, plus whatever state-of-the-art ML tools are out there, and get a whip-smart Boston-area intern or starving grad student to play around with it. Presumably, the licensing terms with your data vendors would allow this. I bet for 5-10K, you could build a little supercomputer to play with. There have been numerous requests for high-end deep learning on Q, so you could gather support that it would be beneficial (or show that it stinks) and start to understand how to implement it (or drop it). @ Thomas - Regarding your comment above that some unit tests around the ML framework would be in order, how would you propose this be done? Presumably, one would need to use the Q research platform and paste chunks of your code into it. Ideally, I'd think that it would be nice to test the entire algo with controllable inputs. One could imagine Q devising and releasing a single toy data set for this purpose, from which multiple factors could be created. Using the magic of look-ahead bias, one could generate the toy data set from actual stock market data, controlling the degree of predictability of the data set. The standard, configurable toy data set would then be accessible from both the research platform and the backtester. It seems like the toy data set would drop right into the existing Q framework, no? Thanks for the LSTM reference; I'll need to continue to get up the learning curve on ML. Intuitively, it seems that if one has reasonable stationarity, then one can learn from the recent past without any fanciness. By analogy, identifying pictures containing ducks will be the same problem today as it will be in a year. However, if in a year, there is no such thing as a duck, but we are using a 1-year look back, it won't work. I'd seen some comments on Q that for the market, certain time-dependent ML techniques should be used. Is that what you are working to implement on Q? Also, regarding your comment "Training does not happen daily" it appears you are using this code within class ML(CustomFactor): if (not self.init) or (today.weekday() == 0): # Monday  I suggest sorting out how to move this control to a scheduled function, rather than burying it in the code, if you are developing a general framework/template/workflow. For example, to run the ML on a Monday before trading starts, then scheduling a function on Friday to set a flag would do the trick. Also, I'm wondering if you could have a control for the degree of ML to be imposed to set the relative factor weights in the alpha combination step? For example, one could have: alpha_combined = a*alpha_naive + (1-a)*alpha_ML  By setting a one could control the influence of the ML relative to a naive alpha combination (e.g. equal weights). In the ML lingo, I gather that a would be called a hyper-parameter, to be optimized as part of the development. Another free tip is that although you do a great job of commenting your code, some accompanying schematics/flow charts/swim lanes/state diagrams/etc. would be really helpful. @ Thomas - I have spent quite some time testing different approaches for factors combination. One thing that I find difficult is determine the size of historical data to pass to the ML factor. Looking only at recent history (few historical samples) increase chances of overfitting, too many historical samples make the ML factor slow in adapting to changes in market conditions. Do you know of any technique used to estimate the best history size (in this case window_length of ML factor) given how many days in advance we want to predict the prices (n_fwd_days) ? It would be interesting to hear from Thomas, generally, how Quantopian intends to address the ML hyper-parameter optimization problem. Each factor would have a characteristic optimal time scale, tau, so the problem sounds a bit gnarly (note tau is the trailing window of factor values, not the window length of data used within the factor). And the time scale may change with time, so we have tau_n(t), where n is the factor index. Within the algo, tau would need to be updated periodically, if I'm thinking about this correctly. One thing I've suggested in the past would be to run this R&D stuff off-platform, so that the constraints of the Q platform don't come into play. I'd think that the Q R&D team would be chomping at the bit to run on a high-performance platform (e.g. https://blog.quantopian.com/zipline_in_the_cloud/). @Luca: Yeah, that's a tough question. I have done some research on this here: http://twiecki.github.io/blog/2017/03/14/random-walk-deep-net/. Note that PyMC3 is not added to Quantopian. But I think an approach like that could solve the problem. Otherwise you can of course try to analyze each factor to determine the longevity of the signal using alphalens. @Grant: As you know, I've done quite some work on this in the past. I'm not super convinced, however, that it's really the right thing as it can very easily overfit if you're not super careful. These recent slides from Lopez de Prado touch on these and related issues and are well worth a read: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3031282 Thanks Thomas, I'm not sure I understand the argument. It is something like "Let's not give sharp knives to chefs in-training; they'll cut themselves." In the book on ML I read recently, I gathered that hyper-parameter optimization is a standard operation. Besides, for the Q fund, my assumption is that you have a super-sensitive over-fit detector, and would easily filter out the nonsense. The Q platform kinda supports hyper-parameter optimization now. Users can run an unlimited number of backtests in parallel (at least that used to be the case), load the results into the research platform, and do the analysis. And then pick the best hyper-parameter values for use in the algo for the next N periods, and then do it all over again (in theory one could feed in the hyper-parameters with fetcher, without having to stop the algo). It just seems like a relatively simple API wrapper around this existing process would be nice. You're right, it's not really a good reason not to add it. We try to provide all capabilities necessary, and hyper-parameter optimization definitely counts as that. We still have some work to support the ML workflow properly, I think it makes sense to only then worry about hyper-parameter optimization. Hi Thomas - I'd be curious how you are approaching developing a Q ML workflow. Are you constrained by the present hardware & software limitations of the Q platform, or can you run offline (but still have access to the data sets)? It just seems like you'd want to have an unconstrained platform for R&D, to show the path, versus the other way around. Also, as I'd mentioned above, it would be prudent to have the capability to synthesize inputs, for algo verification testing. Is it possible to do this in the research platform? Can an algo be run there, but with synthetic inputs? @Fahrhan: Thanks for sharing, that looks great! I see you had to increase the window length which makes sense and is enabled by the RAM upgrade. Let's see how it does out-of-sample. Can you paper-trade this? Curious, what other things did you change? I also have a version that uses new fundamentals and can run more factors without running into time-outs. Will post soon. @Farhan because you changed both the algorithm settings and the list of factors in input to the ML model it is difficult to say if you found the right combination of settings that makes the ML model works or if you were just lucky with your input factors. You could try to reset the list of factors as it was in Thomas' version and compare the results. Then you can add your new factors AND keep the original factors and see if the ML model is able to select your factors amongst all the factors, as your factors offer the best performance they should be the selected ones and the good performance should persist. What does n_fwd_days control? It controls how many days in the future you want to predict the return for. Hi Luca - I thought the point of the ML was to improve the alpha combination step, over a simple equal-weight approach (see https://blog.quantopian.com/a-professional-quant-equity-workflow/). So, I don't understand the idea of applying it to a single factor. Also, to get at whether the ML code is working as expected, it seems we would want to be working with synthetic factors, with controllable characteristics. Within the Q research platform, this would be straightforward, I think. What characteristics of the input data would be required for the ML code to work like a charm? Once it is shown that the code works under ideal conditions, then perturbations to those conditions (e.g. random walk type noise and other statistical wackiness) can be introduced. Basically, develop the tool first, and then apply it to the real-world. This convolution of uncontrolled inputs with an unverified tool doesn't make sense to me. Sorta leaping ahead in my opinion, which can lead to a lot of murkiness. Another thought here would be to incorporate Alphalens into the ML. As I understand, it is a human-readable tool for assessing alpha factors, but couldn't it be made readable by a machine? As part of the ML routine, couldn't it periodically pass all of the factors to Alphalens, which would provide input to the ML, which would then provide point-in-time weights across all of the factors, for the alpha combination step? @Grant: Yeah, using alphalens (specifically the Information Coefficient) to assess which factors should be used is a great idea. Ideally the classifier would pick up on the same thing but definitely worth a try. Another note: for factors like the Precog stuff, where there could be over-fitting, the algo should take into account the in-sample versus out-of-sample periods. The most conservative approach would be to use only the out-of-sample period (i.e. ignore the factor, until the date it became live on Quantopian). Regarding the Information Coefficient (IC), are we actually talking about Spearman's rank correlation coefficient, or something else? Grant - I thought the point of the ML was to improve the alpha combination step, over a simple equal-weight approach> So, I don't understand the idea of applying it to a single factor. I fixed 3 bugs in the ML factor, each of them was preventing the ML factor from working at all. I guess showing that the code works is the first step toward a more robust testing. So, here it is, debugged and cleaned up ML factor code for you to use. Also, to get at whether the ML code is working as expected, it seems we would want to be working with synthetic factors, with controllable characteristics. I agree with you but what's the point of creating a synthetic factor if Precog factor already performs as an ideal synthetic factor? That's the reason why I used precog data: first the NB shows the performance of precog iteself, so that you know what kind of data you are using, then the precog data is given in input to the ML factor. Once it is shown that the code works under ideal conditions, then perturbations to those conditions (e.g. random walk type noise and other statistical wackiness) can be introduced. That's exactly what the NB does, so we completely agree on how to test the ML factor and the fact you have some doubts on that NB might be due a misunderstanding of what the NB exactly does. This convolution of uncontrolled inputs with an unverified tool doesn't make sense to me. What is this unverified tool? Another thought here would be to incorporate Alphalens into the ML. This makes perfectly sense to me but why would you do that? First, clear requirements from the ML factor are needed, then we have to test if the ML factor satisfies those requirements and only if we discover that the current design prevent the ML factor from achieving the desired results it would be worth trying another approach. Although I don't believe it is a good idea to go on a trial and error approach, It would be better to understand if the current ML algorithm (i.e. Adaboost + DecisionTreeClassifier + the transformation applied to the input factor and returns ) can actually learn what we want it to learn. If it is clear that ML factor cannot achieve the desired goals by design, then it is worth trying something else, but that should be a "goals driven design". Another note: for factors like the Precog stuff, where there could be over-fitting, the algo should take into account the in-sample versus out-of-sample periods. The most conservative approach would be to use only the out-of-sample period (i.e. ignore the factor, until the date it became live on Quantopian). By design ML factor can understand if the input factors are overfitted because the factor values matched with future returns won't be correlated. This is true only if you run ML factor out of sample but in my NB I ran ML factor on historical data and in this case there is no way ML factor can detect onverfitting. In my NB I exploited this ML factor impossibility of overfitting detection so that I could use precog data as it was synthetic data. Regarding the Information Coefficient (IC), are we actually talking about Spearman's rank correlation coefficient, or something else? Exactly, that's Spearman's rank correlation coefficient computed against factor values and future returns I fixed 3 bugs in the ML factor, each of them was preventing the ML factor from working at all. Just making sure, these fixes are all in the currently posted version, right? Or did you find new bugs. Hi Luca - My thinking regarding the IC (Spearman's rank correlation coefficient) was to address your observation: Third test: the ML factor is run with the predictive factor from the first test plus four other noisy factors, which don't have alpha value. The goal of this test is to verify that ML is able to find alpha even among noise, so we expect the results to be at least as good as the first test, given the same good factor is present in this test too. Unfortunately this is not true and the results are good but not as good as the first test.This is another point where ML factor needs improvements, being able to discard noisy factors. I figure if a human can use the IC to discard noisy factors then a machine should be able to do it, too. @Thomas Just making sure, these fixes are all in the currently posted version, right? Or did you find new bugs. The 3 bugs are the "old" ones, not new ones. I stopped checking other people code though, so I can only be sure about my latest NB code. @Grant I figure if a human can use the IC to discard noisy factors then a machine should be able to do it, too. That makes sense, I didn't get that was the reason of your suggestion, sorry. @ Luca - Regarding use of the IC in an ML algo (where ML is used for the alpha combination step), one could imagine several approaches upon computing the IC across all factors: 1. Set an IC cut-off and simply exclude some factors from the alpha combination step. 2. Weight the factors by the IC (or a ranking by IC or similar) within the alpha combination step (I'm thinking along the lines of a constraint to the ML algo, but I'm not sure of the mechanics here). 3. Supply the IC as a feature for each factor. 4. Do a one-time screening of factors by IC (i.e. simply don't include factors that don't make the cut via Alphalens analsyis). This is the current Quantopian baseline approach, I think. @ Thomas - If this is something you'd like to continue pursuing, would it make sense to release code on github? And perhaps quantopian.experimental? Or maybe this is just a simplistic illustration that isn't worth the effort (per your comment above "We still have some work to support the ML workflow properly"). If I ever get a chunk of time to dig into it, it would be nice to know which code is current, and have visibility on the changes and bug fixes. Here is an updated version that uses new Fundamentals and allows for using way more data (252 days, but probably can be extend) as well as many more factors. The constraining factor now is training the classifier which can easily time-out in before_trading(). Thus, we need to use something fast, and I changed the classifier to a linear one (which might also help with overfitting). The training code was also refactored by Joe Jevnik. Please let me know if you find any problems with it. If it seems fine, I'll replace the top algo with this one. @Grant: Yes, I like the idea of putting this on Github. 1183 Loading... Backtest from to with initial capital Total Returns -- Alpha -- Beta -- Sharpe -- Sortino -- Max Drawdown -- Benchmark Returns -- Volatility --  Returns 1 Month 3 Month 6 Month 12 Month  Alpha 1 Month 3 Month 6 Month 12 Month  Beta 1 Month 3 Month 6 Month 12 Month  Sharpe 1 Month 3 Month 6 Month 12 Month  Sortino 1 Month 3 Month 6 Month 12 Month  Volatility 1 Month 3 Month 6 Month 12 Month  Max Drawdown 1 Month 3 Month 6 Month 12 Month # Backtest ID: 59eb4263e13c964451a1978a There was a runtime error. @ Thomas - For the code you posted immediately above (Backtest ID: 59eb4263e13c964451a1978a), what is the earliest possible start date for a backtest? Even as late as 1/1/2004, I'm getting: KeyError: Timestamp('2001-06-28 00:00:00+0000', tz='UTC') There was a runtime error on line 466.  Also, regarding the time out issue with before_trading_start, I understand that for the backtester/trading platform, you have to pick something for a time limit (why 5 minutes?), but this is more of a research project at this point. Is there a reason backtests can't be run in the research platform without this constraint? It seems like one would want to be able to plunk the code into the research platform and then with a simple run_backtest call, run the code using the current backtesting engine, but with control over the before_trading_start time out. I have a version of the algorithm using a random forest classifier among other changes. (wish i could use a neural network but that doesn't work on quantopian) but backtests always give the before_trading_start timed out timeout error. Is there a workaround allowing me to simply wait for the longer training time? @Grant For the code you posted immediately above (Backtest ID: 59eb4263e13c964451a1978a), what is the earliest possible start date for a backtest? Considering that the ML factor has a window length of 252 days and also the input factor of ML with the longest length is 252 days, then you need at least 504 days of historical data (trading days). @Thomas The refactored code is really nice expect for the global variable PRED_N_FORWARD_DAYS used internally by ML factor, that should be a ML initialization argument to avoid too easy misuses. @Thomas Also, I noticed that the factors are created without the mask argument so there is lots of wasted computation and probably memory too (this depends on pipeline implementation, so I don't know for sure). Considering we are already reaching the platform limits it would be better to mask the factor by the algorithm universe. @ Luca - Thanks. I just started a backtest, with a 1/1/2005 start date, and it is running; this is consistent with your advice. A note to the Q team: it sure would be nice to indicate the earliest possible start date as part of the error message. @ Thomas - What is the intended purpose of quantopian.experimental? Regarding the before_trading_start time-out issue, another thought is that perhaps you could do a "save as" of the before_trading_start code and call it before_trading_compute (or similar) and eliminate the 5 minute time out (or change it to a much longer time period). However, before_trading_compute would only be available by importing it: from quantopian.experimental import before_trading_compute Then, presumably, you could keep it sand-boxed (e.g. backtesting only). Or maybe this would be a hack and waste of time, since a more complete ML workflow is needed? Looking over the code, I'd assumed that the ML would have at its output a set of weights, by factor (e.g. a vector w with N elements, where N is the number of factors). Then, the combined alpha would be a linear combination of the N alphas, using the weights. Is this not how it works? @Grant Not exactly, ML does the following: - rank each input factor - binarize the historical returns so that the 30% worst returns become -1 and the 30% top returns 1 - discard returns in the range 30%-70% and the ranked factors values corresponding to the returns in that range - align factors values with binarized future returns accordingly to n_fwd_days - at this point the ML factor has a trailing window (ML window_length long) of ranked factors aligned with the binarized future returns - Training: that trailing window is passed to AdaBoostClassifier that will try to understand the relationship between the output, which is the binarized returns one dimensional array (one value per security, 1 or -1), and the input, which is the ranked factors multi dimensional array (multiple values per security). As AdaBoostClassifier receives a trailing window of binarized returns/ranked factors, then it become clear that a predictive factor will have a consistent relationship between its values and the binarized returns (a particular factor value will have a binarized return value of 1 or -1 every single day of the trailing window, it won't change) while a non-predictive factor will be inconsistent and the same factor value will have both binarized return values 1 and -1 in different days. This is how AdaBoostClassifier can learn how to predict future returns. - Prediction: the already trained AdaBoostClassifier is given in input the today ranked factors (that is the multiple ranked factor values associated with each security) and the classifier will output the probability for each security to be 1 or -1. The ML factor output is the probability of a security of being 1. So I wouldn't say "the combined alpha would be a linear combination of the N alphas, using the weights. " because AdaBoostClassifier doesn't combine the multiple factors in a linear way, instead the combined alpha would be the probability of a security of being 1 however AdaBoostClassifier predicts it @ Thomas Thank you so much for the updated algo. I have one quick question. What would we need to do if we want to go about in testing this algo by using different types of models that scikit learn provides. Such as: #reg = linear_model.LinearRegression(normalize=True) #reg = linear_model.Lasso(alpha = 0.001, normalize=True) # can change options here to use linear or ridge regression instead of lasso #reg = linear_model.Ridge (alpha = .5) #reg = linear_model.LogisticRegression() #reg = linear_model.tree() Those are just 6 different possibilities, as you know many more are possible. I am trying to wonder where in the code you have provided I would need to alter so that I can test the algo on different types of regressions or classifiers similar to what I list above? Would something need to be altered in the section of the code starting with 'class ML(CustomFactor):' ? Thanks in advance for the help as always. Best, Sam @Thomas I think I found it. Would we just alter the line of code: self._classifier = linear_model.SGDClassifier(penalty='elasticnet') to whatever type of classifier we want instead? @Thomas Sorry for multiple messages! One last thing. Can you also please help me find where to locate all the classifiers and regression models Quantopian has to offer? Thus far I have found: reg = linear_model.LinearRegression(normalize=True) reg = linear_model.Lasso(alpha = 0.001, normalize=True) # can change options here to use linear or ridge regression instead of lasso reg = linear_model.Ridge (alpha = .5) reg = linear_model.LogisticRegression() reg = linear_model.tree()  But I am assuming there are many others I can play with? Such as XGBoost, LDA, QDA, Random Forrest, SVM?Wondering where I can find a list of all possible ML models we can run Quantopian algos on....... Thanks in advance! @ Luca - Thanks for the explanation. Basically, it sounds like for each security the code generates a new single alpha, based on its alpha values across factors. For example, if there were only two stocks, XYZ & PDQ, and two factors, A & B, we'd have: alpha(XYZ) = f(alpha_A(XYZ), alpha_B(XYZ)) alpha(PDQ) = g(alpha_A(PDQ), alpha_B(PDQ)) Effectively, if I understand correctly, it finds the functions f & g, so that when the current alphas are plugged in, the resulting alpha is predictive of returns. Why was the problem formulated this way, instead of finding the weights for a linear combination of factors? It would seem like the logical baby step from an equal-weight alpha combination. Or maybe ML wouldn't work for this? @Grant, I believe your interpretation is correct but I would stick to the nomenclature used in the ML code: future_return(XYZ) = trained_AdaboostClassifier(ranked_factor_A(XYZ), ranked_factor_B(XYZ)) future_return(PDQ) = trained_AdaboostClassifier(ranked_factor_A(PDQ), ranked_factor_B(PDQ)) f and g are actually the internal model used by AdaBoostClassifier, whose specific paramenters were set during the training phase alpha_A and alpha_B are the the ranked input factors alpha is the future expected relative return (actually the probability of the factor values associated to each security of having a binarized return == 1) Why was the problem formulated this way, instead of finding the weights for a linear combination of factors? It would seem like the logical baby step from an equal-weight alpha combination. Or maybe ML wouldn't work for this? I'd love to hear from Thomas how and why he came up with this solution, but I personally find this approach very robust and it makes much sense to me. If you dig into the documentation of AdaBoostClassifier and DecisionTreeClassifier (used by default by AdaBoostClassifier) it will makes sense to you too. @ Luca - Thanks for the elucidation. One potential problem I see is that there is nothing in the code to select, point-in-time, which features (alpha factors) are valid and which are not. A given feature could have a Spearman's rank correlation coefficient (information coefficient or IC) of zero and still be included. This would seem to be wishful thinking, that even though the IC is zero, the feature would be useful in combination with other features. There is nothing to eliminate extraneous features point-in-time. @ Thomas - I'd recommend using a universal rebalance function that simply takes in an Quantopian standard alpha vector and applies the optimize/ordering routine to it. It appears you have some stuff in there that is specific to your strategy. My interpretation of https://blog.quantopian.com/a-professional-quant-equity-workflow/ is that the alpha vector should be fully formed prior to the portfolio construction step. @Grant One potential problem I see is that there is nothing in the code to select, point-in-time, which features (alpha factors) are valid and which are not. A given feature could have a Spearman's rank correlation coefficient (information coefficient or IC) of zero and still be included. Yes, there is no control on that. This would seem to be wishful thinking, that even though the IC is zero, the feature would be useful in combination with other features. There is nothing to eliminate extraneous features point-in-time. Well, one could modify the code so that it periodically test and remove the alpha factors with IC too close to 0 and don't pass them to AdaBoostClassifier, that would be an interesting test. @Sam Khorsand: We have scikit-learn 0.16 installed on Quantopian: http://scikit-learn.org/0.16/supervised_learning.html#supervised-learning @Grant: The key thing to understand is that most classifiers are not linear. As such, the end-result will not be a linear combination of individual factors, bur a non-linear combination of individual factors. A decision tree might classify as: if (momentum_stock_i > 3) and (mean_reversal_stock_i < -2) and ...; then predict 1, else predict 0. Training finds these rules. Good point on using the rebalance function. @Luca: We've been inspired by some research from Sheng Wang, formerly at Deutsche Bank. I agree that the workflow makes a lot of sense and breaks the problem apart quite nicely. Another sentiment I recently came across is to not rank stocks as that makes them co-dependent (you change the rank of one stock, you have to rebalance your whole portfolio, incurring large transaction costs). If instead you just try to predict the returns of individual stocks separately, you can trade every position independently. The difference of course is that one is trying to predict absolute returns that way, while the ranking approach just tries to predict relative returns. Although one could try to just predict relative returns of individual stocks by subtracting out the beta-component. @Thomas Yes awesome I see that. But im more curious on whether the syntax is the same as ordinary python in a terminal when trying to import scikit learn and run different scikit learn packages in Quantopian? I assume it is? Here's a long backtest. I guess I'd reiterate my comment that it is hard to tell if the ML does anything (and perhaps even kills returns). Maybe there needs to be a switch in the code, to change to a simple equal weight, linear combination of factors to combine the alpha for comparison? 18 Loading... Backtest from to with initial capital Total Returns -- Alpha -- Beta -- Sharpe -- Sortino -- Max Drawdown -- Benchmark Returns -- Volatility --  Returns 1 Month 3 Month 6 Month 12 Month  Alpha 1 Month 3 Month 6 Month 12 Month  Beta 1 Month 3 Month 6 Month 12 Month  Sharpe 1 Month 3 Month 6 Month 12 Month  Sortino 1 Month 3 Month 6 Month 12 Month  Volatility 1 Month 3 Month 6 Month 12 Month  Max Drawdown 1 Month 3 Month 6 Month 12 Month # Backtest ID: 59f199de0226bc43e837c673 There was a runtime error. Perhaps it is in the code already effectively, but I suggest applying a standard preprocessing step across all features prior to combining them, such that an individual feature could be input into the optimize API as-is (see discussion on ttps://www.quantopian.com/posts/quantcon-nyc-2017-advanced-workshop regarding demeaning and scaling of alpha factors), or the features could be simply combined with a linear combination and then input into the optimize API. There is a discussion of approaches on http://scikit-learn.org/stable/modules/preprocessing.html. I've used this function: from sklearn import preprocessing import numpy as np def preprocess(a): a = np.nan_to_num(a - np.nanmean(a)) # return preprocessing.robust_scale(a) # security violation return preprocessing.scale(a)  Basically, if you look at the nice workflow on https://blog.quantopian.com/a-professional-quant-equity-workflow/, all of the individual alphas (features) would be put through a preprocessing step, yielding the final set of alphas to be combined (either by ML or otherwise). The code should be modular so that a user can easily change the feature preprocessing technique without having to dig into the code and unravel it. Folks, I have a problem I'd like some help with. How does one explain the behavior(smoothness and positive monotinicity) of the cumulative returns in the included notebook for the "ML + predictive factor" cell ? The notebook is @Luca's NB, slightly modified. The modifications are: 1. Changed the start/end dates to essentially the last two years. 2. The key change is to up the "n_fwd_days' to looking about a a half-year ahead(140 days). 3. Also changed the sign (+ to -) of the ML output, to match up with the I run the "very predictive factor" and the "ML + predictive factor", with a window of 252 days and 140 foward days look-ahead. Look What I was looking for, in Alphalens, as per Q reccomendations, is a strong separation of the red line(lower quintile) below the dark green line(upper quantile, to help produce strong dollar-neutral long/short portfolios when creating a trading algo for the contest. I've done this before with multiple factors, and with the right ones, the effect is the same. My current guess is that the effect might have to do with an integrative smoothing effect with the way alphalens "periods' is setup, but I'm really at a loss to explain this effect theoretically...which is why I'm asking for the "wisdom of the crowd" ! Thanks! alan 19 Loading notebook preview... Notebook previews are currently unavailable. @Alan - nice, I've never tried making ML factor with such a long n_fwd_days :) I didn't get why you inverted the ML sign though, is it because you saw in a previous run that ML was predicting the opposite of what it should have had? How does one explain the behavior(smoothness and positive monotinicity) of the cumulative returns in the included notebook for the "ML + predictive factor" cell ? I tried running your NB in a different time period and I got different results, but are you saying that you are getting very similar results with many other factor? If so, what are the similarities with this NB (same time frame, same alphalens periods, 140 same ML settings) ? By the way, I am curious about this cumulative returns output, I'll double check if it is correct. IF you are curious about how the cumulative returns plot is computed by Alphalens you can look at this thread, it's a long discussion. Due to limitations in memory and availability of other python libraries, Q might be missing out on recent breakthroughs in deep learning. In particular, the Keras library with Tensorflow or Theano as backend provides for Long Short Term Memory (LSTM) model, a type of recurrent neural networks that are very suitable for time series prediction as it captures long term dependencies in the temporal space. The 'deep' part is stacking many layers coupled with long training periods for better generalization makes it computationally power and memory hungry. Having said this, I support calls to be able to do computations offline and upload predictions in the Q platform. This to me, is a win-win compromise as it takes away from Q the burden of providing computational resources that could accomodate highly complex models like deep learning while reaping the innovative benefits of bleeding edge technologies. Hello all :) I'm new to Quantopian. I read the API infos but I'm struggling with this code anyway (math degree so I'm not a good coder :( ), can someone write some more comment on that after the features part? I would love to help here, I'm thinking about modify the code in to a RL one, just keeping the features and some of the trading stuff. @ Thomas - I'd be interested in your take on how this kind of ML alpha combination approach can be put in the context of the new risk model. Is the idea here that even if many (or even all) of the factors individually would be considered too risky, the ML, via nonlinear combination of them, might spit out a combined alpha that reduces the risks that would result from a purely linear combination of alphas? The other comment is, I'm wondering if, for technical indicators relying solely on OHLCV bar data if there will be enough persistent, repeatable (but still transient) alpha for them to make sense as ML features? As the saying goes, "You can't get blood out of a stone." Or is the idea that the ML, with some non-OHLCV bar data (fundamentals, or so-called alternative data) would find relationships would not be apparent with a rank correlation analysis, for example? Finally, could ML be used simply to find the coefficients for combining alphas as a polynomial with interaction terms? For example (generalizable to any number of alphas): alpha = b0 + b1*alpha1 + b2*alpha2 + b12*alpha1*alpha2 Basically, point-in-time, find the optimal b coefficients. Perhaps it would work, and not be subject so much to memory limitations and time-outs? Another thought to clean up the inputs to the ML algo would be to run each individually through the Optimize API with constraints first, and then do the training. This should tend to tame the ML inputs, and perhaps lead to better results. Yes, great questions. One approach is: for every factor, regress it against the risk model and treat the residuals as the actual alpha factor you put into the ML model. Well, as I understand, you are working on all of the machinery to manage risk and all of the other desired constraints (see https://www.quantopian.com/posts/a-new-contest-is-coming-more-winners-and-a-new-scoring-system), and presumably it will all reside within the Optimize API. So, my thought is that each alpha factor would be run through calculate_optimal_portfolio first, and then be combined (and then finally stuffed into order_optimal_portfolio for a final tweaking). This way, each alpha factor is fully constrained and risk-mitigated individually, prior to the combination step. It would seem that this would tend to de-noise the features using exactly the same constraints that will be applied to the final combined alpha. Can the Optimize API be used within pipeline? I don't recall ever seeing an example. The Optimize API can't be used like that. The process I outlined above gets at the same thing, just more direct. The idea is to "regress" every factor against the existing risk factors. There will definitely be exposures. For example, if you have a novel mean-reversion factor it will be very highly correlated with the one we have in our risk model. But if you then regress out those factor exposures you are left with a factor that's orthogonal to everything we have in the risk model. Hmm? The Optimize API is based on CVXPY, and CVXPY can be used in Pipeline, so it seems like the Optimize API could be made to work within Pipeline (or something like the Optimize API). I guess I'm confused--why was the Optimize API developed in the first place, if there is a more direct way? A topic for a separate discussion thread, I suppose... Hi Thomas - I think what you might need is to show explicitly in Jonathan's flowchart a preprocessing step for alpha factors. I'm not sure from an architectural standpoint whether it would be within each alpha factor (e.g. as a function called at the output of each alpha factor), or at the input stage of the combination step (applied to each alpha factor). If the preprocessing is common to all factors, then maybe it could be vectorized within the combination step? The other advantage of having it in the combination module is that the preprocessing can then be tailored to the details of the combination technique. Within the preprocessing step. all of the noise/clutter/extraneous junk could be removed, prior to combination. Then, the Optimize API should be more of a nudge than a shove (ideally one should be reasonably close to optimal prior to using any optimization routine). Is there an active effort to build up a ML API/framework on Quantopian? If so, it would be interesting to hear how you are approaching things. Here's the Machine Learning on Quantopian algo with the risk model added as a constraint to the optimizer. I had to change the objective to a factor-weighted portfolio in order to pass the contest constraints. Additional research on what factors to include should help bring the cumulative returns to positive and allow anyone to submit it to the contest. 143 Loading... Backtest from to with initial capital Total Returns -- Alpha -- Beta -- Sharpe -- Sortino -- Max Drawdown -- Benchmark Returns -- Volatility --  Returns 1 Month 3 Month 6 Month 12 Month  Alpha 1 Month 3 Month 6 Month 12 Month  Beta 1 Month 3 Month 6 Month 12 Month  Sharpe 1 Month 3 Month 6 Month 12 Month  Sortino 1 Month 3 Month 6 Month 12 Month  Volatility 1 Month 3 Month 6 Month 12 Month  Max Drawdown 1 Month 3 Month 6 Month 12 Month # Backtest ID: 5a81e2757ada5f4237c8d5a9 There was a runtime error. Disclaimer The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. Hi Maxwell - I'd recommended to Thomas W. to sort out how to provide a set of dummy factors that have been synthesized to have favorable characteristics for this type of algo. As it stands, it is very hard to tell if the algo is working properly, in my opinion. Basically, my thinking is one ought to compute a baseline with a simplistic/naive factor combination, and then compare it to combining factors via the ML routine. One would use idealized/toy factors, and systematically add in noise/randomness to see how the ML algo performs. Is this feasible? Could you put some factors up on quantopian.pipeline.experimental for this purpose? Grant: Completely agree, it's a great idea. Mainly a question of prioritization. Well could the analysis be done in the research platform, where presumably one can construct factors with look-ahead bias (by time-shifting the Pipeline input data)? Maybe then Aphalens could be used to see if there is any benefit to using the ML- based alpha combination. Yes definitely, good idea. Would be great if someone wanted to give that a shot! It would be good to have a starting point. Surely there are papers on the topic of synthesizing toy market factors with controllable characteristics. As far as priority, I'd think this would have broad applicability in the context of Quantopian. Maybe I'm thinking about things incorrectly, but the workflow would seem to be: 1. Create set of toy factors with controllable characteristics, X. 2. Run algo on toy factors and confirm algo works as anticipated 3. Create real factors, that are hypothesized to yield a similar set of characteristics, X. 4. Run algo on real factors and test hypothesis. This is kinda standard practice in science/engineering, systematically introducing randomness/noise into a model of the inputs and examining the outputs. I'm guessing something similar is followed in the world of finance? I guess it would be kinda cool to try this, but how would one go all the way through to the algo stage? Would there by a way to do it with quantopian.pipeline.experimental? Question: I hacked together a version of this algo that seems to perform much better than what has been posted, but every so often it returns an error due to NaN input. Now, since quantopian doesn't provide the full stack trace, I am completely at a loss to what is going on, since the imputer should replace NaN values with 0. The key change I made was that instead of factor ranking or zscore, I use the raw value. Now, this is actually quite important to my algorithm's function, and it performs much worse using factor rank or zscore. Does anyone have any ideas where things could be going wrong? Does this resolve your nans definitely? https://www.quantopian.com/posts/forward-filling-nans-in-pipeline I've seen 100% with a macd factor. In that case there was a try except, with nanfill the try always worked, no nans. @David: I'm sorry that the IDE is limiting you in your ability to debug the issue. We have a tricky balance to strike between transparent error messages and tight security practices, which is why we limit the error message to just the last line of the stack trace. Here are a couple of options for debugging the problem: 1) Comment out blocks of code. This is my preferred tactic. I usually comment out most of the code (everything but initialize, usually), then add back one junk at a time, making sure the result is complete and matches my expectation. With a slower backtest, you should kick off a few backtests in parallel so you don't have to iterate through one section at a time. 2) Try using the built-in debugger by setting a breakpoint in your algorithm and stepping through it line by line. This is the ideal solution, but I know the debugger sometimes goes very slow in algos that perform complex computations. I recommend you try it, and if it doesn't work, move on to 1) or 3). 3) Email in to [email protected] with your backtest ID, and we can try to get a better error trace for you. Sometimes this is helpful, others it won't be as helpful, and your best bet is to go back to 1). Regarding your specific error, are you dropping NaNs in the screen of your pipeline, or are you doing it somewhere else? Disclaimer The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. @Jamie Thanks for the advice, will try to drop nans in the screen, as I believed the imputer was sufficient to prevent nan inputs to the classifier. These 3 threads are very rich, but I often see references (especially by Thomas Wiecki) to something Luca said that does not appear in the thread. Have some of Luca's comments been deleted? Is there a parallel conversation somewhere else that everyone except me is privy to? (Maybe in Slack? Or on Github?) Since Thomas Wiecki frequently praises Luca's contribution for technical seriousness and relevance, it is all the more disconcerting. I feel like I'm reading Plato to figure out what Socrates actually said, or maybe like I'm reverse engineering data in search of an invisible factor. Those of us who are late to the conversation would be grateful to find the conversation still in tact. @John Strong, I delete my comments from time to time if I believe they are no more relevant. In this thread I discussed with Thomas about few bugs in the ML code but the code was eventually fixed and updated so I decided to delete my comments. There was also a NB that I used to test if ML factor works but I deleted that post too due to lack of collaboration. I'll see if I still have the NB and I'll post it here Thanks, that's kind of you. My vote is: leave the comments in! Even when they're stale. @John...its's a flaw in the way Quantopian runs the Forum as far as I'm concerned. No disrespect to @Luca or others who are just using the forum within the implicit rules, yet my belief is that in a Forum like this one, the only entity that should be able to delete posts is the moderator, and that should only be in special circumstances, which should be spelled out to all of us. I realize that there may be IP/privacy issues at play, yet the alternative of "threads-with-holes" doesn't seem like a good solution either! So yes...at times it does feel like: I feel like I'm reading Plato to figure out what Socrates actually said, or maybe like I'm reverse engineering data in search of an invisible factor. I've taken to immediately copying algos/notebooks that I'm interested in in case they quickly disappear... alan Here is the NB I used to test ML factor. Firstly ML factor is run with a single predictive factor in input. This is done to test that ML is actually able to detect the alpha and to predict the future returns accordingly. Secondly a slightly predictive factor is added to the mix. Finally we analyze the performance of ML with the predictive factor in input plus 4 non predictive factors to verify that the alpha is still found when there is lots of noise. 78 Loading notebook preview... Notebook previews are currently unavailable. @Luca: This is awesome (hopefully people won't wonder what this was in response to a couple of weeks from now ;))! I don't remember seeing this actually, but it's exactly what Grant and I have been looking for. It seems like the ML can still do a good job even with noisy factors present, which is highly reinforcing. Would be interesting to see the clf.feature_importances_how much weight is given to the noisy factors. Also, would be curious to compare this with different classifiers. Specifically, does a linear predictor does better? Does RandomForest does better than AdaBoost? Really exciting framework to make progress in. hopefully people won't wonder what this was in response to a couple of weeks from now Lol :) One thing about AdaBoost is that the documentation is not correct and the default base_estimator is DecisionTreeClassifier(max_depth=1). I found that out looking at the source code. If we want the DecisionTreeClassifier to keep into consideration the interaction between multiple factors max_depth cannot be 1. @ Luca - So what is the "executive summary" of your notebook posted above? If I understand correctly, it is not entirely clear that the ML provides an advantage over a simpler alpha combination. It says that there isn't a gross bug (good to know), but I can't tell if the ML is worth the trouble or potentially disadvantageous over some naive alpha factor combination (e.g. sum of z-scores). Here is what I think we should do: 1. refactor the code to just return the alpha of the factor. 2. run experiments on all kinds of different classifiers and combination methods (like the baselines outlined by Grant) 3. create nice comparison plots of the different approaches / methods in the different scenarios (single strong signal etc) Anyone here up for the task? Regarding re-factoring, would it be possible to put the code on open Github, but have it available for import, via quantopian.pipeline.experimental? Overall, it would be worthwhile to understand where Q is headed with the alpha combination step, as your former CIO Jonathan Larkin discusses on [A Professional Quant Equity Workflow][1]: Lastly, modern machine learning techniques can capture complex relationships between alphas. Translating your alphas into features and feeding these into a machine learning classifier is a popular vein of research. Before diving into re-factoring, I'd advise doing some system engineering and road-mapping to understand what the goals should be versus time. In the end, I'd think Q would want some combine_alpha API module that would be extensible and scalable (including compatibility with whatever computing power will be available down the road). I think that's the wrong order. First you want to understand the problem, and we have all the tools to do so, and then you can provide a meaningful API. "If you don't know where you are going, you might wind up someplace else." Yogi Berra Well, start with what you know. If it is a reasonable assumption that some sort of alpha combination API is desirable, and at least parts of it would be open-sourced, then the idea of re-factoring the code on Github and being able to import it via quantopian.pipeline.experimental seems sound. It is the 21st century, so source control with a collaborative tool like Github and code import should be a no-brainer, in my opinion. I'd also note that Jonathan posted the framework on August 31, 2016, so I'd think that some alpha combination framework has been percolating (if anything, in considering how to combine user algos for the fund). If not, well, I guess other work has taken precedent. But if so, it could be brought into the sunlight, along with any thinking about platform compute capability versus time. @Grant - You understand the NB correctly. The ML factor is not perfect yet, but we have a robust ML method, debugged code and a NB for testing: the hard part is done. Now It takes very little to have fun adding our own ideas to the framework. Is there an appropriate Github location for the code? Also, is there any kind of flowchart/schematic or other existing visual documentation (e.g. in Visio or similar) that could be edited for a re-factoring effort? Hi Thomas - Any feedback on my suggestions above? It seems that to move this forward in a substantive way, input and support from Q would be helpful. I wouldn't mind working on it, but I'd like to know how it would fit into the bigger picture. Also, as I mentioned, getting it on Github and integrating it with the Q plaform in some fashion (e.g. as importable via quantopian.pipeline.experimental) would help a lot, as the forum is not well-suited for this sort of thing. Also, this would allow the resulting code to be used on Quantopian and zipline-live (although if zipline-live is successful, I suspect it will leave Q in the dust in terms of platform compute power, and so zipline-live users may favor other approaches...a framework for alpha combination of Pipeline factors might be useful, regardless). Those are great suggestions but they probably will not happen soon. As I said before, the workflow is supported (as proven by Luca's research). There will always be limitations we could sit around for and wait to be resolved. I'm happy to upload the code to a public github repo, just not sure what that unblocks. The steps I outlined above can all be done today. Getting the code to a public Github repo would just be a better practice, I think. Or we can all clone your algo above, and then repost umpteen versions on this thread and elsewhere on the forum. I gather that you don't really have time to shepherd the thing; if there is interest, Github would seem to be the right tool for collaborative work on the code. My personal interest is that I see it as an opportunity to learn a little bit about ML, and interact with folks more expert than I. The forum is really challenging in its present state for this type of work, in my opinion. Thanks Thomas - I'll have a look at some point, to see what I can learn and contribute. This is my first post ever on Q. This thread is AWESOME. Very near and dear to my heart because I did stat arb in my past for a very well known fund. As an admitted "old timer" I'm always amazed to see how far and fast things have come. Absolutely awesome. BTW....what is this "advanced momentum" thing? What is the insight? I took a first-pass at re-factoring, starting with a framework for the factors/features: https://www.quantopian.com/posts/multi-factor-algo-template If you'd like to contribute, please have a look. Hi Thomas - I'd be interested in your thoughts on the feasibility of doing part/all of the computationally intensive ML offline, and then uploading the results using the new API that is in the works. If I understand correctly, to do offline work, there would be a real benefit in having offline access to the QTradableStocksUS symbols point-in-time. This way, the ML would be operating on the same universe as is required by the contest and preferred for the fund. If I'm thinking about it correctly, one does not want to do the ML on all stocks, for example, and then filter out symbols that are not in the QTradableStocksUS, since in effect the ML will be "contaminated" by the unwanted stocks (as it does not operate on individual stocks, independent of the others). I'm highlighting this, because I'd think that offline ML would be one of the primary use cases for the new data-upload API, but without access to the point-in-time universe, it would seem a bit hamstrung. I just stumbled on a very interesting use of a Machine Learning within Zipline. Professor Erk Subasi gave a talk two years ago at QuanCon 2016 entitled, "Honey, I Deep-Shrunk the Sample Covariance Matrix!" (a reference to a classic paper by Ledoit and Wolf), in which he shows how he used autoencoders to calculate the covariance matrix used in portfolio optimization. I really would like to understand his algorithm, but it is pretty deep waters for me and I could use a study partner, in case anyone is interested. You can listen to the video here: https://www.youtube.com/watch?v=jSAqgotWYYc&list=PLRFLF1OxMm_W-TiyLZoRYCHOnXG1CqTAL&index=25&ab_channel=Quantopian You can download his Jupyter notebook with autoencoder code and Zipline integration here: https://github.com/erksubasi/AutoencoderCovShrinkage @Alan Coppola, only in the sense that both approaches tackle the problem of ill-conditioned covariance matrices. But OAS is a standard shrinkage technique whereas Erk Subasi is using autoencoders, which is a type of neural net. I had never heard of that before, and Subasi is a quant and portfolio manager. He's no dummy. I wish I were a seasoned researcher and could tell you I have made extensive comparisons of Ledoit and OAS and GraphLasso and autoencoders and then offer you list of pros and cons for each. Alas, I'm still a student and green and struggling to invert any convariance matrix bigger than a couple hundred equities. But this is an educational thread, no? I thought someone might be intrigued by the idea of using autoencoders to regularize covariance matrices. Hi all, Great thread, however, if I understand correctly, my question has not come up. According to my understanding of the posted algorithm, on any given backtest date when the pipeline runs (say, I am bactesting for June 2015) it identifies the stocks that are part of the universe on that day, and then loads the history of various factors for these stocks. However, is that actually the data that I want as input to my ML algorithm? If the algorithm looks at factors from one year earlier, shouldn’t it consider the stocks that were part of the universe at that time (say, June 2014), not at the backtest date when the pipeline runs (June 2015). In the current form, it appears to me that a lookahead bias or survivorship bias creeps in, and it also results in NaN values being loaded. In Research I can build pipelines that behave the way I sketch above, but I can’t find/hack a way to implement this in the Algorithm environment. Do I understand this correctly, and does anyone have advice or a reference for this? For example, on a given backtest date (say, in June 2015), can I load the pipeline output as it would have been at an earlier time (say, in June 2014)? Thanks Hi Ja, According to my understanding of the posted algorithm, on any given backtest date when the pipeline runs (say, I am bactesting for June 2015) it identifies the stocks that are part of the universe on that day, and then loads the history of various factors for these stocks. That's not the case. When you run the backtest starting in June 2014, it will load the universe for June 2014, compute the factors, train the model, make predictions etc. Then in July 2014 it will load the universe for that month. So there is no look-ahead bias here. In general, we try to make it impossible to access future data in a backtest so that you don't have to worry as much about look-ahead bias. Hi Thomas, Thanks. Yes, I understand that the pipeline always loads in the data that is available before the open for each date in the backtest. This is what makes the pipeline a great tool! Maybe I didn't communicate my observation as clearly as I hoped (and I should possibly have stuck with "survivorship bias", not "look-ahead bias"). Allow me to try once more, using an explicit example (which I checked!): Let's say I am running my backtest. I set the algorithm to rebalance my portfolio every Wednesday. One such rebalance date would be this last Wednesday, 25 April 2018. One of the stocks included in the Q1500US universe on that Wednesday is Equity(45212 [CTRL]). Now, for every stock in Q1500US (including Equity(45212 [CTRL])) the algorithm loads data from the 252 days prior to last Wednesday, that is, back to 25 April 2017, and trains the ML classifier using this data, makes predictions, and rebalances my portfolio. Okay. However, Q1500US changed dynamically along the way from 25 April 2017 to 25 April 2018. For example, Equity(45212 [CTRL]) had a market cap less than 500 million back then, and was not in the universe around 25 April 2017. I checked this. One example, which could get me into trouble is if I use market cap as a predictor: all stocks that had a market cap below500 million performed great in the ML training data. Otherwise, they wouldn't be in Q1500US on the rebalance date, 25 April 2018! This is precisely the case for Equity(45212 [CTRL]). So the ML algorithm might overweight market cap. Similarly for long term momentum, such as your 39 week returns.

The solution to this issue would be that when the algorithm loads data from the previous 252 days for each rebalance date, it should take into account what the Q1500US universe was on each of those 252 days. I hope my point is clearer now. My question was whether this is possible. Thanks!

Hi Ja,

OK, I see what you're saying. It's not really look-ahead bias of the backtest but because of the way we apply the universe, the resulting ML model is biased in a certain way. That sounds right and it's a keen observation. I'm not quite sure how this could be addressed however. You would want to apply today's universe only to the historical data. I'm a bit surprised this isn't what's happening. Maybe Scotty has an idea?

Hi Thomas -

The impact of a limited universe would seem to depend on how the learning is done. If each stock is independent, it would seem not to matter whether you apply the universe filter before or after the learning. For example, say I have a stock market of three stocks: ABC, XYZ, & PDQ. At any point in time, I pick two of the three to be in my tradable universe. Out of convenience, I could only run my prediction on two (ABC & XYZ), ignoring the third (PDQ), or I could run my prediction on all three (ABC, XYZ, & PDQ), and then filter out the one (PDQ) that is not in my point-in-time universe. Either way, I should get the same result.

This is still a bit of a mind-bender for me, but I think the problem comes in if I need PDQ as part of my data set to predict ABC & XYZ (which, by policy, will be the only ones I'll trade going forward). Taking a horse race analogy, if the three horses are well-separated on the track, one could make the argument that each runs an independent race; the presence of the other horses is irrelevant. However, if they are bunched up, then potentially I have a very different problem; not including PDQ in the analysis could be problematic. Does the ML code that is the subject of this forum post include interactions, or does it effectively treat each stock as an independent horse on the track?

Hi Ja,

Scotty provided a solution for the issue you identified. We need to get a universe of stocks that are in the QTU not just today but the complete window duration. This can in fact be achieved quite simple with Pipeline: Any(inputs=[QTU()], window_length=500) creates a new filter that contains all stocks that ever were in the QTU in the last 500 days. You could also use All if you only wanted stocks that were in the QTU at all times during the last 500 days. You can use this filter in-place of where currently QTU is being used.

Let me know if that makes sense and if you make progress with that.

Hi Thomas,

Thanks for investigating! Okay, I see what All and Any would do. So do you mean essentially that in my Pipeline, instead of screen=QTradableStocksUS(), I should set screen=All(inputs=[QTradableStocksUS()], window_length=500). But that raises a NonWindowSafeInput error when I run it. So am I misunderstanding? Thanks

Hi Ja,

That's a bug we just fixed that should be on the platform in a day or two.

Hi Thomas,

Thanks for following up. Indeed, it looks like it works as intended now.

Thanks,
Jakob

I've read about 1000 Quantopian messages and web pages and didn't see an answer to the following question, so I am posting:

Does the backtester support a 'burn in' phase?
That is, can one collect training data for N days before making predictions?

(I can get that to occur using an initial call to history() by simply going far enough back,
and then doing my own sliding-window calculations, but (apparently) not with pipeline() and using built-in factors and Fundamentals.)

I can get a 'burn in' to happen with pipeline by:

    def before_trading_start(context, data):
...
fillExamplesInCircularArray(context, data) # Store today's pipeline data, keeping the last context.historySize days of results.
if context.countOfTradingDays >= context.historySize : create_model(context, data) # After the burn-in phase, (re)train daily (or use whatever period one wants).

def trade(context, data): #This is referenced in a call to schedule_function in initialize.
if context.countOfTradingDays < context.historySize: return # Don't trade until the burn-in phase completes.
...


To collect, process, and store a year's data on a daily basis (ie, N = historySize = 252) for Q500US only takes about two minutes, but the weakness is that the GUI for the backtester assumes I am 'all cash' for the first year.

It would be nice if the backtester supported NOT STARTING the GUI (and its calculation of RETURNS, the benchmark's returns, etc) until the 'burn in' phase completed (with, say a 10 minute 'cpu time' limit on burn-in). Seems a reasonably straightforward change to the tester code. It is natural (and standard) to collect some historical data before 'fielding' an ML system - no need (in most cases) to make 'real' predictions until one has a good amount of historical data for initial training. (I can correct myself the calculations reported in the backtester's GUI, but if I entered a contest, then I could not use this 'burn in' idea, plus it would be nice if the backtester's GUI supported the burn-in idea since it seems natural.)

Thanks,
Jude


I tried to use the universe = Any(inputs=[QTradableStocksUS()], window_length=500) as stated above, but I got an error saying Any is not defined. What do I need to import to get Any and All defined?

from quantopian.pipeline.filters import All, Any

I tried this with 3 factors that have alphas of about [0.10, 0.10, 0.05] which, when combined by simply ranking and summing them, have about ~1.3 sharp and 4% returns.

When hand-picking and combining just the 2 best factors I get ~1.8 Sharp and 7% returns, but the ML algo doesn't seem to be able to discover that these two combined are the best, since when I run it (with different hyper parameters), I often just get between -4% and +1% returns.

Has anyone that tried this with actual alpha factors been able to verify it works at all?

@Ivory: I think Luca's analysis above (https://www.quantopian.com/posts/machine-learning-on-quantopian-part-3-building-an-algorithm?utm_campaign=machine-learning-on-quantopian-part-3-building-an-algorithm&utm_medium=email&utm_source=forums#5aa480f9493f3f001d9f01d8) is the best test currently.

What classifier did you use? If it's a non-linear one it could overfit so I would try a linear one and look at what it's learning.

I am still see the NonWindowSafeInput error when I use the universe = Any(inputs=[QTradableStocksUS()], window_length=500) syntax. The error I got is as follows:
NonWindowSafeInput: Can't compute windowed expression Any((PrecomputedFilter(NumExprFilter(...), name='QTradableStocksUS_version_1'),), window_length=500) with windowed input PrecomputedFilter(NumExprFilter(...), name='QTradableStocksUS_version_1').

I find that All and Any works now. However, I find that some features raise a NonWindowSafeInput, such as market cap or earnings yield, unless they are only used in ranked form. Could this be happening to you?

Do we have a timeline for tensorflow/keras/theano...anything more robust than sklearn?

I was hoping we would see an integration with Compellon 20|20. Blows tensorflow/keras away and doesn't require a heavy coding background...

Does anybody work/play with numerai ? What you have to do there is exactly what the ML factor here tries to achieve: find combination of predictive features (the equivalent of input factors for ML) that explain the training data and use them to predict the future. It is really the same goal of ML factor and you have to face the same problems (but there are also many differences in the details). What is interesting is that sometimes I use numerai ideas on Quantopian and some other times I do the opposite. It is surprising how changing the context helps you thinking at new solutions for the same problem.

I do. Q and Numerai are similar in the sense that they are both crowd sourced hedge funds. The difference is that Numerai provides authors with encrypted blind data and algo authors are free to use whatever predictive quantitative method or techniques they choose with whatever computing power and resources available to them. Q, on the other hand, provides authors with free commercial grade data, limited analytical tools and computational resources within the confines of their trading environment and framework.

As a concrete example, in Numerai, I employ a multilayered deep neural networks algo written in python using Keras/Tensorflow library that is run in GPU powered servers in a cloud for compute power. I don't think Q's framework can handle this kind of ML algo.

In the context of ML algos, Q's current framework and limited compute time allocation per algo/user does not make it suitable to churn out meaningful results. While ML may be in their future pipeline, it is not a priority. Their focus is to fulfill their investors objective which is the low volatility/risk market neutral equity fund through their allocation process.

What I found limiting on Numerai is that we are given only a small amount of validation tests to verify our model. On the other hand Quantopian has plenty of out of sample data and we can test the ML model with both Alphalens or running an algorithm. This is much more interesting.

The other open question is where the "magic" resides, whether it's in creating new features or the combination of features. Quantopian allows you to work on both parts (although it's true that the capabilities to do state-of-the-art combination are still limited on the platform).

@Thomas last July you stated that Quantopian was using version 0.16.1 of scikit-learn. It looks like this is still the case today (making it an even older version now)? When will this be upgraded? I'd love to experiment with sklearn.neural_network.*, but this is not supported in the very old version of scikit-learn used on Quantopian. ☹

Hi Thomas -

I've concluded that the alpha combination step (which would include the kind of ML approach you illustrate above) should be done in before_trading_start where one has access to a full 5 minutes per trading day (or even within handle_data where 50 secs. are allocated). Is this correct? Or is there a compelling reason to do alpha combination in Pipeline (where for back testing, there is a rather severe time limitation, due to chunking)?

If I understand how Pipeline works, this would also decouple the potentially variable data loads from any heavy-duty alpha combination computations. If grabbing chunks of data from a database gets bogged down (e.g. because multiple algos are trying to get data from the same database at the same time), I wouldn't want to take a hit on my ability to do computations.

I've also concluded that rather than using a global to get the trailing data out of Pipeline (you'd suggested this above), one can simply add column labels corresponding to the trailing days, as I describe here.

Then, I think one is left with something like a 20-hour backtest time limit, plus whatever memory limitations are imposed.

Correct? Am I on the right track?

I'm not quite sure in which timeout pipeline runs.
And yes, turning the dataframe into a one-row-columns-only-format, while not great, is probably better than a global variable.

You could talk with your peeps at Q...I recall that there is now a 10 minute timeout for Pipeline, and before_trading_start no longer runs Pipeline...it is basically a 5 minute compute window, prior to the open.

Thanks,

Grant

Here's my take on that limitations of Q MLalgo

Hi Thomas -

I'd be interested in how Pipeline and its output work from a memory standpoint. If I output the alphas, etc. to before_trading_start` to do the alpha combination there, does it effectively double the amount of memory used (since perhaps I'd be copying whatever is in Pipeline to the algo itself)? Or is the memory in Pipeline effectively freed up prior to the transfer?

I'm assuming here that there is an overall memory constraint for an algo, shared across Pipeline and the algo itself. But if they are separate, then maybe doesn't matter?

EDIT - Not sure how to articulate it, but I'm wondering how the chunking works with respect to memory management? One could imagine that there is input buffering, computations, output buffering and then the memory could be freed up for the algo. Ideally, after a chunked Pipeline cycle, one would be left with only data in an output buffer, with pointers provided to the algo. Other memory that was required to run Pipeline over the chunk would be freed up to be used by the algo (obviously, Pipeline doesn't run in parallel with the algo, since the algo pauses for chunking, so it would seem that the memory could be freed up for the algo, no?).