Back to Community
Machine Learning on Quantopian Part 2: ML as a Factor

Recently, we presented how to load alpha signals into a research notebook, preprocess them, and then train a Machine Learning classifier to predict future returns. This was done in a static fashion, meaning we loaded data once over a fixed period of time (using the run_pipeline() command), split into test and train, and predicted inside of the research notebook.

This leaves open the question of how to move this workflow to a trading algorithm, where run_pipeline() is not available. Here we show how you can move your ML steps into a pipeline CustomFactor where the classifier gets retrained periodically on the most recent data and predicts returns. This is still not moving things into a trading algorithm, but it gets us one step closer.

If you haven't yet, definitely read the notebook on the static workflow first. We will be reusing the same concepts and code but not re-explain the logic of preprocessing the data.

Loading notebook preview...
Notebook previews are currently unavailable.

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.

26 responses

That's really nice how you combined all that work in one single factor we can easily use. Unfortunately I remember that algorithms are restricted to pass only certain factors as input to other factors (e.g. Returns and all the ones that are split/merge/divident safe). Is that restriction still in place?

Great work Thomas et al.,

Would love to run the notebook, however, am running into "Timeout: 300 seconds" on block[9]. Any suggestions? Can I change the timeout rule on my notebook?

@Saad, thanks for reporting, we're investigating.

@Luca: Yes, although you can circumvent that. However, I'm not sure why you would want to do that given that it's seems like a helpful restriction. In any case, all the factors here are "window-safe".

Thanks Thomas, I didn't notice there was a call to ".rank()" before passing alpha factors to ML. That explains why they are "window-safe".

In case someone forgot this detail like myself, here is what David Michalowicz said about using factors as input of other factor:

"[...] is now allowed for a select few factors deemed safe for use as inputs. This includes Returns and any factors created from rank or zscore. The main reason that these factors can be used as inputs is that they are comparable across splits. Returns, rank and zscore produce normalized values, meaning that they can be meaningfully compared in any context."

Hi Thomas -

You say:

The 'ML' column will contain for each day, the predicted probabilities of each stock to go up, relative to the other ones in the universe.

What does this mean? Say as a whole, the average forecast return across all stocks is zero. So does a relative probability of >50% mean that a stock's absolute price will likely rise, and a relative probability of <50% that a stock's absolute price will likely drop? And would I expect a stock with a relative probability of 80% to rise more than one with a relative probability of 60%, and if so, by how much?

Sorry, finding the various statistics, rankings, normalizations, and relative values very foreign and confusing...but then I haven't yet caught up sorting out every line of code.

To understand the ranking step a little better, I would suggest taking a look at the Spearman Correlation Rank lecture in the Quantopian lectures section of the Learn tab. In fact, I've found it helpful to work my way through all of the lectures.

To put the answer in my own words I would say that it is really hard to predict whether any individual stock will go up or down in the future. One of the main problems is that a stock's returns are usually heavily dependent on the movement of the whole market. Because the market is pretty efficiently priced (all known events have been priced in), it basically takes knowing what future events will occur to be able to predict it (i.e. a crystal ball).

However, it may be easier to say: "well, I don't know what the returns of stock A will be, but can I predict that it will have a high chance of doing better than stock B according to some factor analysis?" If the answer is yes you can short B and go long A to make money on their relative movements. So, by ranking the returns we're really looking at which stocks will do the best relative to the others and which will do worse.

Then we take it even further and say that even the specific rank of a stock is hard to predict. So, we try to make it an even easier question, which is "is this stock in the top half of the rankings or the bottom half?". That should be easier for a ML algorithm to predict (apparently it's still pretty hard since our accuracy is only 53%).

Then instead of predicting the category (top half or bottom half) we output the precentage chance of being in top or bottom. Why do we do that? Because it allows us choose 10 to go long (highest chance of being in top half) and 10 to go short (highest chance of being in the bottom half). If we just had the categories as predictions we'd just select a random basket from the top half category and another random basket from the bottom half category.

As for why the input factors are ranked: well, for one, outliers. I've seen some crazy outliers in financial ratios that would really confuse an ML algorithm. Also, it goes back to data being very noisy. It's actually beneficial to not tell the algorithm the exact value for the factor. It is more helpful to just tell it how strong the factor value is for that stock relative to the others. There might be periods where the PE ratios of the market are low and there might be times where they are high, but what the algorithm needs to know is "how high is this PE ratio relative to the other stocks in the market", because that will help it figure out how well that stock would do relative to the other stocks, which is what we're actually trying to answer here.

In the other thread you also seem to be focused on trying to fit a polynomial model to the factors. This is okay if you're dealing with a 1-D problem, but as you add more factors there's a combinatorial explosion of possible terms. In any case, you're just choosing a different ML model. I would suggest looking into the CART algorithm, into random forests, into gradient boosted machines and adaboost. It's a bit of rabbithole, but worth knowing. By the way, any algorithm that is based on decision trees (like the ones I listed) can handle categorical factors. The reason is a node in the decision tree can branch based on a category or based on a threshold of numeric value.

Anyway, hope some of that helps.


Thank you so much for this work. I've found it very inspiring. I have two qualm/questions though.

1) Is there a plan to be able to utilize these ML workflows in an algorithm? If there is, what are the details and what is the timeline?

2) It would be great if there were an easier way to work with a larger timeframe of data. The run_pipeline function takes a start date and an end date. It would be so great if you could pass in a list of dates somehow or maybe a skip amount (so you could specify that you want just one day from each month).

It just seems like it'd be helpful to work with a long timespan and just sample one day out of every 30 days or something. I think I would want to be able to find factors that are stable over a long time. One might argue that having more recent data would be better, because stationarity can't be relied on and you should capture current sensitivities to factors. But, maybe it would make more sense to have a long timeframe and just sample more days from more recent times to get a balance between the two. Either way, we need more flexibility in specifying dates for the pipeline. The only workaround I can think of is to run the pipeline day by day and that would probably be slow.

Finally, I thought I'd offer a suggestion. Since cross-validation could be misleading due to lookahead bias, I'd suggest segmenting the timeline into 4 sections (tuning training set, tuning testing set, final model training set, and final model testing set). Each of these would be non-overlapping, not necessarily the same size and chronological in that order. The tuning sets would be used for tuning variables (or hyperparameters) like "how many days to be looking ahead" (which you would use for a rebalance period), "what fraction of data to remove from the middle section of the return rank", even "what factors to use" (you could automate removing low informative features, since models can improve with lower dimensionality). Then once you've found the optimal hyperparameters for your model using the tuning pair, you train using those hyperparameters on the final model training set and test it on the final model testing set. Since the final model pair was different from the tuning pair, you should get a pretty accurate test for how your hyperparameters would actually perform.

Thanks Rudiger. --Grant

Thanks Rudiger.

1) Is there a plan to be able to utilize these ML workflows in an algorithm? If there is, what are the details and what is the timeline?

Yes, in fact just this morning I ran a 5 year backtest with weekly retraining. There were some efficiency improvements to pipeline that had to happen before but engineering came through on that. I'm still tweaking things a bit but hope to post soon.

2) It would be great if there were an easier way to work with a larger timeframe of data. The run_pipeline function takes a start date and an end date. It would be so great if you could pass in a list of dates somehow or maybe a skip amount (so you could specify that you want just one day from each month).

Yes, I had the same thoughts when I started implementing this workflow. However, the way pipeline is coded is that all data required for the entire period is loaded into one big array. This makes things very fast. Moreover, as a lot of factors have considerable look-back (e.g. 3M vol), there would not be any memory saved if you only wanted to run pipeline e.g. weekly. In any case, the recent efficiency improvements should have lifted the limits even further so I think we can run over reasonable time-frames now. Do let me know which limits you run into.

Your suggestion of only running pipeline on individual days works quite well actually. However, there is no run_pipeline in the IDE so it doesn't translate. But there is probably a way to sample within pipeline.

I really like your suggestion for cross-validation. For the walk-forward presented here, the "final model testing set" would be the hold-out results already used here-within for evaluating the model, correct? Please post any progress you make on that front!

Hi Thomas,

It sounds like the workflow will be exclusively focused on pipeline and its daily bars for computing returns. In my own work on a single mean-reversion factor, I see an advantage in smoothing the data, using minute bars, versus using daily bars directly (similarly have also been requests for daily VWAP, computed from tick/minute bar data). Will there be a way to input alpha factors both from pipeline and from the algo (derived from minute bars), so that they can be crunched by the ML alpha combination step? Or are you focused exclusively on alpha factors that would run within pipeline?


Can't wait to see the code for your 5 year backtest.

Regarding sampling within the pipeline: I've tried something like that before inside a factor's calculate function. The problem is if you specify a very large window, it still ends up ends up running out of memory. Maybe there's a better technique?

Related question: is the lookback window that is specified in the custom factor in trading days or calendar days? I'm pretty sure it's trading days, but just checking.

And yes, the final model test set would be the same as the hold-out for the current notebook. However, there is another test set that is used for the parameter tuning. The key thing is I want to use completely different data to tune my hyperparameters than the one I use to test the final product.


I too wanted to work with longer periods and I found it hard and often the cell that did the work would crash because it took too long. Its a hack but the way that I solved this was to run many smaller chunks and patch them back together. I hope this helps you out. The hack also works on part 1 as well.


I was thinking the same thing about the ranking VS absolute value so I developed a version of part 1 and 2 that use actual value and regression instead of ranking and classification. There are a couple hacks from the original that I had to make to get it functioning but it works. Unfortunately so far, I'm finding better results with rank and classification but it you would like me to start threads on the community to work on those ideas, I can. I just need to clean them up a bit. I also have an example of an algo that uses absolute value and regression if you would like me to put that out there. The results are horrible though.


I'm really excited about this direction that Q is taking with ML. Thanks for all your work on this. I have a prototype of an algo that will run a 10 year back test if you would like to see it. Its a mess and I need to clean it up a bit before sharing it with the community but if you would like me to share it and you can build on it, just let me know.

Loading notebook preview...
Notebook previews are currently unavailable.

In case it's useful to anyone, here is the function I wrote to call run_pipeline() on single days multiple times. It's a bit specific to my use-case but you can probably easily adapt it. Just to make sure though: You won't be able to use this in the IDE as there is no run_pipeline() there. I thus recommend finding other ways, like the factor.downsample('week_start') method.

import pandas as pd  
def run_pipeline_intervals(pipe, start_date, end_date, freq='1BM', normalize=True, shift_returns=False):  
    date_range = pd.date_range(start=start_date, end=end_date, freq=freq)  
    factors = []  
    for date in date_range:  
        f = run_pipeline(pipe, start_date=date, end_date=date)  
        fwd_returns = f['Cumulative Returns 1M']  
        f = f.drop('Cumulative Returns 1M', axis='columns')  
        if normalize:  
            f = f.divide(f.groupby(level=0).count(), level=0)  
        f['fwd_returns'] = fwd_returns  
    factors = pd.concat(factors)  
    factors.index.rename(["date", "equity"], inplace=True)  
    if shift_returns:  
        # Workaround for groupby dropping nans  
        factors['fwd_returns'] = factors.fillna(-999).groupby(level='equity').fwd_returns.shift(-1)  
        factors.loc[factors.fwd_returns == -999, 'fwd_returns'] = np.nan  
    return factors  


Thanks for sharing that. One suggestion I'd have for your code above is to use trading dates and not calendar dates. I'm not sure what the best approach on Quantopian to do that is. Maybe you could just pull the price history for SPY and then pull out the index of the resultant Series.

I have a few other concerns that I've thought about, that related to survivorship bias. It stems from the way you have been shifting the backward-looking returns in the pipeline output to be forward-looking returns for another date. Let's say on 2016-10-01 your universe has stocks [a, b, c , d], and on 2016-10-06 your universe has stocks [a, b, c, e]. Then the shifting operations would have no returns defined for d and the returns of e would be unused. This may be an even bigger problem with the algorithm version that you're working on. If you define the universe based on today's date and then go get historical data for those stocks, then you're definitely introducing survivorship bias.

One potential fix for the Research version of the pipeline is to actually define your factors in terms of previous values. So, you would define a factor like this:

class EnterpriseMultiple(CustomFactor):  
        inputs = [morningstar.income_statement.operating_income,  
        window_length = 5

        def compute(self, today, assets, out, operating_income, ev):  
            out[:] = ev[-5] / operating_income[-5]  

Then, you could define your returns factor the way you have before:

returns_factor = Returns(inputs=[],  
factor_ranks['Returns'] = returns_factor.rank(mask=universe)  

This would help the returns be perfectly matched with the outputs of the "alpha" factors. The one other thing would be to shift the actual universe selection back 5 days, but that doesn't seem to be possible with Q1500US.

Thanks Rudiger. The look-ahead thing is an issue in the code I posted above but I don't think it's a problem in the notebooks I posted. There, we're not shifting returns forward but rather factor values backward. And we don't swap in returns if some go missing, in that case d and e would be dropped from the classifier. Classification also has no knowledge of a stock disappearing the next day so it will just provide a prediction and trade into that position.

So, just so I'm clear, in your first notebook where you specified a range of dates for the run_pipeline, it applies the Q1500US filter to each date individually? I guess my mistake was I was thinking it's applied once for all the dates. In that case, I guess you're right. But, I do think my pre-shifting idea is a nice way to avoid having to do the shifting manually :)

The Q1500US filter is updated at the start of every month. If a stock is delisted it is dropped from the universe at that day. For more info, see or I like your idea but it makes coding the factors more error-prone and complex.

This leaves open the question of how to move this workflow to a trading algorithm

Hi Thomas,

When you get the chance, I'd be interested in how the workflow will map onto a trading algorithm. In particular, I'm wondering if the alpha factors will come from both pipeline and user-defined functions from within the algo, when the alpha combination and portfolio construction (optimization) steps will be performed, etc. It sounds like ultimately, you are aiming to support intraday cycling through the workflow ("This process is a loop and the cycle time could be once a week, once a day, every 15 minutes, etc.), but the indication from your posts is that the alpha factors and combination will be pipeline-based (and thus run before the market opens). Any thoughts yet on how this all might work?

Hi Grant,

Yes, now that the ML training and prediction part moved into pipeline, all factors have to be in pipeline too, which means that you're limited to daily bars. We're on the same page that it'd be preferable to do intraday stuff as well, not sure if it's possible to open pipeline to that or where that's on the roadmap.

Thanks Thomas -

I see. So, presumably, if I follow correctly, prior to the open (e.g. in before_trading_start or a new API), we would run everything with the exception of the "Execution" step in the workflow. Each day would start with an updated portfolio allocation vector, and then presumably, in the context of the hedge fund, this would be handed over to the order management system (OMS), which would receive N such vectors every day (where N is the number of algos in the fund portfolio). This supposes that the algo strictly follows the workflow; it would be possible to tack on allocations derived from minute bars, so long as there isn't a strict requirement to set the allocation prior to the open (although I suppose if those allocations used minute data from prior days, then they would be ready prior to the open).

As discussed on VWAP - Are there any plans to fix this?, it seems feasible to incorporate daily data derived from minute bars into pipeline, and you could still stick with the framework of running the workflow before the market opens on daily values (e.g. daily VWAP). It just seems like some smoothing over minute bars would be advantageous here, but I don't have data to support my intuition (other than my own experimentation with a long-short mean-reversion algo that seems to benefit from smoothing minute bars over a 5-day look back window, versus using daily bars).


I've been working on a Notebook that tries out the day by day sampling technique we discussed. Here's the code I used:

def form_training_data(dates, universe, lag, percent_kept):  
    pipeline = make_pipeline(universe, lag)  
    X_list = []  
    Y_list = []  
    for date in dates:  
        results = run_pipeline_for_date(pipeline, date)  
        results_sans_returns = results.copy()  
        returns = results_sans_returns.pop('Returns')

        X = results_sans_returns.values  
        Y = returns.values  
    X = np.concatenate(X_list, axis=0)  
    Y = np.concatenate(Y_list)  
    lower = np.nanpercentile(Y, percent_kept * 100)  
    upper = np.nanpercentile(Y, (1 - percent_kept) * 100)  
    upper_X = X[Y > upper]  
    lower_X = X[Y < lower]  
    upper_row_count, upper_column_count = upper_X.shape  
    lower_row_count, lower_column_count = lower_X.shape  
    upper_Y = np.ones(upper_row_count)  
    lower_Y = -np.ones(lower_row_count)  
    X = np.concatenate((upper_X, lower_X), axis=0)  
    Y = np.concatenate((upper_Y, lower_Y))  
    return X, Y  

So, I ran this against 120 dates and it took around 24 minutes to finish (1461.63 secs). That's pretty damn slow considering that I wanted to reuse this function over and over as I varied both the lag (days to compute return over) and the percent_kept (the upper and lower percentile boundaries to keep). Let's say I want to test lags of 5, 10, 15, 20, and 25 days and percentiles kept of 50, 40, 30, 20. Then that would be a grid of 20 points that would take 8 hours to get through. Then I also wanted to automatically test pruning of less relevant features, so you can see how this is way too slow.

Is there anything obvious I'm doing wrong here (I'm new to python so I'm not aware of all the possible optimizations)? If I were working off of the Quantopian platform I'd consider parallelizing the for loop. I'm assuming that np.concatenate preallocates the full size of the output array so it shouldn't have issues with constantly having to allocate memory for larger and larger arrays as it adds each one by one.

How long does your new code that uses separated dates take?

As a followup, I thought of two ways I can improve the time it would take to test all those scenarios. One thing I could do is compute all the return lag scenarios at once (i.e. have a column for "Returns5", "Returns10", etc). The other thing is that the pipelines don't have to be run each time we change the percentile thresholds for what data is kept. That can be recomputed from the existing pipeline output.

@Rudiger: That's a good idea. The way pipeline is written, you will get Returns5 almost for free if you also compute Returns10 (i.e. the data is only loaded once and then reused).

What you posted before: It's not surprising that it's slow since it's not doing any smart caching and reloading a lot of data. I think your second approach of using only one pipeline call is the way to go.

Check out Part 3 of how to turn this workflow into an actual strategy:

All: Luca found two bugs in the code I posted, those should be fixed now in the top NB.