Machine Learning on Quantopian part 1: Basics

This is the first part of our series on Machine Learning on Quantopian. See Part 2 to see how to run this NB in a walk-forward manner and Part 3 for a fully functional ML algorithm.

Recently, Quantopian’s Chief Investment Officer, Jonathan Larkin, shared an industry insider’s overview of the professional quant equity workflow. This workflow is comprised of distinct stages including: (1) Universe Definition, (2) Alpha Discovery, (3) Alpha Combination, (4) Portfolio Construction and (5) Trading.

This Notebook focuses on stage 3: Alpha Combination. At this stage, Machine Learning is an intuitive choice as we have abstracted the problem to such a degree that it is now a classic classification (or regression) problem which ML is very good at solving and coming up with an alpha combination that is predictive.

In this notebook we will show how to:

• construct alphas in pipeline,
• preprocess alphas so that they can be fed into scikit-learn classifiers,
• train a ML classifier on these alphas with good performance on hold-out data,
• analyze which alphas were most predictive.

• Pointers on how to improve the accuracy.
• A small Machine Learning competition.
• Resources to further your understanding of Machine Learning.

Note, however, that this workflow is still a bit rough around the edges. We are working on improving it and adding better educational materials. This serves as a sneak-peek for the curious and adventurous.

2733
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

158 responses

And here is a library of commonly used alpha factors. You can copy&paste some (or all?) of these into the above notebook to see if you can improve the accuracy of the classification. Ultimately, we will include these in a library you can just import.

1211

Stunning, no other word for it.
It will take me a while to absorb it all since I am still currently working through various books. However, your work will make this entire area vastly easier for me to comprehend and put into practice.

One initial question:

Exercise: It is also common to run cross-validation on the training data and tweak the parameters based on that score, testing should only be done rarely.

I am aware of having made serious errors in this respect in the past. I still can't quite think my way through this properly. I understand the whole concept of cross validation on training data and so forth but wonder whether you could expand your thoughts a little on when to actually resort to the test data?

We will be coming up with countless different models using a combination of the factors. I'm just not quite clear on when to use any of those models on the test data.

Accuracy on test set = 67.26%

As you are well aware, predictions on price alone rarely exceed much above the 50% random level. This therefore seems more than a " small edge in the market (or at least the test set)! "

It will be interesting to see as we implement models going forward whether this sort of accuracy can be maintained on fresh data. But it certainly looks promising, even if it has not yet been put into an actual system for back testing.

Interesting to see the low ranking of momentum.

Anyway, thanks to the whole team for this.

Thanks, Anthony.

Cross-validation: A common method is to do model development and comparison without ever touching the hold-out set. Once you have something you think works really well you put it to the final test to see how much you fooled yourself. A good mind-set is that you lose something every time you test on your hold-out set.

Accuracy: I'm a bit suspicious of that myself. Initially I got more like 53% which seems much more reasonable. I think this is due to having overlapping time-periods. The factors themselves are computed over overlapping windows and so are the returns to be predicted. Thus, probably just by forward-projecting you get pretty good predictability. Then again, the classifier does not have access to any forward returns so can't just forward-fill (except the first 4 days). I can post another version where I subsample the data to only train and predict once every week to avoid the overlap. This brings accuracy down significantly (i.e. to 53%). The ultimate test, as always, will be how well this predicts on future data (i.e. testing the classifier in a month from now).

@Anthony: The high accuracy was indeed driven by the overlap from the training day returns into the testing day returns. I updated the NB now to leave a 4-day delay between train and test (as we have 5-day returns) and the accuracy is down to a more reasonable 53.3%.

and the accuracy is down to a more reasonable 53.3%.

Got it, thanks Thomas. As you know a typical unadorned trend following system as practiced by the old style CTAs rarely has "accuracy" of more than 40%. I E the win loss ratio is around 40/60.

Given that, an accuracy of 53.3% ought to be quite high enough to profit handsomely when combined with the usual TF type rules of letting profits run and cutting winners.

But more of that anon when I have a clearer view of ML and the whole process.

This all seems a huge step forward.

it seems like we need a way to save a trained ANN. just like people, the intelligence of each network can vary based on the initial weights and the training set it was given.

also, input data needs to be randomized to avoid over learning a common segment. not sure if that is done automatically using the tools given

@ Anthony & others who may be interested - check out udacity's free course 'Intro to Machine Learning' - it's specifically focused on python, and the sklearn module, and I've found it an excellent course.

@ Thomas thanks for the notebook, looks really cool, gonna get stuck in as soon as I have some time

it seems like we need a way to save a trained ANN

Example is AdaBoost, not neural network. I'm not sure why this selection as there are superior alternatives - boosting classifiers are known to overfit easily especally with noisy datasets and financial data clearly is noisy. For example Random forest or extremely randomized forest ie. extratrees are known not to overfit easily.

I didn't go through the example but if the data is not normalized and the data used has upwards trend (as has been in past years) the training data should be normalized if trend continues or the classifier might just fit the trend - This is even more important when using classifiers like adaboost which are known to overfit easily and especially with unbalanced datasets.

Hi Thomas,

You Q folks are doing great work, I'm sure, but for me, it's a bit like trying to eat a nice steak in one bite (and perhaps trying to take in a whole potato, as well).

My understanding is that you are interested in "coming up with an alpha combination that is predictive" (in 'combination' you presumably include interactions, up to some effective polynomial order?). Presumably, you are trying to predict forward daily returns, based on daily prices (open or close...take your pick). There is a vector of Y's, which consists of the daily returns of N stocks (not per unit wall clock time, since early closes/weekends/holidays are ignored). So, the name of the game is to find the set of M X's that, in combination, best predict Y's (on any timescale...I just need predictability to trade). The X's consist of a set of "commonly used factors" (fundamentals, technical indicator, and their combinations, etc....whatever one dreams up).

So, if I follow, there are returns for N stocks (Y), and the supposition is that M factors (X) can be used to predict those returns. So, in effect, are you creating N independent models? Are you finding Yn = fn(Xn), where n corresponds to the nth stock?

Or is it that you are building a model that best predicts the overall return (R) and the model consists of the M factors and the weightings of N stocks in the portfolio? Something like R = f(X,W), where R is the scalar overall return of my optimized portfolio, X is a vector of M factors, and W is a vector of N stock weights?

Or maybe you are doing something else altogether?

On a separate note, I can't tell from your notebook if you are computing the feature importance for a single period, or these features would be considered persistent over a long time span (e.g. back to 2002 or whatever)? It would seem the question to answer would be, are there any factors that are consistently predictive over many market cycles (even if their relative importance varies with time). In other words, which factors are total junk and which ones are the gems. Should I include the age/gender/nationality/favorite food of the CEO? Or do these factors never have any predictive power whatsoever, and just over-complicate the model, and impute a risk of over-fitting?

What is the chart of "Feature importances" telling us?

Another question is, it seems that your workflow of screening individual factors using alphalens, followed by combination (using ML), would assume that you don't have terms like y = c*x1*x2 (where c is a constant)? If the response, y, is a function of the product of the factors x1 & x2 only, then they won't survive to the combination step; they would be ruled out using alphalens (since it only looks at individual factors, right?). Is this a correct understanding? Or would alphalens include a model up to, say y = a*(b + x1)*(c + x2), which includes the cross-term?

What is the chart of "Feature importances" telling us?

In ensembles of trees "Feature importance" basically tells how many times given input is used in tree ensemble.

Mikko -

how many times given input is used in tree ensemble

Is it a good thing for a factor to be used many times in a tree ensemble? Or bad? Presumably good.

Also, what is meant by "a non-linear combination of features" as used in the notebook? Is this in the sense of the quadratic and cubic models described on http://www.itl.nist.gov/div898/handbook/pri/section3/pri336.htm ?

Uggh...lots to learn here.

Mikko M
Exactly what type of normalisation are you suggesting and at what stage? Is each and every factor to be scaled between 0 and 1 at the very beginning ? Or standardised? Or what?

Every factor is first ranked and then scaled to be between 0 and 1. Scaling matters for some classifiers but not tree-based ones as used here. Returns are converted to -1 and 1 by median-splitting.

Scaling matters for some classifiers

Yes, so I understand. Does this scaling you have put into place here suffice for all classifiers? Or would some prefer other methods? And de-trending (as Mikko M seems to be suggesting)? Or does this amount to the same thing in the way you have done it?

In any event, the framework is there and I guess people can experiment in any way they want. Which has to be a good thing. Its a great framework and given the fundamental data from Morningstar its all there, wide open for experimentation.
A

Hi Thomas,

If it is not obvious from my comments here and elsewhere, I'm interested in the overall workflow you folks are putting together, but I'm finding it challenging to get the bigger picture. Your efforts seem to be based on an established intellectual edifice that is foreign to me. Jonathan Larkin provides a decent 30,000 foot view on https://blog.quantopian.com/a-professional-quant-equity-workflow/. We have the Q500US & Q1500US (along with the more general make_us_equity_universe), and then alphalens, which is a manual tool for analyzing individual factors. Followed by your work above, to combine the factors. Etc.

At a level lower than 30,000 feet, but above the ground level of individual lines of code (let's say 10,000 feet), there is a kind of overall point-in-time optimization problem that is being solved, which includes the universe selection, and the factors and their interactions. There's also the issue of how the universe selection interacts with the factors, which I don't think is addressed in the workflow (at least not as a factor to be automatically optimized). In the end, it would seem that as an output of the factor combination step, I should have a model of how to forecast returns for every stock independently, no? Is that where we are headed? Then the portfolio construction step would be to sort out how to apply those forecasts and construct portfolio weights (i.e. how much of my portfolio should be in XYZ at any given point in time)?

Perhaps you could provide a sense for the overall optimization problem that is being solved (across the entire workflow) and how your factor combination step fits in? Presumably, your "library of commonly used alpha factors" was run through alphalens and somehow vetted? Each one was shown to have some predictive power individually, and now you are combining them in a fashion that accounts for interactions ("a non-linear combination of features"). Sorry, as I said, I'm missing the 10,000 foot view. What are you doing?

To provide some background, I am familiar with the so-called "response surface methodology" used in design and process engineering. My fuzzy picture is that you are building a multi-dimensional response surface that can be used for optimization, subject to constraints? If so, what order of model are you constructing (first-order, second-order)? Is the ML model effectively the response surface? Or has it already done the optimization? Again, I'm obviously confused...

Best regards,

Grant

Anthony: Yes, this normalization works for all classifiers.

Grant:

I should have a model of how to forecast returns for every stock independently, no?

With independently, do you refer to the fact that we are ranking the stocks here and that rank depends on the other stocks in the univ?

The idea is: we have factors, we combine them in such a way to be predictive of future relative returns. When we can predict that, we just need to long stocks that the model predicts to go up, and short the ones that go down. The other stuff is just icing and there are many ways to skin the cat.

Regarding normalization, as I said before I didn't go through the code and now after explanation by Thomas it sounds like normalization is done "properly" for long-short purposes.

Thanks Thomas,

Not sure that makes things any clearer, but I'll take what I can get. I'll have to take another pass through your workbook, to see if I can understand it. All of this ranking talk, I don't understand. For example, if I assign a value of 1-10 to foods that I like, 10 being the highest, then by default, I can rank them. It is the value assignment that would seem to matter.

I guess you are saying that the underlying assumption is that there is a monotonic relationship between the values and their utility for trading. So, if I rank them, I can pick the top/bottom 10% or whatever. Still, it is not the ranking, but the assignment of values in such a way that the monotonicity holds, right? The ranking is just turning the crank, so to speak.

Grant

Hi Thomas -

Under your section "Run the pipeline" you end up with a table of numbers, resulting from results.head(). How did you derive the table? I see factors as column labels (except for "Returns"?), and then for each day, there is a list of securities (presumably every security in the universe). And integers within the table (except for the "Returns" column). How are you determining the integers? How are the returns calculated? Are the returns used to determine the integers? The integers are ranks, but is the ranking by security (row-wise), but not normalized? Or by factor (column-wise)? Or something else? Are you ranking the relative predictive power of each factor, relative to other factors, by security? For a single security, I read across, and for a single factor, I read down?

Also, I don't understand if a factor doesn't directly predict the 5-day return, then how are you applying it? For example, Working_Capital_To_Assets doesn't tell me if I have a stock price for XYZ, what the price will be 5 days later? Wouldn't all of my factors need to be formulated to actually forecast prices?

If you could fill in some missing pieces on how you are doing the forecasting and relative ranking, it might help (me at least). What is the recipe (in words, mostly, since I don't have the patience to unravel Python/Q API code...which I can do later, once I understand the big picture)?

Thanks,

Grant

The integers are the rank of that factor of that security on that day relative to the rest of the universe. Returns are calculated over a 5-day window. The computation of returns are not influenced by the factor rank or vice-versa.

For example: on day 1 we have four stocks in our universe.
1. On that day Working_Capital_To_Assets values of all stocks are [0.2, 0.3, 0.1, 0.5].
2. We rank them which will become [1, 2, 0, 3].
3. We then normalize so that it becomes [0.25, 0.75, 0., 1.].
4. Assume 5-day-returns from day 2 to day 6 of each stock of [.01, 0.05, -.01, 0.02].
5. We then binarize the returns by a median-split which gives [-1, 1, -1, 1].
6. We then feed this into a classifier which will try to find a way to predict the labels from the input. In this case, we could use the decision rule: if normalized_ranked_factor_of_stock_i > 0.5 predict 1, else predict -1. In this case, it would get 100% of the training data correct.

Intuitively, stocks which have a high rank in Working_Capital_To_Assets are predictive of a stock going up more than the other stocks in its universe.

If we were to act on our prediction on day 2, we would long stock 2 and 4 and short stock 1 and 3 and exit the positions on day 6. This would be a profitable trade.

We then binarize the returns by a median-split which gives [-1, 1, -1, 1].
if normalized_ranked_factor_of_stock_i > 0.5 predict 1, else predict -1.

I am going to try a probability based approach rather than stepwise binary.

Find the top and bottom 30 percentile stocks by their returns. Essentially, we only care about relative movement of stocks. If we later short stocks that go down and long stocks going up relative to each other, it doesn't matter if e.g. all stocks are going down in absolute terms. Moverover, we are ignoring stocks that did not move that much (i.e. 30th to 70th percentile) to only train the classifier on those that provided strong signal.

Nice idea but I suspect it may perhaps introduce an element of bias. I would prefer to rank later based on the probability of the 1/-1 forecast and not load the dice by momentum at this stage.

My initial gut feel is that selecting / filtering at this stage based on just momentum is tainting the procedure. Patterns affected by the other Factors we are calculating may have an independence from Momentum. At least I am hoping so.

Sorry if my thoughts are not very well expressed. I will of course be coming back in Notebook form once I am further forward in my thinking.

@Anthony: Cool! Most classifiers in sklearn support .predict_proba() as well.

Thanks Thomas,

Makes more sense now. I'm still kinda fuzzy on the ranking step 2, since it seems you could just skip it, and normalize.

By the way, your binarize reminded me of this:

http://www.cs.ucr.edu/~eamonn/SAX.pdf

You are basically reducing the dimensionality in a similar fashion, except just chopping the returns in half, +1 or -1, instead of multiple levels.

Grant

Hi Thomas -

I am very much in favor of the structured workflow y'all are pulling together. I've seen Jonathan Larkin referenced several times (by you above, and on other posts). He is blogging on the topic (https://blog.quantopian.com/a-professional-quant-equity-workflow/ ) at a high level, but I'm guessing he's more of an internal stakeholder, than a nitty-gritty, detail-oriented system engineer/architect. Is there anyone in this role on your end who could speak to how the whole thing might go together, kinda at the 10,000 foot level and below? I think it'd be really productive to have an open discussion, back-and-forth with users on the topic, with someone on your end who has the vision for the end "product" and knows the deep-dive details of your platform (and where it might go in the future). But is at a high enough level in the organization that they can just decide what can be discussed and what shouldn't, on a public form. Does such a person exist? Would this approach be beneficial?

As you might recollect, back in the day, folks like Fawce, Dan D., others would field questions and interact with users on the forum, and I think the openness/transparency was better. Maybe it is just a natural evolution, from start-up to actual business. Something has been lost, unfortunately.

Grant

The ranking serves a purpose. If I change the example to have Working_Capital_To_Assets values of all stocks to be [0.2, 0.3, 0.1, 100.](perhaps due to a data error or just an outlier) the normalization would squash the first three values.

@ Thomas W. -

I'll have to noodle on this ranking business more. It is not obvious why it is necessary, what problem it solves, and what might be the alternatives. Normally, when dealing with extreme values, one might work in logarithms, or some such thing. It seems that the ranking is a form of summary statistic that is then used for your horizontal axis of factors, which drive the responses (e.g. the returns in this case, but there could be multiple responses). There is some information that is lost, I think, but perhaps you need put all factors on the same scale, effectively, to be able to manage the analysis in a simple fashion, hence the ranking (although there must be other approaches, right?). For uniform sampling of the factor, ranking is trivial, right? If my factor values are A = [0.1, 0.2, 0.3, 0.4], and I rank, then I just get A_ranked = [0, 1, 2, 3], which is the same as subtracting the minimum, and then dividing by the maximum (offset the axis, and then normalize) and then scaling by the total number of elements minus one (re-normalizing). I can use A or A_ranked, but they are equivalent, I think. But if you don't sample factors uniformly, then you are doing something different, as you suggest. You are pulling in outliers, making it seem as if you used uniform sampling, when actually you didn't. So, something is lost. There is a big leap in the analysis, I think.

In my most recent post above, am I correct about Jonathan L.'s role? Or is he fielding user comments/questions/feedback? If not, who? Or maybe you don't want to approach it in this fashion? Worker bees will be assigned pieces of the puzzle, and interact selectively with users, keeping the meat of the development process internal?

Your contest idea sounds interesting, but would it just be for bragging rights? A "prize" of some sort? Could you formulate it like Numerai, so that users would have the option of running offline, or using the Q platform?

Grant

Definitely information is lost. You don't have to rank your stocks in this way. These aren't strict rules to be followed but rather some initial default settings. If it doesn't make sense to you, the best thing is to try what accuracy you get without the rank, maybe it works even better.

The mini-contest I mention is just for fun and learning right now. We might turn this into a real thing later.

Thanks Thomas -

That's what I thought, regarding losing information. I'll have to think about it a bit more.

Grant

I feel compelled to understand the ranking of factors better (versus using factors directly, which in my experience, is the typical thing to do, e.g. in science/engineering). I found this:

http://www.cs.cmu.edu/~ggordon/rank.pdf

On page 16 (slide 15), it says:

Rank vector R = ((1), (2),...(N)) is maximal invariant statistic under monotone transforms
That is any statistic unaffected by monotone transforms is a function of rank vector

I'm not sure what all this means yet, but it is not clear that any information is lost, under certain assumptions (e.g. as I discuss above, ranking is equivalent to an axis offset and normalization/scaling, for equi-spaced (uniformly sampled) factor values). Maybe if one assumes a monotonic relationship between the raw factors ( X) and the responses (Y), then no information is lost in converting to ranks of the factors (X_rank)? But it reduces the dimensionality, since I can now work with discrete integer values (levels) instead of continuous float values (or I don't need to interpolate the missing data)? But the risk is that if the monotonic relationship does not hold, then I've thrown a monkey wrench into the whole thing (e.g. if my response Y peaks at extreme values of X and then rolls off)?

Maybe one has to show that the monotonic relationship actually holds, prior to ranking? Just guessing...

Anybody understand the ranking?

why notebook is not working at train-test split?

/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:5: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

print X_train_shift.shape, X_test_shift.shape
(0, 18) (0, 18)

I'm having the same problem as Laszlo - initially the notebook was working fine, but now it seems to be breaking down at the point he mentions...

I'll investigate, thanks for reporting.

Small update: I fixed the breaking change by replacing np.percentile() with np.nanpercentile(). This is due to updated numpy (https://www.quantopian.com/posts/soon-upgrade-to-pandas-0-dot-18). Unfortunately, the shape of the resulting data also changed which I'm not sure why that is the case. It also negatively affected the OOS performance of the classifier, as you can see. I will try to get to the bottom of why the two would be different but others are free to explore too and see if they can get a better score.

64

Thanks Thomas!

@Grant Kiehne

Hi Grant,

I am new to Quantopian and spotted that you have been using machine learning for trading strategies.

I am not yet familiar with the environment of Quantopian and would like to see if you could help giving some insight using ML on the development environment of Quantopian.

I myself have more than 30 years’ experience in proprietary trading (discretionary approach, not systematic one) while my partner is more into quantitative approach. We look to collaborate with someone with more AI expertise and experience of algo development (someone like you we believe). We hope our domain knowledge in proprietary trading could also be good for your personal use.

Just drop us a message. We look forward to communicating with you further.

Hi Thomas,

I'm still working my way through your notebook above. In your results.head() what is the Returns column? Is it the trailing returns? Is that why you then have to "Shift factor ranks to align with future returns n_fwd_days days in the future"? I suggest simply doing the alignment within pipeline, so that one can directly read across the rows, as factor-response data. Is this possible?

You say:

Find the top and bottom 30 percentile stocks by their returns.

Are you setting up for the ML to predict which stocks fall into the two buckets, the top 30 percentile, and the bottom 30 percentile? For example, if you code +1 for the top bucket, and -1 for the bottom bucket (and 0 for not in a bucket), then a given stock XYZ could take on values (-1,0,+1)? In other words, if my ML ends up predicting a +1 for XYZ, then I'd expect its return in 5 days to be positive, and better than the mass of stocks falling between the 30 to 70 percentiles?

Another question is presumably this is all aimed toward writing a glorious, scalable long-short algos for your Point72 buddies. What if all the stocks go up and I have nothing to short? Is there an implicit assumption here that I'll end up with a basket of stocks to long, and a basket to short? I don't see anything in the analyses that would imply a market neutrality constraint. Or is it in there, somehow?

And how would one handle categorical factors (e.g. color of CEO's sports car, exchange on which stock is listed, etc.)?

Can this be extended to handle multiple responses (Y's)? You only do returns, but wouldn't one want a model for other stuff, too? The Quantopian contest and pyfolio are concerned with lots of responses, which all have some importance, right? It seems that in the end, to do constrained optimization, you need more than one response (e.g. maximize overall return under the constraint that SR > 1.0, for which you'd need to predict the return and its variance).

Thanks,

Grant

@ Alpha Seeker - I'm definitely not your guy for ML, but there should be some whiz kids here on Quantopian who can guide you (including Thomas W. & Co.).

Grant, those are great questions.

In your results.head() what is the Returns column? Is it the trailing returns?

Yes, trailing 5-day returns.

Is that why you then have to "Shift factor ranks to align with future returns n_fwd_days days in the future"? I suggest simply doing the alignment within pipeline, so that one can directly read across the rows, as factor-response data. Is this possible?

Yes, that would be ideal, but not possible currently. What we need would be a .shift() pipeline operator. It's on the feature wish list.

Are you setting up for the ML to predict which stocks fall into the two buckets, the top 30 percentile, and the bottom 30 percentile? For example, if you code +1 for the top bucket, and -1 for the bottom bucket (and 0 for not in a bucket), then a given stock XYZ could take on values (-1,0,+1)? In other words, if my ML ends up predicting a +1 for XYZ, then I'd expect its return in 5 days to be positive, and better than the mass of stocks falling between the 30 to 70 percentiles?

Yes, that's exactly the right understanding. But note that there could be a case where you correctly predict +1 but the 5-day forward return is not positive in absolute terms. E.g. it could be that market is tanking and all stocks are negative, just this one less than others (i.e. it's still in the top 30th percentile).

Another question is presumably this is all aimed toward writing a glorious, scalable long-short algos for your Point72 buddies. What if all the stocks go up and I have nothing to short? Is there an implicit assumption here that I'll end up with a basket of stocks to long, and a basket to short? I don't see anything in the analyses that would imply a market neutrality constraint. Or is it in there, somehow?

As we train the model to predict relative stock movements it's very unlikely to not have shortable stocks. As I said above, they can still go up for this to be profitable, just less than the others. So a very simple next step would be to just short the N stocks the classifier is most certain are -1's and long the N stocks the classifier is certain are +1's. That way I'm always market neutral.

And how would one handle categorical factors (e.g. color of CEO's sports car, exchange on which stock is listed, etc.)?

That's a great question and not one where there is a single obvious answer to. One idea with e.g. sectors would be to train a single classifier for each sector independently. But there are also other methods to include categorical data in machine learning. This is really where someone can add value by being creative.

Can this be extended to handle multiple responses (Y's)? You only do returns, but wouldn't one want a model for other stuff, too? The Quantopian contest and pyfolio are concerned with lots of responses, which all have some importance, right? It seems that in the end, to do constrained optimization, you need more than one response (e.g. maximize overall return under the constraint that SR > 1.0, for which you'd need to predict the return and its variance).

Technically, those ML algorithms exist but I don't think they would be useful here. That's not to say that those other things you mention are not important, but there are other ways to include them. The ML prediction part is just one piece of the trading algorithm. The prediction just gives me a long and a short book. Do I just equal weight each stock in each book? Probably not, so that's where a risk model would come in handy. For example I could do inverse-vol weighting, but then you also want to reduce exposure to certain risk factors etc. Or also "maximize overall return under the constraint that SR > 1.0" could be included at this level as well. You could also train a separate classifier to predict vol and feed that into the portfolio optimization, but my hunch is that historical vol will be good enough in that case.

@Grant Kiehne

Thanks Grant for the introduction. Let us keep in touch.

Thanks Thomas. More questions to follow, as I continue my effort to understand what you are doing.

As we train the model to predict relative stock movements it's very unlikely to not have shortable stocks.

Not so clear on this one. To go long-short, one needs absolute returns to be forecast as positive-negative, respectively, right? Or are you assuming some additional market neutralization machinery at a later step, should one end up with only shorts (the statistics don't support long positions with any confidence)? For example, when the whole market was crashing in 1929 or 2008-09 (or other such "corrections"), maybe going long can't be justified, based on the data? Then what?

It's important to realize that we do not attempt to predict market movements here and are not interested in them. A market-neutral strategy tries to make money on relative price movements. A couple of examples with simplified math:
1) long book returns: 5%, short-book returns: -5% -> profit: 5% + (-1)*-5% = 10%
2) long book returns: 15%, short-book returns: 5% -> profit: 15% + (-1)*5% = 10% (bull market)
3) long book returns: -5%, short-book returns -15% -> profit: -5% + (-1)*-15% = 10% (bear market)

As you can see, we're making profit on the spread between the long and short book. We don't actually care about the total movement, only about the difference between them. The classifier is trained in that manner too, it only tries to predict returns relative to others, not absolute. As such, the classifier should never predict all stocks go down relative to each other, as that's impossible, even in 2008.

In each of these examples, the classifier could have correctly predicted +1s for the long book and -1s for the short books as the ones in the long book are relatively better.

Why is Merton's Distance to Default commented you in the 2nd notebooks? Also why some of the defined factors are not added to all_factors?

Suminda: I got an error with Merton, maybe it cleared up. The missing factors are an oversight, want to add them?

It also negatively affected the OOS performance of the classifier, as you can see. I will try to get to the bottom of why the two would be different but others are free to explore too and see if they can get a better score.

I would guess that the first results were mostly due to "good" (even though incorrect) values for the test set and by changing the values to represent data differently you just got different result.

I also note that you raised the number of stubs from 60 to 150. This might have a big difference. I would also think if you ran the same example (with different seed) n times you would get different results but then you could calculate mean and variance and understand what kind of results you are really getting.. Adaboost is randomized not "static" classifier ie the results vary especially with larger datasets.

Just for fun, here is Naive bayes. It gets a slightly better results. NaiveBayes is usually a good baseline.

edit: I accidentally posted a training set prediction, test set prediction was
Accuracy on test set = 50.49%
Log-loss = 0.70200

22

Here is another variation (Naive bayes again) where we only pick top 10 and bottom 10% for training set. The idea behind this is that by picking just the best/worst performers we might better find an anomaly that has to do with extreme returns/losses. It seems that this is helping.

23

@ Thomas -

Sorry, I must be missing something. Say I decide to do one of these long-short thingys with only two stocks, ABC & XYZ. So, no matter what, I'll go long ABC & short XYZ (or vice versa). What if ABC & XYZ are highly correlated? If my model says that both will go down, shouldn't I short both, and then go long an ETF, as an overall market hedge?

I guess I don't get the concept of going long-short if my model predicts everything (or almost everything) will go in a certain direction. Does it all work out somehow in the limit of a relatively large number of volatile stocks, over many market cycles? The whole scheme you guys are cooking up would seem to require certain assumptions about how the market works, no?

Grant, in your case of ABC and XYZ, think about it as a pair trade. Correlation is a fine thing, so long as a) they aren't 1.0 perfectly correlated and b) you know which one to long and which one to short.

It's OK if your long-short model predicts everything to move in the same direction. The important part is that your model have to has to have different predictions for each stock (or at least predictions for a top book and bottom book). Look again at Thomas's examples:

1) long book returns: 5%, short-book returns: -5% -> profit: 5% + (-1)*-5% = 10%
2) long book returns: 15%, short-book returns: 5% -> profit: 15% + (-1)*5% = 10% (bull market)
3) long book returns: -5%, short-book returns -15% -> profit: -5% + (-1)*-15% = 10% (bear market)

Summarizing: If your model predicts every equity to behave exactly the same, it's useless. If your model accurately predicts differentiation between equities, then it has alpha.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Grant,

The great thing about maintaining these long and short baskets is that irrespective of the market you can make money. As Thomas illustrated above, it is all about the spread. Though I understand what you are saying, in that if you know things are going down or going up why not just do one leg with a hedging instrument? The answer to that is by long and shorting based on our predictive scheme we are completely isolating the performance of the algorithm to just our ability to predict. As long as we are confident in our model (in this case our ML classifier) then we don't care what the market is doing. We don't want to bet on the market - we want to bet on our model.

As far as correlation goes, let's take those two stocks and extrapolate it too 200 stocks, now we have two baskets (long and short) each with a beta of approximately 1.0 (assuming here that on average 100 stocks would have a beta about equal to the market). Now by putting equal amounts of the portfolio value in each leg we get 1.0 + (-1) * (1.0) = 0, for a low beta algorithm. It is this property that is so great because it removes the need for a hedging instrument entirely.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

I think Grant's question is more concerned with the corner case of the classifier predicting -1 for everything. That's valid, but in reality, we are not using the binary label but rather have the classifier predict probabilities of the stock going up or down (in the notebook is .predict_proba() for that). In that case you get a continuous measure which you can rank and short the bottom and long the top, even if the probabilities for all stocks is < 50%.

@Mikko: Thanks, I completely agree. After all, this is still a toy example and finding a significant edge with well-known factors and a simple classifier in the current market regime is probably a lot to ask. Your NB is very interesting, especially only using top and bottom 10% movers. I wonder if a more sophisticated classifier does better with these inputs.

Thanks Dan, James -

Just trying to understand the assumptions here, how the thing actually works (and why it might not work under certain conditions). Taking Thomas' example again, assume that my forecasting is 100% accurate. Then I could do:

1) long book returns: 5%, short-book returns: -5% -> profit: 5% + (-1)*-5% = 10%
2) long book returns: 15%, short-book returns: 5% -> profit: 15% + (-1)*5% = 10% (bull market), or 20% if all long
3) long book returns: -5%, short-book returns -15% -> profit: -5% + (-1)*-15% = 10% (bear market), or 20% if all short

Why wouldn't I do this? Instead of a blended profit of 10%, I end up with 17% (assuming scenarios 1, 2, & 3 are equally probable). Is the implicit assumption here that I can forecast individual stocks but not the entire market (which would be nice, since then I could write a much simpler algo that would just go either long or short in SPY)? I'll never get the additional 7%? There's too much risk in going all long or all short, since my ability to forecast the market stinks? I'd agree with Dan's statement "If your model predicts every equity to behave exactly the same, it's useless" except that if my model is actually predictive, then I'd just chuck the market neutrality constraint, and go either long or short, but not both equal weight.

Perhaps the confusion lies in the fact that you have an implicit market neutrality constraint that, as James points out, one hopes can be met if enough stocks of the right flavor are included in the long-short portfolio. The ranking and then going long-short (top-bottom percentiles) equal-weight will satisfy this constraint under certain assumptions. So, if I play my cards right, I'll break even, worst-case (on average, I'll make no profit), although I have to make enough to recover my Vegas flight and hotel. On top of that, there is some magical uncorrelated profit (the so-called "alpha"). By the way, since "beta" is so undesirable, if Q makes any money this way, I'd be glad to take it off your hands.

Jonathan L. suggests that there may be a deeper understanding:

Today I’ll add substance to that philosophy by giving you a detailed tour of the investment process for a popular and deep area of the quantitative investment world: cross-sectional equity investing, also known as equity statistical arbitrage or equity market neutral investing. This approach to equity investing involves holding hundreds of stocks long and short in a tightly risk-controlled portfolio with the goal of capturing transient market anomalies while exhibiting little to no correlation to market direction or other major risk factors.

Perhaps there is a really good (not highly mathematical/theoretical), short tutorial out there that fills in the gaps (and shows that it actually can work)? Sounds wonderful, but it is hard to picture, particularly in a statistical sense. Why is it called "equity statistical arbitrage"? How would I know that I'm doing "arbitrage" i.e. printing money, and not something else? If beta is zero, then is it automatically "arbitrage" by definition? Or do I have to show some form of point-in-time cointegration for the long-short baskets, point-in-time?

Is the implicit assumption here that I can forecast individual stocks but not the entire market (which would be nice, since then I could write a much simpler algo that would just go either long or short in SPY)? I'll never get the additional 7%? There's too much risk in going all long or all short, since my ability to forecast the market stinks?

Yes, that's exactly right (no snarkyness intended).

Well, is this assumption justified? What if I'm able to forecast the overall market, even a little bit? Wouldn't your workflow and alpha combination step be leaving money on the table? Or does it allow for a point-in-time market "tilt" to the portfolio (i.e. beta not exactly zero)? Does it allow for beta to be between -0.3 and +0.3, for example, as the contest allows?

It's just the assumption of the workflow outlined here. If you can forecast the overall market you should absolutely do so. Perhaps there's a way to combine the two (long/short and market-forecast) by having a separate macro-market model that introduces said tilt.

@Thomas I actually also tried random forests, extremely randomized forests and gradient boosting but they all got worse accuracy than naive bayes. It's also worth noting that log loss went up with naive bayes so it's not clear if it's "better". This should be tried in trading environment to make sure which one finds better trades, I would assume that if we will use predict_proba and use probabilities then the random forest would do better (but I would also guess that neither would be profitable).

By the way sklearn has this thing called classification report which might be better than the simple accuracy score.
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report

I also like cohens kappa a lot (classification report does not include it neither does any older sklearn version)
https://en.wikipedia.org/wiki/Cohen's_kappa

Sklearn also has its own pipeline that can be used to get rid of all that transformation stuff (see attached notebook that uses the same transformations and Random Forest for classification)

18

And here is confusion matrix for the same prediction.
https://en.wikipedia.org/wiki/Confusion_matrix

This usually tells important information especially when working with unbalanced datasets (not the case here) or more than 2 classes.

By the way it's worth studying which method does get you better results in trading, predicting relative movement or predicting probability of going up/down and going long/short percentile of those predictions.. Because of volatility clustering usual machine learning statistics usually don't tell you the whole truth about classifier robustness in real trading environment..

8

Hey Mikko - I'm lurking in this thread as I'm new to Quantopian. But I saw you mention Random Forests and was curious if you happen to have an example of that you'd be willing to share. I'm trying to find a decent example of running a list of vectors through a simple RF with Quantopian. Thus far I have only worked with ANNs and SVMs (outside of Quantopian) and I'm not sure what the appropriate dot file format is, appropriate first settings, etc.

Maybe you (or someone else) can help out? Any pointers would be welcome. Thanks in advance! :-)

@Evil @Mikko

Just in case you guys look to take a RF model to the IDE, I want to remind you that random number generation is not currently supported in live trading (per the live trading guidelines).

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

@Jamie: My current approach with SVMs (outside of Quantopian) is to produce the model externally and then use it to parse during live trading, which does not use much processing at all. I would draw the same separation if ended up using Quantopian for SVMs, ANNs, or RF models. Thanks for the pointer.

@Jamie

Now I'm a bit confused, almost every more advanced ML method uses randomity as part of training including adaboost used in this example and neural networks, random forests (and almost every other recent method).

If there is no way to load classifiers trained elsewhere to live environment and there is no way to train a classifier/regressor in live trading environment then why are we even talking about machine learning?

Hi Mikko,

Sorry, you're right, I wasn't clear at all. Currently, live trading only supports deterministic algorithms. As a result, if you want to live trade with an algorithm that depends on random number generation, you have to specify a seed. For example, with the RandomForestClassifer in sklearn, there's a random_state parameter that you can use to specify the seed. This will make the algo deterministic. This is only necessary in live trading.

Let me know if this helps.

Okay this is good to hear. I was afraid that it's not possible to use random at all. Having to specify the seed sounds very reasonable.

I'm still hoping for a pointer regarding the RI classifier. So sklearn is it then?

See the notebook I attached to the post, it uses random forest (from scikit-learn ie sklearn which is quite widely used library).

Here is the ensemble page with some other alternatives too and from there you can find the link to Random Forest api as well.
http://scikit-learn.org/stable/modules/ensemble.html

@Mikko - outstanding, much obliged! :-)

@ Thomas -

That's valid, but in reality, we are not using the binary label but rather have the classifier predict probabilities of the stock going up or down (in the notebook is .predict_proba() for that). In that case you get a continuous measure which you can rank and short the bottom and long the top, even if the probabilities for all stocks is < 50%.

So presumably this would allow one to go more long than short (or vice versa) and then neutralize the beta with SPY. This is what I do in an algo I'm working on. My general advice is that the workflow should support both the long-short equity only style, and one that accommodates one or more hedging instruments (e.g. ETFs). Or is there a compelling reason to be in all equities?

@ Thomas since you cannot directly use something done in research directly as an algorithms if you were to port this to to the algo pipeline side who would the code look like?

Hi Thomas -

I don't quite understand the desired output of this alpha combination step. In my mind, the most general approach would be to input a universe of 500 (or 1500) stocks, point-in-time, and then output a corresponding vector of 500 weights, normalized as a portfolio weight vector (e.g. ready for order_target_percent). Or does this happen at the next step of portfolio construction? And if so, how do the two fit together? I'm confused, since you seem to be setting some of the weights to zero (by only working on the extreme percentiles). Shouldn't every stock get some weight, and then the portfolio construction step would optimize subject to constraints, taking risk and other factors into account? Or is it more like the alpha combination step passes synthetic securities to the portfolio construction step? You've clumped them together; they are no long individual instruments? You end up with a minimum of two clumps, the longs and the shorts?

Grant

Why is working capital to assets a most predictive feature? This does not make any sense. But I see a more basic issue with this ML approach.

ML works best when features are spatial, not temporal, as in character and picture recognition. If you have a picture with the features changing in time ML would produce random results.

As someone mentioned, long/short makes no sense and IMO "market neutral" is a misnomer. Funds during 2008 crash learned that the hard way. When everything goes down your spatial ML classifier will fail to predict it. Random profits from several years will vanish in thin air in less than a month.

Numerai is already doing this and results are not good based on payout for best predictions.

Not even mentioned data snooping.

I would like to see a good explanation as to why ML works with time series and time-based features. Until then I will trade SPY long and I am doing 12% this year with simple strategy and limited risk.

First off, this is a great intro. But I have two comments/questions:

1: It can be argued that access to tools like sklearn has only been widespread for a few years (sklearn didn't even have it's initial git commit until 2011). If we see alpha from using ML techniques on backtests, couldn't it be argued that the alpha only exists because most market participants didn't have access to ML at that time, and that in the current era when anyone with a laptop can do this analysis, we can expect it to rapidly decay?

2: The post mentions briefly using cross validation, and then links to the SKLearn cross validation tools, but I would like to point out that those tools are unusable for time series. The SKLearn cross validation suite just picks random samples for training and random for testing, but it should be obvious that that introduces look ahead bias.

Thanks

@Brom - my 2 cents:

1) I have been developing trading strategies for the better part of the past decade. And let me assure you that having access to ML tools/libs doesn't mean they're able to produce and then maintain an edge. Besides running simulations in mathematica or now Quantopian/Quantconnect is one thing - integrating various ML techniques into a viable trading strategy is quite another. The devil lurks in the details...

2) How is randomized testing associated with look ahead bias? Please explain how this is obvious. IMO randomization reduces form fitting and especially market cycle dependencies.

@Evil Speculator

I think we agree on point 1:

On point two. if you train your ML model on data from 2016, then test it on data from 2015 and get a high accuracy score, it seems like you shouldnt really trust it since you wouldn't have had data from 2016 if you were trading in 2015.

Cross validation / look ahead bias. I'm not sure it really matters as long as at the point of running the training you are using data which would have been available at the end of the training period. Unless of course you then integrate trained weights into a back test for the training period. You should / could use the trained weights thus obtained but only for the validation and test set. And going forward the same thing applies. That is my tentative reading at this stage anyway.

Interesting point regarding spatial versus temporal. I gather than Thomas may be doing something analogous to determining a 2D response surface by sampling the X's over many periods. For example, say I have factors X1 & X2. To sample X1 & X2 and determine a response surface, I have to let X1 & X2 vary over time. However, if I then ignore the fact that the X's were spread out in time, there's potentially something wrong with the analysis, unless every week is like every other. Does stationarity have to be assumed for this to work?

Nice work here on ML but I have also doubts about the effectiveness of the method. As someone mentioned already ML does a good job when features are in space domain. In time domain ML does not appear to do a good job. In fact, ML was popular in the 1980s but traders abandoned it as soon as they understood the problems, for much simpler algos.

As far as cross validation: There is an issue because it assumes i.i.d. But the most important issue with ML is that each time you update your data the classifier must run again and that means data-snooping is inevitable due to reuse of old data.

Unless you are going to use ML only once for a single prediction, something that is not practical of course but this is what is done with text and face recognition, then each time it runs with updated data the probability of a Type I error is higher. This must be understood and ML immediately disqualifies unless there is a sound method for determining significance of results and can compensate for the bias.

I have written a post about Numerai. It is interesting to see how more than half of the predictions are worse than random.

It is also important to realize that better ML metrics (logloss, accuracy, etc.) do not necessarily mean better predictions. Most top performers in terms of logloss in Numerai have probably achieved that via "boosting" based on knowledge of public leaderboard. On the other hand, good predictions can be generated from random participants. Due to that Numerai takes many predictions into account (controlling capital) in an attempt to average out noise. This is smart but at the same time risky as the probability of ruin is not zero and at some point will occur, we just do not know when, hopefully in a long time and after Elon Musk has already colonized Mars.:)

Conclusion: ML for markets can be an exercise in futility. I have spent many many hours testing different ML algos:

Logistic regression (with and without L1, L2 and k-fold validation)
Random forests (with and without k-fold validation)
Gradient Boosted Trees (with and without k-fold validation)
Decision trees (with and without k-fold validation)
SVM with RBF and sigmoid kernel

None of the above worked for me unless I already knew which feature will work. But if I know that, ML is not really required. I can use a simple scan. I do not discount the possibility that it can work well for someone else and I was not doing it right. But that was my experience.

ML is good at telling you what IS. I suspect no method, including ML is very good at predicting what will be.

People can't have it both ways. If the markets have no memory and autocorrelation does not exist then cross validation on time series is perfectly valid. Provided it is based on data available at the time of testing with no look see as to tomorrow.

The exercise would nonetheless be futile by definition.

This must be understood and ML immediately disqualifies unless there is a sound method for determining significance of results and can compensate for the bias.

I absolutely agree with the above. Fitness method is absolutely the most important aspect when working with machine learning.

I also am not sure that using notebooks to find results in general way without time series analysis is a good idea at all. Accuracy of the classifier might tell you that everything is better than great but the actual trading results can be catastrophic due to volatility clustering that this kind of analysis does not take into account.

I have done a lot of work with genetic algos in the past (some discussion at this old thread http://www.forexfactory.com/showthread.php?t=167720 - please note that this was before the current ML/deep learning hype) and my observation based on lots of studies in actual trading results is that the best results out sample (it's very easy to get good in sample results) come from very simple fitness method: normalized_returns/maximum_dd (and you might want to include some correlation analysis if you are trading multiple methods). But you always want to get the fitness score from real market data so volatility clustering is take into account.

Does ML work? This is like asking if it is possible to find trading methods at all. IMHO computer can find methods as well as human can but you better be sure that you know what your requirements are as the computer will find you the methods you have asked, not the ones that you want.

One way to think about this is to suppose that everything is static, going back to the Stone Age (like physics). Then, I can assume a model Y = f(X1,X2), and the problem becomes finding the unknown function f. Say I'd like to know how long it takes to bring a water/ice mix to boil on my stove top, with a constraint on the total mass of water in the pot. So, X1 could be the mass ratio of water-to-ice, and X2 could be the setting on my burner, 1-10. And Y would be the time it takes for the water to come to a boil (I could get fancy and use a thermometer, or just look for bubbles). I could control the experiment by stirring the pot in a certain way, etc. I would then hire a graduate student, or more likely, an unpaid middle-schooler to conduct N measurements to find the relationship, f. If I know what I'm doing, I'll find f and I'll be able to predict Y, the time to boiling, just fine (most likely, with some finagling, I can get a model in the form of a multivariate polynomial in X1 & X2, over some range of X1 & X2).

For the stock market, one would hope to apply the same approach, but of course controlled measurements are not possible (and would probably be illegal if they were), unless X1 & X2 naturally vary in time over the parameter space of interest (which of course they do). Then, I could say 'Aha!' I'll just let the market do the work for me, and once I have a sufficiently large data set of (X1,X2,Y), I'll figure out Y = f(X1,X2). If changes are slow, I can simply update Y = f(X1,X2) as new data become available, making incremental adjustments. The adjustments don't really play into my short-term forecasts; they just keep the model in line with reality over the long term. This is not unlike the water/ice boiling model construction, where I might have some long-term aging effect (e.g. gradual deterioration of my heating element) that I can compensate for by repeating the measurements on an appropriate time scale.

One problem, it would seem, is that I'd like to trade at least weekly, if not more frequently, but there are only 252 trading days in a year. And I only have one noisy data point per day (if I'm following, the proposed workflow is based on pipeline and daily OHLCV bars). But then I'd like a model that is Y = f(X1,X2,...XN), where N is a large number and I'd like to include non-linearity and interactions up to some order. My intuition is that unless the signal-to-noise is really good and the market mechanisms are stable over many years, I won't have enough data to trade frequently (which I'd like to do on an effectively continuous basis, to have tight control over the portfolio return, to increase the Sharpe ratio).

My intuition is that basing the workflow on daily OHLCV bars doesn't make fundamental sense, when a historical database of minute OHLCV bars is available. It doesn't mean that the input data would be greater in dimension (i.e. one could still use 252 data points per year), but they could be significantly better quality in terms of signal-to-noise (e.g. one could compute daily factor values using minute OHLCV bars or summary statistics derived from minute OHLCV bars). For some reason, this basic point has not gotten traction. I'm befuddled. Am I wrong in my intuition? Whether I use ML or not, I'll be much better off with the greatly improved signal-to-noise.

Grant, if you look at the code at the notebook you'll notice that the factors used for prediction are quite long term ones derived from both fundamental and price data. In these kind of factors I don't think there is much sense to use minutely data to aggregate these kind of factors but I might be wrong.

If we would be regressing price data then the question would be if data aggregated from 1 min bars (or even lower tf!) would give us better results than using daily data. I don't think we can answer this question without actually implementing both methods and checking the results.

We aren't actually using daily data, unless the signal-to-noise of a single trade in a given day is the same as that of the mean price for the day, for example. The mean must be a lot less noisy, since it would be the average of 390 individual trades; the noise will be knocked down by a factor of 1/sqrt(390). And the mean would be a better representation of the price for the day; a single trade of unknown volume could be pretty whacky.

Re stationarity: That's certainly the assumption in the window of the training data. But if you don't assume some form of stationarity over a limited time-period, how would you want to do predictions of any nature in the first place? What the classifier is trying to learn is that if a stock has this pattern of factors (lower vol, higher earnings quality, etc) it will do better than then others in its universe.

I would not expect these patterns to hold indefinitely. One way to get around that is to retrain the classifier e.g. every week.

@Brom: Cross-validation is a way to see how robust your classifier is and allows you to tweak hyper-parameters. It also gives you some guidance of what to expect for the hold-out. It's true that we mix future and past data there but that's why there is the hold-out set with only future data. That is the only number we should really care about, as I write in the NB. Cross-validation is just a tool to get more out of our training data.

@Suminda: I have a notebook almost ready to go that shows this

@Michael Harris: I do know of systems not unlike the one I present here (albeit more complex) that have shown very competitive performance over long periods of time. I do agree that it's all about the factors (garbage in garbage out) and any system that does not allow you to innovate at that level will be limiting. The ML or factor aggregation step however can turn many weak signals into a stronger one, especially if they are uncorrelated. But it can not create signal where there is none. I don't think that the well-known factors I used here would be expected to carry much alpha. That's where the Quantopian data sets get interesting.

@Grant

In my mind, the most general approach would be to input a universe of 500 (or 1500) stocks, point-in-time, and then output a corresponding vector of 500 weights, normalized as a portfolio weight vector (e.g. ready for order_target_percent). Or does this happen at the next step of portfolio construction? And if so, how do the two fit together?

That's exactly right. Transform the long and short baskets to portfolio weights is part of the portfolio construction step. Given the baskets (just binary long/short signals or probabilities) it would run portfolio optimization to e.g. (i) minimize volatility, (ii) minimize exposure to certain risk factors, (iii) take the previous portfolio into account to reduce turn-over, (iv) take transaction and borrow costs into account etc to determine the weight vector for order_target_percent(). We haven't discussed this step at all yet but it is in the works.

@Mikko: Thanks, love your contributions. I didn't know about the classification report which looks very useful. I do like the sklearn pipeline approach, it makes the fitting and prediction steps much more concise. I also agree that a classifier should be evaluated in a broader context and over longer periods of time. My next NB will show how to do that.

@Anthony,

"ML is good at telling you what IS"

This is interesting. The problem is that "What IS" is always relative to some metric. In some domains (example relativity physics) the metric is constant for a wide range of parameters (in case of relativity physics it is the timespace metric, which breaks down only at the quantum level) In the market , the metric changes all the time depending on conditions. "What IS" changes constantly and this is done naturally due to liquidity constraints among other things.

@Mikko,

"...before the current ML/deep learning hype..."

The hype is there because the industry must morph to maintain its profitability. There is always a new promise. In the 1990s the industry offered chart patterns and simple indicators. Most traders were ruined. Now the industry offers GP, machine learning and promises of "deep learning." Many more will get ruined. Then the industry will morph to something different. I expect home HFT to be available soon for a few dollars a month.

There are good applications for ML and deep learning in other domains where features are well-defined and there is no risk of ruin. I remember someone telling a story in a forum that their ML identified a pattern in stolen credit cards: the thief goes to buy gas immediately after stealing the card with high probability. You can then inform the gas stations to be careful in accepting credit cards. Here the application limits losses. Face recognition and aversion of a crime may limit losses of human lives. But when asking ML to generate pure alpha in the markets, this is too much. Maybe best use is for risk management, not for alpha.

@Thomas,

AMZN with P/E of about 205 (TTM) (reminiscent of dot com bubble BTW) is moving the tech market. The classifier could easily get fooled that gains in some stocks are due to some factors when in fact they are due to index correlation. I do not now how to solve this problem (de-correlate data) and I am even not willing to do it when a simple mean reversion algo that I use based on a simple formula out of a college text in probability theory is performing quite well YTD when ML algos are struggling to stay in the black.

Please do not take me wrong, I find high value in Quantopian and I actually use it now for backtest sanity check. I think this platform is quite valuable for traders and everyone should use it and contribute. I have doubts about ML and long/short systems with the designation "market neutral." There is no such thing and if there is it is only temporary and risks are much larger than those of directional algos. Then, one should be skeptical why some people are in favor of these market neutral strategies. Note that this market is notorious for squeezing out shorts and has done it repeatedly in the last 7 years. Why would anyone want to go short individual stocks when risk is so high? I could think of some reasons but this is not the place to discuss them.

Michael

"ML is good at telling you what IS"

What I meant (but did not specify) is that ML is great for such things as text or image processing. But probably less great at predicting the future state of a chaotic system.

Michael

As usual I find myself hard pushed to disagree with much of what you say. The only slight difference is that if that is what managers like numerai want and are willing to pay for it....well, am I going to complain.

Although even then I still find myself in a slight moral quandry.

Disclaimer - I run a trading blog and have had the dubious pleasure of seeing several thousand people come and quickly go (i.e. wipe out) over the past eight years plus. And quite frankly the shake out rate is equally high among regular traders and highly intelligent ones with programming or even quant/math backgrounds. Intelligence and system complexity are not guarantors of success! Otherwise we'd see a lot less white papers on trading algos and more yachts parked over here in the Mediterranean ;-)

Machine Learning is just that. The wiki describes it as 'a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.' Says nothing in there about predicting future market activity. Which, just like the future, you CANNOT. The best you can do is to look for discrepancies, to parse for subtle repetitive patterns, develop a small edge based on market inefficiencies. You work with the factors and markets you KNOW and NOT with the ones YOU THINK YOU KNOW or THINK/WISH WILL HAPPEN.

Most of you guys probably have forgotten more about mathematics and machine learning than I will ever manage to grasp in my life. I'm literally standing on the shoulders of giants here and I'm more an ML groupie than a quant. But over the past years I nevertheless have managed to hold my own and even help many others develop and maintain a profitable edge. And for the sole reason that I DO NOT EVER ATTEMPT TO PREDICT THE FUTURE. Having spent the better part of the last decade developing trading systems (for myself and clients) I assure you that anything that may appear to be predictable will quickly cease to be. I cannot tell you how many systems I have seen thrive for a few weeks or even months and then nosedive into oblivion. It's probably the one reliable pattern I am confident about when it comes to trading systems.

Part of the problem is that markets are cyclical and that the nature of those cycles are not easily determined. The best indicator is your P&L curve and one of the most promising techniques for improving my systems has been to know when to turn them off and then on again. There are many very profitable systems out there - you just need to figure out when they work and when they don't. That is where I believe the rubber meets the road.

Now machine learning is extremely fascinating to me but it in my mind it is also a double edged sword. I do not believe that complexity equals success - quite to the contrary actually. Give me a simple system with simple (human) observable rules any day instead of having to run three parsers and decision trees in order to produce an entry or exit. The ML systems I am interested in building are simple by design but its patterns require ML techniques to be unveiled. I would not feel comfortable to trade via a black box that spits out rules I cannot understand.

FWIW this is one of the better discussions I have seen on this topic and it reflects the high caliber of the people who participate here.

Anthony,

One cannot know the true intentions of a manager like Numerai. The concept is smart and the manager appears also very smart. But I have no idea of what he is doing in background. Numerous tests I have done on training and tournament data show that they are i.i.d. despite claims to the opposite and that the highest possible accuracy without any clever boosting is in the order of 3% and logloss is higher than 0.6900. Of course some participants can lower logloss via boosting and get a top placement but that does not increase prediction accuracy.

Now it is possible to profit with a 3% edge if there is a decent payoff ratio and a large number of low cost transactions or a high payoff ratio with a lower number of transactions at moderate cost. The problem with this ensemble approach (expectation) is possibility of ruin in the time domain. It is gambling in essence. He may have found a way to minimize risk of ruin but it is still there, it is finite and as a result, ruin will occur at some point (in this sense, ruin even takes place when a cumulative stop-loss is reached after which trading stops.)

However, these problems are not particular to ML but present in all trading methods. ML has additional issues with high data-snooping, selection bias and overfitting. Note that all these are in play when one uses ML. In essence then, probability of success is very low.

@Brom "On point two. if you train your ML model on data from 2016, then test it on data from 2015 and get a high accuracy score, it seems like you shouldnt really trust it since you wouldn't have had data from 2016 if you were trading in 2015."

Obviously not. The better approach would be to train your model on data from 2008, 2010, 2114, etc. and then test it on 2016. Most likely this will break your model (completely different market cycles) and so your real task begins. Which is to figure out the conditions under which your model functions within your defined parametes. As a sidenote and I concede that this may sound horribly unscientific to an audience of quants: The root of all evil in system development may be in attempting to produce systems that work continuously. There may be some exception to this rule, e.g. time insensitive systems discussed further above, but I suspect that many models fail because they are based on the assumptions that market patterns/fractals/cycles are continuous. I think our time may be much better spent in modeling what sometimes works well as opposed to what always works.

@ Thomas -

Transform the long and short baskets to portfolio weights is part of the portfolio construction step. Given the baskets (just binary long/short signals or probabilities) it would run portfolio optimization

My point is that the workflow that is being discussed and engineered is a kind of system with interfaces. There is a step-by-step flow of inputs and outputs (for a high-level sketch, see https://blog.quantopian.com/a-professional-quant-equity-workflow/ ). As a side note, I would encourage the whole financial industry to stop using the term "alpha" and use "factor" instead, in this context. For one thing, it pre-supposes that after the universe definition step, you've magically found factors that have been expunged of any overall market influences, and can print money in an uncorrelated fashion. Can "alpha" as it is conventionally defined, even be calculated accurately until the backtest step anyway? Save the buzz-words for the marketing guys.

Back to my point...as I understand, you are working to put together a factor combination step using ML that will be general, and applicable to the vast majority of users. You want a kind of module (like the Q500US & Q1500US universes) that is broad brush and configurable, and that will provide the most general set of outputs to the next step in the workflow, which as you describe, is another optimization. So, my sense is that if your factor combination step takes in N securities, it ought to spit out N securities, as a normalized portfolio weight vector (e.g. a series of calls to order_target_percent(stock,weight[stock]) would update my portfolio). There would be no baskets of long-short; the output of the factor combination step could be all long or all short, for example.

The next step in the workflow, portfolio construction, would operate on the portfolio weight vector, as you say:

run portfolio optimization to e.g. (i) minimize volatility, (ii) minimize exposure to certain risk factors, (iii) take the previous portfolio into account to reduce turn-over, (iv) take transaction and borrow costs into account etc

At this step, for example, you would apply the market neutral constraint, which is a risk factor to be managed. If you impose the market neutral constraint by design earlier in the workflow, then one can't apply it in the portfolio construction step, where it would seem to reside. Your factor combination step, I think, should be an unconstrained maximization of the overall portfolio return, point-in-time (which may be a problem, since there is no constraint on the new portfolio vector bearing any resemblance to the old one...this is handled in the OLMAR algo by minimizing the square of the Euclidean distance between the old and new portfolio vectors, subject to an overall forecasted return inequality constraint). In the factor combination step, are you effectively finding the point-in-time overall return response surface and then finding the peak, with no constraints in the optimization (under a stationarity assumption, I gather)? Is the idea that you can sorta patch things up in the next step of portfolio construction, by re-normalizing the portfolio vector output from the factor combination step?

I'm wondering if you can actually approach things as you are. Your combination and optimization steps may be better thought of as one step. You are wanting to solve a constrained global optimization problem, I think, to control the overall portfolio return versus time. Shouldn't you be combining the factors to maximize the forecast return for the next portfolio update, subject to a set of equality and inequality constraints? I don't understand how you can break things up, and still get the right answer.

Grant

EDIT: Maybe what you should be passing to the portfolio optimization step is effectively a point-in-time response surface model for the overall return (the "mega-alpha")? Is that what you are doing? Then you can use the model in the constrained portfolio optimization problem? This would seem to make sense.

@ Thomas -

As a baseline, to see if this fancy ML jazz is buying you anything, you might try the old-school response surface methodology, which amounts to a multivariate polynomial fit to predict the response. If you have stationarity and your factors vary enough, you'll be able to establish the point-in-time response surface of the forecast return and find the maximum. Computationally, it should be easier. If you get the same (or a better) answer than ML, then chuck the ML (or find a ML approach that is superior to the multivariate polynomial fit).

A problem, I think, is that at each point-in-time, your multivariable polynomial model will change. Not all terms will be statistically significant at all times. So, you'd need some way to evaluate which terms to keep at any given time. Manually, this is done by assuming a higher order model and then reviewing which terms should be dropped based on their fit statistics. Then a re-fit is done, with a simpler model, to avoid over-fitting.

@ Thomas -

Re stationarity: That's certainly the assumption in the window of the training data. But if you don't assume some form of stationarity over a limited time-period, how would you want to do predictions of any nature in the first place?

So aren't you then obligated to test for stationarity in some fashion? Or would this have been done in picking the factors in the first place (with alphalens)? But then, what if the factors go in and out of stationarity?

Sorry, I still continue to be confused how all of this fits together...

@ Thomas -

Another thought is, shouldn't you include SPY in your analyses, as a proxy for the entire market? You would run as many factors against it as you can (should be able to augment the Morningstar data with aggregate numbers for SPY, right?). Maybe some factors run against certain stocks are simply yielding the same results as SPY, and they could be rejected? For stock ABC, if I compute factor_ABC and it is indistinguishable from factor_SPY, then I should reject the use of the factor on ABC, since I'm not interested in the market.

@Mikko M, I'd be interested in hearing more about Cohen's Kappa.

The Wikipedia page suggests that Youden's J statistic is better for supervised learning. Did you use either of these? I'd be interested in seeing how the Adaboost and NB implementations compared with these stats instead of just using accuracy.

Hi Thomas -

Having thought about this a bit, for the next step in your workflow, portfolio optimization, it seems like you'll want a model for the returns. For example, up to quadratic order, you'd need this:

Y = β0+β1X1+β2X2+β3X3+β12X1X2+β13X1X3+β23X2X3+β11X1^2+β22X2^2+β33X3^2

where Y is the forecast return and X1, X2, & X3 are 3 factors. For each security, you would have such an equation.

Is this effectively what you are doing with the ML? Combining factors in a nonlinear fashion? Why chose ML over a polynomial fit? Is it expected to do a better job?

And would you also need to forecast the volatility in returns? If so, shouldn't that be included in the modelling?

Grant

Grant,

In trading we are looking for a model with categorical dependent variable: {Buy, Sell} or {1,0}

i.e., we want to know when to buy and when to sell/short

Although linear regression could model this problem by looking at feature changes and using dummy variables to represent categories, a more efficient way is machine learning classification where we try to find a model using the features that predicts the categorical depended variable with high enough accuracy. Essentially, this model provides the buy and sell/short signals when new data are available. These signals are necessary in practice.

I hope this makes sense.

At a level lower than 30,000 feet, but above the ground level of individual lines of code (let's say 10,000 feet), there is a kind of overall point-in-time optimization problem that is being solved, which includes the universe selection, and the factors and their interactions. There's also the issue of how the universe selection interacts with the factors, which I don't think is addressed Dissertation Help in the workflow (at least not as a factor to be automatically optimized). In the end, it would seem that as an output of the factor combination step, I should have a model of how to forecast returns for every stock independently, no? Is that where we are headed? Then the portfolio construction step would be to sort out how to apply those forecasts and construct portfolio weights (i.e. how much of my portfolio should be in XYZ at any given point in time)?

Perhaps you could provide a sense for the overall optimization problem that is being solved (across the entire workflow) and how your factor combination step fits in? Presumably, your "library of commonly used alpha factors" was run through alphalens and somehow vetted? Each one was shown to have some predictive power individually, and now you are combining them in a fashion that accounts for interactions ("a non-linear combination of features"). Sorry, as I said, I'm missing the 10,000 foot view. What are you doing?

To provide some background, I am familiar with the so-called "response surface methodology" used in design and process engineering. My fuzzy picture is that you are building a multi-dimensional response surface that can be used for optimization, subject to constraints? If so, what order of model are you constructing (first-order, second-order)? Is the ML model effectively the response surface? Or has it already done the optimization? Again, I'm obviously confused...

Best regards
angellily

Here is a version with cross validation implemented. Note with cross validation your training error is very close to your test error, so you can tune your classifier and your final results will be similar to your training results. On the RandomForests classifier there was almost 99% training accuracy but the test accuracy was much closer to 50%. Now with cross validation. the RandomForests training error is more realistic.

Additionally I have implemented another performance metric: Youden's J-stat, which ranges from -1 to 1. A J-stat of 0 says your classifier is useless. I ran it for Adaboost, NB, and RandomForests giving: -0.04, -0.06, -0.03.

23

I noticed that there were 20 factors defined but only 18 were being used, upon further inspection I noticed that two were defined twice, so all 18 were being used. The definitions were the same so the duplication didn't change the results, attached is a version with the duplicates removed.

28

@ Michael Harris -

In general, for the portfolio optimization step, I think one wants, at a minimum, the projected absolute return of each stock in the universe, using multiple factors combined to make the forecast. It also seems like it would be handy to have a measure of the expected variance in the forecast return. Additionally, projected stock-to-stock pairwise returns correlations might be useful. In the end, it is something like minimizing the rotation angle of the portfolio vector in its N-dimensional space (since there are N stocks in the universe), subject to a minimum Sharpe ratio constraint (e.g. SR ~ 1.0), and a constraint that the beta of the portfolio be within a range about zero (or not, if one wants a strong beta tilt or long-only or short-only).

If the output of the alpha combination step is just baskets of long and short stocks, with no other information, it seems like the optimization step won't be performed optimally.

But admittedly, I'm still trying to get my head around this whole thing...

Just a simple question.

How long does it take to run the last pipeline?

I think it would be nice if authors of cloneable notebooks give an estimate on how many hours will be spent running a pipeline and waiting for a response.

Some kind of progress indicator might be fine too.

@Peter: Thanks, those are great contributions. I updated the original NB with your fixes to the Factor library. It's also validating to see that the cross-validation gives similar (poor) results than the hold-out, suggesting it does not overfit.

I think the next steps would be to use less common factors to try and improve accuracy.

@Guy: I updated the original NB with a timer for running the pipeline, it took around 3 minutes.

@Grant: You can certainly experiment with different classifiers, like a polynomial model. In general, these give poor out-of-sample performance due to overfitting. Beta-to-SPY is a great factor to include as well. Your comments in general are valid ways to experiment with the workflow. You shouldn't understand this as "the one and only way to do algorithmic trading on Q", but rather, "here is a starting point that might be a useful template to extend from".

We just published part 2 of this workflow which retrains the model periodically: https://www.quantopian.com/posts/machine-learning-on-quantopian-part-2-ml-as-a-factor

@Thomas,

You might have missed that but @Grant talked about regression (similar to a Probit model maybe) and portfolio optimization, not classification.

You said that "In general, these give poor out-of-sample performance due to overfitting." and I agree. But this also holds for machine learning classification.

It is not entirely clear how classification is better than either simple regression or probit regression for large long/short portfolios in the absence of a study. At the end of the day, classification is also an optimization method.

I would start by comparing the results of machine leaning to a simple cross sectional momentum model, i.e., buy the strong performers and short the week performers. If excess alpha is not significant, then this may be an overkill but nevertheless a good analytical exercise.

Data-snooping is hard to overcome and it is a serious issue in machine learning. As the number of factors and securities increase, it becomes difficult to assess the significance of the results due to the multiple comparisons problem.

Thanks Thomas -

You can certainly experiment with different classifiers, like a polynomial model. In general, these give poor out-of-sample performance due to overfitting.

I think over-fitting can be avoided by excluding terms not supported by the fit statistics. As I recall, one ends up with p-values for each polynomial term. One accepts/rejects terms based on their p-values. For example, if I have only two points (factor-response pairs) and I try to fit a quadratic to them, the quadratic terms should be flagged as over-fit; I can only fit a straight line (or a constant?). Given that one could manage the over-fitting problem, is there a reason this approach would still be a bad idea, over ML or something else?

The stationarity issue would seem to be the sticky one. If all of my factor-response data represent one unchanging, underlying market going back 20 years, then I have a shot at sorting things out. I gather, though, that we are perhaps dealing with transient stationarity. Is this correct? Don't I also need to know which factors are valid, point-in-time? Perhaps this was the point by others above, that a static ML model (or any static model, for that matter) might not work. I need to know the validity of each factor as a function of time? Normally, one would want to know if a given factor-response is "in control" over the look-back period, prior to attempting a model.

Here is a version of the notebook that uses sector codes. The sector codes are then binarized so that they can be used in an algorithm that cannot handle categorical data, SVM for example. There was no improvement in the training performance, so I didn't implement the categories in the test.

20

Gentleman
The efficacy or otherwise of ML can only be tested by applying the predictions to past data and looking at the profitability of the system. Note my extreme scepticism of back testing after many years experience, however it is all we have. The predictive capability of back testing as to the future profitability of a given system is.......probably rather feeble.

As we all know however, given a 50/50 win loss ration it is quite possible to make substantial profit from long term trend following. CTAs usually make do with 40/60 against.

Trend following is not want Q wants to do. Fair enough. But the rest of us may have interest in this long standing apparent anomaly.

@Peter: Interesting! Seems to be doing a little bit better (but probably not significant). Personally, I'd be interested to see how it does on the hold-out. Sectors are intuitively an important thing to consider. An alternative approach would be to train a classifier for each sector.

I cleaned up the code so it works on both training and testing. Using sectors in this manner did not improve the performance of the classifier. I am going to try new factors, perhaps with better factors the sector data could be used to improve performance.

62

@Peter

But this repeated testing increases data-mining bias considerably. At some point the model will validate on hold-out but could be a random result. A method to quantify significance due to multiple testing is also required apart from hold out performance.

I wonder whether it is better to start from fundamental analysis rather than end with it.

Testing each all every factor is to begin with fundamental analysis and to let ML decide which of those factors has predictive ability. Michael Harris and others have pointed out the dangers of over optimisation and repeatedly testing new ML versions over the same data. This is the paradox of back testing; ultimately it is useless.

The determinants of price in the LONG term must surely be obvious: economic growth at the macro level and earnings growth at the individual corporate level. Coupled with a strong balance sheet.

In the long term these factors will win out and these companies will enter into and remain in the relevant big cap stock indices until their success falters and they begin the inevitable decline which is the fate of all things.

Anything else is mere noise. So perhaps ML ought to be secondary to one's judgement of what makes markets tick.

Unfortunately of course this is hardly a sexy approach for a hedge fund.

@Michael: Certainly that's true, although we have quite a bit of hold-out data here which dampens the severity. The other benefit is that everyday we accumulate new hold-out data. Like with algorithms, the proof is in paper-trading.

@Anthony: I definitely think intuition and statistics should complement each other. For example, a quality earnings factor has some intuitive appeal, so I include it in the model. Then when training the classifier, I find an inverted-U shape relationship to how it relates to returns. That could lead to the insight that obviously bad earnings are bad, but perhaps overly positive reports are suspicious too and might try to mask a more fundamental problem. You seem to have gained a lot of intuition in markets over the years which is highly valuable so I encourage you to compile this knowledge into factors which more ML / Data Sciency people can play around with and try to optimally employ and combine.

@ Michael Harris -

You might have missed that but @Grant talked about regression (similar to a Probit model maybe) and portfolio optimization, not classification.

You said that "In general, these give poor out-of-sample performance due to overfitting." and I agree. But this also holds for machine learning classification.

It is not entirely clear how classification is better than either simple regression or probit regression for large long/short portfolios in the absence of a study. At the end of the day, classification is also an optimization method.

My intuition is that if the conditions are right for ML to work, then multivariate polynomial regression for forecasting returns will also work, if applied correctly. If the underlying system is stable in time, all of my factor-response pairs are stationary, and all is copasetic, then I should be able to find an empirically-derived multi-factor response surface. There are areas of science and engineering where this works just fine; even if I don't have a scientific model (e.g. from Maxwell's equations in E&M), with the right experimental data sets, I can build a perfectly valid, predictive multi-factor empirical model. By analogy, in finance, if the underlying system is stable, then it should be predictable, given the right data sets.

@Anthony,

Nice summary. It should be used as a guide for any future work.

The clash is between unique hypotheses versus hypotheses suggested by the data. I still do not understand whether the objective is asset allocation or short-term anomaly arbitrage. Using fundamental factors for the latter makes no sense.

Can anyone clarify the above? For example, the probability of AMZN selected as a longer-term holding based on fundamental factors is quite low.

On the other hand, as you said:

"Unfortunately of course this is hardly a sexy approach for a hedge fund."

except in the case that they will act as your sell side to test market strength with a small investments while leveraging the faded trade with a bigger investment.

@Thomas

"we have quite a bit of hold-out data here which dampens the severity."

This is true, i.e., a large hold-out increases power of the test and minimizes Type I error. However, what is important is the actual sample of predictions and not the hold-out length. If that is a sufficient sample then there is a chance. But a large hold-out limits the exposure of ML to a variety of market conditions. In my book I recommend doing away with hold-out and use two distinct universes, one for learning and the other for testing, both spanning same timeframes. The test universe should be chosen in advance and never changed to minimize data-snooping. This certainly decreases p-hacking probability but never to zero. There is always possibility of a random fit.

What are the objectives of Point72 in terms of trading timeframes? intraday, short-term, medium-term of longer-term? That can make a huge difference in the approach.

Best.

As I've mentioned above, my sense is that the timeframe is largely driven by the exclusive use of daily OHLCV bars in conjunction with pipeline. The game plan, as I see it, is to see how far that will take us, without additional development and operational costs.

@Michael: What I never understood about the two-universe hold-out method is how correlations do not invalidate it. If correlations are high and we are in a bear-market, the classifier will easily predict all stocks to go down and get a high score on the other 50%. It is also much more complex to find an equal split across industries etc. It also just seems more detached from what we're actually trying to measure (performance going forward, rather than performance if we knew half of the universe's returns).

@Thomas: All valid points. There is a trade-off between determining the integrity of the ML process and reducing universe size with the two universe hold-out method.

The problem is - and as far as I know it is a misunderstood issue - that validation on a hold-out is not enough to determine the integrity of the ML process due to multiple comparisons. For example, for simple strategies with one ETF I develop with my ML program I use validation on a anti-correlated security. The result is used to determine whether the predictions made by ML are not suggested by the data but from more fundamental factors.

With the two universe hold-out method the power of test increases significantly and this is what can help in minimizing Type I error. Your point about correlation is valid and this is why the two universes should have sufficient history and include a wide variety of market conditions. I agree there are issues and your points are valid but in my opinion the single hold-out method is doomed in the first place unless you intend to run the ML algo only a few times and not make many changes to features, etc. motivated by knowledge of performance in hold-out, which is data-snooping.

Actually, as I have written in one of my articles, analyzing the integrity of the ML process is the true edge, not the predictions.

I agree with Anthony's point of view. Fundamental data on a day to day basis has very little predictive powers. And as expressed by Grant, a multivariate polynomial regression is the same thing as using multiple factors.

All real life trading decisions have to be taken at the right edge of a chart.

You can't decide backwards. Looking at past data can only be useful to validate a concept, to show that indeed using such and such trading methods would have produced something over a particular data set over a particular period in time.

The main idea has always been to extend to the future what has been found to work in the past. However, if I manipulate the data in the past just to show good results (over optimize), I should not expect the market to comply with my “better” trading methodology going forward.

That is the problem, at the right edge of a chart, the future remains this stochastic sea of stock variances generating this buzzing cloud of price variations where a quasi random walk infrastructure might prevail. And it might be quite independent of its past.

Maybe what best describes what I am saying is the following chart:

In the beginning, fundamental data is a lesser part of price variations while over the long term, it is what will have dominated the scene. Without some positive fundamental data to support a stock over the interval, a stock might not even survive to appreciate the notion of long term.

@ Thomas -

One problem I see here is that you are trying to show that everything works on real data from the get-go. Your methodology would benefit from synthesizing stock data with known characteristics. Then, you can see if the known inputs generate the expected output. Then, you can take the next step, and say "If the real-world input data have characteristics similar to my synthesized data, then my tool will work" and proceed to try real data.

As Michael Harris touches on above, you first need to prove "the integrity of the ML process" and then give it a go on real-world data. For example, if you were developing a ML system to recognize faces, you might use some 3D rendering software to synthesize a set of faces for training, and then see if you could extract known features, before moving on to applying your ML system on actual faces.

Is this kinda thing done in the quant world?

@ Grant -

Here's an example of someone doing something kind of similar to what you're describing in terms of using synthetic data series:

One of the challenges I devised was to create data sets in which real
and synthetic stock series were mixed together and given to the system
evaluate. To the human eye (or analyst’s spreadsheet), the synthetic
series were indistinguishable from the real thing. But, in fact, I had
“planted” some patterns within the processes of the synthetic stocks
that made them perform differently from their real-life counterparts.
Some of the patterns I created were quite simple, such as introducing
a drift component. But other patterns were more nuanced, for example,
using a fractal Brownian motion generator to induce long memory in the
stock volatility process.

It was when I saw the system detect and exploit the patterns buried
deep within the synthetic series to create sensible, profitable
strategies that I began to pay attention. A short time thereafter
Haftan and I joined forces to create what became the Proteom Fund.

It should be noted that in this case he deliberately throws in patterns that are different from the patterns in the actual stock data, rather than making them as similar as possible. Figured you still might be interested though.

@Graham,

So author of the article sees his system fitting to random drift and fractal Brownian motion and he thinks he's got something. There is no proof from his results that the system did not fit but instead identified some genuine patterns. Double blind out-of-sample does not solve this problem at all. He was probably fooled by noise and genetic programming.

Whatever happened to the Proteom fund? Website is not operational: www.proteomcapital.com

It would be nice to see the performance for a reality check. Maybe they made it big, maybe not.

@ David -

I just shared that as an example somewhat along the same lines of what Grant was describing in terms of using synthetic data. I can't speak to the effectiveness of the author's use of the synthetic data. The topic is far beyond my capabilities so I can't comment personally. However, I've read some of his research via his blog for a while - http://www.jonathankinlay.com - and he seems to know his way around a trading model, so I tend to take notice whenever he writes about something. Obviously, as you pointed out one can never know without access to his historical returns, which I also don't have. I recommend you read some more of his writings to decide for yourself.

@Graham,

The high Sharpe ratios in excess of 5 indicate high probability of over-fitting to noise. The inclusion of a drift component in synthetic data is possibly responsible for that. And thanks but no - I will not be spending any time in a blog like that. The author gives the answer about the quality of his work himself: "Even so, given the enormous number of models evaluated, there remains a significant risk of over-fitting."

Note that the approach followed in that article is quite different from the sound ML approach of Quantopian with models based on classifiers of economically sound factors. It is hard to incorporate these factors in synthetic data and for this reason the data are useless for the purpose of this modeling approach here.

I disagree with some posters who have questioned the integrity of the ML process followed by Quantopian. This is a valid process based on classifying features that have economic value. Synthetic data and two universe hold-out are not required. All that is needed is forward testing of about three months and then go live if it works.

All that is needed is forward testing of about three months and then go live if it works.

For weekly trading, that's only ~ 12 trades (sets of orders). Doesn't seem like enough data.

@Grant: 12 * n_stocks, and it's not reasonable to assume n_stocks > 200, so at least 2400 data points which is not nothing.

@ Thomas -

I was thinking in terms of the number of times the portfolio is rejiggered. There will be a return for each rejiggering (week-to-week). I end up with a timeline of 12 rejiggers, which in my mind is still in the realm of small-number statistics. I guess it all depends on how quiet the backtest data are. If I end up with (1+r)^2 returns, and the trend continues, then o.k. But in all likelihood, the backtest will have zig-zags over any 12 week period. So, it'll be hard to say if the overall trend is continuing, or if the dreaded over-fitting may be at play.

I don't see how the number of stocks matters, but then all of this is still sinking in. Unless you are thinking that lots of stocks will smooth out returns, making it easier to detect a change in slope, at the transition from in-sample to out-of-sample?

Thank you Thomas for this awesome piece of work! This notebook and also the one where you put all into one factor are a great way to easily apply all machine learning ideas on Quantopian.

I thought this approach should be close to the partner data set PreCog Top 100 Securities, so I just took your notebook and included this one as an additional factor. It didn't change the result at all (still a little under 50%). In the predictor importance it was only in the middle field (see predicted_five_day_log_retrun).

Do you think this data set is worth its \$100?

I also did the alphalens analysis on it.

50

For weekly trading, that's only ~ 12 trades (sets of orders). Doesn't seem like enough data.

12 * n_stocks, and it's not reasonable to assume n_stocks > 200, so at least 2400 data points which is not nothing.

@Grant @Thomas

The larger the amount of data points & trading days the more confident you can obviously be.
But i would add another point here which i don't think has been mentioned. If after 3 months of Live trading your backtest closely tracks your real account activity, then you can be fairly confident that you are on the right track. If you made a bunch of money in those 3 months but your backtest doesn't track well your live activity, you need to rethink what it is you are doing.

@Thomas
Thank you Thomas. Compared with adaboost , what do you think of hidden markov model, support vector machine with the fundamental data ?

Any ideas why I keep getting import errors for sklearn modules?

"InputRejected: Importing SelectFromModel from sklearn.feature_selection raised an ImportError. Did you mean to import SelectFdr from sklearn.feature_selection? Last warning! One more import error and your account will be suspended for security reasons until a human can talk to you. "

I've also had problems with the below:

sklearn.model_selection.KFold
sklearn.model_selection.GroupFold
sklearn.model_selection.StratfiedKFold

@Leigh: It seems we haven't whitelisted the new namespace. In the meantime you can still get them via:
from sklearn.cross_validation import StratifiedKFold

@Thomas - Running into the same issue with sklearn.model_selection (trying to use sklearn.model_selection.GridSearchCV ) - would there be an ETA for whitelisting the namespace?

(For now, using sklearn.grid_search.GridSearchCV which is going to be deprecated)

Many thanks!
boris

It will definitely be fixed when we upgrade sklearn next so not to worry about the deprecation warning.

Check out Part 3 of how to turn this workflow into an actual strategy: https://www.quantopian.com/posts/machine-learning-on-quantopian-part-3-building-an-algorithm

Dear Thomas,

Yesterday Jonathan Larkin was giving a talk to Imperial students at Stevens Institute of Technology. I am one of the students from the group. Thomas presented an "L1" risk layer, which we would like to implement in our algorithms. It would be great, if you could direct us to the relevant topic or piece of code.

Kind Regards,

Raid

In trying to understand this notebook a bit better, as well as wrap my head around the implications of rank vs not ranked factors, I found that I'm able to remove the .rank(mask=universe) from the notebook research environment, however when I tried to do this within an Algorithm it resulted in an error: NonWindowSafeInput: Can't compute windowed expression

I suspect this is caused by the ML factor calling f() on the previously declared factors, however I'm still getting my feet wet with understanding how they play together. I understand why it's important for the Algo to be "window_safe" however I'm not sure how to go about insuring that each of the factors declared within make_factors actually is window_safe.

You're right, that's a nice side-effect of ranking that I hadn't thought about before. The problem can be seen as follows: if one of our factors is just the price of the asset and there is a split, the price would halve inside of your window causing the classifier to be confused. I'm not quite sure what one would have to do to ensure everything is window-safe otherwise. Maybe Scotty can advise on which operations make an input window-safe.

But doesn't the function need to run before it can compute the rank in the first place? If that's the case it seems like we should be able to get the pre-ranked values out.

It looks like a workaround might be to feed .top() with full list of stocks. For example if you're using Q500US then something like this avoids the error, however I don't know yet it if also adjusts for splits.

    # Instantiate factors using top() to avoid NonWindowSafeInput Error
for name, f in factors.iteritems():


@Pumplerod

If you really know what you are doing you can remove the window_safe exception like this:

    for name, f in factors.iteritems():
factor.window_safe = True # some factors don't have window_safe=True even if they should
factors_pipe[name] =factor
[...]


Please understand well the implication of removing the NonWindowSafeInput exception or you might encounter some trouble in the future.

By the way, you know that you can use the zscore function instead of rank?

By the way here is what David Michalowicz said about using factors as input of other factor:

"[...] is now allowed for a selected few factors deemed safe for use as inputs. This includes Returns and any factors created from rank or zscore. The main reason that these factors can be used as inputs is that they are comparable across splits. Returns, rank and zscore produce normalized values, meaning that they can be meaningfully compared in any context."

That's great, thanks for digging that up. z-scoring should be done in any case so if it also solves the window_safe issue that's the preferred way if you don't want to rank, Pumplerod. Please report back if that improved the prediction.

Hi to all. I am writing the dissertation with idea that feature extraction (PCA) can improve metrics of classification.
I change the following
1. add Momentum as a factor
2. add many returns as a factors
3. divide period to chunk to run pipeline for longer period
4. use zscore not rank
5. use random forest not adaboost
6. using different number of pca and see results
7. try to work with pca number 5 for different sectors

What I have now that it look like that different result we have for different sectors. And I think it is the main idea!
One needs to create machine learning algorithm for each sector or even for each company.
Possible the different results happen because of some bias.

69

@Vladislav: This is a very comprehensive study and your findings on sector-differences are highly intriguing. Thanks for sharing this!

I am new to Quantopian. Can I use keras with Tensorflow backend and sklearn on quantopian ?

does quantopian allow users to install packages or load their own packages ?

@Thomas,

Thanks for the nice post. I see Sheng Wang's name in the notebook and I roughly get where you got your original 30-40-30 split for 1/-1 label from:-) I have also read his original reports and I have also tried to duplicate his results w/o that kind of success using my local data.

I do have some general questions especially about the data part.
1. In your notebook you got QS1500 stocks and calculate the returns for the past 5 days and shift the returns to make it aligned as future returns. This step could introduce some bias. Basically, when you are constructing the Xs and Ys at time t you are using stocks guaranteed to have a close price at time t+5 and being in QS1500 at time t+5. This seemingly small look-ahead bias might bring huge performance difference when I was testing Sheng's original monthly schemes.
2. Don't you think a window length of 1 year is too small? At least we should include some minor pullback (10%) of the SPY to make the ML able to learn about the relative performance of stocks around the turning point so the underlying model may not be totally dominated by momentum type of behavior for instance.

Thanks.

@qi chen: Yes, we've been greatly inspired by the Sheng Wang paper. Good catch :)

1. Good point. We should probably add those back in that didn't have prices at t+5.
2. Potentially. It's been difficult to make it scale but now it seems like the platform is fast enough. The current bottleneck is the actual training of the classifier, but with a linear SGD we could probably run longer windows. How do you mean "include some minor pullback of the SPY"?

@Thomas Wiecki

I meant instead of using a 1-2 year window which may not see any SPY corrections we should use a longer window (~5 year) to include the market correction to make sure the training sample has experienced both cycles. Otherwise if the ML algo is smart enough using only the past year data it may just give you a buy the dip model.

@Qi,

I agree with you, most ML algos use long training periods to capture price dynamics meaningfully, unfortunately under Q framework, they have a time limitation on computational resources ( I think it is 5 minutes if I'm not mistaken). So any attempt in ML within this given time limitation is in my opinion, an exercise in futility.

@qi chen, @James Villa

I believe that the training period length is related to how far in the future the ML is predicting the assets performance. If you are using ML to predict tomorrow prices you won't probably need 5 years of data, but if you like to predict next month price it is likely you need a long training period. We face a trade-off with ML: on one side we want a long training period to avoid overfitting, but at the same time we want a short training period so that ML can predict the price for the current market regime. As we keep updating the ML model in a rolling basis we can take advantage of that and focus on transient market features that would be not recognized using a long training period.

@Luca,

As we keep updating the ML model in a rolling basis we can take advantage of that and focus on transient market features that would be not recognized using a long training period.

Sorry to disagree with your above statement . On the contrary, these transient market features specially black swan events like the 2008 mortgage crisis and subsequent recession, will be recognized and possibly give you ample warning, if 2008 data was included in training and ML model trained well. Under your scenario, say, you just trained the ML algo during the last two years which is an upward bullish market, if a sudden black swan event occurred, your algo will not be able to react because it has not seen or trained for such an event, you will be wiped out! How can your model give you meaningful prediction on data or event it has never seen before, makes sense?

@James Villa

I understand you point and it might be correct but it depends on what you are trying to predict, how far in the future and what features you are using. As I said earlier we have a trade-off here.

Sorry for ask you people stupid question, but I was wondering.whether any one of you wizard has tried it in real markets and gained something ? I ask this from retail solo investor/trader point of view.

Is predicting 5 day returns appropriate here? Fundamental data is updated very infrequently, so should we be testing it's ability to predict longer term returns?

Also- can someone provide the link to the Sheng Wang paper?

@Zak: A valid point, it would treat the algo more as a template so definitely experiment around with longer time-horizons.

Unfortunately the Sheng Wang paper is not public.

Very useful guide on getting started making your own projects and strategies.Helped a lot on clarifying some stuff using this library