Back to Community
Tackling overfitting via cross-validation over quarters

Overfitting is probably the biggest potential pitfall in algorithmic trading. You work on a factor or algorithm that looks pretty good, you have some ideas to improve it so that it looks even better. Excitedly, you turn the algorithm on only to be dismayed when the out-of-sample performance doesn't look nearly as good as your backtest. We at Quantopian certainly observe this pattern a lot when evaluating algorithms for inclusion in the fund. If you want your algorithm to do well in the contest, you need to be very careful in this regard.

In this notebook you will learn a simple technique to detect if you overfit your factor or not, by splitting your data into odd and even quarters.

Loading notebook preview...
Notebook previews are currently unavailable.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

21 responses

@Thomas, Thanks for your work on this, it helps!

I looked this over a few times, and while I think I understand the concepts and the code, I'm not sure I understand the results...basically why the results are so different IN THIS CASE for the train vs the test...so here are some questions:

1. Holding out every other quarter is a potentially biased decision itself, as it could introduce some time-of-year bias(e.g. Even(Jan-Mar,July-Sept) vs. Odd(Apr-June, Oct-Dec)). So is the difference due to time-of-year bias?

2. OTOH...the two Quantile statistics metrics are close, whereas the Returns Analysis are not...what's up with that?

Even Quantiles Statistics  
        min         max          mean          std                 count    count %  
1   -1.730268   -0.325803   -1.203210   0.267923    36123   20.019730  
2   -1.025159   0.279893    -0.430024   0.295982    36072   19.991465  
3   -0.415741   0.960416    0.336761    0.289696    36064   19.987031  
4   0.371067    1.353257    0.993832    0.176791    36072   19.991465  
5   1.063428    1.731804    1.474874    0.140226    36106   20.010308

Even Returns Analysis  
10D  
Ann. alpha  0.059  
beta    -0.113  
Mean Period Wise Return Top Quantile (bps)  14.221  
Mean Period Wise Return Bottom Quantile (bps)   -27.574  
Mean Period Wise Spread (bps)   41.794

Odd Quantiles Statistics  
        min         max          mean            std            count   count %  
1   -1.729265   -0.360678   -1.210830   0.260762    35387   20.023086  
2   -1.059931   0.304455    -0.452029   0.292576    35328   19.989702  
3   -0.439530   0.929758    0.318451    0.292419    35326   19.988570  
4   0.288471    1.349952    0.983764    0.190968    35328   19.989702  
5   1.074236    1.731799    1.477151    0.141219    35362   20.008940

Odd Returns Analysis  
10D  
Ann. alpha     -0.008  
beta               -0.081  
Mean Period Wise Return Top Quantile (bps)        -7.183  
Mean Period Wise Return Bottom Quantile (bps)     13.801  
Mean Period Wise Spread (bps)                                       -20.985  

3. Bottom line, I don't know at this point whether the factor is flawed, as you suggest or there is an error in the way I'm looking at it.
Certainly, I can try and split the train-test sets temporally using other than odd/even quarters and see what happens(e.g. random quarters)...or I can try and explain why the statistics for a simple factor like momentum are so non-continuous...any ideas gladly taken!

alan

Can someone also demonstrate in backtest form?

Hi Thomas,

I think this is a novel idea and quite a stringent stress test of overfitting, so I thought I'd give it a test drive with an algo I have been working on lately. Cloned your notebook and keep same dates and replace the simple momentum factor with my alpha combination factor. Of course, I wouldn't want to giveaway my secret sauce, so I just cut and paste the relevant stats,

TRAINING

Quantiles Statistics  
min max mean    std count   count %  
factor_quantile  
1   -2.726328   -0.827215   -1.470242   0.434959    34728   20.035539  
2   -0.944471   -0.161090   -0.534089   0.188538    34643   19.986500  
3   -0.281707   0.387344    0.064343    0.163708    34656   19.994000  
4   0.274276    0.933425    0.609198    0.159646    34638   19.983615  
5   0.844164    2.635445    1.333821    0.343627    34667   20.000346  
Returns Analysis  
10D  
Ann. alpha  0.061  
beta    -0.229  
Mean Period Wise Return Top Quantile (bps)  6.621  
Mean Period Wise Return Bottom Quantile (bps)   -20.563  
Mean Period Wise Spread (bps)   27.184  
Information Analysis  
10D  
IC Mean 0.020  
IC Std. 0.112  
Risk-Adjusted IC    0.178  
t-stat(IC)  1.720  
p-value(IC) 0.089  
IC Skew 0.135  
IC Kurtosis -0.087  

TEST

Quantiles Statistics  
min max mean    std count   count %  
factor_quantile  
1   -2.725484   -0.791198   -1.465107   0.425436    34006   20.029096  
2   -0.932438   -0.143606   -0.539442   0.188026    33952   19.997291  
3   -0.296662   0.372270    0.058329    0.165776    33933   19.986100  
4   0.289199    0.920854    0.607867    0.159483    33949   19.995524  
5   0.854588    2.642809    1.341129    0.345461    33943   19.991990  
Returns Analysis  
10D  
Ann. alpha  0.021  
beta    -0.217  
Mean Period Wise Return Top Quantile (bps)  -5.305  
Mean Period Wise Return Bottom Quantile (bps)   4.986  
Mean Period Wise Spread (bps)   -10.292  
Information Analysis  
10D  
IC Mean 0.004  
IC Std. 0.123  
Risk-Adjusted IC    0.033  
t-stat(IC)  0.312  
p-value(IC) 0.755  
IC Skew 0.188  
IC Kurtosis -0.541  
Turnover Analysis  
10D  
Quantile 1 Mean Turnover    0.281  
Quantile 2 Mean Turnover    0.435  
Quantile 3 Mean Turnover    0.481  
Quantile 4 Mean Turnover    0.459  
Quantile 5 Mean Turnover    0.319  
10D  
Mean Factor Rank Autocorrelation    0.762  

I am not a big user of AlphaLens but understand the concepts and the stats. It seems to me that the results shows a case of not overfitting, at least on a 10 day forward returns. Would appreciate your comments and feedback, thanks.

It would be of interest to know how Q's lo beta, low vol market neutral algos have weathered the minor storm in the past few weeks. The FT reports some fairly poor performance from Long / Short hedge funds in general. It would be nice to think that all your hard work has paid off and that your shorts balanced out your longs or so that at least you have suffered a less than market decline.

Equally, I assume Q will have at least run its algos against the 2007/8 crisis.

In a market crash, what exactly does "market neutral" entail?

Let me clarify: Thomas, if the market were to decline 50% over the next two months, how would you hope your fund would perform? Not because I have any view on the matter but merely because I am putting together one or two algos based on fundamentals which I may or may not submit. Out of interest I willl run them through the 2007/8 period. How should I expect / hope my algo would behave?

How would YOU wish such an algo to behave in such conditions?

Cross validation over quarters may need to spread its wings rather more widely than looking at the past two years of bull markets?

Incidentally the platform seems to have come on leaps and bounds since I last took a serious look at it. Zipline and its related software is indeed a formidable weapon. You pipeline tutorials are excellent - a far remove from the days pipeline was first introduced.

Hi @Thomas,

Really great NB - thank you for putting it together and sharing it. Will help me quite a bit in my research.

I have a few questions as well:

  1. Would training on even quarters every year be prone to 'over-training' on seasonal trends (e.g. sell-in-May, the January effect, etc), and if so, is there a better way of dividing up the training and testing sets?

  2. I 'tweaked' your simple_momentum factor (quite a bit) to get the below figures on the 'training set' (all else in the NB is the same); I haven't run the 'testing set' yet.

Returns Analysis  
10D  
Ann. alpha  0.055  
beta    -0.032  
Mean Period Wise Return Top Quantile (bps)  18.806  
Mean Period Wise Return Bottom Quantile (bps)   -20.839  
Mean Period Wise Spread (bps)   39.645

Information Analysis  
10D  
IC Mean 0.026  
IC Std. 0.080  
Risk-Adjusted IC    0.320  
t-stat(IC)  3.084  
p-value(IC) 0.003  
IC Skew 0.232  
IC Kurtosis 0.682  

Before I run and look at the 'testing set' output, how much 'variance tolerance' should I accept before I call it 'overfit'? For example, if I set my p-value cutoff on the 'testing set' to <0.05, I may say that this factor model is 'overfit' if the p-value is above 0.05 on the test set, correct?

However, if I do get a p-value of below 0.05 on the test set, how much variance of the below figures should I be able to 'tolerate' before I call the model 'overfit?':

  • Mean IC
  • Ann. Alpha
  • Mean Period Wise Spread (bps)

@Alan:

  1. I'll answer that below with Joakim's question.
  2. The quantile statistics are not what you want to look at there. Those just tell how much data is in your quantiles and what range the quantiles cover, not what is actually happening to the stock returns. The way I constructed my factor here (ranking and then zscoring) the quantile stats are pretty meaningless because I know that by construction I will get uniform quantiles over a specific range. The thing to look for instead are the performance metrics like Alpha which is positive for even but negative for odd quarters. Same for mean IC. The mean returns of the quantiles is also nicely going from negative to positive for even but all over the place for odd.
  3. As outlined in 2 it is bad, you want a factor that works rather evenly across time. Although that is very hard in general, if you find that your factor only works on the time-period you looked at but not your testing set, it's a pretty strong sign you overfit.

@Blue Seahawk: [How would one do that in a backtest?]

I don't think it's currently possible to do this in a backtest, unfortunately. However, I would treat the backtest just as a last step where you already know your factor works from the research env and alphalens. The backtest then is mainly to make sure it's not killed by turnover or has some other undesirable things like high exposures. Ideally you'd spend 90% time in research designing the factor and then when you're done run a single backtest to make sure it also works when actually placing trades. If you don't want to do that, you can just leave out the last 2 years when you run backtests and only test that time-period once at the very end. Personally I'm quite excited about the aspect that the factset fundamentals enforce a 1-year hold-out period we can use to evaluate the strategy over.

@James Villa: [Tried it on his own factor]
Thanks for trying this out! I think you already know this and just wanted to test it out, which is great, but I'll say it anyway: I guess you developed that factor already beforehand which renders this test less meaningful because you already tweaked it on the test-set - it can only be used once. Even then though, the factor doesn't seem to be significant (p-value) in either period. The mean returns of the top and bottom quantile also seems to flip in train and test which also is not a good sign.

@Zenothestoic: [Does being market neutral pay off for corrections like we just experienced?]

This is definitely a market period for which market-neutral was developed. In theory, the market tanks but because you are well hedged your portfolio should not be influenced. Those periods can even be especially lucrative because that's when opportunities open up. So I would hope that our fund would do especially well if the market were to drop 50%. Unfortunately, it seems that for most funds that hasn't been true in this most recent correction: https://twitter.com/robinwigg/status/1055802622739968000?s=21

Thanks for feedback, I agree that the tools have improved a lot, and even more good stuff is coming :).

@Joakim

Would training on even quarters every year be prone to 'over-training' on seasonal trends (e.g. sell-in-May, the January effect, etc), and if so, is there a better way of dividing up the training and testing sets?

This is similar to Alan's question above. I think the key thing to ask there is whether you designed your factor to exploit any of these effects. If it does exploit seasonal patterns than this type of testing might not be the right choice. However, if that's not the case there is no good reason it should behave in this way. Probabilistically speaking, the probability that the factor is overfit is much higher in that case. Finally, would you even want a factor that only works in these certain time-periods? We certainly wouldn't. Having said all that, you could also sample your quarters randomly if it's a concern, or flip every year from even to odd.

I 'tweaked' your simple_momentum factor (quite a bit) to get the below figures on the 'training set' (all else in the NB is the same); I haven't run the 'testing set' yet.

Thank you, that is excellent. Yes, this looks like it could be a great factor just from looking at the stats, which makes it an even better example. I like your proposal of also looking for p < .05 for the test set performance.

@all

One question a few of you touch on is how to actually say one way or the other. Certainly if the factor performance goes from positive to negative between train and test it should be rather obvious, but what if it's not as clear cut? What if it's still positive but maybe not quite as much. This is something I haven't done too much thinking / experimenting with yet, so I hope I will have a better answer at some point. For now, probably the simplest thing which sounds very reasonable is to require the p-value to be < 0.05 in both periods, as suggested by Joakim.

Hi Thomas,

Thanks for your feedback. As I mentioned, I'm not a big user of AlphaLens mainly because metrics like t-stats and p-values are based on assumptions that the time series are stationary with normal distributions and are measured as such. For financial time series, I don't think these assumptions hold true and that is why I really don't pay much attention to those metrics. The novelty of your idea in splitting training and testing of factors (tweaked or untweaked) between odd and even quarters in a "no more excuses" outcome. Having said this, I actually just look at one metric, Ann. Alpha. What I want to know is if the factor will make money on both the even and odd quarters , formatted as train and test. If it does, chances are the factor is not overfitted, provided one has a long enough dataset to analyze. I'm just an ordinary Joe with a simple mind that just looks at the bottom line. I still see the utilty of your routine and for me, it is a validation mechanism.

@James: Yeah, p-values are less than ideal here, although we do report other statistics to say how non-normal the data is (skew, kurtosis). Using Ann. Alpha should work just as well but I would set a predefined threshold before evaluating on the test set and being self-disciplined about the outcome.

Thanks @Thomas,

I decided I was done 'training' the factor, so ran the 'test set' to see how overfit it might be (see below). Looks like it was quite overfit? So, now that I've looked at the test set data, I should 'throw away' this factor (since it was overfit) and try something quite different? Unless I've held out more data that I could test on again?

One thing I find very strange is the Turnover Analysis. I didn't include it earlier, but it was all 0.0 for all the quantile's mean Turnover, and 1.0 for AutoCorrelation in the 'training set' (see very bottom). I didn't include it earlier because I thought that might be 'normal' for a Momentum type factor with 10D holding period. However, the result from the 'test set' was quite different, which seems quite odd to me and makes me a bit suspicious...

Returns Analysis (Test set)

10D  
Ann. alpha  0.009  
beta    -0.045  
Mean Period Wise Return Top Quantile (bps)  0.821  
Mean Period Wise Return Bottom Quantile (bps)   2.573  
Mean Period Wise Spread (bps)   -1.751

Information Analysis (Test set)

10D  
IC Mean 0.007  
IC Std. 0.083  
Risk-Adjusted IC    0.081  
t-stat(IC)  0.769  
p-value(IC) 0.444  
IC Skew 0.011  
IC Kurtosis -0.462  

Turnover Analysis (Test set)

10D  
Quantile 1 Mean Turnover    0.604  
Quantile 2 Mean Turnover    0.729  
Quantile 3 Mean Turnover    0.722  
Quantile 4 Mean Turnover    0.741  
Quantile 5 Mean Turnover    0.557

10D  
Mean Factor Rank Autocorrelation    0.365  

Turnover Analysis (Training set)

10D  
Quantile 1 Mean Turnover    0.0  
Quantile 2 Mean Turnover    0.0  
Quantile 3 Mean Turnover    0.0  
Quantile 4 Mean Turnover    0.0  
Quantile 5 Mean Turnover    0.0

10D  
Mean Factor Rank Autocorrelation    1.0  

@Joakim: Yes, quite overfit. Personally I would expect most ideas not to work but can be made to look good by tweaking. Now, however, you can tell when you're fooling yourself quicker and work on more promising ideas instead.

And yes, the turnover issue looks very weird so I investigated. Turns out the turnover calculation is wrong here due to a bug (https://github.com/quantopian/alphalens/issues/323) and can thus not be trusted. The factor is doing the right thing, it's just not shown correctly.

Unfortunately not all ideas however good will work all the time. We have seen this very clearly over the years where fashion seems to dictate which factors will lead to out-performance in different periods. All things are driven by fundamentals at the end of the day but look at the varying fortunes of value stocks over time.

Yes, most things will turn out to be temporary fads or curve fit anomalies. What we can count on in the very long term is increasing earnings per share coupled with a strong balance sheet to drive share price. Which in turn depends on a robust economy in the geographical area or areas where the company makes its sales.

Without those two fundamental factors (or three including the general economy) you can forget long term share price appreciation.

As regards Long / Short, the big question I have is "was Alfred Jones right?"

In fundamental terms those stocks with good earnings increases and strong balance sheets should outperform those with the reverse. In the long term.

It would be foolish however to imagine they will do so every quarter or year. I suspect it would also be over optimistic to expect a long short model built on such principals to retain a lack of correlation to the general market in a crash where the baby is inevitably thrown out with the bathwater.

But I will use your excellent platform to attempt to prove myself incorrect.

@Zenothestoic: For the long term, I'm with you, but for shorter time-periods you would expect cognitive biases like herd behavior, loss aversion etc to be present, no?

Thomas
I agree that it is over the long term that value holds out. I agree that in the shorter term other factors can reign supreme for a while. The problem is that benefiting from such shorter term factors then becomes a matter of of luck or "market timing" - the latter two may amount to the same thing.

In the long term long short / market neutral may have great validity if based on the correct fundamental factors. It may be as good as Alfred Jones thought it was; in theory at least.

And of course "short term" can seem a long time in the context of a traditional career span of say 25 years. Take the success of the commodity trend followers who had a good run for many year before (perhaps) their trade was spoiled by sheer weight of competition.

Of course short term can also be just that - a couple of years. Take the guys who profited so handsomely but so briefly from the Big Short. Atradis went from a small player to the biggest hedge fund manager in Singapore. Its managers made hundreds of millions from fees. But they turned out to be a one trick pony and their fund lost 60% one year after the crisis and closed.

I think your aims at Q are entirely correct in a sense. You are looking to eschew short term-ism and are looking to run with something you hope makes sense "forever". In bull and bear markets. I have never been comfortable with leverage but leaving that aside it may be you are creating something with as much long term utility as the basic stock index. Hats off to you.

So in a way it all depends on what type of investor you are. If you are a pension fund, even 25 years may be short term. If you are a hedge fund manager looking to profit personally from a quick buck for a few years from management and performance fees then it makes sense to be short term. IF you are lucky. If your strategy hits the right market at the right time.

And of course it may be that you are looking to patch together or use many shorter term factors in your long short trading which, in aggregate, you hope will give longevity. But in that case there is the age old question of when to abandon a strategy.

At least if you know the strategy/ factor is in tune with the very long term, it may go out of fashion temporarily but you are pretty sure it holds over the long and very long term.

I hope this makes sense. As ever I fear my rambling tends almost more towards philosophy than day to day investment.

Thanks @Thomas,

I'm curious if you've tested the NB on a factor that is 'known' (or at least very unlikely) not to be overfit? E.g. perhaps the ExtractAlpha factor?

If not, how can we be sure that the NB is working as intended? I'm only asking because I tend to get 'too good' results on the 'train set' and 'too bad' results on the 'test set' on all factors I've tested so far. Perhaps they are all overfit, but how can I be sure that the NB actually does work as intended on factors that are not overfit?

Just a suggestion, but a good QA process might be to write a test case (with expected results) for both a factor that is 'known' to be overfit, and another test case for a factor that is 'known' to not be overfit, and then compare the actual test results with the expected results in each test case?

@Joakim: I have not. If you use this test on existing factors it's more of a stability test, rather than an overfitting on. Remember, it's only a test-set if you haven't used it before. Something that we know works well, like ExtractAlpha, will do well in this test but be useless because the author had access to the time-period in our testing set. It is a bit surprising though that all your factors seem to fail that test. At the least, if you pass in factors that you already developed using the whole time period, performance on train and test should be similar (or rather, random as to which set it performs better on).

I think you provided the perfect test case above where you tweaked my factor to where it looked great on train but then failed on test.

@Thomas,

Rightio, fair enough. I actually only tried the NB on two of my factors (both of which I had used most of this period to 'train' them), and got a bit depressed (and gave up) when they both failed the 'test set'.

However, I just tried on two other factors (also 'trained' and 'known' to work quite well during this period) and they both passed the 'test set' with flying colors (roughly 10D Mean IC of 0.03 with p-value of 0.001 for both of them, both during the 'train' and 'test' sets). This makes me 'trust' that the NB does indeed work as intended (even though both factors were already 'trained' on the full period).

Thanks again for the NB and taking the time to answer my questions. It will help me quite a bit when researching and testing new factors going forward.

Hi Thomas,

If we analyzed the factor year over year from training to test wouldn't our conclusion be different than that presented in the notebook. I am assuming the odd and even quarters in a year probably average out (when summed together) to some kind of medium'ish metric for all years in AL.

Would be helpful to know how Quantopian sees the difference between overfitting that you are tackling in this thread and hypothesis testing that you referred to in this thread Alternative Test For Overfitting

Thanks.
Leo

Sorry, another question. In the attached notebook are you testing a hypothesis (a definition of momentum) or illustrating an overfit factor. The reason I am asking the question is because, I didn't see any of the steps in the notebook that can be considered as overfitting as defined in this post portfolio structure and overfitting although you have mentioned that one could try to improve the factor. It appears the raw factor itself is predictive only in some quarters and one cannot make any conclusions with confidence about any improvements to the factor as having overfit if the factor is not predictive to start with in the test set and highly predictive in the training set without any changes.

Leo: The idea is that you iteratively improve the factor on the train set (this is implicit in the NB). By doing so it will probably start to look pretty good, but you might have just overfit. Then you run it on testing and if it doesn't do as well there, you know that you overfit. The factor is just for illustrative purposes.

Thomas, thanks for clarifying that the factor is for illustrative purposes.