Alternative Test For Overfitting

Inspired by the work of Dr. Thomas Wiecki here tackling-overfitting-via-cross-validation-over-quarters and the new Factset data with its mandatory 1 year holdout period, I just wanted to share with the community my alternative methodology for tackling overfitting using what I call "backward and forward holdout period backtests". The essence of this technique is to find consistency in generalization results from the "training phase" of the backtest when applied to data it has not seen before, in this case, the backward and forward dates/periods.

I start by doing a two year backtest from 10/12/2015 to 10/13/2017. This algo makes use of Factset and Morningstar fundamental data together with some OHLCV technical indicators. The results, from the onset, looks quite dismal and boring, and one might be tempted to dismiss and throw away.

Here's the returns tearsheet:

28
20 responses

I now apply this to the "backward holdout period" backtest. I went as far back as the data availability will allow which was 2005, purposely as these will cover different market regime changes. I wanted to see how my alpha combination factors will perform in such grueling phases of regime changes.

As you can see, it held up quite nicely! Here's the returns tearsheet:

3

Now for the "forward holdout period" backtest, the caveat here is you have to enter it in the contest to take a "peek" at the one year holdout period. And so I did, here's the returns tearsheet:

4

To summarize the results of the different phases:

                        Training Phase   Backward   Forward

Annual Returns            0.8%          3.9%          2.8%
Annual Volatility         1.3%          1.5%          1.3%
Max Drawdown             -2.3%         -1.8%         -1.2%
Sharpe Ratio              0.61          2.59          2.17


While the two year "training phase" exhibit below than par results in risk adjusted returns, both the 10+ years of backward holdout and 2 year forward holdout period show some nice results. Overfit models exhibit an opposite behavior, where in sample periods normally show nice risk adjusted returns only to falter when presented with out of sample data. To further verify whether my in sample results are not just a fluke, I go back to the long backward holdout backtest and see whether there are some years that exhibit a similar behavior. Indeed, years 2010 and 2013 shows similar behavior / results as my in sample / training phase.

It is also interesting to note that portfolio volatility across the board seem to be fairly consistent, only skewed during the 2008 Global Financial Crisis where market volatility is extremely high. It is also during this period that the algo achieved its highest returns! This further boost my confidence in the system, assuming that history repeats itself, the trading strategy will be able to not only weather the storm, it can also possibly excel!

I hope this helps, specially the young guns out there just starting out in this complex world of financial trading!

I see your reasoning and agree with it. The only question I have is that period of increased negative correlation to the benchmark in 2008. I'm not sure from the charts whether or not that period still complies with the beta to the S&P Rules?

In any event, I wonder whether the reason for that out-performance needs further consideration? I get the impression that the straighter the curve the better pleased Quantopian will be.

I wonder whether there is some hint of an anomaly here which might cause harm rather than benefit in future market conditions?

Hi Anthony (Zenothestoic),

To answer your first question, the algo strategy passes all contest metrics for 2008 during the GFC, the beta-to-SPY is only -0.02, well within the threshold of +- 0.30.

I think the ideal returns curve for a low beta, low volatility, market neutral long short strategy, is a smooth upward curve where returns outperforms the market during times of negative returns and high volatility and underperforms while still being postive the market returns in times of strong momentum (upward trending). I believe this strategy exhibits this behavior and doubt there is some anomaly at play here. The smoothness of the returns curve is somewhat distorted/disrupted during this period due to the extraordinary high volatility, where one can amass good returns if you are betting on the right side. I attached the full round trips tearsheet to clearly illustrate this point. During this period the algo had a 55% hit rate on shorts and 50% hit rate on longs and I know for a fact that during momentum/trending periods, the opposite is true which leads me to believe that this particular algo is able to adapt to the different market regimes and the reason for this is in my design which worked as intented.

Sorry I'm having problems attaching the notebook, some technical difficulties on Q's end. I will re-try later on.

Sounds hugely fruitful James. As a fundamental analyst by training I have long meant to quantify the value of long and much vaunted ratios which are now available for easy back testing.

There seems to be a subtle shift at Quantopian and for the better. Anecdotally there seems to be less day to day activity in the forum but it seems to me to be of better quality than it was when I was last using the back -tester.

I am now coming to believe more closely their mission statement - to democratize Wall Street. Given the huge advances they have made with Alphalens, Pyfolio, Zipline and the datasets I am coming to believe that great things may be done here.

I'm glad you feel the same way I do. Yeah, seems like great things are now coming into fruition. Best!

Here's the the full round trip tearsheet for 2008:

4

Hi James,

Great post, love the ingenuity! What I like is that this method can also be used in the backtester. A couple of questions:

• If the factor didn't look good on your training period, why move forward? Technically you couldn't have known it would perform well on these periods. I assume this example was chosen as an illustrative example (like I did in my other post, and that's totally fine) but I think it's important to think what protocol should be followed here. Certainly, you are not allowed to make any changes to the factor after you move on from the test set. As the factor does not do well in the training set, why didn't you tweak it further? What would you have done if the factor failed the backward or forward test?
• What do you think is the benefit of this vs the quarter approach (not trying to do one-up-manship here, I think more options are good, but we should try to compare them).

In any case, the exact method to do hold-out testing is not as critical as doing hold-out testing in the first place, so posts like these are very valuable in teaching this mindset.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hi Thomas,

Thanks. This methodology is actually what I use in my deep learning models to gauge overfitting and is anchored on the assumption that financial time series are non-stationary.

The training period was chosen as the last two years available as per the FactSet data availability to be as close to the forward one year holdout period as possible. Just my arbitrary choice but could be any middle chunk of the whole dataset.

Due to the non-stationarity of the time series, this method actually forces you to move forward, unless of course, the results of the training phase are significantly or ridiculously dismal then it might just be a waste of time. In terms of protocol, it's really just a judgement call. The reason to move ahead with the backward and forward holdout backtest even with under par performance in the training period is to test the generalization ability of the factors on data it has not seen before. This is also a way to handle the non-stationarity of the financial time series without tweaking the factor/s after the fact. In other words, the validation of how well the factors perform (or overfit) rests more on the results of the holdout data. One can then evaluate and compare the results of the training set to what had happened in the past and what could happen in the future.

Financial Modeling is as much an art as it is a science. Since the choice of training set is quite arbitrary, the modeler should ask questions after running the backward and forward holdout periods like: Was the behavior of the training set similar to any occurences in the past or in the "future"? If it did then, the confidence in acceptability is easier to judge. In the other case where it fails the backward and forward tests, then it is also easier to conclude that the factors overfitted that particular training set/period as it did not generalized well in data it had not seen before.

Honestly, I really don't know what the benefits of this method of holdout validation is vs. your quarter approach. I'm sure there are both pros and cons in either method. I just view it as another alternative test for the veracity of factors and / or overfitting.

P.S. - Perhaps the more profound difference in doing this type of holdout testing is the use of the backtester vs. AlphaLens. The main advantage of using the backtester is that it already includes the transaction and slippage costs. It also already incorporates the risk model framework.

As the factor does not do well in the training set, why didn't you tweak it further?

Ouch. Sad to relate, in my view and in my experience of trading futures "tweaking" is the very spawn of Beelzebub. I have made that mistake so often. I prefer James approach: if one period looks bad, move on to other periods and see if that one period was anomalously bad in a sea of goodness.

If all is till looking dismal, dump it.

I do think these guys have it largely right: Mathematical Investor

Let me expand a little. I have been working hard on the various Q tools over the past week and thinking hard also about what they want. When you take a look at Example Long Short Equity Algorithm you become very aware how tightly Q have tied up what they want.

They have already set so many parameters: dollar neutral, sector neutral, factor neutral, choice of universe, max gross leverage, winsorisation, optimisation with constraints. All you have to do is to choose the factors. One at a time in the case of alphalens - at least when your are making your choice for eventual input into a "combination factor".

So take EPS growth by way of example. It is a quarterly event (unless you convert into some daily metric - eg the related PER). I'm not sure I am happy fiddling much further with it. It either dives price or it does not. Perhaps it does not drive price (or is ignored in one particular year) but if it does not I prefer the approach of trying out a few different years to see overall whether the assumption is correct.

I can see that the use of moving averages my be of use (with PER for instance - although that will take us ominously close to the already closely related factor of momentum). But I'm not too sure about tweaking it much further. The parameters are already so many and so tightly set.

I may be talking in ignorance but it seems you have already gone so far down the line in terms of rigid system design that further tweaking seems a bit ominous. Perhaps I will veer from the paths of righteousness as my journey progresses.

if one period looks bad, move on to other periods and see if that one period was anomalously bad in a sea of goodness. If all is till looking dismal, dump it.

I don't see a reason to not just test it on the whole time-period then. You only need to protect against overfitting if you're actually fitting something. Of course, an approach where you just define your idea ahead of time, test it once (on all the data, staggered doesn't make a difference) and then either dump it or submit it to the contest is probably the most robust way as there is no chance to overfit. I would call this a pure hypothesis driven approach. For other approaches where you look at the data to guide your strategy you need protection against overfitting in the form of cross-validation. We might call this data-driven.

@Zenothestoic
That all sounds fine to me. Just note that a factor does not have to be simple, or even single-factor. The final factor might be a non-linear (e.g. via ML) combination of several factors, like I exemplified in the ML on Q posts. I would imagine these types of multi-factor alphas are less likely to have been discovered already.

No. But the initial testing on alpha lens is single factor then multifactor. Perhaps a crappy single factor fits well with other factors for non correlation. As to discovery of new factors I guess one always has to beware causation correlation arguments. But of course most of the metrics are well tried and tested. Ecclesiastes had it dead right. As did that memorable phrase from Battlestar Galactica. It has happened before; it will happen again.

Right, although you can put a combined multi-factor alpha also into alphalens.

Hi, I'm new here. How do I see the Python code used to generate your validation strategy? All I'm seeing is plots.

Thanks Thomas. Yes I am getting the hang of it. A powerful tool.

@Corey,

Welcome to Quantopian!

Unfortunately, I can't share the python code for this algo since it is a contest entry and therefore vying for fund allocation. It is also proprietary and is thus considered an Intellectual Property of the author. For python code examples, I suggest you go through the Quantopian Lecture Series. There are tons of educational content and resources you can find in the Help section. Additionally, Quantopian has provided the community with some very powerful tools such as the Backtester which is tailored to guide authors through the contest criteria, Pyfolio which gives analysis of results, AlphaLens which can analyze factors, the plots you see here are what they call Tearsheets which gives you in depth analysis of the results of your algo and much more. My advice to you is to be patient and spend some time learning the different facets of financial trading through the resources provided here in the Quantopian website and community forum.

@Thomas, @Zeno,

Let me expound a little more on the importance of recognizing the non-stationarity of financial timeseries as it relates to holdout validation. I'm sure you both are familar with this phenomena but for the benefit of those that don't, non-stationarity of a time series refers to statistical properties such as mean, variances and autocorrelations of data series that change over time. This phenomena is why prediction is so challenging. Non-stationary behaviors manifest itself as trends, cycles, random walks or combinations of the three. We often refer to them as changes in market regimes or conditions.

To answer Thomas' question: If the factor didn't look good on your training period, why move forward? Since the choice of the training period is arbitrary, one might have chosen a period that is highly random and therefore most difficult to predict, if not impossible. So under my methodology, one is almost always forced to move ahead and test on holdout data to see how the factors behave in different time periods / conditions. As in my example above, the results of the training period is dismal in the risk adjusted returns sense but when the factors were presented in holdout data, both backward and forward, the results were generally quite satisfying. Further, I can look back at the results of the holdout periods to confirm if the behavior of the training period also similarly occured in the holdout period, in this case it did.

In summary, since the non stationarity of financial time series manifests itself as trends, cycles, random or a combination of the three over time, the quant modeler must be cognizant of these phenomena to make better judgement on whether the algo overfits or not based on the training period. There is really not much a quant can do in periods of randomness because the level of predictability is very low if not nil. However, I believe the key is to find factors that are able to adapt to both trending and cycling (mean reverting) periods where predictability exists.

I have been using a variant of James method quite effectively for a few months now. In my approach I use chunked periods backward and forward.

Hi Leo M,

Thumps Up!