Avoiding Overfit Bias -- An Overlooked Dimension of Holdout Data

I apologize if I'm stating the obvious or if this has been covered already. I hadn't seen it discussed, so I thought I'd start a discussion on it.

Typically algorithms that have been posted to these forums that display stellar backtest results do not hold up when tested later for out-of-sample performance. The result is typically so stark and immediate that the issue surely isn't alpha decay. This shows how ubiquitous the problem of overfit is.

One common technique used to help avoid overfit is the use of a holdout period -- some time frame you've set aside that you have not modeled your model on. However, ultimately this can suffer from the same problem. Algos that survive the holdout period may do so by chance, and those will be selected to move forward.

Another problem is that some datasets aren't old enough for a holdout period to be viable. For example, sentiment data has maybe three years of statistically-viable activity and before that is suspect.

The other day I was working on a sentiment-based algorithm and was getting unbelievably fantastic results -- consistent alpha, 50% CAGR. Since a timeframe holdout wasn't viable, I dug into returns attribution, which is where I discovered that my HMNY trades cumulatively produced 10x more gains than any other position. Eliminate this one company and algorithm meandered and lost money. If one position makes the difference between fantastic and terrible performance, that tells you something. HMNY was the short of a lifetime, but it's not coming back.

One thing I like to look for is alpha that is robust. There must be breadth in the returns attribution. If most of the returns can be attributed to rare/one-off opportunities, this is a good sign that it will not be predictive of future performance.

Just like you might set a 2-year time-period aside as holdout data, I think you should also have a holdout stock universe. QTradableStockUS has enough stocks that you should be able to have a 25% or even higher holdout, no? Here's some psuedo-code for pipeline:

salt = 397439872 # changing this number will generate a new random stock universe hold-out in a deterministic way
mask = crc32(sid + salt) % 3 != 0 # eliminates 1/3rd of stocks, change 0 to 1 or 2 to swap the hold-out


You should not be targeting one-offs with your model, but rather generalities. One-offs are not repeatable and are not statistically viable. So if an algorithm doesn't hold up to a random change in stock universe constituents, then it doesn't have the breadth to indicate it's not overfit.

While this technique won't eliminate overfit, but it could be a useful tool to help.

I would like to see some more returns attribution breadth analysis added to AlphaLens/tearsheets. A red flag like HMNY in my example would be easy to catch, which is just the most extreme example, but actually I think this is quite common in overfits. If you're holding a basket of 500 stocks, that can mislead you into thinking you're working with statistically viable data, but if the returns attribution lacks breadth -- say a vast majority of your gains are coming from only 20 of positions -- that's easily in the realm of overfit, since 20 is not a statistically viable sample size.

I would also like to see some stock universe holdout built into the IDE-- to encourage people to devlope algorithms with hold-out data.

2 responses

Its an excellent idea. Thank you for raising this.

Possibly another way to implement might be to allow partitioning the data set into N (some user-selected number, maybe 2 to 5 or so) equally-sized subsets, running the algo on all of them, and then comparing the results to see the range of uncertainty arising solely from different sub-sample selection.

Good idea. Just replace the != with == above and you can accomplish that instead.