Best Practices for Out of Sample Testing with Multi-Factor Models - Question

When testing single factors, it seems like best practice would be to reserve a “test set” for the final model once multiple factors have been combined. This way, you can be confident that your model has not “seen” the out-of-sample data. However, without seeing the performance of individual factors in the test set, it becomes difficult to distinguish between robust factors and overfit/decayed factors?

If the model performs poorly, one may be tempted to throw out all factors in the model in future development efforts in order to avoid multiple comparisons bias. If the model performs well, the poor factors that are weighing down the model may be kept even though they have no predictive value. What is best practice in striking this balance between getting good individual factor attribution data in the test set vs. keeping a clean holdout set for the multi-factor test?

(I elaborate on this question below in hope that it will be a bit more clear. I also propose a possible, albeit unsatisfying solution to the problem.)

A simplified process of developing a multi-factor model may look like this:

1. Develop and tweak a single alpha factor on a “training set”.
2. Repeat for any number of factors.
3. Combine factors in training set.
4. If it performs well, run combined model on a “test set”.

After step 4, if the model performs poorly, you must throw out the model and go back to the drawing board to avoid multiple comparisons bias. However, if you start from scratch, you risk throwing out good factors that were being weighed down by other factors that either decayed due to arbitrage or were overfit in the first place.

An alternative would be to test each factor individually on the out of sample data (after tweaking and testing on the training data). This way, you keep factors that hold up and throw the others into the trash. However, once you move on to the factor combination step, it will most likely look good no matter how you combine the factors because the individual factors have already “seen” the out of sample data. Therefore, you have lost the out-of-sample data for the factor combination step.

If this alternative approach is taken, maybe it makes sense to create a higher benchmark for the factor combination step to beat. Instead of raw risk-adjusted performance, maybe the benchmark is an equal weighted combination of the individual factors. Any combination methodology (mean-variance optimization, machine learning, etc.) must beat the equal weighted combination in the out of sample step. (I’m not sure this approach is entirely satisfying either.)

In the end, my question is the following:

How do you balance the trade-off between 1) keeping a clean out-of-sample testing set for the final factor combination/algo backtest step, and 2) the need to have individual factor attribution data, so that you don’t keep using decayed or data-mined factors in future models, and/or so that you don’t throw out good factors because of a poor multi-factor test?

4 responses

@Michael,

This is something I've been thinking about a lot as well. I don't claim to have a 'correct' answer, but I can tell you what I do, and my rationale for it, and you or others can tell me how I'm fooling myself.

First, testing on OOS data only once to determine keep/disregard, in my opinion might be overly conservative. A few times might be ok when you're 'tweaking' individual factors. I think one is quite likely to fit on noise at least some on the training set, so looking at the test results a few times after trying to make the factor more 'general' might be ok?

Second, one can have multiple periods held out for OOS testing, right? What I tend/try to do is to have some OOS data to test on when I develop my individual factors, a different OOS period for when deciding which factor to keep/disregard in the combined model, and a final OOS period for the final test.

When researching individual factors I like to use Thomas' odd/even quarters to train/test my factors. If a factor looks ok on the test set, but still a bit overfit, I allow myself to test a few more times after trying to generalize the factor on the training set. Not too many times of course, but 3-4 times might be ok?

The second OOS test I tend to do is usually when testing the combined factor model in the backtester. This is only to test which factors to keep and which to disregard in the combined model. I don't allow any tweaking of factors during this stage, but I do allow multiple tests to determine which factors to keep. I tend to use 'live_start_date' with Pyfolio to separate my IS and OOS stats.

Once I'm happy with the combined model, I submit it to the Q Contest for my last OOS test (this one I only allow once). I normally use at least some FactSet data, so after a few days I get to see how well/poorly it held up during the FS holdout period (last 1 year). I don't really have a threshold for saying when it's overfit or not (so far I've been lucky), but if average Sharpe during the hold out period is less than half of the IS average Sharpe, I'd say it's at least partially overfit. Maybe there's a better way of setting this threshold? E.g. if average Sharpe is lower than 1 standard deviation of the average Sharpe during the IS training period?

Lastly, best OOS testing is of course on future data, so I try to force myself to keep a strategy (that's held up so far) running for at least 3 months in the contest (to accumulate a 'full score'). 3 months may not be long enough, as a strategy doesn't necessarily have to be overfit even if it doesn't perform well for 3 months, but I tend to be too impatient to wait any longer.

The biggest challenge for me is discipline and to actually force myself to adhere to these 'rules' I've set for myself. Easier said than done...

Hi Joakim,

Thanks for responding. I doubt there is a “correct” answer, so I appreciate any well thought-out feedback. Here are some of my thoughts on your responses:

First, testing on OOS data only once to determine keep/disregard, in my opinion might be overly conservative.

1) It might be ok, but I think the key principal is that the more you use the holdout set, the more likely you will find a false discovery. It can be very easy to take the information that you gain from the OOS test and create a good looking factor that is fit to noise. But maybe it is just a trade-off that the researcher has to balance for him/herself.

Second, one can have multiple periods held out for OOS testing, right?

2) I agree with you here. This is similar to the machine learning method of having a training set, a cross-validation (CV) set, and a test set. With ML, you train the model on the training set, tune your hyper parameters on your CV set, then test your best model on the test set. This sounds like what you are doing in your manual development approach. However, I tend to believe that the test set needs to be pass/fail for similar reasons that I mentioned in number 1.

It also doesn't completely address the issue of alpha decay of individual factors in most recent periods (at least until more out of sample data is made available as time passes).

When researching individual factors I like to use Thomas' odd/even quarters to train/test my factors.

3) I think Thomas’s solution is interesting, particularly in helping to make sure that you don’t miss a certain type of market regime in the training set that exists in the test set. A drawback is that you don’t get a clean continuous test set (unless you still hold out a continuous set of data in the most recent period). This can create some difficulty in interpreting the statistics in the final test. For example annualized stats (e.g. Sharpe ratio, alpha, annualized return) will be off. Volatility metrics may also be understated due to the time where the strategy is not traded. That said, these statistics could probably be adjusted to overcome this obstacle.

Lastly, best OOS testing is of course on future data

4) I agree with you that forward testing is the only true OOS test.

I don't really have a threshold for saying when it's overfit or not (so far I've been lucky), but if average Sharpe during the hold out period is less than half of the IS average Sharpe, I'd say it's at least partially overfit. Maybe there's a better way of setting this threshold?

5) Regarding the acceptability of alpha decay, I would think that as long as the risk-adjusted performance is strong, the percentage decay isn’t so important. For example, let’s say that in my testing phase, strategy A has a Sharpe Ratio (SR) of 2.0, and strategy B has a SR of 1.5. In my live test, both strategies have a SR of 1.0. Strategy A decayed by 50% and B decayed by 33%. Despite A having a greater decay than B, I would argue that both are worth keeping (assuming the SR is the only metric we are using to evaluate the algorithm).

6) Another interesting point was in Marcos Lopez de Prado’s recent book, Advances in Financial Machine Learning, where he recommends recording every backtest conducted on a dataset so that you can estimate the probability of overfitting. This also enables you to deflate your Sharpe ratio by the number of trials carried out.

For anyone following along, Thomas’s excelent post, as mentioned by Joakim, is relevant to the discussion and worth a read.

Thanks Michael,

I really appreciate getting someone else's feedback and perspective of what I'm doing!

I'm currently reading MLdP's book, which I've found excellent so far. I also found his speech from QuantCon 2018 extremely helpful as well. I'm sure I'm susceptible to and a victim of all the 7 pitfalls, even though I don't really do any ML stuff.

I especially found the deflated Sharpe Ratio (SR) interesting, which you mentioned. If I do 500 backtests, the expected max SR is 2, even if there's no underlying strategy!! And of course I'm going to choose the backtest with the highest SR... Basically backtest (selection) overfitting. Made me realize I need to spend a lot more time in Research and a lot less in the backtester. Essentially I've been overfitting without even realizing it...

Bumping to see if anyone else has any thoughts on this:

How do you balance the trade-off between 1) keeping a clean
out-of-sample testing set for the final factor combination/algo
backtest step, and 2) the need to have individual factor attribution
data, so that you don’t keep using decayed or data-mined factors in
future models, and/or so that you don’t throw out good factors because
of a poor multi-factor test?