Overfitting Questions - What are your thoughts?

I would like to hear other's opinions about overfitting, because I want to avoid being guilty of it, but at the same time I want my model to be as good as it possibly can be. So here goes... In my mind, choosing stocks like Apple, Netflix, Google, and Amazon would be overfitting because past performance of those stocks isn't a guarantee of future results.

HOWEVER, in the algorithm I am working on now, I use a 5% trailing stop loss. I tried backtests at 4%, 5%, 6% and 7%, but settled on 5% because it had the best backtest result. In your opinion, is this overfitting? The idea of a stop-loss is completely reasonable, and I used the tools available to me to figure out where that number should be.

So am I overfitting, or being reasonable?

3 responses

There should be a theoretical basis behind all your algorithms. The basis should give you a risk premium. Everything else is just fluff. If your algorithm simply does well because of overfitting stop losses, then you're setting yourself up for trouble.

Minh - I definitely agree with that. Assume I have a theory/strategy that I am using, and as a part of that I feel like stop-losses are a good risk-management tool. I'm not overfitting just by trying to figure out what my stop loss percentage should be, am I? Thanks for your input.

Nothing wrong with trying some things and going with what works best. Seems reasonable.

I generally use a few rules of thumb to check (though not ensure) that I'm not fooling myself and 'overfitting' a particular parameter. These aren't exactly statistically sound but more 'seat of the pants'.

1. Small parameter changes should result in small output changes. Vary the parameter a little bit. If the output results change a lot be suspect.
2. Directional parameter changes should result in consistent directional output changes. Vary the parameter up then down. If the output consistently follows the direction (or inverse direction) then good. If the output direction changes then be suspect.
3. Inverting the parameters should invert the results. This may not be applicable to all parameters (maybe not the stop loss described above) but try flipping the logic. If a high factor value generates high returns and a low factor value generates low returns, then good. If a low factor value also creates high (or even medium) returns then be suspect.
4. Parameters should be independent of timeframe. Try backtesting across different time frames. If the same parameter choices come up as optimal then good. If parameters work in some times and not others then be suspect.
5. Parameters should be independent of the assets. Try backtesting across different independent samples of assets. If the same parameter choices come up as optimal then good. If parameters work with some samples and not others then be suspect. (See below for a custom classifier to use to generate random samples using a pipeline. Set a filter for 'universe 0' then do the same but filter for 'universe 1'. The output results should generally be the same for both samples.)
from quantopian.pipeline import  CustomClassifier
import numpy as np

class RandomUniverse(CustomClassifier):
inputs = []
window_length = 1
dtype = np.int64
missing_value = 9999
universes = 2
seed = 1
size = 10000
np.random.seed(seed)
random_buckets = np.random.randint(universes, size=size)
def compute(self, today, assets, out):
# get the asset ID into a range 0-9999