Back to Community
Random Forest unable to predict outside of training data

I have been using Random Forest Regression for making price predictions and I recently realised a major flaw in that unlike standard regression it will never be able to predict values bigger or smaller than the max and min in the training data (for some background see here http://stats.stackexchange.com/questions/235189/random-forest-regression-not-predicting-higher-than-training-data). As it seems RF does very well under normal behaviour but could lose everything in fat tail 2008 type situations.

So my question is what machine learning methods do not fall victim to this shortcoming? Or is is possible to patch up RF to alleviate this issue (maybe by using a regime change detector) ?

Any help is much appreciated!

4 responses

Hi Warren,

Yes, it's certainly true that RFs can't extrapolate to regions it has not been trained on (or rather, it extrapolates in a very crude way). I suppose it's a philosophical question whether you think that's a bug or a feature. For example, you could argue that a higher order regression is much worse because it likely behaves extreme in those areas without data. Personally, I'd rather have a constant prediction when I go to the edges of the known.

As to your question, any linear model will extrapolate (Ridge, Elastic Net etc), as will SVM regression and Neural Nets. These last two might be the most powerful tools at your disposal with your desired feature. All the decision-tree based methods like RFs or Boosted Regression Trees will not extrapolate.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Remember, though, that linear regression emphasizes the conditional probability distribution, as opposed to the joint probabability distribution (the multivariate domain).

Have you experimented with least absolute deviations regression? I often use that to address fit failure (errors).

I am not, however, a big fan of Ridge Regression: it seems there is very little to back up the claims you hear ad nauseum about its great constrained optimization.

Thanks guys, I'll start having a look at the details of those methods mentioned. I'm still more familiar with Time Series Analysis type methods so I'm still getting familiar with the ML world and which model fits my needs.

@Thomas I agree with your sentiment to a certain extent but I also feel that having one very bad day in a month can wipe out the 29 good other days (or the 364 good days in the year!). RF is still I great tool but how it seems to me is that you need to find a way to know exactly when it breaks down and ceases to be relevant.

@Steve I was looking into using MAE instead of MSE as the error measure (which came available in scikit-learn 0.18) as it seems to help resolve the outlier issues. Is that what you are referring to?

@Warren: One of the reasons I like Bayesian Machine Learning techniques that can tell you "I don't know, never seen that before" by also providing uncertainty in the prediction, rather than be force to always provide an answer.