Q paper - All that Glitters Is Not Gold: Comparing Backtest and Out-of-Sample Performance on a Large Cohort of Trading Algorithms

All that Glitters Is Not Gold: Comparing Backtest and Out-of-Sample Performance on a Large Cohort of Trading Algorithms

Thomas Wiecki
Quantopian Inc

Andrew Campbell
Quantopian Inc.

Justin Lent
Quantopian Inc

Jessica Stauth
Quantopian Inc

March 9, 2016

Abstract:
When automated trading strategies are developed and evaluated using backtests on historical pricing data, there exists a tendency to overfit to the past. Using a unique dataset of 888 algorithmic trading strategies developed and backtested on the Quantopian platform with at least 6 months of out-of-sample performance, we study the prevalence and impact of backtest overfitting. Specifically, we find that commonly reported backtest evaluation metrics like the Sharpe ratio offer little value in predicting out of sample performance (R² < 0.025). In contrast, higher order moments, like volatility and maximum drawdown, as well as portfolio construction features, like hedging, show significant predictive value of relevance to quantitative finance practitioners. Moreover, in line with prior theoretical considerations, we find empirical evidence of overfitting – the more backtesting a quant has done for a strategy, the larger the discrepancy between backtest and out-of-sample performance. Finally, we show that by training non-linear machine learning classifiers on a variety of features that describe backtest behavior, out-of-sample performance can be predicted at a much higher accuracy (R² = 0.17) on hold-out data compared to using linear, univariate features. A portfolio constructed on predictions on hold-out data performed significantly better out-of-sample than one constructed from algorithms with the highest backtest Sharpe ratios.

Wiecki, Thomas and Campbell, Andrew and Lent, Justin and Stauth, Jessica, All that Glitters Is Not Gold: Comparing Backtest and Out-of-Sample Performance on a Large Cohort of Trading Algorithms (March 9, 2016). Available at SSRN: http://ssrn.com/abstract=2745220 or http://dx.doi.org/10.2139/ssrn.2745220

76 responses

very interesting, thanks for sharing

Hi Thomas, et al.,

I read through the paper to get the gist of it. Regarding overfitting, my algo that won the first Quantopian contest is a nice example (see https://www.quantopian.com/posts/winning-algo-drops-below-$90k?c=1). You mention it in the paper. I picked the stocks by googling for lists of stocks that had performed well over the backtest period required by the contest (e.g. "best performing stocks of 2013" - that sort of thing). There was no basis for this approach, other than to give the long-only strategy a running start to produce good backtest results. I did fine-tune one parameter, context.eps. I adjusted it to minimize the drawdown over the backtest period (drawdown was highly sensitive to context.eps while Sharpe, volatility, alpha, beta, and return were largely unaffected). The strategy performed splendidly for the 1-month OOS contest period, and then fell apart, as the overall market deteriorated. In my opinion, the contest rules at the time strongly incentivized over-fitting backtests and long-only strategies. The prospect of risk-free gambling with$100K of Quantopian's money was enticing, by design. At the time, my sense is that Quantopian needed publicity and engagement in the contest, and so it was a logical gamble on their part.

One fuzzy thought on the overfitting conundrum would be to provide guidance and tools for letting the computer do the work that is inevitably biased by humans. Give it all over to HAL 9000 and let him sort it out. To my knowledge, there is no practical way to do this sort of thing on Quantopian on a rolling basis. Years ago now, there was talk of enabling walk-forward optimization on Quantopian (see http://blog.quantopian.com/parameter-optimization/). And also some hints of parallel processing (see http://blog.quantopian.com/zipline_in_the_cloud/). I guess pipeline and Q2 are steps in the right direction, but the path toward modern, high-performance computing is still not clear. Maybe the cost-benefit analysis doesn't justify it? And maybe it would only make the overfitting problem worse?

Another thought is that perhaps overfitting is not a problem for Quantopian? Just use the 6 month (or more) OOS data to pick algos for the fund. Ignore backtests altogether, if they are useless as the paper seems to suggest. You may be fooling yourselves when an algo performs well OOS and the backtest is consistent. Maybe it is consistent just by chance? With 60,000+ monkeys on typewriters, maybe you get a handful of nice sonnets for your collection by just using the OOS data?

Grant

Hi Grant,

Optimization: As you say, it's very possible that allowing parallel and efficient parameter search would exacerbate the problem as now every strategy can look good in a backtest by flipping the overfitting-switch. The walk-forward optimization might provide some help but it's probably not a silver bullet either. I think pipeline solves a large part of the problem. Once you have your factors in an array of reasonable size, fitting an ML model to it is not all that computationally costly and can already be done.

Selecting strategies: This is exactly what I'm thinking about. A backtest is useful in evaluating whether a strategy is built properly -- is it beta-neutral, does it take large sector/stock bets, rebalance-frequency etc. Whether something is actually interesting is probably best evaluated on unseen data. We're just starting with a new research project that tries to determine how to best make use of the OOS data. For example, are 6 months always required or can we be more aggressive?

I do advise everyone on Quantopian to adopt a more principled approach. First, by running less backtests, and doing more validation in the research environment; and if you do run backtests, keep a hold-out set of the most recent 6 months that you only run on once at the very end. If it doesn't do well there, discard the idea.

Best,
Thomas

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks Thomas,

Ha! Well, on https://www.quantopian.com/about, Quantopian touts "So far, our users have performed more than 6,500,000 years' worth of backtests!" And you are saying (and supporting with your research) that the backtests are pretty much worthless. And your financial backers must be wondering why they are paying for them. One would have to wonder if all of that computing power could be re-directed more productively.

I have to speculate if Quantopian, other than making money off of your fund, is thinking of monetizing its growing user base of 70,000+? For example, if you want to grow the number of users, and have them make lots of trades, then running lots of backtests and over-fitting might be beneficial. It is more of a gambling venue model. If everyone took a principled approach, would your business model work? Are you just a crowd-sourced fund? Or are you relying on a parallel business model, too?

The research environment is great, but it is in the stone age as a computational platform. I realize that you can't deploy an expensive platform to the masses, but would it be possible to do some data crunching on a separate HPC platform, to aid users? Share the results as guidance for strategy development. Just a thought. It would be kinda cool.

Best wishes,

Grant

Frankly the entire history of stock market data is insufficient to establish whether or not a given strategy will work going forward. Please do not mistake my comment for levity. Hi have first-hand experience if curve fitting spreading back over many years and have made many costly errors. But survived well.

My caution and scepticism is now extreme. To the extent that my only real belief in market strategies is in asset allocation.

And even then you will certainly not be able to achieve the super smoothe ATM effect almost all neophytes are seeking.

In the long term all things are born and then die. Most stocks end up worthless and only the stock index can give the impression of continuing growth.

And a correct impression until and unless the two to three hundred year old age of enlightenment ends.

Look at the short termism in the industry. Look at the lack of performance persistence over the long term in every form of investment product.

Hedge and active mutual funds are a waste of time and money over the long term unless they follow a simple strategy to follow asset growth. The more mechanical that can be the better.

The vast majority of participants in the financial markets are fooling themselves and their clients.

But well done Q on that paper!

Grant,

I have to politely disagree with your characterization of Thomas' answer here. He did not say "backtests are pretty much worthless." Rather, he pointed out that, from our perspective looking to select outperforming algorithms, backtests are "useful in evaluating whether a strategy is built properly".

From the quant's perspective of course backtests are good for a whole host of reasons, they let you test your idea on historical data in a systematic and hopefully deterministic fashion. They help invalidate ideas that are patently unprofitable. They can help you catch edge cases and misbehavior in your algorithm that you might not have noticed by just watching a short period of paper trading, and so on.

I think the key insight, and we are certainly no proposing this is a novel insight to us, is that while in-sample data is great, you really can't trust a model until you validate on hold out data. And as Anthony says, even then there is always grounds for healthy skepticism.

Jess

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

In addition to Jess' points, I do think there is a silver lining provided by the ML results. The predictability of a backtest is much much higher if you look at a (non-linear) combination of various traits. The patterns that seem to emerge are that tail-behavior and the amount of backtesting you do are informative. Granted, this is just a first step and more analysis needs to be done but a backtest-derived metric combined with OOS data (which we haven't done yet) might improve this even further. For me, the read of our paper is that you can't take a backtest Sharpe ratio at face value -- something that has been suggested for a long time and is now confirmed empirically. That, however, does not automatically suggest that a backtest is not informative and that the practice is useless.

I wish there was a "random" button on the backtester. Something that allows a backtest to be run on a completely unknown time period. Would be nice to preserve the purity of the in sample data.

Hi Frank,

I recently came to the realization that the set up we have created with the partner data can act as a nice forcing function on this. Each partner's data set has a time period for which you're required to subscribe in order to get access. For some it is 1 year, others it is 2 years. So you can build and backtest your strategy with the free sample and then run a back test with the held back data once you've subscribed (and then also live trade as well, if you wish).

Hope that helps,
Josh

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Good idea Josh. Thanks.

It's really wonderful to see a formal write-up on this topic; any plans to submit to a journal for peer-review?

I am of the opinion there are relatively few sources of alpha that are material and long lasting enough to be worthy of a large fund. The backtesting in my mind is a good way to establish how efficiently you are mining the alpha, and how pure it really is: have you rediscovered a combination of existing risk premia? If you're getting a huge Sharpe ratio, then it's probably overfitting. If your return is huge, without drawdowns, then your backtest may not cover the right market conditions to really test it.

@ Jess and Thomas,

I'll have to give the paper another read, when I get the chance. I just recall a bunch of blobs of data, and even after what sounded like a heroic effort, an R^2 ~ 0.2. Nothing to get too excited about, but maybe you find it informative. Yeah, perhaps "worthless" is a strong word, but if you have to wait 6 months or more to invest a dime, then there would seem to be an awful lot of latency in the Quantopian discovery process. But maybe it is standard practice in the industry, even if you have professional quants cranking out algos?

By the way, I am reading Systematic Trading by Robert Carver. He suggests that that persistent Sharpe ratios of greater than about 1.0 are unrealistic ("In reality SR consistently greater than 1.0 are rarely achieved, even by sophisticated insitutional investors" p. 47). What if you filter out backtests with Sharpe > 1.0? Are the backtests then more predictive of OOS performance? Maybe there is a cohort of your users who aren't over-fitting and you can identify them by rejecting unrealistic backtests.

Sharpe ratios of >1 can be achieved. Not just by mining alpha, but by diversification and good risk management. In finance you learn the hard way that the only free lunch is diversification. And you need to use it to the maximum possible extent. You need to run 100's of strategies across multiple horizons, asset classes and geographies and alpha sources. There is a well known fund with >$35 billion AUM that has a Sharpe ratio of >2 since inception in 1990. They have 300+ diversified trading teams, each running multiple strategies. Old alphas will die, new ones will show up.They also practice good risk management by having strict drawdown limits. Trading teams at that fund do have turnover, but each team manages a small capital allocation. That's how they stay in business. I was taught that there is no such thing as a free lunch....So I would label diversification and good risk management as the next best thing...a tax deductible lunch, tip included:) I think one thing Q misses out on, in terms of establishing likelihood of overfitting, is knowing the number of free parameters in an algorithm. Only the algo writer knows this, and to be honest, it's easy to look over implicit parameters, such as: asset selection, lookback for risk measurement, number of positions held, etc. I work hard to pull these out, and test each one for robustness by seeing if a range of sensible values produce similar behaviour. It would be cool if the platform could do this automatically. It could also give Q some insights into overfitting as it would see both the number, and potentially the nature of the free parameters. We could also run several algo variants, with different parameters, to achieve some small degree of diversification. One has to wonder if the whole idea of black-box algos is doomed. Personally, I don't see the point. As I understand, Q is looking for institutional-grade, diversified algos that scale to ~$5M-$25M each--not the kind of thing individuals would be putting their own money into in the first place. So, it's not like users would develop the algos for their own capital. So what would be the risk of sharing strategy details and code with Q? There is a case of the strategy being lifted from a user's professional experience, going against an NDA. So, there would be additional risk to the user if the strategy details were revealed. And perhaps this keeps Q out of legal hot water, as well. What if Q killed the sacred cow of algo secrecy, and looked at code, and collaborated with users? Instead of collecting thousands of black-box algos, in a massive social engineering and data mining effort that results in a bunch of fuzzy plots that are hard to interpret, engage a more limited set of users at a fundamental level, with your very best people. Simply require transparency, and in exchange, be more transparent. And figure out how to reward participants who make incremental contributions, versus "You too could be the next Quantopian manager! Join the select few, and make millions!" (which of course incentivizes over-fitting). But maybe I'm missing some basic "trader" psychology here. I would certainly be more interested in Grant's approach. I have been badly burnt over the years by people wanting something for nothing. So much of my time has been taken up by fools promising much and delivering nothing. I approached Q sometime last year about the possibility of hosting some algos for rent on the website but at the time that did not fit into the plans. I am not interested in Q's current business model but admire what they have done and appreciate the skills the team has to offer. And indeed the skills evident in various other participants. All it needs is a way for participants and Q to profit from the Enterprise. And to distinguish this excellent site from the bunch of frauds and snakeoil salesman who dominate the trading and investment world. Personally, I like the current approach. I know I would not have participated if Q had open access to my code. After all, all I can offer is an idea, a small edge. I will never know if Q take it from me, but if they did, I would have nothing. Their current approach as described, and in subsequent interactions, gave me the confidence they would not do that. With a bit of work, they can work around the black box nature of our algos. They can ask: what factor/inefficiency/risk premia are you going after? They can then check this in and out of sample using linear regression against other factors/algos. They can assess if this is additive to the fund, or more of the same. They can assess if the algo is efficient vs other algos going after the same factor. They end up with a collection of alpha sources they can readily explain to investors. Speculating some more, I reckon huge hedge funds like Renaissance are so protective and secretive of algos internally, their setup won't be a million miles away from Q. Why would the capital allocation and risk management team actually need to see the code of an algo when they can see the return stream? I also have a different opinion on diversification than some of the comments above. Diversification (across factors or asset classes) is THEIR job, not our job. Sure, if I wanted to manage my own money, I would do it and get my free lunch, but this is not the case when submitting an algo to the fund. I'm sure they want some pure source of alpha, unmuddied. They can then put these sources together in different ways, depending on the investor preference. As far as the business model goes, I've yet to grasp the "crowd-sourced" concept. The read is that funding may be available to a select few (30 or so people, perhaps). Basically, it becomes a recruiting effort. The "managers" sign agreements and then Q can start working with them formally on their next genius ideas (perhaps still in a black-box fashion). It is not clear how the remaining 70,000+ users fit in, unless there is a parallel business model. From the paper, it sounds like the first attempt at crowd-sourcing fell flat. Most users produce junk (myself included) and don't justify paying the electric bill to run the servers. So, if I were the Q team, I'd be doing a head-scratch: "How can we get users to use the resources we are paying for more productively?" Or maybe they already have their 30 or so managers lined up, and are all set? As I said, the crowd-sourced concept is lost on me. My head is full of trading ideas and factors, and I could write a notebook full if I had enough coffee. My problem is that I cannot code to save my life. I am a total hack. In my humble opinion the problem here is that quantopian appeals to the programmer where they should try to appeal to the traditional finance crowd. I have discussed the quantopian platform with my peers in class, and the hurdle remains that none of us can write code. We are Master of Science in Finance students, studying finance at the highest levels, but this platform is out of reach. My peers couldn't understand the simplest line: print "Hello, world!" #and I still struggle with py2 vs py3.....  Programmers can build a hyper-fitting algo in hours but will miss by miles basic market fundamentals. I've seen this in action in other threads, and even other code samples. Partnering with a programmer is a non-starter because I don't want to generate ideas and have them stolen. I am trying to learn python as fast as I can, but I reckon that I am still a year or more away from being able to write a robust algo. Will Quantopian exist then as it does now? @Jon learn Python! It's a great tool that can be used everywhere, not just Q. The actual code you need to write factors is minimal. @Dan oh I am well on my way. 4 months in and I was able to incorporate scripting on a valuation project that saved me immense amounts of time. I've been forcing myself to use python instead of excel :) @ Jon - Surprised they don't offer some type of coursework that includes Python or R in your program... @frank we use bloomberg terminals for security/portfolio analysis Jon Gotcha....well I understand your sentiment completely. I think of Python as a travelling salesmen approach to algo writing. It is the fastest language to learn, and probably as powerful as the rest...But it is still incredibly frustrating not being able to immediately test a hypothesis in the back tester! After hours of researching to get to the point where you can implement your code, your idea can and usually will fail miserably in the back tester. A waste of time, and more importantly...hope. I bet this drives a lot of the over-fitting that occurs in Q by finance professionals that are new to programming . Unwilling to label the time wasted as a sunk cost, one will ultimately fidget with their algo until it no longer reflects the economic logic they started with, but does do well in back testing . Basically a lottery ticket to be deployed in the Open as a Hail Mary. This along with the hyper-fitting algos produced by more seasoned programmers probably does a good job of generating a qualitative summary of the data revealed in "All that sizzles is not steak"...excuse me, "All that glitters is not gold". In regards to your issue, I guess there are really only two solutions...1) Find a programmer you trust, or 2) Keep working on learning Python. In the end, I am just thankful Q exists because without it, our ideas would probably be thrown out with the dirty diapers. Good Luck Man! @ Jon - Indeed, there is a steep learning curve, and an assumption on the part of Quantopian developers that users are fluent in Python, and the API gets more complex with every release (I've been with Quantopian since the early beta days--officially, I think they are still in beta?). One approach is to post a strategy idea on the forum, in a form that can be easily coded. If certain steps are hard to explain in words, then use pseudo-code. Someone might just code it for you. Then you can clone it, and play around. My working theory is that there is probably little if anything to keep secret in the world of finance. There may be some niche arbitrage opportunities, but they probably aren't that scalable. Is there any evidence that strategies have been leaked and their viability evaporated? I suspect much of this secrecy business is not justified, and is more part of trader psychology, akin to every alchemist thinking that he alone holds the key to turning base metals into gold. Yes indeed Grant Kienhe......the Emperor has no clothes! Quite right. Yeah, so the question in my mind is if the crowd-sourced concept has any merit, how can 70,000+ users all contribute and get paid in proportion to their contributions. Are there fund construction problems that would benefit from a global crowd? The paper suggests that asking them all to write individual, full-up algos might not be the most productive approach, since if there were any skill, backtests would be much more predictive of OOS performance. Grant, to me the Quantopian manager's model is quite clear. Find people who can write profitable algos, invest in those algos. the rest of the users they don't really care about. only problem is that they haven't really found anyone or any*thing* sufficiently profitable, insofar as i know. Regarding monkeys at typewriters, if you get enough monkeys, even with a 6 month paper-trading period, i think it's really easy to get shakespeare for 6 months and gibberish thereafter. regarding collaboration through Quantopian. I don't think it'll work. 1. i think it's a logistic nightmare to reward players in such a game, since almost all players will believe they contributed more to the end product. 2. High quality player drop-out and low quality player drop-in. this is sort of demonstrated in failures communes that are not extended-family or religion based. regarding secrecy, I think you're probably right. But I'm still going to waste a bunch of time in my secret data sorcery laboratory to prove to myself that you are right. Dan. Regarding knowing free-parameters, I don't think it's possible, and i don't think it's helpful. for example, what if i included 4000 specific securities in my universe? how many free params is that? what if i included just 1, aapl. Jon. Like others have said, learning python yourself probably the easiest way. what i think you should be prepared for is the surprising number of perfectly reasonable and clever ideas that don't work. regarding over-fitting. I'm I the only one who saves the last 2 year's worth of data for validation only? really simple right? do all of your development for 1992-2013, check your work with 2014-2016. @Jon - Partnering with a programmer is a non-starter because I don't want to generate ideas and have them stolen. And if you're a programmer, partnering with a trader is a non-starter because you don't want him to take your code and make$gazillions while you get little or nothing.

To succeed, you need finance, math including probability and statistics, and programming skills. Either form a group of specialists, if you can find some who will trust one another, or get educated in all three and go it alone. There is no other way.

@toan I would say 4000 specific stocks is about 5 free parameters. They are going to be so correlated, and you've removed any event driven risk, so you're probably just making some calls on size, industry, growth, value etc. You would then just flex these parameters to see how sensitive your algo is to those. For example, split your 4000 stocks into small and large cap. Does it work equally well on both?

If you've just chosen AAPL, then you've got one free parameter that is very easily flexed. Does your algo work with another large tech company? Does it work on all tech companies equally weighted? Does it beat the Sharpe of just buying and holding AAPL since 2003 (>2.5)

If you find your algo falls apart with small flexes to the free parameters, then you've overfitted. If it's robust, can you find a way to remove that free parameter?

@Grant it's a recruiting effort, where the downside of missing out on the gig is not so bad. It doesn't cost you any money, and you learn a great deal on the way. It's better than the Hollywood system, where the film stars earn the big bucks, and everyone else busses tables.

@ Dan H

it's a recruiting effort

It would seem so. The aspiration is to get to $10 B, so assuming$5 M per manager, that's still only 200 managers out of 70,000, or 0.3% (an upper limit, since by the time the Q fund gets to $10B, there will be more than 70,000 registered users). Of course, there could be a lot of manager turn-over, which perhaps is part of the model--to simply draw a new candidate from the large pool. But as the paper suggests, the pool may only have a few keeper fish. [CORRECTION: To get to$10 B @ $5 M per manager, 2000 managers are required, not 200. So, the upper limit participation rate would be 3%. Guess it could work, but think about it--you've got 2000 people from around the globe, having written black-box algos, managing$10 B. I'm still not grasping the crowd-sourced concept.]

I'd thought it would be kinda cool if something other than a recruiting effort could be done. For example, for the 888 algos in Thomas W.'s study, could they somehow be cobbled together into a long-short portfolio, treating each as an instrument? Something along these lines was my initial impression of the crowd-sourced fund concept.

@ Toan -

There have been allocations. See https://www.quantopian.com/posts/quantopians-first-discretionary-capital-allocations.

Regarding "...the rest of the users they don't really care about" frankly I don't know. My assumption is that if they can make money in other ways off of the rest, then they won't turn it down.

Dan, it appears that in quantifying "free parameters," you've followed up the solution with more questions. my point here is that this strategy of counting params seems to bring more confusion than clarity. Moreover, i had to read your paragraph regarding the 4000 securities 5 or 6 times, because your assumption of what some people would do with that set is very different than what i'm doing with that set. and still at this point, I don't really understand it. wouldn't defining over-fitting as an algo that does well with in-sample data and poorly on out-of-sample data not be a better metric than to say "i think this is 5 free params?"

Grant, thanks for the interesting link. i would say that they would care about utilizing the rest of the field if they could make more money than the simple approach after you subtract the cost of execution of said venture. when i said that they "don't care," i meant to say the (extra alpha)*(prob. of success) is sufficiently low that imo, they should not seriously consider it.

Andre', i don't think it's quite true. there are so many programmers out there more than happy to partner up with an ideas guy (since they really don't have any ideas of their own!), but this is not to say that i disagree with the gist of your statement--it's just that i tend to believe that ideas are mostly worthless (and i'm an ideas guy).

One thought here is that there is probably an interplay between the perception that Q is looking for exceptional (potentially unrealistic) algos and the over-fitting problem. I recall that there was a pretty wide range of performance that was desired, when the fund was first discussed. Something like SR ~ 0.7 and 7% annual return as still being acceptable. I have to wonder if the contest is providing the wrong kind of feedback to participants. For example, looking at Kevin Q.'s algo stats from Contest 11 (https://www.quantopian.com/leaderboard/11/55e093d16c958400f40000b1), his algo ranked 60th based on backtest results (out of 99 with all "badges"), but 1st in OOS paper trading. His in-sample SR was ranked 313th. Looking over his other in-sample rankings, it kinda says "Hey, you have some work to do. You may not do so well in the contest." Perhaps he resisted the temptation to over-fit, or maybe just got lucky, or a bit of both. The point is that it is not clear that the feedback provided by the contest, based on backtest results and ranking is the best approach to discourage over-fitting. It would seem just the opposite, since naively, one would want to fiddle with a strategy to improve its backtest ranking. If it is basically a contest to see who is the best over-fitter, then it is working against the objective of getting a decent match between OOS performance and backtested performance. Quantopian may be getting what they ask for, by providing the wrong feedback. Maybe there needs to be a "likelihood of being overfit and unrealistic" metric for the contest?

Grant, if we can create a working "likely over-fit" detector, i'd be all for it, but really i think the easiest way to encourage proper algo dev is to separate the data into two sets, back-testing set and validation set. you can back test for all data up to the last 3 year's worth of data and when you think you've got good results, you can hit the submit algo for validation button. this one should run like a paper test but could return the results overnight.

I think we should be careful and not jump to conclusions too early. As it is written in the paper, there was a significant market regime change in most of OOS data. Although it is indeed very interesting research, we should probably wait for 2-3 more years if not more to get solid results, IMO. Also, regarding the total number of backtests being predictive of worse results, I believe we should also be careful since we have absolutely no control over how a user organizes his algos. Because it is not possible to re-use code/classes, or do some kind of versioning ourselves (unless we use the not-so-evident way of looking at the code associated to a full backtest), I'm pretty sure most of us do a lot of copy/pasting between "algos", just to keep a progress record. I sure did. An isolated algo could look very good with only 10 backtests on it, but 20 to 30 other algos with hundreds of backtests could have been used to actually get there.

@Charles - True. Almost all my algorithms form a linear evolutionary tree.

Just looked. My latest contest algorithm has 13 backtests, plus some I have deleted. The total number is 562 backtests, though that includes some quick sketches, clones and replies to forum posts.

@ Toan - Q already provides a great "likely over-fit" tool. Check out Pyfolio.

@ Frank -

I don't think that there is any one tool, such as Pyfolio, that is a panacea to overfitting. I've been reading Systematic Trading by Robert Carver (http://www.systematictrading.org/) and I kinda gather that the overfitting problem is endemic to the industry and hard to avoid, even for professionals. If Pyfolio is used as feedback to improve a fit, without justification (e.g. by adding more free parameters), then it won't help. It's more of a process problem, akin to step-by-step, disciplined product development processes, with gate reviews by peers and stakeholders. Done right, you end up with great products, and the bad ones never see the light of day. Quantopian is at a disadvantage in this regard. They just get a big, steaming pile of algos, and have to dig through it to find the gems.

@ Toan -

The problem with dicing up the historical period into in-sample and out-of-sample regimes is that it still ends up being a process discipline problem. Once the out-of-sample test is run, if the performance is examined and used to tweak the algo, then the risk of overfitting creeps in. And if one is familiar with market conditions, world events, etc., within the out-of-sample regime, then those facts can creep into the strategy at the time of construction.

@ anonymous -

Perhaps there is a stand-out firm that has a SR ~ 2, but is it typical? Realistic? With a quick search, I found http://www.msrinvestments.com/Sharpe%20Ratios%20Reported%20by%20Hedge%20Fund%20Indices%20Underestimate%20Annual%20Standard%20Deviation.pdf. The table on page 5 is consistent with the guidance in Carver's book that SR > 1 may indicate a lack of realism. My guess is that Quantopian has grand aspirations and needs to claim that somehow their crowd-sourced approach will be different. In 10-20 years, we'll have the answer, I suppose.

A study of hedge fund performance:

Figure 9: Average Annual Return and Standard Deviation and also Sharpe Ratio for All Hedge Funds suggests that at least on average, a SR > 2 may be unrealistic.

@ Grant

I appreciate your sentiment and I am not trying to be argumentative, but...

"Quantopian is at a disadvantage in this regard." - Against whom? I would give my left pinky toe for a portion of the Q data.

Also, great article. Thanks for the share.

I kinda gather that the over-fitting problem is endemic to the industry and hard to avoid, even for professionals.

Yes, I am a professional and it is.

Most professionals in the industry have their noses planted right up against the window and thus have no focus. Stand back, look at history - and not just the past 30 years. If people got grown up about it instead of focussing on getting 2 and 20 out of clients NOW and then closing their fund when it almost inevitably crashes and burns we might have a better and more honest finance industry.

Case in point the biggest Singapore hedge fund a few years ago closed its doors down 60%. Who knows, maybe those in at the start got out with a profit. But the managers certainly did!

I don't think these people are necessarily either fools or knaves. I just believe in their hurry to achieve "loads a munny" they rush their fences and kid themselves and their clients.

"Where are the customers' yachts?"

@ Frank,

What I mean is that normally the stakeholders would have a say in the development process to be followed, and the deliverables at each stage. They pay their underlings, and so get to dictate how things are done, and can check the intermediate steps, with go/no-go gate reviews. Quantopian doesn't have this level of control. It just gets the output (although as the paper shows, it can use a form of surveillance to characterize the likely process followed, e.g. number of backtests run suggestive of overfitting). Regarding giving your left pinky toe and a slice of your liver, the question is would you put up $23.8 M of your own money (see https://www.crunchbase.com/organization/quantopian#/entity)? Q does get the algos, they just don't look at them. But they can run them as black boxes. And can profile users, to augment the information provided by algo exhaust. @ Grant So if I understand correctly, you are drawing a comparison between a more conventional product development process (i.e. a widget), and Q authored algos as a form of a product? And your theory is that the algos as a product will be inferior to the widget because of the lack of transparency into the algo creation process relative to the widget creation process? I was not really focusing on that aspect, I guess I was focusing more on the value of the aggregate data being sourced by Q, which I think is its own beast entirely. Pretty interesting to think about Q's business model in general. So many different dynamics and possible opportunities. your theory is that the algos as a product will be inferior to the widget because of the lack of transparency into the algo creation process relative to the widget creation process @ Frank, The point is that the crowd-sourced development process is different. There are books/seminars/gurus on how to develop products in the traditional fashion. The Q approach is different in that they take in a slew of candidate products, and then have the task of finding the decent ones. For Q, the focus is on the individual contributor versus a team. Incentives are different, etc. The process is different but it doesn't necessarily mean that the product will be inferior, nor superior. If it is more of a recruiting effort, a kind of test/filter, to get hired on (effectively), then as a suggested above, the next step would be to start working actively, in a traditional sense, with a smaller cohort of quants. Presumably, the black-box agreement stays in place for under-contract managers (no requirement to share code or strategy details), but I'm not sure it makes sense, unless the Q contract really does state that the algo is the free-and-clear intellectual property of the originator. Then, if things don't work out with Q, one could seek out other sources of funding, with some assurance that the details hadn't been revealed. @ Anthony, As a professional, what process do you follow to avoid overfitting? I think Quantopian's approach is fine. They are taking a two-pronged approach of both researching new models for predicting OOS performance of algos based on their exhaust only, and author education and "cattle guards" like Pipeline and Data to encourage best practices and best chances for success. IMO They can't really start doing high-touch also development with prospective authors until the very last stages; not only would they contaminate their pool with their prejudices (more than they already have with the long-short hedging factor stuff, regrettably), but that would present a major scaling problem for them, manpower-wise. The fact remains that, as someone mentioned, to do well you really need to know finance, programming and maths, and there's really no way around that. I am a little surprised that MSc's aren't expected to learn how to program, but nonetheless, I'd expect that many of the successful algo authors come either from a maths/science background, or from a programming background with a pickup in maths and finance. Those with a finance background might need a partner, but who knows! Regarding the background required to be successful, it may be asking for a lot to have it all in one individual contributor, probably working in isolation. I have a decent background, and have been with Quantopian since the early days, and it is hard to keep up, on a casual, hobbyist, DIY/maker basis. I'd say if one can't program reasonably well, and be able to learn more complicated parts of the API (e.g. pipeline), it's gonna be hard to get going. @Simon - That "someone" was me replying to Jon above. Thank you for noticing. @Grant - It's going to be even harder if Quantopian keeps changing the API under us every couple of months at very short notice and disqualifying algorithms that don't use the latest version. @ Simon @ Grant @ Andre Probably tripping over a quarter to pick up a dime on this one, but I would add economics as a good background skill as well (I don't think econ and finance should be lumped in together) . I wonder if the term "Quantonomist" has ever been used..... As I continue reading Systematic Trading by Robert Carver, he repeatedly emphasizes that Sharpe ratios (SR) > 1.0 are probably unrealistic. Any more feedback from the professionals out there if it is a strong indication of dope smokin' if SR ~ 2-3? @ Thomas, I do advise everyone on Quantopian to adopt a more principled approach. I'd be interested in guidance and examples of how to do so-called bootstrapping in Quantopian. Along the same lines, I think, would be Monte Carlo type simulations. I'm not sure that running fewer backtests is necessarily the right guidance. It seems that if done properly, a relatively large number of backtests could be used to understand sensitivity to parameters and the statistical validity of a strategy. Also, I gather that something like 10-20 years of data is required to understand what's going on with a given strategy. So, encouraging a 2-year backtest for the contest may be enabling overfitting. It is just not enough data, in most cases, as I understand. Unless one really has the inside scope on an actual market inefficiency (and would also have the inside scoop on when it is no longer extant). Grant I never really look at Sharpe to be honest but looked a lot at MAR. CAGR/ max DD. Among CTAs the average figure is horribly low. Maybe 0.25. IE a 10% CAGR gives a 40% max DD on average. Something like that anyway. CTA Dean Hoffman did some research a while back. This may help to show what useless nonsence short term backrests showing high risk reward ratios are! @ Anthony - Thanks. Sounds bad. To get a 10% return, I'd have to suffer a 40% drawdown. I'll try and dig out the research for you. I think it was put out on the private Trading Blox forum. IASG.com is a good site for CTA returns but suffers from survivorship bias. Yes, I fear that this sort of ratio very probably holds for HFs as a whole and not just CTAs. Could be the subject of a book! The industry as a whole is so laughably disappointing. No wonder something like 30% if US institional money is simply indexed. Good old Bogle! If I were Q I'd flip the privacy flag: a freemium model where open is free and you have to pay a subscription to run algos without disclosing the source. Then you get both some funding from the hermits, and self-selection: I'd expect the bulk of solo players obsessed with privacy to produce mostly random number generators and mediocre variants of well known things (it's extremely difficult to have an original idea in this field, but it's very easy to reinvent something and be too proud to look up prior art). The average quality of the open algos should be significantly higher due to that and the productivity leverage from collaboration. As to how you reward the open-space collaborators: they get an immediate reward in terms of free CPU cycles and a meeting place, and in getting to see algos they can run with off-platform money. There's a theoretical risk of free-loading but then judging what a good algorithm is pretty tricky (the title theme of this thread after all) so a freeloader looking at all the open work will be none the wiser... pay a subscription Well, the question is how much would your typical user be willing to pay? I certainly wouldn't pay for the privilege of working as a Q fund developer. Would anyone? And for personal trading, the issue is, how much could they charge? Say your typical account size is$30K. Would the market bear 1% per year, or $300? That's a pretty big hit to returns. A related question is why would anyone pay for the premium data offered by Q? The business model is that Q is building a crowd-sourced fund. So, why would I be willing to pay for data? Unless there is more to the business model? Not sure how Robinhood fits in either, unless the idea is that Robinhood would somehow attract users who would also write fund-worthy algos? All not so clear to me. Without reference to Q, whose research I find extremely valuable, in general people always want something for nothing. I was contacted by a bunch of conference organisers the other day who asked if I would give a 45 minute presentation and then separately a 1 or 2 case course. Based on my considerable work over the years with ETFs. I asked what fee they would pay me. Funny old world, I never heard back. Is this just a crap industry or is all commerce like this? These guys charge £2000 a day for attendees. And the speakers get paid sweet FA. Funny old world indeed! @ Anthony - Well, maybe they figure whether they pay you or not, you'll just be promoting yourself/company. So, they might as well not pay you. Just guessing. Regarding Bogle, I've read some of his books. One point he made is that the corporate structure matters. As I recall, he set up Vanguard with a structure that was intended to put incentives in the right place (see https://about.vanguard.com/what-sets-vanguard-apart/why-ownership-matters/). Maybe Q will eventually go this route, but I doubt it. Presently, I have to imagine, it is all about how the investors getting their$23.8 M back. My bet is that Q will repeat sins of the past, due to the fundamental structure of their business. Seems inevitable, but I could be wrong.

@Anthony it's the market for attention. If more people want to speak at a conference than there are speaker slots, the price of a speaker slot should be negative. Many content markets are like that, you get paid in units of fame and glory. If you're not interested in that, you can at least understand content buyers will tend to pick the guy behind you who is.

Norbert, hilarious. I did not realise this was common practice. Needless to say since I have nothing to sell they can go whistle.

@Grant re sharpe ratios,

Carver's book is great and I think he paints a pretty honest picture. To give things a little more perspective, RenTec's Medallion Fund, probably the best performing hedge fund of all time up to this point, had an annualized sharpe ratio of 1.68 (net of fees) from 1993-Jan2005(link). This is while averaging something like 35-40% annually (again, net of fees) with only 3 losing quarters and ~89% of all months being profitable...

@ Graham -

That's kinda what I thought. Seems like a good sniff test. If an algo has a SR > ~ 1 in backtesting, then it might be too good to be true, and the model may be overfit. It'll be interesting how the contest plays out over the next year. The recent contest winner (https://www.quantopian.com/posts/contest-11-winner-kevin-quilliam) had a backtest SR = 0.8922, and a paper trading SR = 1.598, with annual returns of 5.458% and 9.333%, respectively (see https://www.quantopian.com/leaderboard/11/55e093d16c958400f40000b1).

Maybe I missed it after a quick read-- did the paper address the heteroscedastic tendencies in their regressions? I didn't see anything. It is intuitive to grasp why results would scatter proportionately over a more elongated backtest--but clearly there is a conclusion that can be drawn from the divergence..

@ Daniel -

Looking at SR in Figure 1, the IS is centered around ~ 1, and the OOS is ~ 0, so by eye, it looks to me that there was a systematic shift in SR, as a population.

Hi Thomas

With regards to this paper you guys published All that glitters is not gold

I scanned through it and also watched your presentation on youtube. I am wondering what you did with regards to slippage in the backtesting process. I found very little mention about that in the paper, and so i was wondering what the slippage/commission/transaction costs parameters were in the backtesting process of the IS & OOS periods respectively.
My intuition is that a big part of the low predictive power of an IS backtest to true OOS execution very much depends on whether one has tested the stability of the performance of the backtest to shifts in slippage/transaction costs/bid-ask spreads.

As a thought, maybe you should devise an IS measure of the change in Sharpe (or performance or weighted avg of a bunch of "predictive" factors) with respect to a change in slippage/transaction cost/spreads mode, and see what the predictive power of that is to OOS performance.
(If already done that, could you potentially give more insight into the results you got)

Hi Thomas

Also we can make another interesting point here that we have to be true to each strategy if we are to reach reasonable conclusions and give each strategy the merit it deserves. i.e. typically a strategy will tend to exploit certain characteristic(s) of the particular market mode/regime in order to gain a statistical advantage.
Example is if you have a mean-reverting trading strategy but the market regime is in a breakout mode/trend mode then this strategy is (in a statistical sense) not going to fare very well under the specific market behaviour.
So i think it would be very interesting to deepen your analysis and see what is the IS predictive power with respect to OOS_similarmarketmode & OOS_differentmarketmode.
Then the question arises as to how you classify an OOS period as to whether its OOS_similarmarketmode or OOS_differentmarketmode.
I guess you somehow have to classify a given strategy as mean-reverting, trend-following etc, which i am not sure its that easy based on the data you have available at your disposal. But assuming you can get over that hurdle you could do some pretty interesting analysis and see whether in a given period you have a statistically significant % of strategies of a given "class" that have performed better than others, and draw conclusions about the predictive power assuming the market regime remains similar or changes.

I hope i'm making sense.

I don't understand why there is such a high concern regarding algorithms working well with out-of-sample data (and backtests). Any good performing algorithm explores some market inefficiency that is currently present on the markets, and it is perfectly normal that it does not work well even on backtests depending on the time frame (the inefficiency could just not be present at the time being tested, or was of lesser magnitude for instance). It is also therefore perfectly normal that even a very good algorithm stops working well after a while, after all the inefficiency has to vanish at some point. It does not matter if market participants are or not looking specifically for it the same way the algorithm is, there is simply no endless mine of gold out there and if that is what you are looking for, I am sorry but you are not going to find it. That is why it is very important for any of us trying to produce something of value here to understand the economic substance of what we do, so that we can understand why what works works and when it is likely to stop working. People often prefer algorithms to "traditional" financial analysis because they think it removes the subjectivity aspect and provide a more "scientific" approach to investing but that is just not true, both walk together and for any successful algorithm ever (to be) developed, as soon as it economic explanation is found, analysts will come with a thousand solutions to explore it better (and faster) than you do - and then it will stop working.

Hi Luiz,

Being able to assess predictive power to OOS data is actually a deeper level of analysis that is very important when you are trying to run an institutional business and not have an algo that will only work for a month or so and subsequently will lose money.
To say it another way, suppose Quantopian has 2 strategies (that they cannot see the code and don't know what they do) and run a backtest for the same set of data and they get exactly the same headline statistics (Sharpe, performance, drawdown, drawdown recovery time etc); how can you tell which one of the 2 will actually perform better with real money? Which one would you invest your money in?
What's worse is how can you tell if one of the 2 backtests is the real deal or whether it has been overoptimized by the quant that developed it ?
The point is it's very easy to overoptimize a set of data and come up with something that you think is the "bomb" only to put real money in it and it starts to lose money.

The paper that Thomas co-authored is trying to answer some of these questions and is actually a very interesting paper as it gives a lot of insight.

@Photi

Yes of course, I am not overlooking or disdaining the paper, on the contrary. Much of its conclusions are actually very in line with what I said (and personally think), particularly "the more backtesting a quant has done for a strategy, the larger the discrepancy between backtest and out-of-sample performance". Like it said it is normal (or expected) that a given investment opportunity ceases to exist after a while or did not take place before some point in the past. Thus, the longer the time frame of the back test and the more "well tested" a strategy is, the more likely it is to be a result of overfitting rather than some real investment strategy. What I am saying is that relying too much on backtests and out-of-sample performance to identify successful strategies can be unproductive because in the end what we want is to differentiate luck from skill and doing backtests and testing for the out-of-sample performance is (sometimes a) too indirect way to do that (just assume for instance that, for a given strategy, both perform very well, the backtests and the out-of-sample performance: is it enough for us to correctly predict it will work well from now on? - I don't think so, and I don't think that I would prefer such a strategy than one that do not perform well on the backtest and/or out-of-sample but that I have a good understanding of why is that).

Regarding your example, if you have two strategies that have exactly the same parameters (including exactly the same returns at the tiniest time frame range for all points in time), for any time range possible, then both are the same strategy - the probability that they are different is virtually zero. If just "key" statistics inferred from the returns pattern are the same however, then that's another story. I can't tell you which would perform better with real money, but I do not think that this is a hard problem to solve (I mean, in the end what you need a practical solution, and there is a practical solution for this problem). You would just need to find out the correlation between both and then a simple MVO would tell you how much to invest in each. The challenge would be to find out the "true" correlation between each, but that would be actually a "pleasant" task to perform (because we know how to do it) - you could just test which (correlation) would produce the best performing results (from the inferred weights you would get by using each correlation value as input - which are bounded by -1 and 1), and pick this one. You are assuming from the beginning that the backtests have some predictive value on these two strategies, and this would thus be the most consistent with your assumption because it relies on precisely the same thing, a backtest of a composition of both.

Hi Luiz

I totally agree with many of your points, but when you are running an institutional fund and from Quantopian's view are relying on sourcing strategies from quants that you don't know and you don't have the IP of the strategies, you are trying to solve a complicated multivariate problem.
There is not one single factor to look at (Sharpe ratio, drawdown, etc) for deciding the selection strategy and the capital allocation over all those strategies, but rather is an optimisation problem.
So over which factors are you going to optimise this problem over? You should optimise over all factors that have a predictive power/value for what your future returns will be like as a fund.

Our job as quants is to build very good algos and Quantopian's job is to develop a framework and model for doing the most optimal selection of strategies that will actually perform in real life. Ultimately this is something that benefits everyone, because if you have a "good algo" (however you define good) you want that it's merit is quantified in such a way that you get a higher capital allocation (and thus get paid more) compared to a "bad algo" that looks good (i.e. has good stats just because it has been overoptimised, but would not necessarily perform well with real money)

The thesis presented by Thomas is just a small part of the puzzle that is solving how to do an efficient capital allocation.