Bug in consistency score?

I think there is either a bug in the consistency calculation between backtest and paper trade, or the logic is flawed. My understanding is that it should be based on how similar distribution of out of sample daily return and backtest daily return are.

But look at some of the "most consistent" algos:

Name, score, out sample return vs backtest return
Leaf Jiang, 0.957778084, 4.5% vs 100.8%
Grant Johnson, 0.957493622, 3.7% vs 43.7%
David Conroy, 0.953979891, -19.3% vs 16%
....

vs least consistent algos:

Rut Kid, 0.530878654, 15.4% vs 66.3%
Lucas Silva, 0.636548658, 11.7% vs 37.2%
Filippos Livieratos, 0.626546325, 0.2% vs -1%

The least consistent algos actually look more consistent to me.

I suggest Quantopian make public the source code of the consistency calculation so:
- Users can recommend improvement
- Users can check if there is bug
- Users can verify their own consistency score is calculated correctly. Right now this is true for all other metrics, except the consistency score.

11 responses

My subjective conclusion: By applying consistency factor the way it is now, Quantopian promote mangers who doing nothing or consistently loosing money.

https://www.quantopian.com/posts/how-consistent-is-consistency-factor

The same I may tell about Stability Factor
https://www.quantopian.com/posts/how-stable-is-stability-calculation

I'm confused about the exact implementation too, but I don't think that merely the difference in the values of annual returns is representative of consistency
Quoting Dan Dunn from https://www.quantopian.com/posts/scoring-changes-for-the-april-quantopian-open , "we're computing the consistency score using a kernel-density estimate using Gaussian kernels found in the Python scipy package. Both the backtest daily returns and the paper trade daily returns are each pushed through the function to fit them to a distribution separately. The difference between the areas of each of the distribution curves is used for the consistency score."
This means that if somebody has a high consistency score, then perhaps their daily returns are, say normally distributed, both in the backtest and paper trading. For instance, if in the backtest, I had daily returns of [0,0,0,0,0,100,0,0,0,0,0] - this would be scaled by a 100 while generating the distribution. Now in live trading, if I had daily returns of [0,1,0] - this distribution (which is just a solitary spike) would have 100% overlap with the backtest spike.
I'm assuming that both the distributions are first being scaled by the total return amount, as is done while computing a probability distribution so that the area under the curve is unity, and then the fraction of the overlapped area is giving consistency. I'm curious as to whether this is indeed the case. As one can imagine, this does sound wrong as scaling the returns ends up awarding a consistency score of 1 to somebody who's algo performs a hundred times worse trading live.

I agree with this thread. I hope Dann or someone from Q sheds some light on this. Gaussian kernels sounds fancy but the result is apparent to all of us. For example I saw in entry in top 10 today with backtest score of 40 and paper trading score of 80. The consistency is 86% which makes me wonder why consistently of my algo is so low.

Hi folks,

Arshdeep has it correct in linking to Dan's post on scoring changes. The consistency score is arrived at from creating distributions of daily returns data and then computing the overlapping region under the curves.

You can read more detail from Dan's post linked above, and you can nerd out on the Gaussian KDE function if you like. But in the extreme of having many daily returns data points for both the backtest and the paper trading period the distribution function shouldn't be all that critical.

This method is by no means perfect. One weakness will show up when there are very few days of paper trading data to fit the 'out of sample' distribution, in which case the fit might be overly generous. It should, however be a lot better than just comparing the annualized returns for the backtest and paper traded results, as the annualized returns for the paper traded results of just a few days, or even weeks can be extremely noisy.

We are working on an open source risk library that will expose this calculation along with all the other ones we're using to evaluate algorithms in the contest. In the meantime, Justin shared a clone-able notebook last week that has a bunch of research code you can reuse. The function we use to compute consistency is called out_of_sample_vs_in_sample_returns_kde. You can clone that notebook and use it directly in the research platform, or if you just want to see the function I've included it below along with a sample plot for a visual of what an 0.85 consistency score looks like.

 def out_of_sample_vs_in_sample_returns_kde(bt_ts, oos_ts,
transform_style='scale',
return_zero_if_exception=True):
bt_ts_pct = bt_ts.pct_change().dropna()
oos_ts_pct = oos_ts.pct_change().dropna()
bt_ts_r = bt_ts_pct.reshape(len(bt_ts_pct),1)
oos_ts_r = oos_ts_pct.reshape(len(oos_ts_pct),1)
if transform_style == 'raw':
bt_scaled = bt_ts_r
oos_scaled = oos_ts_r
if transform_style == 'scale':
bt_scaled = preprocessing.scale(bt_ts_r, axis=0)
oos_scaled = preprocessing.scale(oos_ts_r, axis=0)
if transform_style == 'normalize_L2':
bt_scaled = preprocessing.normalize(bt_ts_r, axis=1)
oos_scaled = preprocessing.normalize(oos_ts_r, axis=1)
if transform_style == 'normalize_L1':
bt_scaled = preprocessing.normalize(bt_ts_r, axis=1, norm='l1')
oos_scaled = preprocessing.normalize(oos_ts_r, axis=1, norm='l1')

X_train = bt_scaled
X_test = oos_scaled

X_train = X_train.reshape(len(X_train))
X_test = X_test.reshape(len(X_test))

x_axis_dim = np.linspace(-4, 4, 100)
kernal_method = 'scott'
try:
scipy_kde_train = stats.gaussian_kde(X_train, bw_method=kernal_method)(x_axis_dim)
scipy_kde_test = stats.gaussian_kde(X_test, bw_method=kernal_method)(x_axis_dim)
except:
if return_zero_if_exception:
return 0.0
else:
return np.nan
kde_diff = sum(abs(scipy_kde_test - scipy_kde_train)) / (sum(scipy_kde_train) + sum(scipy_kde_test))

return kde_diff

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

thank you for the code Jessica!

I dont have daily returns of those algos so I just use the total return, maybe I will start downloading daily score-sheet and compute their daily returns.

I understand nothing is perfect and there will be trial and error. Using daily return distribution is a start. However
- Practically consistency should also be about the magnitude of the total return. The 0.957778084 consistency score for 4.5% vs 100.8% is a good example of why magnitue matter. Maybe the distribution match, but if the magnitude are way off, it is not a consistent algo.
- Why daily? What if algo only rebalance weekly or monthly? In those cases, daily returns might not reflect the fundamental characteristics of the algo.

Plus, with limited number of data points for out of sample, which is the situation that we face now, perhaps a different approach would be more appropriate than comparing return distribution.

How about rank percentage difference between backtest vs out sample metrics (return, dd, sharpe, ....) and average out the rankings, just like the overall algo scoring process? Maybe give different weights for metrics that are better / worse out of sample? I think this will work better with smaller set of data.

I was looking at the rank and cannot understand how the rank 2 guy is above the rank 3.
Does the consistency takes the volume and frequency into account? If a strategy does few trades or if the strategy trades small volume then large? can it differentiate or it will take weight both trades equally? Did the scoring algorithm change during the competition?

@Jessica

Is there any proof of using Gaussian KDE function on 1 to 22 out of sample data points and compare them with 504 in sample data points?
What daily return similarity will be if in sample all of them was near zero and out of sample all was near zero?
I suppose 1.0. You promote algo for doing near nothing.
What daily return similarity will be if in sample all of them was near zero and out of sample all was positive?
I suppose 0.3. You punish algo for doing better then in sample.
Is that right?
Similarity of loosing or doing nothing is not the same as similarity of positive returns.
If Gaussian KDE function does not understand the difference between positive and negative similarity it must be removed from algo scoring .

The same I may tell about Stability Factor which behave much worse than Consistency.
Look at this algo by Pravin Bezwada
Point me where You see instability when 7 major out of sample metrics are positive equal or better then in sample.
Why stability is 0.2050 (score 347).
For what you punish this algo?

Whether better out of sample performance should be treated differently from worse is up to debate, but the issue right now is that algos with more difference in performance (metrics used for scoring) seem to have higher consistency score than algos with less difference. This is exactly the opposite of "consistency".

Quick response to some of the questions:

It's important to think about the metrics used for the contest ranking in a more holistic manner.

No single metric can define a great algo from a poor algo and it's important to analyze algos across a breadth of different metrics in order to get a 360-degree view of its previous performance, and hopefully future performance.

We recognize the issue with attempting to judge an algorithm based on a few datapoints in papertrading, and we hope to rollout improved methods by which to do this in the future (e.g. possibly a different type of "consistency" or "stability" metric). Regarding the current consistency metric though, the main intention with it is to provide an insight into the shape of the distribution of returns rather than the sheer magnitude of them. We have a couple of other metrics that rank based on magnitude, and do not account for distribution (e.g. Annual Return, Sharpe Ratio), and thus the consistency metric was born out of a desire to hone in on another single dimension of an algo's performance all within the context of evaluating the algo across the other metrics as well.

Again, though, the way I view analyzing algos is in this more holistic manner, recognizing that any single computed metric on a backtest is likely going to be different (to a degree) out of sample, but hopefully fall into some expected reasonable range. Thus considering cutoff values for each metric and ensuring an algo meets some basic criteria across the entire breadth of metrics is possibly more important than simply precisely computing any single metric value. It's sort of like thinking, I'd rather have an algo that is in the Top 10 of all 8 metrics, instead of the algo that is ranked #1 in Sharpe Ratio (but ranked 200 in all of the other 7 metrics.)

Perhaps in a sports analogy, consider the individual sports such as golf or tennis, where most of the time the best #1 ranked player in the world are above average across the breadth of the shots required in the sport, but many other lower ranked competitors are best at any of the 1 single strokes (serving, forehand, backhand, driving off the tee, putting on the green, etc). To be the best in investing is analogous, and in particular building a portfolio, is about selecting the components of your portfolio that are all above average across a breadth of characteristics rather than getting any single metric perfect.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

@Justin Lent

You should recognize that experiment with not standard in industry performance metrics named "consistency" and "stability" till now shows only negative results despite they suppose provide the distribution of returns rather than the sheer magnitude of them.

But there is Omega ratio
https://en.wikipedia.org/wiki/Omega_ratio
https://faculty.fuqua.duke.edu/~charvey/Teaching/BA453_2005/Keating_An_introduction_to.pdf
which considered as industry standard , based on distribution of returns , natural from the standpoint of probability and statistics and appealing in its financial interpretation.
I recommended to use it in numerator of single algo performance metric "QuantopVYan Score" couple month ago
https://www.quantopian.com/posts/request-real-world-strategy-scoring-metric
and never got response to that post.
What do you think about that?

Quantopian open June 2015 Average metrics of top 10 by stability (of loosing) and consistency (of doing nothing)

Quantopian open June 2015   annRet   annVol   maxDD   sharpe    calmar  stability  consistency
Stability Best 10_pt       -130.52%  10.64%  -31.62%  -14.226   -4.835  0.973      0.807
Stability Best 10_bt        -36.15%  14.27%  -73.24%   -3.590   -0.493  0.876
Consistency Best 10_pt        1.31%  12.74%   -6.44%   -0.258   -0.194  0.128      0.962
Consistency Best 10_bt       38.59%  15.54%  -13.62%    2.175    3.325  0.780