Some code from Ernie Chan's new book implemented in Python

If you have a copy of Ernie Chan's new book you might find a bit tricky that all of his code is written in Matlab. I like Matlab but it's proprietary and some of the tool boxes are actually quite expensive. Octave, is the (even slower) alternative but it does not have all the wonderful stats tools that Ernie uses in the code in his book.

In order to do the exercises I started to write out the Python versions of his code. So far I've done the Hurst exponent, the variance ratio test, the EWC-EWA example with ADF-test, a neat implementation of the Johansen test (although the backbones of the Johansen code are from another source), half life of mean reversion, the EWA-EWC example with Johansen, and the self-correlation calculation for his inter-day momentum strategies.

There might still be a bug here and there and the code is not always pretty but I felt that it was worth sharing. If you find some bugs or have any comments or requests please let me know.

36 responses

Tom,

I'm interested in the Hurst Exponent but Firefox gives a warning about the site you are lnking to as follows:

www.leinenbock.com uses an invalid security certificate.
The certificate is not trusted because no issuer chain was provided.
(Error code: sec_error_unknown_issuer)


P.

Peter,

This is the code Tom Starke had posted for the Hurst Exponent in case you are having trouble accessing the site

In a nutshell, this nifty little number H tells us if a time series is a random walk (H ~ 0.5), trending (H > 0.5) or mean reverting (H < 0.5).
I have found a good code example for this in Octave as shown below. Because of the finite sample size we also need to know the
statistical significance of our calculation. For this we use the vratio test which can be found here.

import matplotlib.pyplot as py
from numpy import *
def hurst(p):
tau = []; lagvec = []
#  Step through the different lags
for lag in range(2,20):
#  produce price difference with lag
pp = subtract(p[lag:],p[:-lag])
#  Write the different lags into a vector
lagvec.append(lag)
#  Calculate the variance of the differnce vector
tau.append(sqrt(std(pp)))
#  linear fit to double-log graph (gives power)
m = polyfit(log10(lagvec),log10(tau),1)
# calculate hurst
hurst = m[0]*2
# plot lag vs variance
#py.plot(lagvec,tau,'o'); show()
return hurst
if __name__=="__main__":
#  Different types of time series for testing
#p = log10(cumsum(random.randn(50000)+1)+1000) # trending, hurst ~ 1
#p = log10((random.randn(50000))+1000)   # mean reverting, hurst ~ 0
p = log10(cumsum(random.randn(50000))+1000) # random walk, hurst ~ 0.5
print hurst(p)

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

nice!

@Peter,

I'm not sure why it's giving you problems. Must have something to do with the SSL certificate. I try to sort it out.

Tom

Thanks, Tom. I ignored Firefox and went to the site anyway!

P.

This is pretty nifty Tom! Do you think you'll be writing these up in algos here on Quantopian? Since you're sharing the code, I'm assuming you don't mind if we do it, too?

Ernie's stuff is always interesting, and getting it running in Quantopian has been on my mind for a while.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

@Peter: Great that you could sort it out! I'm pretty sure Chrome doesn't give this error.
I just read your earlier post on the Hurst exponent. I use it but I recently found a paper that somehow weakens the Hurst a bit (I added a link on my website.)
Personally, I wouldn't just use Hurst as a mean reversion indicator and much less so as indicator for trends. I recently set up a neural network with several other indicators that is much more reliable. What it can do is to filter out the clear cases of mean reversion and momentum, which is much harder if you just try linear regression.

@Dan: Personally, I don't use Quantopian much because I work a lot with Options. I have played a bit with zipline though, it's really nice. I have recommended Quantopian as a great way to get started in algo trading at my recent talk and workshop at OHM2013, which was attended by several hundred people. You will probably have noticed a significant spike in sign-ups at July 31. I also ran a brief demo there of a simple backtest. With all the people in the workshop it slowed the website right down (sorry), so I told everyone to stop and just watch the screen :-)
I don't mind you sharing the code (wouldn't have published it otherwise). I like Ernie's stuff as well but like most people he doesn't publish anything that hasn't been out in the public domain before, which is perfectly adequate. Like the first, his second book is great and contains a lot of good ideas explained in a very understandable way.

Tom, that explains the mystery. We definitely did note the burst of people who all showed up and cloned the same algorithm in a 5-minute span. I was baffled as to the root cause. Great to know what it was.

FYI, we have an auto-scaling server farm. When load goes up, servers are automatically spun and up and put into the pool. Unfortunately, your presentation brought dozens of people in under 5 minutes, and the farm doesn't spin up that fast! If they had come in over 15 minutes instead of 5 minutes we would have responded better. If by chance you give the presentation again, feel free to let us know so we can have some servers warm for you ;)

On the presentation itself: that's a great one. That's great content. I'm going to forward that around.

Mystery! :-) So sorry. The story was that I wanted to just present it on the beamer anyway but then the projector did not work and I said to everyone to do it themselves. I was a bit busy with the projector until I realized shortly after that this will probably exceed your capacity. So I told everyone to stop. Next time I will let you know, promise.
I was going to let you know before but I had a bit of trouble as my computer go hacked, like everyone elses :-)
Glad you like the presentation. It's really basic but it was for a non-trading crowd. Having said that, a lot of people there are super-quick in taking things up. It was great and a bit of advertising for you is always a bonus.

It was the type of mystery we love to have. Feel free to create more of them ;)

Tom,

I've implemented your version of the Hurst Exponent to work with Quantopian.
It's pretty neat, except I was a bit confused on what values to set for lag; and batch_transform, but for now here it is!

-Seong

Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
'''
Hurst Exponent implemented from Ernie Chan's 'Algorithmic Trading: winning strategies
and their rationale'. One of several tools to test whether a strategy is mean-reverting
'''
import numpy

def initialize(context):
context.past_prices = []
context.spy = sid(8554)

def handle_data(context, data):
log.info(hurst(context,data,context.spy))

@batch_transform(window_length=40)
def gather_data(data):
# Gathers data for an arbitrary number of days and refreshes
# Also serves to make sure that data exists
return data

'''
Adjusts a list of past prices to the number of periods you want
So if you want the nummber of prices in the last forty days, set period = 40
'''
def gather_prices(context, data, sid, period):
context.past_prices.append(data[sid].price)
if len(context.past_prices) > period:
context.past_prices.pop(0)
return

'''
Hurst exponent helps test whether the time series is:
(1) A Random Walk (H ~ 0.5)
(2) Trending (H > 0.5)
(3) Mean reverting (H < 0.5)
'''
def hurst(context, data, sid):
# Gathers all the prices that you need
gather_prices(context, data, sid, 40)
# Checks whether data exists
data_gathered = gather_data(data)
if data_gathered is None:
return

tau, lagvec = [], []
# Step through the different lags
for lag in range(2,20):
# Produce price different with lag
pp = numpy.subtract(context.past_prices[lag:],context.past_prices[:-lag])
# Write the different lags into a vector
lagvec.append(lag)
# Calculate the variance of the difference
tau.append(numpy.sqrt(numpy.std(pp)))
# Linear fit to a double-log graph to get power
m = numpy.polyfit(numpy.log10(lagvec),numpy.log10(tau),1)
# Calculate hurst
hurst = m[0]*2

return hurst

This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

The Matlab version of Hurst sets some arbitrary lag value, which I have no idea how it was determined. I guess they just had to fix some number. I played around with it for a while and I'm not sure what a good set of values is either. I think it's really not straightforward and it has to be used with great care.
I mentioned it in the other post but for completeness I just wanted to point to a neural network based domain identifier which I've build recently. This seems to do a much better job distinguishing between different types of time series. But have a look yourself. I would be interested in some feedback in how to improve this.

Hi Tom, everyone,

result = stat.OLS(y, x).fit()

I have two curious observations:

1) if y=x, the results do not show they are very cointegrated:

start = dt.datetime(2000, 1, 1)

Out[43]:
(-1.7968325797070679, 0.38203535993053783,
26,
2013,
{'1%': -3.4336026867364144,
'10%': -2.5675349256978177,
'5%': -2.8629768669694458},
-130203.8094987915)

2) Supposedly highly correlated lists are also not very cointegrated:

start = dt.datetime(2012, 1, 1)

Out[48]:
(-2.4546114609901784, 0.1269119048357274,
5,
590,
{'1%': -3.4414821678603946,
'10%': -2.5693855271473716,
'5%': -2.8664511716874657},
-507.45202121209627)

Is my understanding of cointegration and cadf totally wrong?

Thank you.

Guys, Sorry for asking but I think the code might be wrong as it doesn't calculates the variance of the lags, and it takes the square root of the std, what is statistically meaningless.

Hi Victor, there were indeed some problems with that code and this has been corrected now including tests. Please have a look at:
http://drtomstarke.com/index.php/calculation-of-the-hurst-exponent-to-test-for-trend-and-mean-reversion/
Thanks for pointing it out.

Hi Tom,

I am trying to get to your links and I am getting a database error for all of them.

Sorry Brian, I had so many requests to my site in such a short time that my server collapsed. I host it on EC2 and the Micro-instances are not laid out for all that traffic. Never expected that this stuff would get that much interest, haha. Anyway, it should be back up now. Thanks for letting me know.

Just want to say, great site Tom, excellent resource, your work is greatly appreciated.

Hi Tom, I am playing around with your code from your blog where 3 tests were presented. Those tests works fine and make sense to me. However, when I try to apply it against USDCAD price between 20070722 and 20120328 as Ernie did in his book p45. I got 0.9886. Totally different from 0.49 from his book. If I use the code posted by Seong Lee earlier in this thread. It gives me 0.4642. Do I miss something?

Hi Alan,

I think a lot of it may have to do with both the input prices that we use and the implementation system (Python vs. what Ernie uses). Tom may have also implemented Ernie's code a little differently with his own variations (just a guess, not 100% sure) so that's where the changes might be off as well.

Thoughts @Tom?

Seong

Hi Alan,

I'm a bit busy right now to explore that in depth but just one quick comment: the code was adapted from Chu Chen's work in Matlab Central and he provides a whole number of different ways of calculating the Hurst exponent. The problem with the Hurst and autocorrelation in general is that you need to set a limit on the variability of the seasonality. In plain English, you have to set a limit for the differencing of [t_i - t_{i-n}]. In the code that I posted this is done with the line: "mlarge = floor(N/5.0)". Notice that 5.0 is arbitrary and could be replaced with anything. This is not really a problem for infinite data sets but in finite datasets the choice of this number may be cruicial. In that sense it would be nice to calibrate that against other Hurst implementations (which I haven't done). With all these things you have to be a bit careful using out-of-the-box implementations, they may not always be appropriate for your datasets.
On another note, be aware that Ernie's Hurst of 0.49 is way too close to 0.5 (considering the limited number of data points) to make clear assumptions on the autocorrelative properties of the data. It's not incorrect but I wouldn't bet my life savings on it (but don't forget to thank Ernie for bringing the Hurst to our attention).
I hope that makes sense and I will have a further look into that when I find some more time. Let me know if you have any more questions.

Tom

Hi Seong/Tom,
Thanks for the quick response. It seems I have to debug the code to see what is the difference.
I don't think it is the problem of price. I was able to use the price from Yahoo Finance in my python code to to reproduce many example in Ernie's book, particularly chapter 3. Sometime, I have to debug my python and his MatLab code side by side to resolve the difference. One catch is, most price history he used in the book actually starts from 20060426 rather than what he stated in the book. Mainly because some algorithms require a look back window.
I may debug this when I have time. But since Hurst Exp is a very weak indicator, I would suggest we don't spend too much time on it.
Alan,

BTW, Tom, I forgot to say thank you very much for your python code. That helps a lot!

Thanks Alan. I agree that Hurst is not a particularly strong indicator and probably performs better in cardiology. As I said, I believe that the number of periods that is taken into account matters significantly. If you have strong auto-regression for t-n but your calculation only takes into account periods from t-1 to t-k with k < n, then Hurst would probably not give a strong signal. From what I can see, some of the Matlab implementations only take a relatively small range of periods into account. That might be worth exploring.

Tom

Hi,

I'm working on it too, however, I failed to reach the same result of example2.1(adf test) and example2.4(half-life).
and statsmodels is used for adf test and OLS fitting.

import Quandl as qd
import statsmodels.api as sm
import statsmodels.tsa.stattools as ts

startdate = '2007-07-22'
enddate = '2012-03-28'

df = qd.get(qcode, trim_start=startdate, trim_end=enddate)

# calculate half-life
mdf = md.fit()
half_life = -np.log(2)/mdf.params[1]
lookback = np.round(half_life)


However, the result is quite different from the book.
 (-2.3466233030145998, 0.15737142688333444, 1, 1221, {'5%': -2.8639100200710828, '10%': -2.568031835031368, '1%': -3.435716995109265}, -9562.9337394440736)  and my half-life is 218 days.

Any ideas of why is this? Thanks!

import statsmodels.tsa.stattools as ts
import numpy as np
import pandas as pd
from math import *

"""
adf(x) - Calculate adf stat, p-value and half-life of the given list. This test will tell us if a
time series is mean reverted.
:param x: A pandas.Series of data.
:return: (stat, p_value, half_life)
Reference
http://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test
"""

# Calculate the half-life of reversion
price = pd.Series(x)
lagged_price = price.shift(1).fillna(method="bfill")
delta = price - lagged_price
beta = np.polyfit(lagged_price, delta, 1)[0] #Use price(t-1) to predicate delta.
half_life = (-1*log(2)/beta)
return result[0], result[1], half_life

Out[5]: (-1.6228164049914919, 0.47125307555839901, 115.71732788220135)


The usdcad.csv is from Bloomberg data. 20070723 ~ 20120328. You can switch to your own data and see if it is code problem or data problem.

I just tried quandl data for the same range, got
 In [59]: adf(y['USD/CAD']) Out[59]: (-1.851393181282041, 0.35526109673432998, 259.27118869559695)  It seems data is definitely one of the difference.

Hi Alan,

Thanks for your reply. I tried your code and it works the same as mine except that I set the lag order to 1 in adf test. There are minor differences implementing an linear fit but the result should be the same, so I think the data is the main reason.

Hi guys,

I had a closer look at the Hurst again. The inconsistencies we are getting clearly come down to the fact that H explores the variances of seasonal returns. If we have Brownian motion they should increase in a well-behaved manner with an exponent of 0.5 ( sqrt(t) ). If it's larger, positive autocorrelation forces us into a trending scenario and with H < 0.5 we are forced into a mean-reverting situation. We all know that a price series may be trending in the long range but mean-reverting in short time scales or vice versa. Now, if we look at most implementations of H calculators, we have fixed lags. If we look at SPY for example, a lag range of 2 to 20 will give us H ~ 0.43 but a range of 200 to 600 will give H ~ 0.57. If you look at the graph of log(lag) vs log(standard deviation), these two regimes are very distinct. It shows that we cannot simply throw the Hurst at anything, we really need to look at the scales involved. In simple terms, the lags should be in accordance with your trading signals, so if you trade on a 20-day MA-crossover it makes no sense to have a lag range of 2 to 600 but if you use fundamentals you may be better off neglecting the short lags.

Hi Tom

When I click on the links you provided I'm still getting the security warning

www.leinenbock.com uses an invalid security certificate.
The certificate is not trusted because no issuer chain was provided.
(Error code: sec_error_unknown_issuer)

Hey this link could be of use: http://www.quantstart.com/articles/Basics-of-Statistical-Mean-Reversion-Testing
It comes from a book that is similar to Chan's but with everything written in Python.

Hi Tom,
Thanks for having sharing your code.
I've tested your implementation of the variance ratio test using rates of USDCAD from July 22 2007 to March 28 2012, however my results diverge from the results of Chan's book.
According to Example 2.3 of Chan's book I should get following results:
h = 0 and pValue = 0.37
(h = 0 means it may be a random walk)

But with your code I find :
h (or vratio in your code) = 1.28 and pvalues = 0.99999
(According to Chan: h = 1 means rejection of the random walk hypothesis)

Is my assumption:
h (according to Chan's book) = vratio (according to your code)
correct?

Did I misunderstood how to use your function vratio? Or how to explain that my results diverge from Chan's book?

import pandas as pd
import numpy as np
from pandas import DataFrame
import datetime
import pandas.io.data
import matplotlib.pyplot as plt
import statsmodels.tsa.stattools as ts
from numpy import cumsum, log, polyfit, sqrt, std, subtract
from numpy.random import randn
from numpy import random

pi = 3.141592653589793

def normcdf(X):
(a1,a2,a3,a4,a5) = (0.31938153, -0.356563782, 1.781477937, -1.821255978, 1.330274429)
L = abs(X)
K = 1.0 / (1.0 + 0.2316419 * L)
w = 1.0 - 1.0 / sqrt(2*pi)*np.exp(-L*L/2.) * (a1*K + a2*K*K + a3*pow(K,3) + a4*pow(K,4) + a5*pow(K,5))
if X < 0:
w = 1.0-w
return w

def vratio(a, lag = 2, cor = 'hom'):
""" the implementation found in the blog Leinenbock
http://www.leinenbock.com/variance-ratio-test/
"""
#t = (std((a[lag:]) - (a[1:-lag+1])))**2;
#b = (std((a[2:]) - (a[1:-1]) ))**2;
n = len(a)
mu  = sum(a[1:n]-a[:n-1])/n;
m=(n-lag+1)*(1-lag/n);
#print( mu, m, lag)
b=sum(np.square(a[1:n]-a[:n-1]-mu))/(n-1)
t=sum(np.square(a[lag:n]-a[:n-lag]-lag*mu))/m
vratio = t/(lag*b);
la = float(lag)
if cor == 'hom':
varvrt=2*(2*la-1)*(la-1)/(3*la*n)
elif cor == 'het':
varvrt=0;
sum2=sum(np.square(a[1:n]-a[:n-1]-mu));
for j in range(lag-1):
sum1a=np.square(a[j+1:n]-a[j:n-1]-mu);
sum1b=np.square(a[1:n-j]-a[0:n-j-1]-mu)
sum1=dot(sum1a,sum1b);
delta=sum1/(sum2**2);
varvrt=varvrt+((2*(la-j)/la)**2)*delta
zscore = (vratio - 1) / sqrt(float(varvrt))
pval = normcdf(zscore);
return  vratio, zscore, pval

if __name__=="__main__":
#Revert indexes, so that indexes are in a chronological order