clustering example from yesteryear

Here's a clustering example, posted years ago on Quantopian (I don't take credit for the original code). Perhaps it'd be of interest to folks, and could be brought into the modern era with some re-factoring.

101
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
from sklearn import cluster, covariance
import numpy as np
from collections import defaultdict
from scipy import stats
import pandas as pd
# pd.set_option('mode.use_inf_as_null',True)

# based on the example at:
# http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html

update_period = 5*390 # update clusters at this period in minutes

def initialize(context):
c = context

# Nasdaq 100 from https://www.quantopian.com/posts/list-of-nasdaq-100-sids-to-use-in-your-algo

c.sids = [sid(24),    sid(114),   sid(122),   sid(630)  , sid(67),
sid(20680), sid(328),   sid(14328), sid(368),   sid(16841),
sid(9883),  sid(337),   sid(38650), sid(739),   sid(27533),
sid(3806),  sid(18529), sid(1209),  sid(40207), sid(1419),
sid(15101), sid(17632), sid(39095), sid(1637),  sid(1900),
sid(32301), sid(18870), sid(14014), sid(25317), sid(36930),
sid(12652), sid(26111), sid(24819), sid(24482), sid(2618),
sid(2663),  sid(27543), sid(1787) , sid(2696),  sid(42950),
sid(20208), sid(2853),  sid(8816),  sid(12213),  sid(3212),
sid(9736),  sid(23906), sid(26578), sid(22316), sid(13862),
sid(3951),  sid(8655),  sid(25339), sid(4246),  sid(43405),
sid(27357), sid(32046), sid(4485),  sid(43919), sid(4668),
sid(8677),  sid(22802), sid(3450),  sid(5061),  sid(5121),
sid(5149),  sid(5166),  sid(23709), sid(13905), sid(19926),
sid(19725), sid(8857),  sid(5767),  sid(5787),  sid(19917),
sid(6295),  sid(6413),  sid(6546),  sid(20281), sid(6683),
sid(26169), sid(6872),  sid(11901), sid(13940), sid(7061),
sid(15581), sid(24518), sid(7272),  sid(39840), sid(7671),
sid(27872), sid(8017),  sid(38817), sid(8045),  sid(8132),
sid(8158),  sid(24124), sid(8344),  sid(8352),  sid(14848)]

context.elapsed_minutes = 0

def stock_cluster(attribute,context):
c = context

# tell it we're looking for a graph structure
edge_model = covariance.GraphLassoCV()
X = attribute.values.copy()
X_zscore = stats.zscore(X, axis=0, ddof=1)

if np.any(pd.isnull(X_zscore)):
print 'null found in X_zscore'
return None

edge_model.fit(X_zscore)

# now process into clusters based on co-fluctuation
_, labels = cluster.affinity_propagation(edge_model.covariance_)

log.debug("Found {0} groups from {1} complete histories".format(max(labels)+1,len(attribute)))

# filter the sids into groups, in order they appear in c.sids
groups = defaultdict(lambda: [])
for i, grp_idx in enumerate(labels):
# groups[grp_idx].append( c.sids[i] )
groups[grp_idx].append( c.sids[i].symbol )

return groups

def handle_data(context, data):

if context.elapsed_minutes % update_period != 0.0:
context.elapsed_minutes += 1
return
else:
context.elapsed_minutes += 1

prices_open = history(31, '1d', 'open_price',ffill=False).dropna()[:-1]
prices_close = history(31, '1d', 'close_price',ffill=False).dropna()[:-1]
prices_delta = prices_close - prices_open

c = context

groups = stock_cluster(prices_delta,c)

result = '------------------\n'

if groups is not None:
# display stock sids that co-fluctuate:
for i, g in groups.iteritems():
result = result + 'Cluster %i: %s\n' % ((i + 1), ", ".join([str(s) for s in g]))
print result
There was a runtime error.
4 responses

Jonathan Larkin posted a notebook of the approach above here.

I just tried cloning and running the algo that I posted above and get an error:

Something went wrong. Sorry for the inconvenience. Try using the built-in debugger to analyze your code. If you would like help, send us an email.
TimeoutException: Too much time spent in handle_data and/or scheduled functions. 50 second limit exceeded.
There was a runtime error on line 82.

Anyone know what's going on? It was running just over a year ago, and now it fails...

=> backtest from 2013-10-01 to 2013-10-29 (as it was configured originally) worked for me (pretty fast when running during 'Building', quite slow when running the same from/to as 'Full Backtest')

ignore all other code changes (mess) and comments (silly ideas, I've no idea how the clustering algo really works -- but looks interesting)

2
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
from sklearn import cluster, covariance
import numpy as np
from collections import defaultdict
from scipy import stats
import pandas as pd

# Update interval in days
update_interval = 5

def initialize(context):
# Nasdaq 100 from https://www.quantopian.com/posts/list-of-nasdaq-100-sids-to-use-in-your-algo
nasdaq100 = [24, 67, 114, 122, 328, 337, 368, 630, 739, 1209,
1419, 1637, 1787, 1900, 2618, 2663, 2696, 2853, 3212, 3450,
3806, 3951, 4246, 4485, 4668, 5061, 5121, 5149, 5166, 5767,
5787, 6295, 6413, 6546, 6683, 6872, 7061, 7272, 7671, 8017,
8045, 8132, 8158, 8344, 8352, 8655, 8677, 8816, 8857, 9736,
9883, 11901, 12213, 12652, 13862, 13905, 13940, 14014, 14328, 14848,
15101, 15581, 16841, 17632, 18529, 18870, 19725, 19917, 19926, 20208,
20281, 20680, 22316, 22802, 23709, 23906, 24124, 24482, 24518, 24819,
25317, 25339, 26111, 26169, 26578, 27357, 27533, 27543, 27872, 32046,
32301, 36930, 38650, 38817, 39095, 39840, 40207, 42950, 43405, 43919]
context.sids = map(sid, nasdaq100)
context.days_since_update = update_interval

def handle_data(context, data):
pass

### Called every day before market open.

if context.days_since_update < update_interval:
context.days_since_update += 1
return

### Perform update
update_cluster(context, data, ndays=31)
### reset the countdown to the next update
context.days_since_update = 1

def update_cluster(context, data, ndays):
### MODIFIED
#prices_open = history(31, '1d', 'close_price', ffill=False).dropna()[:-1]
#prices_close = history(ndays, '1d', 'close_price', ffill=False).dropna()[:-1]

prices_open = data.history(context.sids, fields='open', bar_count=ndays, frequency='1d')
prices_close = data.history(context.sids, fields='close', bar_count=ndays, frequency='1d')

### ??? maybe [:-1] before .dropna() ?
prices_open = prices_open.dropna()[:-1]
prices_close = prices_close.dropna()[:-1]

### ??? is (Close - Open) price difference the best thing to do clustering on ?
prices_delta = prices_close - prices_open

groups = stock_cluster(prices_delta, context)
print_result(groups)

def stock_cluster(attribute, context):
# tell it we're looking for a graph structure
edge_model = covariance.GraphLassoCV()

### MODIFIED
#X = attribute.values.copy()
#X_zscore = stats.zscore(X, axis=0, ddof=1)
X_zscore = stats.zscore(attribute.values, axis=0, ddof=1)

if np.any(pd.isnull(X_zscore)):
print 'null found in X_zscore'
return None

edge_model.fit(X_zscore)

# now process into clusters based on co-fluctuation
_, labels = cluster.affinity_propagation(edge_model.covariance_)

log.debug("Found {0} groups from {1} complete histories".format(max(labels)+1,len(attribute)))

# filter the sids into groups, in order they appear in c.sids
groups = defaultdict(lambda: [])
for i, grp_idx in enumerate(labels):
# groups[grp_idx].append( c.sids[i] )
groups[grp_idx].append( context.sids[i].symbol )

return groups

def print_result(groups):
### factored out but unmodified code
result = '------------------\n'
if groups is not None:
# display stock sids that co-fluctuate:
for i, g in groups.iteritems():
result = result + 'Cluster %i: %s\n' % ((i + 1), ", ".join([str(s) for s in g]))
print result

There was a runtime error.

Thanks Jon -

Yes, more time in before_trading_start. It is kinda curious, though, that the code was running, and now it doesn't. I'd think any changes to the platform would have improved execution speed, not have reduced it.