Back to Community
clustering example from yesteryear

Here's a clustering example, posted years ago on Quantopian (I don't take credit for the original code). Perhaps it'd be of interest to folks, and could be brought into the modern era with some re-factoring.

Clone Algorithm
98
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
from sklearn import cluster, covariance
import numpy as np
from collections import defaultdict
from scipy import stats
import pandas as pd
# pd.set_option('mode.use_inf_as_null',True)

# based on the example at:
# http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html

update_period = 5*390 # update clusters at this period in minutes

def initialize(context):
    c = context
    
    # Nasdaq 100 from https://www.quantopian.com/posts/list-of-nasdaq-100-sids-to-use-in-your-algo
    
    c.sids = [sid(24),    sid(114),   sid(122),   sid(630)  , sid(67),      
              sid(20680), sid(328),   sid(14328), sid(368),   sid(16841),   
              sid(9883),  sid(337),   sid(38650), sid(739),   sid(27533),   
              sid(3806),  sid(18529), sid(1209),  sid(40207), sid(1419),    
              sid(15101), sid(17632), sid(39095), sid(1637),  sid(1900),    
              sid(32301), sid(18870), sid(14014), sid(25317), sid(36930),   
              sid(12652), sid(26111), sid(24819), sid(24482), sid(2618),    
              sid(2663),  sid(27543), sid(1787) , sid(2696),  sid(42950),   
              sid(20208), sid(2853),  sid(8816),  sid(12213),  sid(3212),    
              sid(9736),  sid(23906), sid(26578), sid(22316), sid(13862),   
              sid(3951),  sid(8655),  sid(25339), sid(4246),  sid(43405),   
              sid(27357), sid(32046), sid(4485),  sid(43919), sid(4668),    
              sid(8677),  sid(22802), sid(3450),  sid(5061),  sid(5121),    
              sid(5149),  sid(5166),  sid(23709), sid(13905), sid(19926),   
              sid(19725), sid(8857),  sid(5767),  sid(5787),  sid(19917),   
              sid(6295),  sid(6413),  sid(6546),  sid(20281), sid(6683),    
              sid(26169), sid(6872),  sid(11901), sid(13940), sid(7061),    
              sid(15581), sid(24518), sid(7272),  sid(39840), sid(7671),    
              sid(27872), sid(8017),  sid(38817), sid(8045),  sid(8132),    
              sid(8158),  sid(24124), sid(8344),  sid(8352),  sid(14848)]  
    
    context.elapsed_minutes = 0
    
def stock_cluster(attribute,context):
    c = context
    
    # tell it we're looking for a graph structure
    edge_model = covariance.GraphLassoCV()
    X = attribute.values.copy()
    X_zscore = stats.zscore(X, axis=0, ddof=1)
    
    if np.any(pd.isnull(X_zscore)):
        print 'null found in X_zscore'
        return None
    
    edge_model.fit(X_zscore)
                
    # now process into clusters based on co-fluctuation
    _, labels = cluster.affinity_propagation(edge_model.covariance_)
    
    log.debug("Found {0} groups from {1} complete histories".format(max(labels)+1,len(attribute)))
    
    # filter the sids into groups, in order they appear in c.sids
    groups = defaultdict(lambda: [])
    for i, grp_idx in enumerate(labels):
        # groups[grp_idx].append( c.sids[i] )
        groups[grp_idx].append( c.sids[i].symbol )
        
    return groups
                
def handle_data(context, data):
    
    if context.elapsed_minutes % update_period != 0.0:
        context.elapsed_minutes += 1
        return
    else:
        context.elapsed_minutes += 1    
    
    prices_open = history(31, '1d', 'open_price',ffill=False).dropna()[:-1]
    prices_close = history(31, '1d', 'close_price',ffill=False).dropna()[:-1]
    prices_delta = prices_close - prices_open
    
    c = context
    
    groups = stock_cluster(prices_delta,c) 
        
    result = '------------------\n'
    
    if groups is not None:
        # display stock sids that co-fluctuate:
        for i, g in groups.iteritems(): 
            result = result + 'Cluster %i: %s\n' % ((i + 1), ", ".join([str(s) for s in g]))
        print result
There was a runtime error.
4 responses

Jonathan Larkin posted a notebook of the approach above here.

I just tried cloning and running the algo that I posted above and get an error:

Something went wrong. Sorry for the inconvenience. Try using the built-in debugger to analyze your code. If you would like help, send us an email.
TimeoutException: Too much time spent in handle_data and/or scheduled functions. 50 second limit exceeded.
There was a runtime error on line 82.

Anyone know what's going on? It was running just over a year ago, and now it fails...

modified to run in before_trading_start() instead of handle_data()
=> backtest from 2013-10-01 to 2013-10-29 (as it was configured originally) worked for me (pretty fast when running during 'Building', quite slow when running the same from/to as 'Full Backtest')

ignore all other code changes (mess) and comments (silly ideas, I've no idea how the clustering algo really works -- but looks interesting)

Clone Algorithm
1
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
from sklearn import cluster, covariance
import numpy as np
from collections import defaultdict
from scipy import stats
import pandas as pd

# Update interval in days
update_interval = 5

def initialize(context):
    # Nasdaq 100 from https://www.quantopian.com/posts/list-of-nasdaq-100-sids-to-use-in-your-algo
    nasdaq100 = [24, 67, 114, 122, 328, 337, 368, 630, 739, 1209,
                 1419, 1637, 1787, 1900, 2618, 2663, 2696, 2853, 3212, 3450,
                 3806, 3951, 4246, 4485, 4668, 5061, 5121, 5149, 5166, 5767,
                 5787, 6295, 6413, 6546, 6683, 6872, 7061, 7272, 7671, 8017,
                 8045, 8132, 8158, 8344, 8352, 8655, 8677, 8816, 8857, 9736,
                 9883, 11901, 12213, 12652, 13862, 13905, 13940, 14014, 14328, 14848,
                 15101, 15581, 16841, 17632, 18529, 18870, 19725, 19917, 19926, 20208,
                 20281, 20680, 22316, 22802, 23709, 23906, 24124, 24482, 24518, 24819,
                 25317, 25339, 26111, 26169, 26578, 27357, 27533, 27543, 27872, 32046,
                 32301, 36930, 38650, 38817, 39095, 39840, 40207, 42950, 43405, 43919]
    context.sids = map(sid, nasdaq100)
    context.days_since_update = update_interval

def handle_data(context, data):
    pass

def before_trading_start(context, data):
    ### Called every day before market open.
    
    if context.days_since_update < update_interval:
        context.days_since_update += 1
        return
    
    ### Perform update
    update_cluster(context, data, ndays=31)
    ### reset the countdown to the next update
    context.days_since_update = 1

def update_cluster(context, data, ndays):
    ### MODIFIED
    #prices_open = history(31, '1d', 'close_price', ffill=False).dropna()[:-1]
    #prices_close = history(ndays, '1d', 'close_price', ffill=False).dropna()[:-1]
    
    prices_open = data.history(context.sids, fields='open', bar_count=ndays, frequency='1d')
    prices_close = data.history(context.sids, fields='close', bar_count=ndays, frequency='1d')
    
    ### ??? maybe [:-1] before .dropna() ?
    prices_open = prices_open.dropna()[:-1]
    prices_close = prices_close.dropna()[:-1]
    
    ### ??? is (Close - Open) price difference the best thing to do clustering on ?
    ### ??? what about log(price_close) - log_price(open) instead maybe ?
    prices_delta = prices_close - prices_open
    
    groups = stock_cluster(prices_delta, context) 
    print_result(groups)

def stock_cluster(attribute, context):
    # tell it we're looking for a graph structure
    edge_model = covariance.GraphLassoCV()
    
    ### MODIFIED
    #X = attribute.values.copy()
    #X_zscore = stats.zscore(X, axis=0, ddof=1)
    X_zscore = stats.zscore(attribute.values, axis=0, ddof=1)
    
    if np.any(pd.isnull(X_zscore)):
        print 'null found in X_zscore'
        return None
    
    edge_model.fit(X_zscore)
                
    # now process into clusters based on co-fluctuation
    _, labels = cluster.affinity_propagation(edge_model.covariance_)
    
    log.debug("Found {0} groups from {1} complete histories".format(max(labels)+1,len(attribute)))
    
    # filter the sids into groups, in order they appear in c.sids
    groups = defaultdict(lambda: [])
    for i, grp_idx in enumerate(labels):
        # groups[grp_idx].append( c.sids[i] )
        groups[grp_idx].append( context.sids[i].symbol )
        
    return groups

def print_result(groups):
    ### factored out but unmodified code
    result = '------------------\n'
    if groups is not None:
        # display stock sids that co-fluctuate:
        for i, g in groups.iteritems(): 
            result = result + 'Cluster %i: %s\n' % ((i + 1), ", ".join([str(s) for s in g]))
        print result
There was a runtime error.

Thanks Jon -

Yes, more time in before_trading_start. It is kinda curious, though, that the code was running, and now it doesn't. I'd think any changes to the platform would have improved execution speed, not have reduced it.