Back to Posts
Listen to Thread

This is replicating James Jack's algorithm which is using scikits-learn to estimate clusters of covarying stocks.

This rewrite simplifies the code by using the new batch_transform. Moreover, the clusters are constantly being re-estimated on the newest data.

Clone Algorithm
46
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

Hi Thomas,

My old code had a bug in it. I think lines 48-57 need to be change to:

    # filter the sids into groups, in order they appear in c.sids  
    for i, grp_idx in enumerate(labels):  
        groups[grp_idx].append( int(c.sids[i]) )  

and pass in the context "c" to the batch transform. Otherwise the order you iterate through SIDs in data.variation is different to the order they appear in c.sids. I think...

Any idea why data.variation is of length window_length+1 ?

Yeah, there is a bug in the batch_transform that the window size is not always constant. However, this is fixed and will be updated shortly.

Feel free to clone this and fix the bug you found!

Hello Thomas,

This goes back a bit, but could this clustering be applied to Nasdaq 100 list recently posted?

Grant

Hi Grant,

Most certainly. This is an interesting algorithm and I look forward to hearing how this does on the Nasdaq.

Thomas

Thomas,

Upon cloning and running as-is, I got this error:

67  Error   Nonexistent property: close  

Any ideas?

Grant

Just ran this (cloned from some time ago) and it seems to run just fine. Not sure if there's a difference?

Clone Algorithm
7
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

Thanks...I cloned the algo you posted immediately above and it runs. --Grant

Your new code has this:

data[s]['variation'] = (data[s].close_price - data[s].open_price)  

The code I cloned originally has:

data[s]['variation'] = (data[s].close - data[s].open)  

Here's the algo with the Nasdaq 100 stocks. Unfortunately, the log output gets clipped:

2013-01-29batch_cluster:60DEBUGFound 26 groups from 12 complete histories  
2013-01-29PRINTCluster 1: 24, 114, 1787, 8677, 23709  
2013-01-29PRINTCluster 2: 630, 2696, 3951  
2013-01-29PRINTCluster 3: 20680, 8816, 4246, 6683  
2013-01-29PRINTCluster 4: 9883, 12652  
2013-01-29PRINTCluster 5: 3806, 40207, 22316  
2013-01-29PRINTCluster 6: 1419, 15101  
2013-01-29PRINTCluster 7: 17632, 9736, 19725  
2013-01-29PRINTCluster 8: 122, 38650, 39095, 36930, 2618, 13862  
2013-01-29PRINTCluster 9: 14014, 24819, 24482  
2013-01-29PRINTCluster 10: 368, 25317, 23906, 25339, 13905, 6413  
2013-01-29PRINTCluster 11: 2663, 20208, 7671  
2013-01-29PRINTCluster 12: 42950, 4485  
2013-01-29PRINTCluster 13: 328, 14328, 2853, 5061, 39840, 14848  
2013-01-29PRINTCluster 14: 18870, 3212, 8655, 43919  
2013-01-29PRINTCluster 15: 1900, 27543, 43405, 5121, 5767, 8017, 8132  
2013-01-29PRINTCluster 16: 739, 27357, 13940  
2013-01-29PRINTCluster 17: 4668, 3450, 7061, 8158  
2013-01-29PRINTCluster 18: 22802, 8352  
2013-01-29PRINTCluster 19: 32301, 5166, 24518  
2013-01-29PRINTCluster 20: 27533, 1637, 12213, 5787, 26169  
2013-01-29PRINTCluster 21: 8857, 19917, 6295, 11901  
2013-01-29undefined:undefinedWARNLogging limit exceeded; some messages discarded  

Have to figure out a work-around. Is the limit on number of lines of text? Or total characters? Something else?

Grant

Clone Algorithm
2
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

Hello Grant,

Is this the 22-line limit?

Logging is rate-limited (throttled) for performance reasons. The basic limit is two log messages per call of initialize and handle_data. Each backtest has an additional buffer of 20 extra log messages. Once the limit is exceeded, messages are discarded until the buffer has been emptied. A message explaining that some messages were discarded is shown.

P.

Thanks Peter,

Yep...figured that out this morning, after looking at the help page. I found a way of printing out all of the results, but the formatting was kinda ugly. I'll give it another go when I get the chance.

Grant

Hello Thomas,

Would it be correct to assume that within a given cluster, there might be a trading pair (like GLD & GDX were)? If so, my thought is to ignore all of the clusters with only one stock, and then look for pairs within the remaining clusters. I came across this:

https://www.leinenbock.com/adf-test-in-python/

What do you think? Could it be applied here?

Grant

I tweaked the output code so that all clusters are displayed in the log:

result = '------------------\n'  
    if groups is not None:  
        # display stock sids that co-fluctuate:  
        for i, g in groups.iteritems():  
            result = result + 'Cluster %i: %s\n' % ((i + 1), ", ".join([str(s) for s in g]))  
        print result  

I also made a change so that security symbols are displayed:

# filter the sids into groups, in order they appear in c.sids  
    groups = defaultdict(lambda: [])  
    for i, grp_idx in enumerate(labels):  
        # groups[grp_idx].append( int(c.sids[i]) )  
        groups[grp_idx].append( c.sids[i].symbol )  

Grant

Clone Algorithm
2
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

Here's the output from the attached backtest:

2013-10-10 PRINT------------------
Cluster 1: AAPL, CMCS, VRSK
Cluster 2: ADI, ADSK, AMGN, INTC
Cluster 3: ADP, CTRX, CTXS, MYL, NFLX
Cluster 4: BBBY
Cluster 5: BIDU, BRCM, CHTR, CSCO, DTV, MNST, NUAN
Cluster 6: AMAT, CTSH, FAST
Cluster 7: DELL, DLTR, GOLD, SRCL
Cluster 8: ALXN, ESRX
Cluster 9: EXPD, GRMN, LINT, LMCA, TSLA
Cluster 10: CHKP, FFIV, FISV, GILD
Cluster 11: CERN, CHRW, EQIX, FOXA, MDLZ, PCAR, SNDK
Cluster 12: GOOG
Cluster 13: AVGO, CA, DISC, HSIC, KLAC, MAT, MCHP, NTAP, QCOM, XRAY
Cluster 14: KRFT
Cluster 15: ADBE, ISRG, MSFT, ORLY
Cluster 16: ATVI, LBTY, MU
Cluster 17: AMZN, EBAY, LLTC, MXIM, SYMC
Cluster 18: ALTR, FB, GMCR, NVDA, SHLD
Cluster 19: AKAM, BIIB, PCLN, WFM
Cluster 20: CELG, SBAC, TXN, VOD, WYNN
Cluster 21: SBUX, SIAL
Cluster 22: PAYX, REGN, SIRI, STX, WDC, XLNX
Cluster 23: EXPE, FOSL, ROST, SPLS, VRTX
Cluster 24: COST, INTU, VIAB, YHOO

Any ideas on how to interpret the results? For example, when I dig into Cluster 5, we have:

BIDU (Baidu) - Chinese-language Internet search provider
BRCM (Broadcom Corporation) - global semiconductor solution for wired and wireless communications
CHTR (Charter Communications) - provides cable services in the United States, offering a range of entertainment, information and communications solutions to residential and commercial customers
CSCO (Cisco Systems) - designs, manufactures, and sells Internet protocol (IP)-based networking and other products related to the communications and information technology (IT) industry and provide services associated with these products and their use
DTV (DIRECTV) - provides digital television entertainment in the United States and Latin America
MNST (Monster Beverage Corporation) - develops, markets, sells and distributes alternative beverages
NUAN (Nuance Communications) - a provider of voice and language solutions for businesses and consumers globally

The cluster sorta makes sense, except for MNST, which is a beverage company. Does the algorithm also provide a measure of the strength of the individual cluster members? For example, within Cluster 5, are there securities that are more confidently assigned to the cluster, with other securities having a weaker association?

Grant

Clone Algorithm
2
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

Hi Grant,

This looks pretty cool! The clusters might be good candidates for pair-trading (e.g. https://www.quantopian.com/posts/fixed-version-of-ernie-chans-gold-vs-gold-miners-stat-arb) so it'd be neat to extend the algo in that way.

Thomas

I think you want sid(1406) for CELG, not sid(40207). sid(40207) is actually CELGZ. We have a data problem where they are both reporting at CELG that I haven't yet resolved.

Thanks...Grant

Hello Thomas & all,

I'm considering updating this algo to use the history API, unless someone has already done it. Thoughts?

I figure I'll learn more Python, and get a better feel for this clustering magic.

Grant

Hi Grant,

I think that's an excellent idea and should be pretty straight forward. In principle you could just remove the batch_transform decorator and manually call the function passing in the DataFrame returned by history().

Thomas

Thanks Thomas,

That's what I thought. And when (if ever) the history API supports minute-level data, the algo can be run at a higher frequency.

Grant

Hello Thomas,

Here's a rough cut. I get a build error "59 Error Runtime exception: ValueError: array must not contain infs or NaNs" triggered by the line:

edge_model.fit(X)  

Any idea what's going on? My understanding is that the code is actually run in some fashion as part of the build, right? However, it's challenging to debug, since the log output is suppressed.

Grant

from sklearn import cluster, covariance  
import numpy as np  
from collections import defaultdict

# based on the example at:  
# http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html

# use in quick backtester

update_period = 5*390 # update clusters at this period in minutes

def initialize(context):  
    c = context  
    # Nasdaq 100 from https://www.quantopian.com/posts/list-of-nasdaq-100-sids-to-use-in-your-algo  
    # c.sids = [sid(24),    sid(114),   sid(122),   sid(630)  , sid(67),  
              # sid(20680), sid(328),   sid(14328), sid(368),   sid(16841),  
              # sid(9883),  sid(337),   sid(38650), sid(739),   sid(27533),  
              # sid(3806),  sid(18529), sid(1209),  sid(40207), sid(1419),  
              # sid(15101), sid(17632), sid(39095), sid(1637),  sid(1900),  
              # sid(32301), sid(18870), sid(14014), sid(25317), sid(36930),  
              # sid(12652), sid(26111), sid(24819), sid(24482), sid(2618),  
              # sid(2663),  sid(27543), sid(1787) , sid(2696),  sid(42950),  
              # sid(20208), sid(2853),  sid(8816),  sid(12213),  sid(3212),  
              # sid(9736),  sid(23906), sid(26578), sid(22316), sid(13862),  
              # sid(3951),  sid(8655),  sid(25339), sid(4246),  sid(43405),  
              # sid(27357), sid(32046), sid(4485),  sid(43919), sid(4668),  
              # sid(8677),  sid(22802), sid(3450),  sid(5061),  sid(5121),  
              # sid(5149),  sid(5166),  sid(23709), sid(13905), sid(19926),  
              # sid(19725), sid(8857),  sid(5767),  sid(5787),  sid(19917),  
              # sid(6295),  sid(6413),  sid(6546),  sid(20281), sid(6683),  
              # sid(26169), sid(6872),  sid(11901), sid(13940), sid(7061),  
              # sid(15581), sid(24518), sid(7272),  sid(39840), sid(7671),  
              # sid(27872), sid(8017),  sid(38817), sid(8045),  sid(8132),  
              # sid(8158),  sid(24124), sid(8344),  sid(8352),  sid(14848)]  
    c.sids = []  
    # some sids to look at  
    c.sids.append(sid(24))  
    c.sids.append(sid(18522))  
    c.sids.append(sid(5061))  
    c.sids.append(sid(20486))  
    c.sids.append(sid(5885))  
    c.sids.append(sid(4707))  
    c.sids.append(sid(3149))  
    context.elapsed_minutes = 0  
# @batch_transform(refresh_period=5, window_length=12)  
def batch_cluster(attribute,context):  
    c = context  
    # tell it we're looking for a graph structure  
    edge_model = covariance.GraphLassoCV()  
    X = attribute.values.copy()  
    X /= X.std(axis=0)  
    edge_model.fit(X)  
    # now process into clusters based on co-fluctuation  
    _, labels = cluster.affinity_propagation(edge_model.covariance_)  
    log.debug("Found {0} groups from {1} complete histories".format(max(labels)+1,len(attribute)))  
    # filter the sids into groups, in order they appear in c.sids  
    groups = defaultdict(lambda: [])  
    for i, grp_idx in enumerate(labels):  
        # groups[grp_idx].append( int(c.sids[i]) )  
        groups[grp_idx].append( c.sids[i].symbol )  
    return groups  
def handle_data(context, data):  
    if context.elapsed_minutes % update_period != 0.0:  
        return  
    context.elapsed_minutes += 1  
    prices_open = history(13, '1d', 'open_price',ffill=True)[0:-1]  
    prices_close = history(13, '1d', 'close_price',ffill=True)[0:-1]  
    prices_delta = prices_close - prices_open  
    # print prices_delta  
    #  
    # return  
    c = context  
    # for s in c.sids:  
        # if s in data:  
            # # add the day's price range to the list for this sid  
            # data[s]['variation'] = (data[s].close_price - data[s].open_price)  
    # note that the model wont work if there are different number of  
    # entries in data.variation.  
    groups = batch_cluster(prices_delta,c)  
    # if groups is not None:  
        # # display stock sids that co-fluctuate:  
        # for i, g in groups.iteritems():  
            # print 'Cluster %i: %s' % ((i + 1), ", ".join([str(s) for s in g]))  
        # # ...ADD ORDER CODE HERE...  
    result = '------------------\n'  
    if groups is not None:  
        # display stock sids that co-fluctuate:  
        for i, g in groups.iteritems():  
            result = result + 'Cluster %i: %s\n' % ((i + 1), ", ".join([str(s) for s in g]))  
        print result  

Hi Grant,

You might have to call .dropna() on the DataFrame.

Thomas

Got it working, however, I still get the error when I comment out these lines:

    if np.any(pd.isnull(X_zscore)):  
        print 'null found in X_zscore'  
        return None  

But I don't see 'null found in X_zscore' in the log output. So, something ain't right with the build process, it seems.

By the way, I added:

import pandas as pd  
pd.set_option('use_inf_as_null',True)  

This was based on the guidance on http://pandas.pydata.org/pandas-docs/dev/missing_data.html that .isnull() will ignore inf and -inf otherwise.

Grant

Log output:

2013-09-03batch_cluster:70DEBUGFound 2 groups from 12 complete histories  
2013-09-03PRINT------------------ Cluster 1: ARMH Cluster 2: AAPL, MSFT, FCS, PEP, MCD, GE  
2013-09-10batch_cluster:70DEBUGFound 3 groups from 12 complete histories  
2013-09-10PRINT------------------ Cluster 1: ARMH Cluster 2: AAPL, MSFT, FCS, PEP, MCD Cluster 3: GE  
2013-09-17batch_cluster:70DEBUGFound 3 groups from 12 complete histories  
2013-09-17PRINT------------------ Cluster 1: ARMH, PEP, MCD, GE Cluster 2: MSFT Cluster 3: AAPL, FCS  
2013-09-24batch_cluster:70DEBUGFound 3 groups from 12 complete histories  
2013-09-24PRINT------------------ Cluster 1: AAPL, FCS, GE Cluster 2: PEP Cluster 3: ARMH, MSFT, MCD  
2013-10-01batch_cluster:70DEBUGFound 3 groups from 12 complete histories  
2013-10-01PRINT------------------ Cluster 1: ARMH Cluster 2: AAPL, MSFT, FCS, MCD, GE Cluster 3: PEP  
2013-10-08batch_cluster:70DEBUGFound 2 groups from 12 complete histories  
2013-10-08PRINT------------------ Cluster 1: AAPL Cluster 2: ARMH, MSFT, FCS, PEP, MCD, GE  
2013-10-15batch_cluster:70DEBUGFound 2 groups from 12 complete histories  
2013-10-15PRINT------------------ Cluster 1: AAPL Cluster 2: ARMH, MSFT, FCS, PEP, MCD, GE  
2013-10-22batch_cluster:70DEBUGFound 3 groups from 12 complete histories  
2013-10-22PRINT------------------ Cluster 1: ARMH, GE Cluster 2: MSFT Cluster 3: AAPL, FCS, PEP, MCD  
2013-10-29batch_cluster:70DEBUGFound 3 groups from 12 complete histories  
2013-10-29PRINT------------------ Cluster 1: ARMH, PEP, MCD, GE Cluster 2: MSFT Cluster 3: AAPL, FCS  
End of logs.  
Clone Algorithm
5
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.
Log in to reply to this thread.
Not a member? Sign up!