Back to Posts
Listen to Thread

I just got word that zlib is available on Quantopian.

I'd requested it, since it should be useful in compression-based pattern matching algorithms (e.g. see http://www.geneffects.com/simphile/).

Grant

I get a never-ending "Waiting for logs..." message with this code:

import zlib

def initialize(context):  
    context.stock = sid(8554)  
def handle_data(context, data):  
    s = 'test'  
    t = zlib.compress(s)  
    log.debug(t)  

Any idea what's going on?

Grant

Hi Grant,

Thanks for reporting this.

When you log a string in Quantopian, we first save it to disk, and then relay it to you. When we save it to disk, we require it to be utf-8 encoded. The output of zlib's compression appears to be raw bytes, which fail to encode as utf-8. This, in turn, is exposing a hole in our exception handling, and the unhandled exception is hanging the backtest.

If you avoid logging the output of zlib's compress, you should be able to proceed with your development while we look into the problem.

thanks,
fawce

Thanks Fawce,

Works now (see attached).

Grant

Clone Algorithm
3
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

Here's a simple example that illustrates the basic idea. --Grant

Clone Algorithm
3
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

The amount something can be compressed is basically down to how much unique data is inside of it, including the "uniqueness" of the order of that data. The maximum amount you can compress something is the amount of "real information" inherent in it.

I suppose if you coded the price action of a stock into a couple of categories (e.g. price[i] > price[i-1] 3 days in a row = category 1, price[i] < price[i-1] 3 days in a row = category 2, and otherwise category 3) then you could run compression to see how likely it is the pattern repeats for that stock? when new data is included and the compression doesn't go higher, then you know its repeating, otherwise you know something "new" is happening.

Just a thought.

Thanks James,

Yes, that's the idea...there is a coding step and then the comparison. One simple coding would be, for each tic of the market, code a 1 if the security price is above its N-tic moving average, and code a 0 if it is below. By applying the coding to multiple securities, their patterns could then be compared and ranked. A buy/sell indicator could then be developed.

Grant

I think I would have to study that for some time before really understanding it!

However it does say they are interested in relative distances, not absolutes. Does that mean you have to divide by the maximum compressed length seen between the two stocks?

Oh sorry, I think they give that at end of the paper:

CDM(x,y) = C(xy) / [ C(x) + C(y) ]

so that would be the normalised distance (?)

James,

Yes, kinda theoretical...just thought I'd include it for completeness.

At the end of the paper, there is the measure that you write above:

CDM(x,y) = C(xy) / [ C(x) + C(y) ]

Earlier in the paper, eq. VII.1 is given as an alternate:

NCD(x,y) = [ C(xy) - min{C(x),C(y)} ] / max{C(x),C(y)}

Both are straightforward to code, since all of the "heavy lifting" is done by the compression algorithm.

I've yet to find a Python implementation of this approach, but there must be some examples out there.

Grant

So the underlying string to compress must be some feature of the stocks to analyse? As you said it could be a simple coding 0 and 1 for up/down. I know the paper specifies only 0 and 1 as making up the string, but I'm not sure if that means you can't use other letters/numbers in the zlib string, since they are ultimately 0s and 1s anyhow.

James,

I don't know if it makes a difference...if my alphabet consists of A, B, C, & D, should I use 00, 01, 10, & 11, respectively, to construct my string to be compressed? Or does the algorithm work the same, regardless of the details of the coding (e.g. the same result would be obtained with A, B, C, & D or 00, 01, 10, & 11)?

My hunch is that the coding of the string does not need to be binary, but I could be wrong.

Grant

Hello James,

The code is still kinda ugly, but here's the idea I had for coding. The next step is to apply the compression, etc. It'll be nice when Quantopian enables plotting...then one could plot the similarity measure versus time.

Grant

Clone Algorithm
34
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

Hi Grant,

I've coded the functions. that's the easy bit ;)

def NCD( X, Y ):  
    CX = len(zlib.compress(X))  
    CY = len(zlib.compress(Y))  
    return (len(zlib.compress(X+Y)) - min(CX,CY)) / float(max(CX,CY))

def CDM( X, Y ):  
    CX = len(zlib.compress(X))  
    CY = len(zlib.compress(Y))  
    return len(zlib.compress(X+Y)) / float( CX + CY )  

For my backtest I'm using 2012-03-1 to 2012-03-29 and here's what WMT and SPY did in that period.

I think Y is the coded version of WMT, right?

so on the first call its    000000000001111101111111111111  
the second                   000000000010111011111111111111    (showing the shift to indicate a new value was added)  

because a new price high of 62.06 meant the average went up, and the relative_prices changed, and one the prices fell below the average.

and by the end its 110000000000000011111111111111  

which would make sense by looking at the avgWMT on the chart.

(Just clarifying, I don't necessarily have a point here.)

EDIT: NCD now on graph as well.

Thanks James,

Yes, Y corresponds to WMT.

I don't understand this:

so on the first call its    000000000001111101111111111111  
the second                   000000000010111011111111111111    (showing the shift to indicate a new value was added)  

In the code that I posted, each market tic (day), I start with a null strings and then concatenate the binary price swing codes (relative to the average price for the period):

    X = ''  
    Y = ''  
    for i in range(W_L):  
        X = X + str(int(coded_prices[i,0]))  
        Y = Y + str(int(coded_prices[i,1]))  

I wouldn't expect blanks in the X & Y strings.

By the way, how did you make the plots? Did you download prices from the web (e.g. Yahoo) and run code on your local pc? Or did you run everything in Quantopian and then cut & paste results from the log output into a spreadsheet program?

Grant

Hi Grant,

There are no blanks in the string, I have shifted the string to the right by one place just to make it clear only 1 character had changed. Since you are computing with a moving window you have one new day's value added in to the start of the period and one value which is remove from the end of the period, so there is a constant shift going on, relating to the previous values.

The plotting I just copied + pasted from the PRINT statements into a text editor and then did a series of regular expression to turn into a suitable CSV format, and loaded into excel. We're really in need of a plotting facility in Quantopian.

James

Log in to reply to this thread.
Not a member? Sign up!