Back to Community
Simple ML demo to port to QSTK

This is a simple ML demo I want to port to QSTK.

Dan

Clone Algorithm
57
Loading...
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# /a/ks/q10/spy255.py

# I should predict short term positions of SPY

from sklearn           import linear_model
from sklearn.neighbors import KNeighborsClassifier
from collections import deque
import numpy  as np
import pandas as pd
from pandas      import DataFrame as df
from pandas      import Series

def initialize(context):
    ct           = context
    ct.is_len    = 1400 # I should learn from this many observations
    ct.tkr       = symbol('SPY')
    # I need help storing recent prices from data[] sorted by recent dates
    ct.rdt    = {} # datetimes
    ct.rp     = {} # prices
    ct.ip     = {} # initial predictions
    ct.myeff  = {} # initial predictions effectiveness
    ct.myg4   = {} # 4 of my last gains
    ct.ipval  = 0.5 # bootstrap ip value
    ct.npval  = 0.5
    ct.kelly  = 1.0
    ct.kelly_base = 1.0
    ct.kelly_x    = 0.0

def handle_data(context, data):
    ct = context
    hc = 2 # I should use this to count every 2-hours
    for tkr in data:
        if (tkr in ct.rdt) == False :            
            # I should initialize trackers of tkrs:
            ct.rdt[tkr]  = deque(maxlen=ct.is_len)
            ct.rp[tkr]   = deque(maxlen=ct.is_len)
            ct.ip[tkr]   = deque(maxlen=ct.is_len)
            ct.myeff[tkr]= deque(maxlen=ct.is_len)
            ct.myg4[tkr] = deque(maxlen=4)
            # I should initialize position too.
            # Just go long on tkr until I learn more.
            order_target_percent(tkr, 1.0)
        # I should get dates,prices from data[tkr]
        mytimes = ct.rdt[tkr]
        myprices= ct.rp[tkr]
        myip    = ct.ip[tkr] # This should collect initial predictions.
        myg4    = ct.myg4[tkr] # Collect effectiveness of last 4 ip.
        myeff   = ct.myeff[tkr] # I should collect sum(myg4)
        if data[tkr].datetime.strftime('%M') == '58':
            # Once an hour I should:
            mytimes.append( data[tkr].datetime)
            myprices.append(data[tkr].price)
            # If I have enough is-data, I should learn/predict
            if len(myprices) == myprices.maxlen:
                # I should sort prices by date.
                datep_df = df([list(mytimes)]).T
                datep_df.columns = ['pdate']
                datep_df['cp'] = list(myprices)
                datep_df_sorted = datep_df.sort('pdate')
                cp = list(datep_df_sorted['cp']) # current price
                lag1 = np.array(lagn(hc*1,cp)) # current price, hc-hours-ago
                lag2 = np.array(lagn(hc*2,cp))
                lag3 = np.array(lagn(hc*3,cp))
                lag4 = np.array(lagn(hc*4,cp))
                lag5 = np.array(lagn(hc*5,cp))
                lag6 = np.array(lagn(hc*6,cp))
                leadp = np.array(leadn(hc*1,cp)) # future price in hc-hours
                cp    = np.array(cp)
                n1g   = (leadp - cp) / cp # normalized gain
                # I should use a DataFrame to collect my X-values:
                bigx_df = df([list((cp - lag1) / lag1)]).T
                bigx_df.columns = ['x1']
                bigx_df['x2'] = (cp - lag2) / lag2
                bigx_df['x3'] = (cp - lag3) / lag3
                bigx_df['x4'] = (cp - lag4) / lag4
                bigx_df['x5'] = (cp - lag5) / lag5
                bigx_df['x6'] = (cp - lag6) / lag6
                yval      = n1g > 0
                yval_is   = yval[0:-1] # in-sample
                # I should predict now.
                # goog: In python pandas how I convert dataframe to numpy array?
                bigx     = bigx_df.reset_index().values[:,1:]
                bigx_is  = bigx[0:-1] # in-sample
                bigx_oos = bigx[-1]   # out-of-sample
                knn1 = KNeighborsClassifier(n_neighbors=len(yval_is), weights='distance')
                knn1.fit(bigx_is,yval_is)
                ct.ipval = knn1.predict_proba(bigx_oos)[0,1]
                # I should save this prediction so I can learn from it.
                myip.append(ct.ipval)
                # I should get and save effectiveness too.
                myg = np.sign(ct.ipval - 0.5) * n1g[-1]
                myg4.append(myg)
                myeff.append(sum(myg4))
                if len(myip) == len(bigx_df):
                    # I should have many ipval and myeff now. 
                    # I make them a feature and predict again.
                    bigx_df['myip']  = list(myip)
                    bigx_df['myeff'] = list(myeff)
                    bigx2_df = bigx_df[['x1','x2','x3','x4','myip','myeff']]
                    # I should convert df to np array.
                    bigx2     = bigx2_df.reset_index().values[:,1:]
                    bigx2_is  = bigx2[0:-1] # in-sample
                    bigx2_oos = bigx2[-1]   # out-of-sample
                    knn2 = KNeighborsClassifier(n_neighbors=len(yval_is), weights='distance')
                    knn2.fit(bigx2_is,yval_is)
                    ct.npval = knn2.predict_proba(bigx2_oos)[0,1]
                    # If ct.npval > 0.5 I should go long.
                    # Else I should go short.
                    # I should use kelly to magnify the prediction direction.
                    # A prediction of 0.55 should give me a kelly of:
                    # ct.kelly_base + (ct.kelly_x * 0.05)
                    ct.kelly = np.sign(ct.npval - 0.5)*(ct.kelly_base+ct.kelly_x*abs(ct.npval - 0.5))
                    order_target_percent(tkr, ct.kelly)
        record(kelly = ct.kelly)

def leadn(n, lst):
    dq = deque(lst, maxlen=len(lst))
    for i in range(1,1+n):
        dq.append(lst[-1])
    return(list(dq))

def lagn(n, lst):
    dq = deque(lst, maxlen=len(lst))
    for i in range(1,1+n):
        dq.appendleft(lst[0])
    return(list(dq))
There was a runtime error.
9 responses

So I'm assuming SPY is the SP 500 ETF Trust. Is that right?
Why does everybody use it as a benchmark? Instead of S&P 500 Stock Index?

Yes SPY is that trust.

I like SPY because it is easier to spell than ^GSPC

Also I think SPY is better because you can actually buy/sell it.

For real world I trade ES-mini because commission is low its liquid and it makes my schedule D tax form thinner.

Thank you for the answer, Dan!

Very interesting!

Why did you choose KNeighborsClassifier?

What is the Kelly bit doing, exactly?

@jj, My preference has been logistic regression.

But then I bumped into KNN while surfing through some links I found at the Quantopian github page.

So, I backtested KNN on some ^GSPC yahoo prices going back to 1950 and found that KNN offers a tiny edge over LR.

Also I like KNN because it is easy to describe how it works.

LR internals are described well by Ng in his coursera videos but that is an investment of at least 30 minutes of your time.

But, KNN can be understood just by surfing its wikipedia page and thinking about it for 5 minutes.

I offer 2 Kelly params.

ct.kelly_base is a general aggressiveness dial.

ct.kelly_base == 1 means act normal.

ct.kelly_base > 1 means act aggressive.

ct.kelly_x quantifies Kelly's idea that if you have an edge, you should bet more.

ct.kelly_x == 0 means dont try to exploit the edge.

ct.kelly_x > 0 means try to exploit the edge.

The attached backtest (spy256) will allow you to see the effect of both params where I set

ct.kelly_base = 2
ct.kelly_x = 2

The first backtest (spy255) had this:

ct.kelly_base = 1
ct.kelly_x = 0

As far as actually using these algos I'd advise against it.

The period between 2006 and 2009 would cause most followers to abandon ship.

Dan

Clone Algorithm
19
Loading...
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# /a/ks/q10/spy256.py

# I should predict short term positions of SPY

from sklearn           import linear_model
from sklearn.neighbors import KNeighborsClassifier
from collections import deque
import numpy  as np
import pandas as pd
from pandas      import DataFrame as df
from pandas      import Series

def initialize(context):
    ct           = context
    ct.is_len    = 1400 # I should learn from this many observations
    ct.tkr       = symbol('SPY')
    # I need help storing recent prices from data[] sorted by recent dates
    ct.rdt    = {} # datetimes
    ct.rp     = {} # prices
    ct.ip     = {} # initial predictions
    ct.myeff  = {} # initial predictions effectiveness
    ct.myg4   = {} # 4 of my last gains
    ct.ipval  = 0.5 # bootstrap ip value
    ct.npval  = 0.5
    # Increase Kelly params here:
    ct.kelly  = 2.0
    ct.kelly_base = 2.0
    ct.kelly_x    = 2.0

def handle_data(context, data):
    ct = context
    hc = 2 # I should use this to count every 2-hours
    for tkr in data:
        if (tkr in ct.rdt) == False :            
            # I should initialize trackers of tkrs:
            ct.rdt[tkr]  = deque(maxlen=ct.is_len)
            ct.rp[tkr]   = deque(maxlen=ct.is_len)
            ct.ip[tkr]   = deque(maxlen=ct.is_len)
            ct.myeff[tkr]= deque(maxlen=ct.is_len)
            ct.myg4[tkr] = deque(maxlen=4)
            # I should initialize position too.
            # Just go long on tkr until I learn more.
            order_target_percent(tkr, 1.0)
        # I should get dates,prices from data[tkr]
        mytimes = ct.rdt[tkr]
        myprices= ct.rp[tkr]
        myip    = ct.ip[tkr] # This should collect initial predictions.
        myg4    = ct.myg4[tkr] # Collect effectiveness of last 4 ip.
        myeff   = ct.myeff[tkr] # I should collect sum(myg4)
        if data[tkr].datetime.strftime('%M') == '58':
            # Once an hour I should:
            mytimes.append( data[tkr].datetime)
            myprices.append(data[tkr].price)
            # If I have enough is-data, I should learn/predict
            if len(myprices) == myprices.maxlen:
                # I should sort prices by date.
                datep_df = df([list(mytimes)]).T
                datep_df.columns = ['pdate']
                datep_df['cp'] = list(myprices)
                datep_df_sorted = datep_df.sort('pdate')
                cp = list(datep_df_sorted['cp']) # current price
                lag1 = np.array(lagn(hc*1,cp)) # current price, hc-hours-ago
                lag2 = np.array(lagn(hc*2,cp))
                lag3 = np.array(lagn(hc*3,cp))
                lag4 = np.array(lagn(hc*4,cp))
                lag5 = np.array(lagn(hc*5,cp))
                lag6 = np.array(lagn(hc*6,cp))
                leadp = np.array(leadn(hc*1,cp)) # future price in hc-hours
                cp    = np.array(cp)
                n1g   = (leadp - cp) / cp # normalized gain
                # I should use a DataFrame to collect my X-values:
                bigx_df = df([list((cp - lag1) / lag1)]).T
                bigx_df.columns = ['x1']
                bigx_df['x2'] = (cp - lag2) / lag2
                bigx_df['x3'] = (cp - lag3) / lag3
                bigx_df['x4'] = (cp - lag4) / lag4
                bigx_df['x5'] = (cp - lag5) / lag5
                bigx_df['x6'] = (cp - lag6) / lag6
                yval      = n1g > 0
                yval_is   = yval[0:-1] # in-sample
                # I should predict now.
                # goog: In python pandas how I convert dataframe to numpy array?
                bigx     = bigx_df.reset_index().values[:,1:]
                bigx_is  = bigx[0:-1] # in-sample
                bigx_oos = bigx[-1]   # out-of-sample
                knn1 = KNeighborsClassifier(n_neighbors=len(yval_is), weights='distance')
                knn1.fit(bigx_is,yval_is)
                ct.ipval = knn1.predict_proba(bigx_oos)[0,1]
                # I should save this prediction so I can learn from it.
                myip.append(ct.ipval)
                # I should get and save effectiveness too.
                myg = np.sign(ct.ipval - 0.5) * n1g[-1]
                myg4.append(myg)
                myeff.append(sum(myg4))
                if len(myip) == len(bigx_df):
                    # I should have many ipval and myeff now. 
                    # I make them a feature and predict again.
                    bigx_df['myip']  = list(myip)
                    bigx_df['myeff'] = list(myeff)
                    bigx2_df = bigx_df[['x1','x2','x3','x4','myip','myeff']]
                    # I should convert df to np array.
                    bigx2     = bigx2_df.reset_index().values[:,1:]
                    bigx2_is  = bigx2[0:-1] # in-sample
                    bigx2_oos = bigx2[-1]   # out-of-sample
                    knn2 = KNeighborsClassifier(n_neighbors=len(yval_is), weights='distance')
                    knn2.fit(bigx2_is,yval_is)
                    ct.npval = knn2.predict_proba(bigx2_oos)[0,1]
                    # If ct.npval > 0.5 I should go long.
                    # Else I should go short.
                    # I should use kelly to magnify the prediction direction.
                    # A prediction of 0.55 should give me a kelly of:
                    # ct.kelly_base + (ct.kelly_x * 0.05)
                    ct.kelly = np.sign(ct.npval - 0.5)*(ct.kelly_base+ct.kelly_x*abs(ct.npval - 0.5))
                    order_target_percent(tkr, ct.kelly)
        record(kelly = ct.kelly)

def leadn(n, lst):
    dq = deque(lst, maxlen=len(lst))
    for i in range(1,1+n):
        dq.append(lst[-1])
    return(list(dq))

def lagn(n, lst):
    dq = deque(lst, maxlen=len(lst))
    for i in range(1,1+n):
        dq.appendleft(lst[0])
    return(list(dq))
There was a runtime error.

The performance looks nice, but the actual number of zero crossings and changes in the kelly value means, that we have a sample size (kelly "events"), which is too small to build enough trust in it.

Could it be tuned to give more kelly based decisions, so we might see, how those decision fare?

ML not only means to apply ML methods, but also to know, when they actually do work and when we only see a lucky shot.

Yes, that is why I want to move it to QSTK so I can study it more.

A simple way to get more crossings is to reduce the number of in-sample rows.

If I did that, a proper thing to do from an ML perspective is to reduce the number of features to avoid over-fitting.

My rule of thumb is that N-features needs 10^N in-sample rows.
You can see from the source code I have 6 features so I should have 10^6 rows.
Instead I feed it 10^3 so I'm sliding down a slippery slope towards a garbage-algo.

Another thing to try is to synthesize features from talib.

This is more like middle school chem-class than quantified finance.

Dan

Hey Dan,

Thank you for sharing and this is great. Would you be kind enough to explain more about this ? I am a little confused while reading the code.. sorry for being dumb.

@ Nyan Paing Tin

The code tries to use the knn to cluster the price trend into 2 clusters.
It tries to collect 1400 rounds of the data into the knn classifier and then use it to predict the probability of the current trend of the price. If it is a rising trend, and the possibility is 0.1, it will utilize the kelly formula to make a position.

If you don't know what is knn, you can look into this link https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm