Back to Community
Segaran's Non-Negative Matrix Factorization Implementation

Hi all,

Thought I'd share an implementation of Non-negative matrix factorization (NNMF).. that's a mouthful so perhaps an explanation is due.

First and foremost, credit for the original algorithm design in the source code goes to Toby Segaran, author of Collective Intelligence - an excellent book on machine learning. My primary value add really is only the implementation on this platform (which by the way, makes it much easier to test and experiment with Segaran's example).

Motivation

We are looking to identify dates during which it appears that some event drove up trading volume across some group of stocks. It is important to differentiate this from correlation. Whereas attempting to find the correlation between two times series of volume would "average out" periods of high co-movement and periods of low co-movement, this algorithm searches just for periods where the trading volumes moved together. While it is not a trading strategy of its own accord, it would be a good base for research for a strategy. For instance, let's say you found a series a dates on which Apple's stock experienced high volume trading and on the same series of dates, so did Microsoft's. Yet, Microsoft and Apple's stock volumes don't always move together. Often Apple will have a high spike in volume that Microsoft doesn't experience and vice versa. So what's happening on those days where they do move together? Perhaps Apple's trading volume spikes on days their board make announcements but only certain kind of announcements affect Microsoft trading... with some research (ie. find what announcements were made on the dates the algorithm spits out) this algorithm is one way to identify what sort of announcements effect both stocks and what don't.

I think you are best served experimenting with this algorithm after you read Chapter 10 of Segaran's Collective Intelligence - I believe the book is now available for free as a PDF and is an excellent primer to machine learning.

Clone Algorithm
Loading...
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Put any initialization logic here.  The context object will be passed to
# the other methods in your algorithm.
import numpy as np
import random
import datetime



def difcost(a,b):
  dif=0
  for i in range(np.shape(a)[0]):
    for j in range(np.shape(a)[1]):
      # Euclidean Distance
      dif+=pow(a[i,j]-b[i,j],2)
  return dif

def factorize(v,pc=10,iter=50):
  ic=np.shape(v)[0]
  fc=np.shape(v)[1]
    

  # Initialize the weight and feature matrices with random values
  w=np.matrix([[random.random() for j in range(pc)] for i in range(ic)])
  h=np.matrix([[random.random() for i in range(fc)] for i in range(pc)])

  # Perform operation a maximum of iter times
  for i in range(iter):
    wh=w*h
    
    # Calculate the current difference
    cost=difcost(v,wh)
    
    #if i%10==0: print cost
    
    # Terminate if the matrix has been fully factorized
    if cost==0: break
    
    # Update feature matrix
    hn=(np.transpose(w)*v)
    hd=(np.transpose(w)*w*h)
  
    h=np.matrix(np.array(h)*np.array(hn)/np.array(hd))

    # Update weights matrix
    wn=(v*np.transpose(h))
    wd=(w*h*np.transpose(h))

    w=np.matrix(np.array(w)*np.array(wn)/np.array(wd))  
    
  return w,h
    

@batch_transform(refresh_period=5, window_length=30)
def get_volume_history(datapanel,sids):
    

    volumes = datapanel['volume']

    
    volume_matrix = [[volumes[sid][-i] for sid in sids] for i in range(1,31)]
    
   

    
    return volume_matrix    

def initialize(context):
    context.stocks = [sid(26578),sid(14848),sid(8347),sid(23112),sid(16841),sid(24),sid(5061)]
    context.dates = []
    pass
    


# Will be called on every trade event for the securities you specify. 
def handle_data(context, data):
    # Implement your algorithm logic here.

    # data[sid(X)] holds the trade event data for that security.
    # data.portfolio holds the current portfolio state.

    #keep track of every date we run through...
    context.dates.append(data[24].datetime)

    #get volume matrix
    vols = get_volume_history(data,context.stocks) 
    

    if vols is not None:
        
        w,h= factorize(np.matrix(vols),pc=5)
        
        #display the data
        for i in range(np.shape(h)[0]):
            print "Feature %d" %i
            
              # Get the top stocks for this feature
            ol=[(h[i,j],context.stocks[j]) for j in range(np.shape(h)[1])]
            ol.sort()
            ol.reverse()
            for j in range(len(context.stocks)):
                print ol[j]
            print "-"
            #
            dates = context.dates
            dates.reverse()
            porder=[(w[d,i],d) for d in range(30)]
            porder.sort()
            porder.reverse()
            print [(p[0],dates[p[1]]) for p in porder[0:3]]
            print "--"

    #print vols
   
    
  
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.
3 responses

That sounds fascinating. I definitely need to do the reading you suggest. It looks like the O'Reilly version is here, and the free download is here.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Very interesting. I took the liberty to pythonize the code a little bit to increase readability.

I think matrix factorization methods have a lot of potential in algorithmic trading. One interesting blog post I found uses Robust PCA to do automatic anomaly detection.

In regards to your algorithm, can you provide an intuition for what the factor loadings mean? Also, do you have more ideas on how to find those events that triggers co-movement in the volume?

Clone Algorithm
22
Loading...
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Put any initialization logic here.  The context object will be passed to
# the other methods in your algorithm.
import numpy as np
import random
import datetime

def difcost(a,b):
  dif = 0
  for i in range(a.shape[0]):
    for j in range(a.shape[1]):
      # Euclidean Distance
      dif+=pow(a[i,j]-b[i,j],2)
  return dif


def factorize(v, pc=10, iter=50):
  ic = v.shape[0]
  fc = v.shape[1]
    
  # Initialize the weight and feature matrices with random values
  w = np.asmatrix(np.random.rand(ic, pc))
  h = np.asmatrix(np.random.rand(pc, fc))

  # Perform operation a maximum of iter times
  for i in range(iter):
    wh = w * h
    
    # Calculate the current difference
    cost = difcost(v, wh)
    
    # Terminate if the matrix has been fully factorized
    if cost == 0: break
    
    # Update feature matrix
    hn = w.T * v
    hd = w.T * w * h
  
    h = np.divide(np.multiply(h, hn), hd)

    # Update weights matrix
    wn = v * h.T
    wd = w * h * h.T

    w = np.divide(np.multiply(w, wn), wd)
    
  return w,h
    

@batch_transform(refresh_period=5, window_length=30)
def get_volume_history(datapanel,sids):
    volume_matrix = np.asmatrix(datapanel['volume'].values)
    w, h = factorize(volume_matrix, pc=5)
    return w, h

def initialize(context):
    context.stocks = [sid(26578),sid(14848),sid(8347),sid(23112),sid(16841),sid(24),sid(5061)]
    context.dates = []

# Will be called on every trade event for the securities you specify. 
def handle_data(context, data):
    #keep track of every date we run through...
    context.dates.append(data[24].datetime)

    #get volume matrix
    factors = get_volume_history(data, context.stocks) 
    
    if factors is not None:
        w, h = factors # unpack
        #display the data
        for i in range(np.shape(h)[0]):
            print "Feature %d" %i
            
            # Get the top stocks for this feature
            ol = [(h[i,j], context.stocks[j]) for j in range(np.shape(h)[1])]
            ol.sort()
            ol.reverse()
            for j in range(len(context.stocks)):
                print ol[j]
            print "-"
            #
            dates = context.dates
            dates.reverse()
            porder = [(w[d,i], d) for d in range(29)]
            porder.sort()
            porder.reverse()
            print [(p[0],dates[p[1]]) for p in porder[0:3]]
            print "--"
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Interesting article and thanks for cleaning up the code! Much more readable.

To answer your questions...I think your first question is quite difficult. Because I'm choosing an arbitrary number of factors and because some of the features appear to only affect one stock - it's not entirely clear what a "heavily weighted feature" actually means. I would compare the "features" as similar to the results of a similar unsupervised method like k-means clustering.

I would venture to say that features that have a heavy weight on only one stock are essentially meaningless. In a sense, these factors are like "residuals" - any and all sudden anomalies in trading volume in a stock that have no similarity to any other stock can be conveniently grouped in a 'feature'. The problem with this is that there's no reason to believe that these anomalies are caused by the same underlying reason. Far more interesting are the features that appear to affect multiple stocks at once. It's particularly interesting when there appears to be a feature that marks high trading volume across some (but not all) stocks in the same industry.

In terms of finding events that trigger co-movement in volume, I think there's many ways to approach this. Personally, I've done some work with data scraping - both news and social media data in particular; I find this method to be a good starting point because of its simplicity and intuitively makes sense. This algorithm spits out dates on which particular features are the strongest and stocks for which a particular feature applies heavily. Assuming we choose a feature that applies heavily to at least 2 stocks, I would suggest that the best way to determine what (if anything) that particular feature "means" (and hence how to find it in the future) would be to scrape news releases perhaps the day before, the day of, and the day after. Next, we look for similarities between these news releases. We also look for differences between these and news releases from other dates that may have high trading volume (for one of the particular stocks that this feature applied to heavily but not others that presumably wasn't identified by the algorithm).

Of course, its very likely we won't be able to distinguish anything just by manually looking (or even more likely, there will be far more information than can be manually processed) - this is where I've used natural language processing in the past.