Back to Community
K-means Clustering Help

Hi guys,

I am completely new to Quantopian! Recently, I have been trying to see if I can group comparable companies together using k-means clustering. I decided to test it on the financial industry first. I used variables like enterprise_value, market_cap, sustainable growth rate, ROA, ROE, ROIC as factors that can group different firms together. Then, within each cluster, I would long/short the top/bottom EV/EBITDA firms (ie. the undervalued/overvalued firms)

The problem is that I had to convert Pipeline data frame to an array in order to k-means clustering from sklearn library. Unfortunately, after I assigned each firm with a cluster label, I could not convert it back into the original DataFrame that had security number and column's title.

I hope that you guys can help me figure out the next step.

Thank you very much,
Thanh Duong

Loading notebook preview...
Notebook previews are currently unavailable.
7 responses

Hey Thanh,

If I understood your problem correctly, this can be solved very easily using the power of simple column assignment. I assigned a column titled 'Cluster' to your cluster array. See attached notebook for results.

I deleted some duplicated imports (without showing work), as well as put the old deprecated code into comments so you can see what I deleted in the main body of your notebook.

Hope this helps you.
Cheers.
Nick

Loading notebook preview...
Notebook previews are currently unavailable.

Hi Nick, thank you so so much for your help!

HI Karl, 50 is just a number that I choose randomly. There are ways for python to automatically find the optimal number of clusters, but I haven't found out about it yet!

You could try hierarchical clustering, which doesn't require a to specify the number of clusters. For the purpose of your strategy K-means probably works well though

I see, Thanh yes a good starting position to get it working, and Nick has shown in clear steps how the ['Cluster'] column is added to the result dataframe.

I was trying to condense the steps into a single-line Python statement:

result['Cluster'] = np.array(KMeans(n_clusters=50).fit(result.values).labels_).reshape((-1, 1))  

Results are quite different from Nick's - I am sure I have missed/misplaced some parts - see last cell in attached Notebook.

Loading notebook preview...
Notebook previews are currently unavailable.

Karl Thank you very much for your input. The result is different because each time you do k-means clustering, even though for the same set of data, it will feed you a different set of clusters, they are all different. I am not sure how to fix this tho, but overall, I think the sets of clusters are pretty similar.

Luca, yes. I have heard of hierarchical clustering. Hopefully, I can use them one day. Thank you very much!

UPDATE: I have attached the completed algorithm. One concern that had is that whenever I use order_target_percent(stock, 1/len(context.groups)), in which context.groups is the list of securities that I want to trade, the algorithm did not buy anything at all! So, I had to use order_target_percent(stock, 0.02) instead, an estimation of what the securities weight should be like. Nick, Luca, Karl, do you guys know what the problem is?

Thank you all very much!

Clone Algorithm
18
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
from quantopian.algorithm import attach_pipeline, pipeline_output
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import SimpleMovingAverage
from quantopian.pipeline.classifiers.fundamentals import Sector 
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import Fundamentals
from quantopian.pipeline.filters import Q1500US, Q500US
import pandas as pd
import numpy as np
import random as random
from itertools import combinations
from sklearn.cluster import KMeans

def initialize(context):
    context.long_leverage = 1
    # Rebalance on the first trading day of each week at 11AM.
    schedule_function(rebalance,
                      date_rules.month_start(days_offset=0),
                      time_rules.market_open(hours=1, minutes=30))

    # Create and attach our pipeline (dynamic stock selector), defined below.
    attach_pipeline(make_pipeline(context), 'kmeans')


def make_pipeline(context):
  sector_filter = Sector()
  financial_sector_filter = sector_filter.eq(103)

  #universe = Q500US()
    
  market_cap = Fundamentals.market_cap.latest
  
  enterprise_value = Fundamentals.enterprise_value.latest
  
  sustain_growth = Fundamentals.sustainable_growth_rate.latest
  
  ROA = Fundamentals.roa.latest
  
  ROE = Fundamentals.roe.latest

  ROIC = Fundamentals.roic.latest

  EV_EBITDA = Fundamentals.ev_to_ebitda.latest
  
  industry = Fundamentals.morningstar_industry_code.latest 
  
  result = Pipeline(
      columns={
          'EV/EBITDA': EV_EBITDA,
          'industry': industry,
          'enterprise value': enterprise_value,
          'market_cap': market_cap,
          'sustain growth': sustain_growth,
          'ROA' : ROA,
          'ROE' : ROE,
          'ROIC' : ROIC
    }, screen = financial_sector_filter #universe
      
  )
  return result

def before_trading_start(context, data):
  context.output = pipeline_output('kmeans')
  result = context.output.dropna(axis=0)
  result_array = result.values #switch Data Frame to array to use k-means library
  result_array = np.delete(result_array,0,1)#take EV/EBITDA out of the k-means clustering process
  kmeans = KMeans(n_clusters=50).fit(result_array) #fit into 50 clusters
  cluster_label = kmeans.labels_ #each cluster now has an ID, ranging from 0-49
  cluster = np.array(cluster_label)
  cluster = cluster.reshape((-1, 1))
  result['Cluster'] = cluster #attach cluster ID to result Pipeline
  result = result.sort_values(by=['EV/EBITDA'])

  context.groups = [];#loop to get a list of desirered stocks 
  for x in range(50):
    group = result[result['Cluster']==x]
    if len(group)>3 & len(group)<15: # eliminate clusters that are three or smaller 
        group_top = group[group['EV/EBITDA'] < group['EV/EBITDA'].quantile(0.25)]#get top 25% EV/EBITDA, ie the highest valued firms
        a = group_top.index.tolist()
        context.groups.append(a)
    elif len(group)>=15: #for cluster that are 15 or larger, only get the largest EV/EBITDA decile 
        group_top = group[group['EV/EBITDA'] < group['EV/EBITDA'].quantile(0.1)]
        a = group_top.index.tolist()
        context.groups.append(a)
  context.groups = [val for sublist in context.groups for val in sublist] #merging clusters to list of desired stocks

def rebalance(context,data):
    for stock in context.groups:
        if data.can_trade(stock):
           order_target_percent(stock, 0.02)
    for stock in context.portfolio.positions:
        if stock not in context.groups and data.can_trade(stock):
            order_target_percent(stock, 0)
There was a runtime error.

Some notes on the source code were wrong. I was trading the bottom 25% and 10% EV/EBITDA instead of the top.

whenever I use order_target_percent(stock, 1/len(context.groups)), ... the algorithm did not buy anything at all

Python thing, when dividing integers it returns an integer. This print will make it clear

           print 1/len(context.groups)  
           order_target_percent(stock, 1/len(context.groups)) #0.02)  

Use this and the ordering happens

1.0/len(context.groups)

Sometimes you'll see that done without the 0, like just 1. with the dot to make the output a floating point result.