Back to Community
Fastest KMean clustering method?

Hey all,
So I've got a data set of ~ 16000 intervals of 180 minute prices and their price changes, to which I'd like to run a kmeans clustering on. Performing this sequentially through a for loop eventually times out before even 1000 intervals, any thoughts on how to make 16000 work?

4 responses

Hey Chris,

K-Means can indeed be slow. I think there are some clever things you can do involving distance heuristics that avoid you having to recompute distances (https://arxiv.org/abs/1108.1351). K-Modes may be faster as you can avoid having to do extra distance computation steps whenever you pick a new synthetic mean. Instead you pick existing observations as centroids. https://shapeofdata.wordpress.com/2014/03/04/k-modes/

When I worked in a computational genetics lab we had to deal with complexity a lot as genetics datasets tend to have a high number of features and a lower number of observations. Some algorithms that worked there involved filtering down to a lower number of features first, and then running a computationally complex search like k-means, clustering, or even a brute force method. You might try the relief methods, they're quite fast. https://en.wikipedia.org/wiki/Relief_(feature_selection)

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

It looks like feature selection is exactly what I was looking for, I'll see how I can implement it with sklearn. Thanks for the tips! Didn't know there were k-mode, k-median, k-prototype, and other k algorithms out there.

Using sklearn selectkbest, I'd like to pick the top 20 predictors in my data set, but I'm not quite sure how to determine what makes a predictor effective. I'm using a list of lists(in the form [[price1, price change1], [price2, price change2], up to [price180, price change 180]]) with each list interval of 180 being a predictor.

Should i use f_regression(http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression) to select, or mutual_info_regression, or possibly even f_classif? And for whichever function I'm using, what other arrays do i need to create in order to fill the arguments in said function?

My apologies for the novice questions, I'm an undergrad college student who's trying to learn quant finance on his own, and a lot of this material takes some time for me to wrap my head around.

You want to select features by ability to predict an outcome. Usually the outcome is forward returns, so you'd put forward returns in as your exogenous variable in any analysis. Take a look at how alphalens works:

http://quantopian.github.io/alphalens/

In general be weary of trying to predict prices with historical prices. It's very easy to overfit and often you'll get models that don't really mean anything. You can generally explain most predictive power in prices by momentum or mean reversion. What's usually more interesting is to try to select features from alternative datasets like those found here.