Back to Community
Pairs Trading with Natural Language Processing

This is the second post in a series on using Machine Learning in pairs trading (the first post is here). This post was motivated by the question

Is it possible to find valid eligible pairs without using any price data at all?

This seems like an impossible task. After all, isn't a pairs trading strategy a price-driven strategy? In this post I suggest that it is indeed possible. Using a public dataset of text descriptions of companies, I train a Machine Learning model to read about companies and find company pairs with similar descriptions. A visual analysis of the companies discovered indicates that their prices move together. This post uses basic concepts from Natural Language Processing and classes in scikit-learn all available in Quantopian Research. In addition to being directly applicable to finding eligible pairs, this post also is an example, generally, of a process to reduce unstructured data to structured form.

Loading notebook preview...

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

2 responses

Thanks Jonathan -

You seem to be having fun, and presumably getting paid for it--a nice combination (I've worked for free for Quantopian since 2012; maybe I should set up a GoFundMe account?). Some comments:

  • At a high level, your statement "Human analysts are good at this kind of task, but can a machine do as well if not better?" puts this work in the context of a long history of automating work that traditionally has been performed by skilled human labor, including so-called "knowledge workers" with advanced degrees, and years of experience. Within the hedge fund industry (and finance in general) there are lots of functions, y = f(x), where f is performed by a human, to produce the output, y. For example, one can imagine that when pairs trading was "discovered" back in the day (the mid 1980s?), teams of analysts were formed, with the task of spitting out candidate pairs (using pencil and paper, mainframes, and desktop pcs). Perhaps part of the workflow may have been to read company profiles and lots of other documents, as part of the assessment. Even today, given the amount of money sloshing around, I'm guessing it is worth paying analysts to digest text and speech. I'm not in the field, and so I don' know what hedge fund analysts do, so one thing that would help would be to understand what a typical workflow would be, and highlight generally where automation could be applied.
  • There is a lot of text available on where one can access company filings. Presumably, some poor souls at hedge funds have to read all of it; maybe Quantopian could relieve them of their misery, by automating the process in some fashion.
  • Perhaps a topic for another thread, but generally I don't understand very well how to handle the time factor (and have a concern that it is often left out because it makes problems easier). Take pairs trading, for example. I'm imagining that there is a lot of transience--profitable pairs don't persist for long. Would there be any way to use the Quantopian data to get a sense for the time scale of profitability across pairs, going back to 2002? For example, at any point in time, there should be N profitable pairs persisting on various time scales, tau. It would be interesting to see a plot of N(t), and the histogram of tau versus time (or summary statistics of said histogram, e.g. tau_mean(t)). Data such as these might provide guidance on the feasibility of pairs trading on Quantopian, before getting ahead of ourselves.

Hi Jonathan. Thank you very much for sharing your work! I think natural language processing would be a potential application for the investment bankers as well!

In investment banking, the analysts often determine the price of the company that they are trying to sell by looking at how much, on average, the similar companies were being sold in the past. The process of picking similar companies are often very subjective (similar size, industry, revenue growth, etc.). With the help of NLP and other types of clustering. Do you think that it's feasible to both quantitatively and qualitatively identify groups of similar companies in the industry?