Back to Community
Scraping 10-Ks and 10-Qs for Alpha

This is the first post in a two-part series. For performance analysis of this alpha factor, see the Alphalens study.

Thesis

Companies generally do not make major changes to their 10-K and 10-Q filings. When they do, it is predictive of significant underperformance in the next quarter. We find alpha in shorting the companies with the largest text changes in their filings and buying the companies with the smallest text changes in their filings.

Background

Publicly listed companies in the U.S. are required by law to file "10-K" and "10-Q" reports with the Securities and Exchange Commission (SEC). These reports provide both qualitative and quantitative descriptions of the company's performance, from revenue numbers to qualitative risk factors.

When companies file 10-Ks and 10-Qs, they are required to disclose certain pieces of information. For example, companies are required to report information about "significant pending lawsuits or other legal proceedings". As such, 10-Ks and 10-Qs often hold valuable insights into a company's performance.

These insights, however, can be difficult to access. The average 10-K was 42,000 words long in 2013; put in perspective, that's roughly one-fifth of the length of Moby-Dick. Beyond the sheer length, dense language and lots of boilerplate can further obfuscate true meaning for many investors.

The good news? We might not need to read companies' 10-Ks and 10-Qs from cover-to-cover in order derive value from the information they contain. Specifically, Lauren Cohen, Christopher Malloy and Quoc Nguyen argue in their recent paper that we can simply analyze textual changes in 10-Ks and 10-Qs to predict companies' future stock returns. (For an overview of this paper from Lauren Cohen himself, see the Lazy Prices interview from QuantCon 2018.)

In this investigation, we attempt to replicate their results on Quantopian.

The notebook attached below details the construction of a textual changes dataset. We begin by scraping 10-K and 10-Q reports from the SEC EDGAR database; we then compute cosine and Jaccard similarity scores, and finally transform the data into a format suitable for Self-Serve Data.

Notes

  1. This notebook is meant to be run locally (on your own machine). In order to run this notebook, you should clone it into your own Quantopian account, then download it as a .ipynb file. You'll need to install specific packages to run this notebook; detailed installation instructions can be found in step 0 of the notebook.

  2. You'll probably see some strange HTML tags appearing above the cells displaying dataframes while viewing this notebook on the Quantopian platform. This is because the version of pandas used on the Quantopian platform (0.18.1) differs from the version of pandas used in this notebook (0.23.0, a more recent version). The tags should disappear when you run the notebook locally with the correct pandas version.

Done constructing the dataset? See the Alphalens study for performance analysis of the Lazy Prices alpha factor.

Loading notebook preview...
Notebook previews are currently unavailable.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

3 responses

UpUp, inspirational article

Another AWESOME post and notebook! Thanks for putting this together!

Finally I'm scraping, wohoo!! :)

Just a quick note to anyone else struggling to get the correct path names to the 10Ks and 10Qs (I created these directories manually). The brackets < > should not be included (maybe that was obvious to everyone else but me? :)). So for me on my Mac (HackerOne, take note):

pathname_10k = '/Users/joakim/Quantopian/10Ks'  
pathname_10q = '/Users/joakim/Quantopian/10Qs'  

@Lucy,

I hope you'll consider creating a video tutorial for either this one, or the Political Contribution one (or both ideally :)), for Python-/Data-formatting-/Beautiful-soup- challenged people like myself who require additional hand-holding... :)

There is a master index for filings https://www.sec.gov/Archives/edgar/full-index/
This probably save some of your scraping time.