This is the first post in a two-part series. For performance analysis of this alpha factor, see the Alphalens study.
Companies generally do not make major changes to their 10-K and 10-Q filings. When they do, it is predictive of significant underperformance in the next quarter. We find alpha in shorting the companies with the largest text changes in their filings and buying the companies with the smallest text changes in their filings.
Publicly listed companies in the U.S. are required by law to file "10-K" and "10-Q" reports with the Securities and Exchange Commission (SEC). These reports provide both qualitative and quantitative descriptions of the company's performance, from revenue numbers to qualitative risk factors.
When companies file 10-Ks and 10-Qs, they are required to disclose certain pieces of information. For example, companies are required to report information about "significant pending lawsuits or other legal proceedings". As such, 10-Ks and 10-Qs often hold valuable insights into a company's performance.
These insights, however, can be difficult to access. The average 10-K was 42,000 words long in 2013; put in perspective, that's roughly one-fifth of the length of Moby-Dick. Beyond the sheer length, dense language and lots of boilerplate can further obfuscate true meaning for many investors.
The good news? We might not need to read companies' 10-Ks and 10-Qs from cover-to-cover in order derive value from the information they contain. Specifically, Lauren Cohen, Christopher Malloy and Quoc Nguyen argue in their recent paper that we can simply analyze textual changes in 10-Ks and 10-Qs to predict companies' future stock returns. (For an overview of this paper from Lauren Cohen himself, see the Lazy Prices interview from QuantCon 2018.)
In this investigation, we attempt to replicate their results on Quantopian.
The notebook attached below details the construction of a textual changes dataset. We begin by scraping 10-K and 10-Q reports from the SEC EDGAR database; we then compute cosine and Jaccard similarity scores, and finally transform the data into a format suitable for Self-Serve Data.
This notebook is meant to be run locally (on your own machine). In order to run this notebook, you should clone it into your own Quantopian account, then download it as a .ipynb file. You'll need to install specific packages to run this notebook; detailed installation instructions can be found in step 0 of the notebook.
You'll probably see some strange HTML tags appearing above the cells displaying dataframes while viewing this notebook on the Quantopian platform. This is because the version of pandas used on the Quantopian platform (0.18.1) differs from the version of pandas used in this notebook (0.23.0, a more recent version). The tags should disappear when you run the notebook locally with the correct pandas version.
Done constructing the dataset? See the Alphalens study for performance analysis of the Lazy Prices alpha factor.