Back to Community
Scraping 10-Ks and 10-Qs for Alpha

This is the first post in a two-part series. For performance analysis of this alpha factor, see the Alphalens study.

Thesis

Companies generally do not make major changes to their 10-K and 10-Q filings. When they do, it is predictive of significant underperformance in the next quarter. We find alpha in shorting the companies with the largest text changes in their filings and buying the companies with the smallest text changes in their filings.

Background

Publicly listed companies in the U.S. are required by law to file "10-K" and "10-Q" reports with the Securities and Exchange Commission (SEC). These reports provide both qualitative and quantitative descriptions of the company's performance, from revenue numbers to qualitative risk factors.

When companies file 10-Ks and 10-Qs, they are required to disclose certain pieces of information. For example, companies are required to report information about "significant pending lawsuits or other legal proceedings". As such, 10-Ks and 10-Qs often hold valuable insights into a company's performance.

These insights, however, can be difficult to access. The average 10-K was 42,000 words long in 2013; put in perspective, that's roughly one-fifth of the length of Moby-Dick. Beyond the sheer length, dense language and lots of boilerplate can further obfuscate true meaning for many investors.

The good news? We might not need to read companies' 10-Ks and 10-Qs from cover-to-cover in order derive value from the information they contain. Specifically, Lauren Cohen, Christopher Malloy and Quoc Nguyen argue in their recent paper that we can simply analyze textual changes in 10-Ks and 10-Qs to predict companies' future stock returns. (For an overview of this paper from Lauren Cohen himself, see the Lazy Prices interview from QuantCon 2018.)

In this investigation, we attempt to replicate their results on Quantopian.

The notebook attached below details the construction of a textual changes dataset. We begin by scraping 10-K and 10-Q reports from the SEC EDGAR database; we then compute cosine and Jaccard similarity scores, and finally transform the data into a format suitable for Self-Serve Data.

Notes

  1. This notebook is meant to be run locally (on your own machine). In order to run this notebook, you should clone it into your own Quantopian account, then download it as a .ipynb file. You'll need to install specific packages to run this notebook; detailed installation instructions can be found in step 0 of the notebook.

  2. You'll probably see some strange HTML tags appearing above the cells displaying dataframes while viewing this notebook on the Quantopian platform. This is because the version of pandas used on the Quantopian platform (0.18.1) differs from the version of pandas used in this notebook (0.23.0, a more recent version). The tags should disappear when you run the notebook locally with the correct pandas version.

Done constructing the dataset? See the Alphalens study for performance analysis of the Lazy Prices alpha factor.

Loading notebook preview...
Notebook previews are currently unavailable.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

10 responses

UpUp, inspirational article

Another AWESOME post and notebook! Thanks for putting this together!

Finally I'm scraping, wohoo!! :)

Just a quick note to anyone else struggling to get the correct path names to the 10Ks and 10Qs (I created these directories manually). The brackets < > should not be included (maybe that was obvious to everyone else but me? :)). So for me on my Mac (HackerOne, take note):

pathname_10k = '/Users/joakim/Quantopian/10Ks'  
pathname_10q = '/Users/joakim/Quantopian/10Qs'  

@Lucy,

I hope you'll consider creating a video tutorial for either this one, or the Political Contribution one (or both ideally :)), for Python-/Data-formatting-/Beautiful-soup- challenged people like myself who require additional hand-holding... :)

There is a master index for filings https://www.sec.gov/Archives/edgar/full-index/
This probably save some of your scraping time.

Hi,

I got this error:

0%| | 0/4785 [00:00<?, ?it/s]

Already scraped CIK 0001090872
Scraping CIK 0001675149


FeatureNotFound Traceback (most recent call last)
in ()
22 doc_url_base=doc_url_base_10k,
23 cik=cik,
---> 24 log_file_name=log_file_name)

in Scrape10K(browse_url_base, filing_url_base, doc_url_base, cik, log_file_name)
51
52 # Parse the response HTML using BeautifulSoup
---> 53 soup = bs.BeautifulSoup(res.text, "lxml")
54
55 # Extract all tables from the response

/anaconda3/envs/py36/lib/python3.6/site-packages/bs4/init.py in init(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, **kwargs) 196 "Couldn't find a tree builder with the features you "
197 "requested: %s. Do you need to install a parser library?"
--> 198 % ",".join(features))
199 builder = builder_class()
200 if not (original_features == builder.NAME or

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

Does anyone know what I should do to fix it? I've reinstalled bs and lxml in the Python 3.6 environment but nothing's changed. I'm a beginner python coder.

Cool ideas.

In quantopian's algorithms, can I make calls such as:
nasdaq_tickers = pd.read_csv('https:// ... ')

That way I can scrape the data, load it on a shared folder and use it in before_trading_starts.

Arun,

I think you might want to explore the capability we provide for loading your own data into Quantopian the Pipeline API, called Self-Serve Data.

Hope this helps,
Josh

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks. That's helpful.

I've tried running the notebook and I am getting the error below when I run the
ComputeSimilarityScores10K(cik) function.. It appears one CIK ticker pair has an empty array, perhaps due to a scraped item with no data. Any suggestions for a workaround?

---------------------------------------------------------------------------  
ValueError                                Traceback (most recent call last)  
<ipython-input-39-d7bb75792ab5> in <module>()  
      4  
      5 for cik in tqdm(ticker_cik_df['cik']):  
----> 6     ComputeSimilarityScores10K(cik)

<ipython-input-37-f051d5332779> in ComputeSimilarityScores10K(cik)  
     71  
     72         # Calculate similarity scores  
---> 73         cosine_score = ComputeCosineSimilarity(words_A, words_B)  
     74         jaccard_score = ComputeJaccardSimilarity(words_A, words_B)  
     75 

<ipython-input-26-53fc3830c9b5> in ComputeCosineSimilarity(words_A, words_B)  
     32     array_A = np.array(vector_A).reshape(1, -1)  
     33     array_B = np.array(vector_B).reshape(1, -1)  
---> 34     cosine_score = cosine_similarity(array_A, array_B)[0,0]  
     35  
     36     return cosine_score

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in cosine_similarity(X, Y, dense_output)  
    915     # to avoid recursive import  
    916  
--> 917     X, Y = check_pairwise_arrays(X, Y)  
    918  
    919     X_normalized = normalize(X, copy=True)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)  
    108     else:  
    109         X = check_array(X, accept_sparse='csr', dtype=dtype,  
--> 110                         warn_on_dtype=warn_on_dtype, estimator=estimator)  
    111         Y = check_array(Y, accept_sparse='csr', dtype=dtype,  
    112                         warn_on_dtype=warn_on_dtype, estimator=estimator)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)  
    468                              " a minimum of %d is required%s."  
    469                              % (n_features, shape_repr, ensure_min_features,  
--> 470                                 context))  
    471  
    472     if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:

ValueError: Found array with 0 feature(s) (shape=(1, 0)) while a minimum of 1 is required by check_pairwise_arrays.  

Some ticker has a "." in it to show different share classes, like BRK.A and BRK.B. However this format is not compatible with SEC search function so no CIK returned. You need BRKA and BRKB to have right CIK from Edgar.

You may want to clean up the tickers a little bit before matching CIKs.

Hope this helps.

You may find the following resource useful: 10Ks and 10Qs in text format (cleaned in a way similar to what the notebook is doing).

https://sraf.nd.edu/data/stage-one-10-x-parse-data/