Short Interest Data pipeline for Quandl dataset

Hi there,

Thought I might share with this amazing community a project I developed to pull short interest data from Quandl using Airflow DAG: https://github.com/jaycode/short_interest_effect

It gets 7+ million short interest data from https://www.quandl.com/data/FINRA-Financial-Industry-Regulatory-Authority in a few hours and then requires about 1-hour running time each day to keep the data updated.

To run:
1. Clone the Repository
2. Follow the "How to set up and run" section in the README.

The main problem for now is it does not pull short interest data for inactive stocks, as the data from old NASDAQ pages (https://old.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nasdaq&render=download and https://old.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download) do not have them. If any of you know where I can get a complete stock symbols data, update the code and send a git pull request (or otherwise let me know).

System Requirements: 1 small Amazon EC2 server and 1 EMR cluster with 1 main node and 3 worker nodes.

The system takes care of dealing with the EMR cluster on its own, so the whole process is quite automatic.

46
30 responses

Why didn't you post this last week? :-) I just built nearly the same thing, but not as fancy as it doesn't use cloud services. Thanks for the effort, and for sharing!

My pleasure! Last week it was still bug-ridden, I didn't want to disappoint ) By the way, did you manage to get a complete list of the stock symbol, including the inactive ones?

I am just using something I had laying around. I'm not going back that far, and am not too concerned about IPOs or dead companies just yet. Mapping between quandl and Quantopian symbols is not perfect. I had to upload the data twice due to a difference in symbols related to stocks with subsequent issuance (i.e. GECC series L, M, and N identified by GECCL on quandl and GECC_L on Quantopian). I am also currently ignoring stocks with a period or dash. These are the nuances that make data science a computer science problem.

Great work Jay! Thank you for your efforts and especially sharing with the community.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Jay: This looks really good. I like the use of technology (Spark etc) and how clear the documentation is. Thanks for sharing!

A couple of questions:

• How easy is it to use these scripts for different data sources on Quandl and outside of Quandl?
• There is not too much information on what to do with the Quantopian file after it was created. Do you plan to add more docs around that? A NB that runs on Quantopian to read the data in and do some simple analysis? I think that would go a long way.
• Is it possible to close the loop from the automatic daily downloading of the data from Quandl to the automatic updating of the self-serve data on Quantopian?
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

I am just using something I had laying around. I'm not going back that far, and am not too concerned about IPOs or dead companies just yet. Mapping between quandl and Quantopian symbols is not perfect. I had to upload the data twice due to a difference in symbols related to stocks with subsequent issuance (i.e. GECC series L, M, and N identified by GECCL on quandl and GECC_L on Quantopian). I am also currently ignoring stocks with a period or dash. These are the nuances that make data science a computer science problem.

I see, thanks for letting me know. I also found the same problem with GECCL; we're really doing exactly the same project :)

In my case, I did not want to miss up data for stocks like Berkshire Hathaway (BRK_A) so I ended up fixing the underscores issue. It had indeed become quite a fun computer science problem to work on.

How easy is it to use these scripts for different data sources on Quandl and outside of Quandl?

If you looked at the aws-latest or local-only branch, it was initially made to combine short interest and pricing data. I'd like to eventually turn this tool into an ultimate data-collector for different data sources. Another that came to mind was scraping EDGAR data. Is there already this kind of data somewhere or should I attempt to build it?

To add different data sources, you'd just need to add another DAG with a similar structure to short_analysis_dag and reuse and update the combine_dag found in the other branches I mentioned above.

There is not too much information on what to do with the Quantopian file after it was created. Do you plan to add more docs around that? A NB that runs on Quantopian to read the data in and do some simple analysis? I think that would go a long way.

I created this tool as a part of my upcoming online course on Python for Finance. That notebook was already created, but currently as a part of the course. I shall like to build this kind of document sometimes, but right now the deadline for this course is extremely tight (have to push for Beta in April - therefore lots of caffeine and lack of sleep).

Is it possible to close the loop from the automatic daily downloading of the data from Quandl to the automatic updating of the self-serve data on Quantopian?

I hope I understand this question properly, and please do let me know if I don't, but did you mean just leave the system humming and the Self-Serve data on Quantopian should be updated automatically? The system is doing just that, it updates a CSV file in an S3 server every 00:00 and Quantopian would pull it in the day through the Live Data feature (thank you for building this feature, by the way; it's a wet dream comes true!!). The reason in the documentation I talked about turning off and on the EC2 server is for cost-saving. I should probably have done it through AWS Lambda and Step Functions instead of Apache Airflow, but I am currently more comfortable in using the latter. Myself, I only turn the system online whenever I need to use it in Quantopian (again, for cost-saving).

I am currently working on this project only part-time but were it to become a full(er) time project I have a couple of interesting features in mind to improve the workflow for including new data sources flexibly.

By the way, I just found a bug on the produced dataset Please do not use the system yet for now, somehow the NASDAQ + NYSE data combination does not work properly, as it only use 1 data source instead of combining them. I am adding a quality control feature to ensure this problem won't come again in the future. I'll let y'all know once it's done.

Alright, the code has been fixed! Feel free to fork and clone the project from my Github repository! In my case, I rebuild the server every working day just before 7 - 11 am UTC so the data are pulled to Quantopian's Self-Serve database, but if I missed the window (and boy did I miss a couple of times), I simply recreated the Short Interest database in the Self-Serve console.

First off, this is amazing. Thank you for doing this and sharing.

I was wondering, isn't it a bit redundant for each person to run all this fetching and processing code that takes hours every day? Wouldn't it make sense from an efficiency standpoint to share the CSV that it outputs? Or better yet, Quantopian could allow people to make their self-serve datasets public/shared.

@Jay, just wanted to add my appreciation as well - great work and thank you for sharing!!

@Viridian Hawk Yes, that makes sense. But unfortunately is often against the agreements between vendors, Quandl, and users (I admit I haven't checked for this particular dataset). On top of that, there are bandwidth costs in hosting such a file. While there are inexpensive ways, it is still a cost.

@Jay, great job. Thanks for sharing.

Thank you for the comments, let me know if you have used it and any problem that may arise.

Anyway, I just accidentally found a potentially more complete list of stock tickers, provided by Quandl, 8000+ stocks, updated daily: https://www.quandl.com/data/EOD-End-of-Day-US-Stock-Prices/documentation section "Available Tickers". Link to the dataset directly: https://s3.amazonaws.com/quandl-production-static/end_of_day_us_stocks/ticker_list.csv

I'll update the code when I can, but it's probably in a few weeks. If you are interested to help, you may fork the project, update, and send a pull request :)

Have a great weekend y'all!

@Jay, great work and thanks for sharing!

Hi Jay

This looks interesting but when digging into the data, it seems at odds with other data reported:
eg FINRA only requires that members report every 2 weeks (not daily).
Last update on NASDAQ website is for 31 Jan 2020 as I write this, eg for TSLA
https://www.nasdaq.com/market-activity/stocks/tsla/short-interest is

SETTLEMENT DATE 01/31/2020
SHORT INTEREST 22752621
AVG. DAILY SHARE VOLUME 17334224
DAYS TO COVER 1.312584

Compare that to the data from quandl for TSLA on 31 Jan 2020, has 2 relevant links:
https://www.quandl.com/data/FINRA-Financial-Industry-Regulatory-Authority?keyword=tsla

https://www.quandl.com/data/FINRA/FNSQ_TSLA-FINRA-NASDAQ-TRF-Short-Interest-TSLA

Date 2020-01-31
ShortVolume 1,795,893
ShortExemptVolume 34,950
TotalVolume 3,532,151

Date 2020-01-31
ShortVolume 836,929
ShortExemptVolume 300
TotalVolume 2,939,794

Any ideas why the discrepancies? Seems like the quandl daily updates are incomplete from the official bi-weekly reporting requirements...
Maybe it's a stock vs flow issue, all members vs 2 exchanges reporting issue?

@Stonks Tradar the difference I believe is that this FINRA data only includes the data from FINRA members that are required to report. The disadvantage is that the scope is limited. The advantage is that it is reported daily.

Hi @Stonks Tradar that is an interesting observation. @John Jones I see, what do you think would be the best way to consolidate these two data sources?

@Jay:

I'd like to eventually turn this tool into an ultimate data-collector for different data sources.

I think that's an excellent vision for this tool.

Another that came to mind was scraping EDGAR data. Is there already this kind of data somewhere or should I attempt to build it?

Not that I know of.

I created this tool as a part of my upcoming online course on Python for Finance.

Definitely make sure to advertise that course here on the forums, I'm sure many users would be interested in it.

The system is doing just that, it updates a CSV file in an S3 server every 00:00 and Quantopian would pull it in the day through the Live Data feature (thank you for building this feature, by the way; it's a wet dream comes true!!).

OK that is what I was asking, just wanted to make sure it was actually piped all the way through -- that's awesome.

This company S3 seems to update their short interest data at least daily and is more up-to-date than the Nasdaq site.
For example, here is current Tesla short interest as of Friday:

20.80MM shs shorted

Why does the Quandl "short interest" endpoint only have fields for "short volume"? These are presumably two different things, no?

Hope this helps. From Zacks: Distinguishing Between Short Volume vs Short Interest
It is easy to misuse various terms related to this particular analytical method. That being said, taking the time to properly understand how these particular ideas are distinguished from one another can help reduce risk and increase the likelihood of profitable gain.

Whereas the term “short volume” measures the number of shares that have been shorted over a given period of time, “short interest” represents the number of shorted shares that have yet to be closed out or covered by investors. Taking this one step further, the short interest ratio is used to assess how long it would take for all shorted shares currently in play to be covered. The larger the ratio becomes, the more outstanding shorted shares have yet to be closed.

Why does the Quandl "short interest" endpoint only have fields for
"short volume"? These are presumably two different things, no?

Indeed!

Short volume = number of shares that were sold short during a period of time.
Short interest = short positions currently held/open (point in time).

Quandl also has this column "short exempt volume", which, as explained in Investopedia, is short interest on steroids. It looks to me, then, a proper ratio is, therefore, combining short interest and short exempt, and dividing the result by the total volume?

In addition to the up-tick rule, I believe in some markets (though I’m not sure about the US), market makers (MM) are also ‘exempt’ from having to flag short sale orders. Reason being: If a MM has 0 positions in a certain stock, and sends both a buy and a sell order resting on the best bid and offer respectively, the sell order would need to be flagged as a short sale (no current position). However, if the buy order gets filled first, they sell order is no longer a short sale, but rather a ‘long sale’, but rather than having to pull the sell order just to change the flag (and lose the place in the order queue), MMs are instead ‘exempt‘ from having to flag short sale orders.

Again, I’m not entirely sure this is the case in the US. I believe MMs might be required to flag all sell orders based on their position at the time orders are sent, which would make the reported short volume number inflated.

Digging into the weeds of this, so yeah, one source (Nasdaq) is a 'stock' variable named "short interest", and the other (quandl) a 'flow' variable called "short volume"
ref: https://en.wikipedia.org/wiki/Stock_and_flow.

The best reference I've found for understanding short interest vs short sale volume is from FINRA itself:
https://www.finra.org/rules-guidance/notices/information-notice-051019

Note this paragraph...
"Finally, short sale volume data does not—and is not intended to—equate to reported bi-monthly short interest information. FINRA rules require firms to report, on a per security basis, the total quantity of shares held as short positions in all customer and proprietary firm accounts twice a month. FINRA publishes the short interest data for OTC equity securities on its website, while the data for listed stocks is published by the exchange on which the stock is listed. Although some websites redistribute the Daily File and refer to the data as “short interest,” it is not, in fact, the equivalent of reported short interest information."

http://regsho.finra.org/regsho-Index.html

@Thomas Wiecki maybe you can get S3 on FactSet proper (currently candidate only) and get them to sponsor a challenge on Q using their data :)
https://open.factset.com/products/short-interest-and-securities-finance-data/en-us

@Stonks Tradar whoa those are valuable information!

This part is also interesting:

Some market participants mistakenly conclude that the bi-monthly short interest data is understated because the Daily File reflects short sale volume that is much larger than what is reported as short interest. However, short interest data reflects short positions held by market participants at a specific moment in time on two discrete days each month, while the Daily File reflects the aggregate volume of trades executed as short sales on each trade date. Therefore, while the two data sets are related (i.e., short sale volume may ultimately result in a reportable short interest position), they are not necessarily correlated.

Do you think both the Short Interest ownership and Short Interest Daily File (our dataset) may both be useful to give a better perspective on how the short interests are distributed, or do we need only one of those?

Yeah Jay that's the research question.

Cos im a stonks guy I prefer the stock variable ha3, but actually there cud very well be signal in the flow data too!

The problem with the short interest data is its low frequency (every 2 weeks) and then reported with a lag of like 10-12 days after those bi-weekly settlement dates eg nasdaq schedule https://www.nasdaqtrader.com/Trader.aspx?id=ShortIntPubSch.
Also to get it for free youll prob have to scrape from the exchange webpages but I hear your ip address can get banned quickly it yur not careful ;)

As for the short sale volume data its timely and free n easy to download, but clearly not the same as the real stuff. Having said that it may still have value in and of itself. Also maybe you cud even create an ML model to proxy/nowcast/impute the real short interest data - that cud also be v interesting too!

I wud suggest tho, that u clearly relabel this dataset in yur repo to 'short sale volume' to avoid further confusion, cos quandl have mixed it up on theirs...

HTH

@Stonks Tradar Yes, that is indeed a research problem and an interesting one at that too. Agree that it needs to be renamed, maybe later, might as well I turn the system into some kind of a generalized scrapper.

Cos im a stonks guy I prefer the stock variable ha3

Ah, I see you're a man of culture as well