Quantopian Partner Data - How is it Collected, Processed, and Surfaced?

Recently, there were some questions asked about the steps that Quantopian takes between receiving data from a partner data source and surfacing it to research/algorithms.

I thought it would be best to answer those questions by giving a walkthrough of the way we collect, clean, and surface fundamental and partner datasets.

Base Tables

At some regular interval, each of our partner vendors upload files with their data to an FTP. On another regular interval (hourly for most datasets), we download the file(s) from the FTP. At this point, we do some processing on the data to fit it to a particular format, and write it to a database in what we call a 'base table'.

When the data is written, it includes an asof_date provided by the vendor as well as a timestamp set by Quantopian. The timestamp is set to the time at which the data is written to our database. In Pipeline, the timestamp is when the data is made available, while the asof_date tells us the date to which the data corresponds. We refer to this as 'point-in-time', which is the same as how we perform split and dividend adjustments on pricing data.

For example, if an earnings announcement is reported on 03/15/2016 and we learn about it 04/01/2016, the asof_date would be 03/15/2016, and the timestamp would be 04/01/2016. This means that the event that was reported on March 15th was learned about on April 1st.

The exception to this rule is that for each new dataset, we start out by doing a historical load. In this case, we set the timestamp to the asof_date + some degree of lag that we deem reasonable. Frequently, we have already have loaders reading from the FTP before we do the official historical load, so we can derive the lag from empirical data.

Deltas

In the event that we download a file with new data for a particular record** that already exists in the base table, we write the new data to a separate table that we call a 'deltas' table. For example, let's say one of our sentiment data providers uploaded a file that looks like this:

symbol, asof_date, sentiment_score
AAPL, 01/05/2015, 0.5


And then the next day they gave us a file that looks like this:

symbol, asof_date, sentiment_score
AAPL, 01/05/2015, 0.6
AAPL, 01/06/2015, -0.2


The base table in the database would look like this:

symbol, asof_date, sentiment_score
AAPL, 01/05/2015, 0.5
AAPL, 01/06/2015, -0.2


and the deltas table would look like this:

symbol, asof_date, sentiment_score
AAPL, 01/05/2015, 0.6


** A record is usually distinguished by a (security, asof_date) pair.

What does this mean for your research and algorithms?

The base table and deltas tables are directly available in research.

In research, you can access the base table for a partner dataset in interactive mode with:

from quantopian.interactive.data.vendor import dataset


and you can access the deltas table with:

from quantopian.interactive.data.vendor import dataset_deltas


In Pipeline, you are always getting the base table with the deltas table applied to it. A pipeline/simulation always uses the most recent data for each record. However, it never uses data with a timestamp more recent than the current simulation time.*** This is true for both research and backtesting.

Deltas are there for restatements and corrections of errors, especially in the case of historical lookback windows while still enforcing a lack of look-ahead bias. For signals derived from machine learning techniques, we decided not to apply the deltas to protect against retraining of the models.

Now, with all that said, we do not actually apply deltas to the Alpha Vertex dataset (sorry for making you read all that!). Why? We only want data that we can guarantee to be out-of-sample on the platform. That means that we only ever surface data that was collected live.

*** When using a pipeline in research, each day's computations are performed as if the simulation time is the current date in the pipeline.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

2 responses

This is a very interesting and useful explanation. Thanks!

Hi Jamie -

I'm confused on how you approach this for a given data set. For example, for the Alpha Vertex data sets, you provide a note (e.g. "Quantopian started collecting this dataset live on March 6, 2017" on https://www.quantopian.com/data/alpha_vertex/precog_top_500). However, for the Twitter data, you provide no such note indicating when you went live. Since both data sets are derived from algorithms, and so are effectively factor feeds, I would think that you would take the same approach for both. Yet, it appears you are treating the Twitter data differently. Quantopian wasn't even in business in 2009, but the Twitter data starts on 24 Oct 2009. So, I'm confused.

The reason I'm asking is that as we learned for the Alpha Vertex data, over-fitting can creep into data sets. If I'm gonna spend time on the Twitter data set, I need to know the in-sample and out-of-sample periods.

Looking forward, how will you approach the Factset data in this regard? How will users know which period to treat as in-sample, and which as out-of-sample? Will you manage things as you describe above for all data sets? If not, which ones will be treated as derived and have in-sample and out-of-sample periods, and which ones will be treated as originating from direct measurements (basically trusting the vendor that they didn't do any post-processing of the data and could have introduced bias, whether intentionally or unintentionally)?