Recently, there were some questions asked about the steps that Quantopian takes between receiving data from a partner data source and surfacing it to research/algorithms.
I thought it would be best to answer those questions by giving a walkthrough of the way we collect, clean, and surface fundamental and partner datasets.
Base Tables
At some regular interval, each of our partner vendors upload files with their data to an FTP. On another regular interval (hourly for most datasets), we download the file(s) from the FTP. At this point, we do some processing on the data to fit it to a particular format, and write it to a database in what we call a 'base table'.
When the data is written, it includes an asof_date
provided by the vendor as well as a timestamp
set by Quantopian. The timestamp
is set to the time at which the data is written to our database. In Pipeline, the timestamp
is when the data is made available, while the asof_date
tells us the date to which the data corresponds. We refer to this as 'point-in-time', which is the same as how we perform split and dividend adjustments on pricing data.
For example, if an earnings announcement is reported on 03/15/2016 and we learn about it 04/01/2016, the asof_date
would be 03/15/2016, and the timestamp
would be 04/01/2016. This means that the event that was reported on March 15th was learned about on April 1st.
The exception to this rule is that for each new dataset, we start out by doing a historical load. In this case, we set the timestamp
to the asof_date
+ some degree of lag that we deem reasonable. Frequently, we have already have loaders reading from the FTP before we do the official historical load, so we can derive the lag from empirical data.
Deltas
In the event that we download a file with new data for a particular record** that already exists in the base table, we write the new data to a separate table that we call a 'deltas' table. For example, let's say one of our sentiment data providers uploaded a file that looks like this:
symbol, asof_date, sentiment_score
AAPL, 01/05/2015, 0.5
And then the next day they gave us a file that looks like this:
symbol, asof_date, sentiment_score
AAPL, 01/05/2015, 0.6
AAPL, 01/06/2015, -0.2
The base table in the database would look like this:
symbol, asof_date, sentiment_score
AAPL, 01/05/2015, 0.5
AAPL, 01/06/2015, -0.2
and the deltas table would look like this:
symbol, asof_date, sentiment_score
AAPL, 01/05/2015, 0.6
** A record is usually distinguished by a (security, asof_date) pair.
What does this mean for your research and algorithms?
The base table and deltas tables are directly available in research.
In research, you can access the base table for a partner dataset in interactive mode with:
from quantopian.interactive.data.vendor import dataset
and you can access the deltas table with:
from quantopian.interactive.data.vendor import dataset_deltas
In Pipeline, you are always getting the base table with the deltas table applied to it. A pipeline/simulation always uses the most recent data for each record. However, it never uses data with a timestamp
more recent than the current simulation time.*** This is true for both research and backtesting.
Deltas are there for restatements and corrections of errors, especially in the case of historical lookback windows while still enforcing a lack of look-ahead bias. For signals derived from machine learning techniques, we decided not to apply the deltas to protect against retraining of the models.
Now, with all that said, we do not actually apply deltas to the Alpha Vertex dataset (sorry for making you read all that!). Why? We only want data that we can guarantee to be out-of-sample on the platform. That means that we only ever surface data that was collected live.
*** When using a pipeline in research, each day's computations are performed as if the simulation time is the current date in the pipeline.