I thought it would be best to answer those questions by giving a walkthrough of the way we collect, clean, and surface fundamental and partner datasets.
At some regular interval, each of our partner vendors upload files with their data to an FTP. On another regular interval (hourly for most datasets), we download the file(s) from the FTP. At this point, we do some processing on the data to fit it to a particular format, and write it to a database in what we call a 'base table'.
When the data is written, it includes an
asof_date provided by the vendor as well as a
timestamp set by Quantopian. The
timestamp is set to the time at which the data is written to our database. In Pipeline, the
timestamp is when the data is made available, while the
asof_date tells us the date to which the data corresponds. We refer to this as 'point-in-time', which is the same as how we perform split and dividend adjustments on pricing data.
For example, if an earnings announcement is reported on 03/15/2016 and we learn about it 04/01/2016, the
asof_date would be 03/15/2016, and the
timestamp would be 04/01/2016. This means that the event that was reported on March 15th was learned about on April 1st.
The exception to this rule is that for each new dataset, we start out by doing a historical load. In this case, we set the
timestamp to the
asof_date + some degree of lag that we deem reasonable. Frequently, we have already have loaders reading from the FTP before we do the official historical load, so we can derive the lag from empirical data.
In the event that we download a file with new data for a particular record** that already exists in the base table, we write the new data to a separate table that we call a 'deltas' table. For example, let's say one of our sentiment data providers uploaded a file that looks like this:
symbol, asof_date, sentiment_score AAPL, 01/05/2015, 0.5
And then the next day they gave us a file that looks like this:
symbol, asof_date, sentiment_score AAPL, 01/05/2015, 0.6 AAPL, 01/06/2015, -0.2
The base table in the database would look like this:
symbol, asof_date, sentiment_score AAPL, 01/05/2015, 0.5 AAPL, 01/06/2015, -0.2
and the deltas table would look like this:
symbol, asof_date, sentiment_score AAPL, 01/05/2015, 0.6
** A record is usually distinguished by a (security, asof_date) pair.
What does this mean for your research and algorithms?
The base table and deltas tables are directly available in research.
In research, you can access the base table for a partner dataset in interactive mode with:
from quantopian.interactive.data.vendor import dataset
and you can access the deltas table with:
from quantopian.interactive.data.vendor import dataset_deltas
In Pipeline, you are always getting the base table with the deltas table applied to it. A pipeline/simulation always uses the most recent data for each record. However, it never uses data with a
timestamp more recent than the current simulation time.*** This is true for both research and backtesting.
Deltas are there for restatements and corrections of errors, especially in the case of historical lookback windows while still enforcing a lack of look-ahead bias. For signals derived from machine learning techniques, we decided not to apply the deltas to protect against retraining of the models.
Now, with all that said, we do not actually apply deltas to the Alpha Vertex dataset (sorry for making you read all that!). Why? We only want data that we can guarantee to be out-of-sample on the platform. That means that we only ever surface data that was collected live.
*** When using a pipeline in research, each day's computations are performed as if the simulation time is the current date in the pipeline.