Data Reference Overview¶
The Data Reference provides an overview of the data available on Quantopian as well as documentation for each dataset. The documentation includes descriptions, code examples, historical coverage, update frequencies, and more.
The sections below outline some of the concepts that are shared between all or most datasets on Quantopian.
This section describes the overall coverage of data available on Quantopian. Individual datasets may have more or less coverage.
The Quantopian platform provides data for global equities. Quantopian provides historically accurate equity data (including stocks, ETFs, ADRs, and more) going back as far as 2002 in the US and as far as 2004 in 25 other countries. This includes equities that are no longer trading today.
Providing data for equities that are no longer listed is important because it helps avoid survivorship bias in quantitative research. Databases that omit delisted assets ignore bankruptcies and other important events, and lead to false optimism about a factor or strategy. For example, LEH (Lehman Brothers) was a tradable asset in 2008, even though the company no longer exists today; Lehman's bankruptcy was a major event that affected many algorithms at the time.
Below is a table of countries and exchanges for which Quantopian provides equity data.
|Country||Country Code||Pipeline Domain||Supported Exchanges|
||Vienna Stock Exchange|
||Australian Securities Exchange, National Stock Exchange of Australia|
||Sao Paulo Stock Exchange|
||Toronto Stock Exchange, TSX Venture Exchange, Canadian Securities Exchange|
||Shenzhen Stock Exchange, Shanghai Stock Exchange|
||NASDAQ OMX Copenhagen|
||NASDAQ OMX Helsinki|
||Berlin Stock Exchange, Dusseldorf Stock Exchange, XETRA, Frankfurt Stock Exchange, Hamburg Stock Exchange, Hannover Stock Exchange, Munich Stock Exchange, Stuttgart Stock Exchange, Xetra Indices|
||London Stock Exchange, ICAP Securities & Derivatives Exchange, Cboe Europe Equities CXE|
||Hong Kong Stock Exchange|
||Bombay Stock Exchange, National Stock Exchange of India|
||Irish Stock Exchange, Irish Stock Exchange Bonds & Funds|
||Milan Stock Exchange|
||Tokyo Stock Exchange, JASDAQ, Osaka Exchange, Nagoya Stock Exchange, Fukuoka Stock Exchange, Sapporo Securities Exchange|
||New Zealand Stock Exchange|
||Korea Exchange, Korea KONEX|
||Madrid Stock Exchange/Spanish Markets|
||NASDAQ OMX Stockholm, AktieTorget, Nordic Growth Market|
||SIX Swiss Exchange, BX Swiss AG, Swiss Fund Data|
||NYSE, NASDAQ, AMEX|
When researching and developing a quantitative investment strategy, it is critical to have a reliable way of identifying equities. However, across the finance industry, there is no one standard that everyone uses to identify equities. Depending on the source of data, equities can be identified using ticker symbols, CUSIPs, SEDOLs, and more. Consolidating data identified using different systems can be a difficult task. To solve this problem, Quantopian collects and surfaces data through uniform APIs by first mapping all datasets to a common set of identifiers called SIDs (security identifiers). SIDs are integer labels that maintain a consistent reference to a particular equity even over symbol changes and other events.
The way that SIDs are determined is different for US and non-US equities. The two methods are described below.
For US equities, SIDs are provided to Quantopian from a third-party vendor. The same vendor also provides mappings from CUSIP and ticker symbols to SIDs for US equities. When integrating a dataset, Quantopian uses these mappings along with proprietary algorithms to associate records from the new dataset with SIDs. For example, FactSet provides Quantopian with CUSIP labels for each of their datasets. These CUSIPs are used to label records from FactSet datasets with SIDs so that the datasets can be used alongside datasets from other vendors.
Global SIDs (Excluding US)¶
For non-US equities, SIDs are generated from FactSet's proprietary FSYM Regional Identifiers. Currently, all global (excl. US) data is sourced from FactSet, so aligning identifiers between vendors is not required.
In addition to SIDs, equities on Quantopian are labeled with a ticker symbol. Whenever an equity is displayed in the application (including pipeline output, backtest transactions, etc.), it is typically displayed with its SID and current ticker symbol. Functions like symbols() in research and the IDE support historically accurate ticker symbol lookups, but any time a symbol is displayed, the accompanying ticker symbol represents the current symbol, even in historical simulations.
To prevent lookahead bias, Quantopian data is stored in a point-in-time fashion. Each data point is stored with two special fields: an asof_date and a timestamp. The asof_date is typically provided by the data vendor and is used to inform Quantopian's simulation engines about where a data point should be slotted in a timeseries. The timestamp is created by Quantopian upon collecting the data from the vendor and is used to inform the pipeline simulation engine about when in the simulation that data point can be used.
The timestamp of each data point is used to control when pipeline uses that data point in a simulation. Each market has a 'cutoff time' set to 45 minutes before market open. The cutoff time is used to decide if a data point was known early enough to be used by pipeline that day. All data points with a timestamp prior to the cutoff time of day N can be used by a pipeline simulation on day N. Data points with a timestamp after the cutoff time on day N will not be used by a pipeline with a simulation date of N.
Let's say Quantopian had the following data points for field X for company AAAA (trading in the US):
|03-04-2019||2.56||03-04-2019 11:55pm (ET)|
|03-05-2019||1.73||03-05-2019 11:55pm (ET)|
|03-06-2019||-5.21||03-07-2019 10:00am (ET)|
|03-07-2019||0.53||03-07-2019 11:55pm (ET)|
And let's say a pipeline was defined with a 3-day simple moving average factor over field X. If the pipeline was executed on 03-07-2019 (e.g.
run_pipeline(pipe, start_date='2019-03-07', end_date='2019-03-07')), the 3-day SMA computation would be performed on a timeseries for AAAA that looks like this:
[2.56, 1.73, 1.73].
Why? The simulation was conducted on 03-07-2019 and company AAAA is trading in the US (markets open at 9:30am ET). Therefore,
03-07-2019 8:45am (ET) was used as the cutoff time. The third data point (value=-5.21) has a timestamp of
03-07-2019 10:00am (ET), which is after the cutoff time, so it was not yet accessible to pipeline. As a result, the value for 03-06-2019 had to be forward-filled from the previous known data point.
If the same pipeline is run on 03-08-2019, (e.g.
run_pipeline(pipe, start_date='2019-03-08', end_date='2019-03-08')), the 3-day SMA computation would be performed on a timeseries for AAAA that looks like this:
[1.73, -5.21, 0.53].
Why? This time, the the second, third, and fourth data points were all known by the cutoff time (
03-08-2019 8:45am (ET)). Even though the data point whose value is -5.21 came later than the 03-07 cutoff time, it was known by the 03-08 cutoff time. After determining which data points it can use, a pipeline uses the
asof_date to slot each data point into the timeseries which is why the -5.21 data point is properly slotted as the second newest data point in the timeseries.
By using both the asof_date and timestamp, the pipeline simulation engine is able to remove lookahead bias from its computations.
For historical data which existed prior to Quantopian's integration, timestamps are approximated by adding a delay offset to historical dates provided by the vendor. Each dataset in the Data Reference documents its live collection start date and historical timestamp approximation method. Currently, the most complex timestamp approximation performed by Quantopian occurs in the FactSet Fundamentals dataset.
Learn more about how Quantopian stores and manages point-in-time data, check out the Three-Dimensional Time webinar.
Corporate Action Adjustments¶
This section describes how pricing data and other per-share data is adjusted for coporate actions like splits, mergers, and dividends. This concept is distinct from how corporate actions are applied to your portfolio holdings in a backtest.
When your pipeline or algorithm calls for historical data denominated in units per share (such as price per share), it is adjusted for splits, mergers, and dividends as of the current simulation date.
Adjustments depend on three pieces of information:
- The date of the data point.
- The date from which the data point is being considered (the current simulation date).
- Any events (splits, dividends, and mergers) that happened between those two dates.
For example, on June 9, 2014, AAPL had a 7:1 stock split event. If we held one share of AAPL before the split, we would have held 7 AAPL shares after the split. Let's walk through this case.
Let's say our simulation date is May 16, 2014. We want yesterday's close price for AAPL. The date of the price is May 15, and the date from which the price is being considered is May 16. Since no events occurred between May 15 and May 16, we can use the as-is close price ($588.82).
Now let's move forward in time. Our simulation date is July 2, 2014. We want yesterday's close price for AAPL. The date of the price is July 1, and the date from which the price is being considered is July 2. Since no events occurred between July 1 and July 2, we can use the as-is close price ($93.52).
But what if we wanted the close price from May 15, 2014 on this same July simulation date? (For example, if we wanted to run a trailing-window calculation with a two-month lookback period.) Then, we'd see a sudden 84% price drop in the middle of our data window solely due to the 7:1 stock split -- even though there would be no change in the value of your portfolio if you held AAPL.
This will clearly result in misleading values for many graphs and trailing-window calculations (for example, the simple moving average over any window that includes June 9). Fortunately, Quantopian data is adjusted so you don't have to account for these sudden jumps.
To continue with the AAPL example: let's say our simulation date is July 2, 2014; we want the close price from May 15, 2014. Instead of showing a sudden 84% decrease in close price, Quantopian adjusts the pre-June 9 prices so that this sudden jump disappears. Since it's a 7:1 stock split, the May 15, 2014 price will be divided by 7 (adjusted price: $84.12).
In this example, any prices from before June 9 will be adjusted (divided by 7) for simulation dates after June 9, 2014. However, prices will not be adjusted for simulation dates before June 9; and prices from after June 9 will not be adjusted at all.
To summarize: On simulation dates after the split occurs, pricing data from before the split will be adjusted.
Though our AAPL example dealt with a split event specifically, dividends and mergers are dealt with analogously.
If it seems like your data isn't being properly adjusted for a split, merger, or dividend, it's possible that we missed the event. Please reach out to firstname.lastname@example.org if you think this is the case.
Why aren't all prices over all time adjusted for splits/mergers/dividends, regardless of the simulation date? Adjusting for an event before it occurs introduces lookahead bias. While it's difficult to determine exactly how this bias would affect a strategy, it's best practice to avoid lookahead bias whenever possible.
Quantopian provides dozens of datasets including pricing, fundamental, and alternative data. Much of the data is available up to the present day, but some datasets have the last 1-2 years of data held out.
For global equity datasets and FactSet alternative datasets, the holdout only applies to you when researching the data. This means you can research an alpha factor on everything up to the holdout, submit your strategy to the contest, and have your contest entry get evaluated using the full dataset. Holding out recent data during the research phase makes it easier to avoid overfitting when building your strategy.
For alternative datasets that are available via subscription in the Quantopian Store (labeled as "Premium"), the full dataset is available via subscription. Unlike the global equity datasets and FactSet alternative datasets, you can only submit a strategy to the contest using a "premium" dataset from the Quantopian Store if you are subscribed to it.
Quantopian is moving away from the subscription based model for datasets. Quantopian is currently focused on integrating datasets that are free to use in the contest, like FactSet datasets that have a time-based holdout on historical data, but are free to use in the contest.