Beyond the datasets offered on Quantopian, you can also upload your own custom data via Self-Serve Data.
Self-Serve Data allows you to upload custom timeseries data and access it in Research and the IDE via the Pipeline API. Note that custom data is processed similarly to other datasets, meaning that it is captured in a point-in-time manner and asset mapped to SIDs.
Once on the platform, your datasets can only be accessed by you. Other community members won't be able to import your datasets, even if you share a notebook or algorithm that imports that dataset.
Your dataset will be downloaded and stored on Quantopian-maintained servers, where it is encrypted at rest. You can learn more about our security procedures here.
Self-Serve Data currently only supports uploading custom datasets that are mapped to US equities. Other markets are not supported at this time.
In for a dataset to be uploaded via Self-Serve, the dataset must meet certain requirements.
Your dataset should have a single row of data per asset per day, and it must be a comma-separated value (.csv) file.
Data should be in csv format (comma-separated value) with a minimum of three columns: one primary date, one primary asset, and one or more value columns. The three column types are described in more detail below.
Column names must be unique and contain only alpha-numeric characters and underscores.
The primary date column stores the date to which each record's value(s) applies. This is generally the last complete trading day after which you learned about the record. For instance, if you learn about a record at 8AM ET on 07/25/2019, the asof_date should be 07/24/2019 since that's when the most recent complete trading day occured. The primary date is used to populate the
asof_date in pipeline, which you may recognize from other pre-integrated datasets on Quantopian.
When you first upload historical data for a custom dataset, Quantopian will also generate a
timestamp value for each record by adding a fixed delay to the primary date (default: one day delay). This is done to model the delay that is expected when data is uploaded to Quantopian.
To learn more about the difference between
timestamp values, see this section of the Data Reference.
- Dates can be formated in like any of the following examples:
- Primary date values must be date values. Datetime values are not supported at this time.
- Data before 2002 will be ignored, as other Quantopian datasets do not extend earlier than 2002.
- Blank or
NaT primary date values are not supported. Uploading blank or
NaT values in a primary date column will cause an error.
The primary asset column is used to map rows from your custom dataset to assets in the Quantopian database. Currently, assets must be identified with a ticker symbol in order for the record to be mapped to a US equity in Quantopian's database. More specifically, Quantopian will use the ticker symbol provided in the primary asset column to by identifying the company that had that ticker symbol on the primary date.
- If you have multiple share classes of an asset, those should be identified on the symbol with a delimiter (
_). Example asset formats: "BRK_A" or "BRK/A".
- Ticker symbols provided in the primary asset identifier should be historically accurate. Quantopian will attempt to map each record to an asset in the database by looking for the asset that traded with the given ticker symbol on the given primary date of that record. Example: if company A traded under symbol "AA" until 2017 and then changed their ticker to "AAA", you will need to provide "AA" in the primary asset field prior for records with a primary date value prior to 2017 and "AAA" for records with a primary date value after 2017 in order for all records to be properly mapped to company A.
The remaining columns in a custom dataset should contain values that you want to access in pipeline. A custom dataset can have one or more columns of values, not counting the primary asset and primary date columns. Each column must contain data of the same type, but different columns can have different types. Self-Serve supports the following column types:
- Numeric: These will be converted to floats to maximize precision in future calculation.
- String: Textual values. These will not be converted further. Often titles, or categories. Any categorical values or labels like sector codes that might normally be represented by an integer should be added as a String column when using Self-Serve. This makes it so that pipeline appropriately loads the column as a classifier later on.
- Date: You can pass columns of date values other than the primary date. Date type columns are not adjusted during Timezone adjustment. Date values can be supplied in any of the supported formats accepted by the primary date column.
- Datetime: Date with a timestamp (in UTC). These values will be adjusted when the incoming timezone is configured to a value other than UTC. Like dates, these will not be used as the primary date column, but can still be examined as data values. Example datetime formats:
07/25/2019 1:23 or
2019-01-01 20:23-05:00 (with timezone designator).
- Bool: True or False values. Example boolean formats:
When uploading your historical data to Quantopian, you will be required to declare each column as one of these types. We will then validate that all the historical data can be processed.
By default, the values
null will be interpreted as
-inf will be interpreted as infinite values.
Below is a very simple example of a .csv file that can be uploaded to Quantopian. The first row of the file must be a header and columns are separated by the comma character (,)."
Once your dataset has been formatted properly, you can upload it to Quantopian in one or two steps. The first step is an upload of historical data, and the (optional) second step is to set up the continuous upload of live data.
Historical Data Upload
Upload a dataset to Quanotpian begins with uploading historical data. Below are the steps required to upload historical data via Self-Serve:
- Navigate to your Custom Dataset page and click Add Dataset in the top right corner. This will pop up a modal with a prompt to name your dataset.
- Name your dataset. Note that this name will be used as the namespace for importing your dataset. The name must start with a letter and must contain only lowercase alpha-numeric and underscore characters. The name can be at most 63 characters long.
- Select or drag a file to upload your .csv file containing historical data.
- Once a file has been successfully loaded in the preview pane, select your Primary Date column and your Primary Asset column, then map all of the value columns to specific types.
- Configure historical lag. Historical timestamps are generated by adding the historical lag to the primary date. The default historical lag is one day, which will prevent end-of-day date type values from appearing in pipeline a day early. For example,
2018-03-02 with a one hour lag would create a historical timestamp of
2018-03-02 01:00:00, which would incorrectly appear in pipeline on
2018-03-02. The default one day historical lag adjusts them to appear correctly in pipeline on
2018-03-03. If you need a reminder about how the primary date and historical lag work, see the Primary Date section earlier on this page.
- Configure timezone. By default, datetime values are expected in UTC or should include a timezone designator. If another timezone is selected (ex: US/Eastern), the incoming datetime type columns will be converted to UTC when a timezone designator is not provided.
Selecting Next will complete the upload and start the data validation process.
Important Date Considerations
Before uploading historical data via Self-Serve, it is important to consider a few things when formatting and populating your .csv file:
- Avoid lookahead data: The primary date column should not include dates in the future, we will ignore them. A similar requirement exists for live data, the primary date must not be greater than the the current date. Primary dates in the future will cause issues later on in pipeline. If you need to include future dates in a custom dataset, you should label those dates as value columns, not as a primary date.
- Trade Date Signals: If your date field represents the day you expect an algorithm to act on the signal, you should create a trade_date_minus_one column that can be used as the primary date column, since the minimum historical lag option is 1 day.
- Be careful when determining the timezone of a new datasource with datetime values that don't have the timezone designator. For example, GOOGLEFINANCE in Google Sheets returns dates like YYYY-MM-DD 16:00:00, which are end-of-day values for the market in East Coast time. Selecting US/Eastern will properly convert the datetime to YYYY-MM-DD 20:00:00 during daylight savings time and YYYY-MM-DD 21:00:00 otherwise.
After uploading your data, you can inspect the progress of your upload in Research by loading your dataset's metrics.
Live Data Upload
Once your historical data has been successfully validated, the Set Up Live Data tab will appear. This step is optional; you can select "No Live Data" and then click Submit if you don't need to configure live data loading.
If you want to configure live data loading, you can configure your Connection Type settings to download a new file daily. In addition to the standard FTP option, live data can be downloaded from CSV files hosted on Google Sheets, Dropbox, or any API service that supports token-based urls (vs authentication).
Each trading day, between 7:00am to 10:00am UTC, this live file will be downloaded and all new records (or updates to old records) will be saved with a
timestamp reflecting when Quantopian downloaded the data. The
timestamp will be used by pipeline so that all live data is surfaced in a point-in-time fashion during simulations.
Column names must be identical between the historical and live files. The live data download will use the same column types as configured during the historical upload.
Uploading from Google Sheets
Google Sheets has powerful
QUERY functions that can be used to programmatically download API data, rename columns, and filter data by columns/dates/etc. (example:
QUERY(Sample!A1:L,"select A,K,L,D,E,B,G,H WHERE A >= date '2018-03-01'")). If you're using Google Sheets to process your data, you can import a CSV from your public Google Sheets spreadsheet. To get the public URL for your file:
- Click on File > Publish to the web.
- Change the 'web page' option to 'comma-separated values (.csv)'.
- Click the Publish button.
- Copy and paste the URL that has a format similar to
Uploading from Dropbox
To use Dropbox, place your file in the Public folder and use the 'Public URL'. The Public URL has a format similar to
Finalizing Your Live Dataset
Once you've configured your live data loading, click Submit. Your dataset is now in the historical dataset ingestion queue. The typical delay from dataset submission to full availability is approximately 15 minutes. You can check the status of your upload in Research. The dataset can be imported before all the data is fully available.
Checking Upload Status
You can check the status of a Self-Serve custom data upload using
load_metrics in Research.
load_metrics allows you to check the status of both historical uploads and daily live uploads to all of your custom datasets. You can use
load_metrics to check the status of all your custom dataset uploads like this:
from odo import odo
# Replace user_<UserId> with your user ID.
from quantopian.interactive.data.user_<UserId> import load_metrics
lm = odo(load_metrics[['timestamp','dataset','status','rows_received','total_rows'
In the above example,
lm is a
DataFrame containing the following columns:
filenames_downloaded: name(s) of the files that were downloaded by Quantopian.
rows_received: Total number of raw rows downloaded from historical or live endpoint (FTP, Google Sheets, Dropbox).
rows_added: Number of new records added to base dataset table (after symbol mapping and deduping by asset/per day).
total_rows: Total number of rows representing as originally uploaded data.
delta_rows_added: Number of new records added that represent an update to a previously uploaded record. A record is considered to be an update if a previous record has been updated with the same primary date and primary asset.
total_delta_rows: Total number of records that represent an update to a previously uploaded record after the load completed.
timestamp: Start time of the download.
time_elapsed: Number of seconds it took to process the data.
last_updated: For live loads,
last_updated represents the maximum timestamp of records representing as originally reported data (excludes update records). For historical loads,
last_updated represents the time that the historical load completed.
source_last_updated: Last modified timestamp on the source FTP file (live loads only).
status: Status of the load. Can be any of [running, empty (no data to process), failed, completed].
error: Error message for failed runs.
Accessing Your Custom Data
Once you have uploaded a custom dataset to Quantopian using Self-Serve, you can navigate to an autogenerated documentation page for your custom dataset by clicking on the dataset name on the Self-Serve Data page. This page is only viewable to you and it includes the first load date and code samples for using the dataset in pipeline to help get you started.
Like other pipeline dataset, a custom dataset is imported as a pipeline
DataSet, with a
BoundColumn attribute for each value column that you specified in your upload. Importantly, there are no special properties about custom datasets in pipeline meaning you can use them like you would any pre-integrated
For custom datasets, the dtype of each
BoundColumn is determined by the type that you selected when you first uploaded the custom dataset. For example, if you chose Numeric as a type for a column, it will be loaded as a
BoundColumn with dtype
float64 in pipeline. A column uploaded with type String will be loaded as a
BoundColumn with dtype
object, and so on.
Currently, community members are limited to 30 custom datasets. Each dataset has the following limits:
- Maximum of 20 columns.
- Maximum file size of 300MB.
- Maximum dataset name of 56 characters.
Additionally, live upload datasets will have their files downloaded and processed by Quantopian every day from 7:00am to 10:00am UTC (every hour on the hour). Currently, there is no way to customize the upload time of a custom dataset outside of this range.
You'll need to keep the following considerations and limitations in mind while using Self-Serve to upload custom datasets:
Reserved column names. The column names "timestamp" and "sid" will be added by Quantopian during data processing, so you cannot use these column names in your source data. If you have a source column named "symbol", it must be set as the Primary Asset.
Lookahead data. The primary date column should not include future dated historical values; any rows with primary dates in the future will automatically be ignored. A similar requirement exists for live data (the
asof_date must not be greater than the timestamp).
Trade date signals. If your date field represents the day you expect an algorithm to act on the signal, you should create a trade_date_minus_one column that can be used as the primary date column.
Reloading notebooks. If you are a loading a new dataset into a notebook you will need to restart the notebook kernel to be able to access the new import.
Tips & Tricks
Below are tips and tricks that you can use to get around certain limitations of Self-Serve Data when uploading a custom dataset.
Uploading Macroeconomic Data
Currently, uploading macroeconomic data like GDP, VIX, or any other information that does not map directly to equities is not directly supported in Self-Serve. However, you can work around this limitation by mapping a macroeconomic timeseries to an established equity that you expect to keep trading for a long time, like
SPY. Mapping your data to an equity like SPY will allow you to load it in pipeline like other custom datasets.