Back to Community
Update: Custom Dataset Performance Improvements

Hey Everyone,

Today, we wanted to update you on some improvements we've made to the performance and scalability of our custom dataset integration. If you are not yet familiar with Self-Serve Data and you would like to upload your own time-series data to Quantopian for use with Pipeline, check out the documentation here.

Since the Self-Serve Data introduction almost two years ago, thousands and thousand of custom datasets have been added to the Quantopian platform. But with increased usage came the growing pains of an overloaded system. How many of you added a self-serve dataset to your analysis, only to have your pipelines slow down to "coffee break" speed?

Faster, Bigger, and More

  1. Greatly improved pipeline runtime performance of self-serve datasets, we've seen speedups up to 10-20x on large datasets.
  2. We've increased the maximum file size from 300MB to 500MB. (Enterprise clients from 500MB to 6GB)
  3. We've increased the maximum number of Self-Serve datasets from 30 to 50.

Auto-migration

In order to minimize the potential change to existing analysis and pipelines, we have migrated all of your (non-error) datasets from the old system to the new. You won't have to change any of your existing self-serve imports to use the improved functionality.

Note: old datasets will be available for comparison for about a month at an old namespace, as we work through migration options for anyone with active live datasets.

from quantopian.pipeline.data.old.<User|Org ID> import <DataSet>  

Improvements

By leveraging the same technology stack that our pre-integrated datasets use, we've been able to improve the coverage of the self-serve symbology mapping step and NaN data handling. Now, when a user explicitly provides a NaN data value, we will process that as a valid data point. In the past, we forward filled previous data over all NaN values, including a user supplied data point.

Note: you will need to re-import a dataset to trigger the symbology mapping step. Migrated datasets are designed to migrate the exact symbology data from the original system.

Self-serve data validation has also been improved by adding explicit date, datetime and boolean type formats.

Two new research APIs to analyze Self-Serve data

Note: you can use shift-tab in research to see more details about the signature and docstring

query_self_serve_data

Now you can query the underlying raw symbol mapped data for a new self-serve dataset. The following example returns all columns for the given DATASET, for the US domain from the starting asof date (None ) through 2002-10-01.

from quantopian.research.experimental import query_self_serve_data  
query_self_serve_data(DATASET.columns,'US',None,'2002-10-01')  

You can also pass a subset of columns

query_self_serve_data([DATASET.asof_date,DATASET.sid],'US')  

query_self_serve_failed_symbology

Will return a list of primary asset identifiers that could not be successfully mapped to assets in the Quantopian system, with the minimum and maximum asof_dates for each asset.

from quantopian.research.experimental import query_self_serve_failed_symbology

query_self_serve_failed_symbology(DATASET)[['identifier','min_asof','max_asof']]  

If you have any questions or issues please contact us at [email protected]

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

5 responses

This is so awesome. Excited to load some datasets. Thanks, Chris!

really exciting news!

Thanks Chris - I particularly like the second feature as some symbols change over time - will check it out.

Today we've released a couple of additional updates that should help you with your custom dataset uploads.

  • We try to set the Column Type and Format for each column based on inspection of the data from the first ~ 100 rows

Note: if we cannot find an exact match to known column formats (ex: dates with 2 digit years or Assets) we will default to the type String. You'll still need to identify the Primary Date and set the Primary Asset, you may want to scan the remaining string columns just to verify the accuracy.

  • We will auto-expand the requirements section when you receive a validation error

If you have questions or need help , clicking on the (?) icon will expand the requirements for that specific section.

Today we've released some additional metadata to the custom dataset dashboard:

The new Last Updated column is helpful for monitoring the last time a dataset was successfully updated (mostly interesting for live updating datasets).

If you mouse over the Last Updated value, you will see the following values:

  1. Input Rows: the total number of rows to be processed (may include multiple files for live updating datasets)
  2. Failed symbology rows: the total number of rows that could not be successfully mapped to assets in the Quantopian system (see query_self_serve_failed_symbology above)
  3. Duplicate primary key rows: the total number of rows that were skipped because they contained the same primary key values (currently primary asset and date). Note: this check is performed after symbol mapping so a raw file that appears to have no duplicate records can still result in skipped rows if multiple input symbology records map to the same asset.
    For live updates this will also contain the number of identical rows (including data columns) across subsequent files, only new or updated values need to be included. Please contact us via [email protected] if you have questions regarding how to optimize your live data.
  4. Output rows: the total number of output records in your dataset.
  5. Domains: the number of domains support by your dataset. You can use query_self_serve_data to get the raw data for each domain.