Research API Improvements
Today we're announcing two quality-of-life improvements to the Quantopian Research API: simpler APIs for loading price/volume data, and new options for tuning memory usage of complex pipelines. We're also announcing a change to
quantopian.research.experimental.history to bring it closer in line with the
data.history method in the backtester.
Simpler APIs for Prices, Volumes, and Returns
We've added five new functions to the
quantopian.research module that make it easier to perform common tasks involving price and volume data.
The new functions have the following names and signatures:
prices(assets, start, end, frequency='daily', price_field='price', start_offset=0)
log_prices(assets, start, end, frequency='daily', 'price_field='price', start_offset=0)
returns(assets, start, end, periods=1, frequency='daily', price_field='price')
log_returns(assets, start, end, periods=1, frequency='daily', price_field='price')
volumes(assets, start, end, frequency='daily', start_offset=0)
In general, each of these functions fetches data for one or more assets over a specified period, returning a
Series if one asset was provided, and returning a
DataFrame if multiple asset were provided. You can find complete documentation for all of these functions (along with the rest of the Research API) in the updated Research API Documentation.
The new API functions offer two benefits over the existing
get_pricing function: better ergonomics, and start offsetting.
One of the goals for the Research API is that it should be fast and simple to fetch whatever data you want to examine. The more time and energy you spend figuring out how to load the data you need, the less you have for doing your actual analysis.
get_pricing function has been the primary interface for ad-hoc price and volume queries. Having a single-function interface to this data makes the API easier to remember (it's just the one function!), but it means that
get_pricing needs a large number of parameters to support all the different ways that users need to fetch data.
We can't expect users to remember the name and order of all 8 arguments to
get_pricing (I can't remember them, and I designed the function!), so
get_pricing provides default values for almost all of its arguments. Having defaults for everything means that users only have to pass the parameters that they actually care about.
The problem with providing defaults for everything, however, is that it actually makes it harder to perform common tasks when doing so requires passing non-default values. I wrote the following description in the git commit message for this change in Quantopian's internal repository:
9 times out of 10 when I'm fetching pricing data I want close prices for one or more assets over some time period. The current API of get_pricing forces me to pass a bunch of additional parameters, and it has so many parameters that I can never remember the order, so I have to pass them all by keyword.
For example, to fetch recent daily close prices for AAPL with
get_pricing, I would write something like:
data = get_pricing('AAPL', start_date='2016-01-02', end_date='2017-01-02', fields='price')
The above code is functional and clear, but it's a mouthful to type, and I had to pull up
get_pricing's docstring to remember the names of the parameters and the correct string for
With the new API, I can get the same output with the following:
data = prices('AAPL', '2016-01-02', '2017-01-02')
One challenge that arises fairly regularly on Quantopian is what I call the "leading offset problem". This problem is exemplified as follows: I want to calculate rolling 5-day returns for AAPL for every trading day in January 2017. In order to compute what I want, I need to load daily close prices for the month of January and for the last four trading days of December 2016.
In general, whenever we want to compute a rolling N-period reduction from t[A] to t[B], we need to fetch data from t_[A - N] to t[B].
Correctly calculating the start date for a
get_pricing call when we need a leading buffer of data is tricky even in simple cases (the right way to do this is usually to use the trading calendar arithmetic functions exported by the
zipline.utils.calendars module), but it gets downright hairy when you want to support different asset classes and different data frequencies.
As of this update, all the Research API functions that fetch raw price/volume data now take an
start_offset optional parameter that can be used to add a leading buffer of N periods to a query. For example, to load prices for January 2017 with an extra 5 days at the start, we can write:
prices('AAPL', '2017', '2017-02', start_offset=5)
start_offset is used internally by the new
log_returns functions. We could directly fetch our 5-day rolling returns with a call like:
returns('AAPL', '2017', '2017-02', periods=5)
chunksize Parameter for
Users running complicated Pipelines sometimes find that they run into memory limits when running their pipelines over very long periods of time. In another post I've written at length about where and why Pipeline uses memory. At the end of that post, I gave the following advice:
The best way for you to reduce your high-water memory usage is to chunk up your run_pipeline calls into smaller increments and then concatenate (e.g. with pandas.concat) them together. For example, if you want to run a Pipeline with lots of graph terms over a 5 year period, you might break that up into five 1-year
run_pipelinecalls, or even ten 6-month calls. This reduces memory usage in two important ways:
- Running over a shorter window straightforwardly translates to allocating fewer rows in the input/output buffers of each Pipeline term.
- More subtly, running over a shorter window reduces the number of assets that are active during that window, which reduces the number of columns in the input/output buffers allocated to your Pipeline terms.
The Pipeline API in research can now transparently perform this chunking for you. If
chunksize parameter is passed to
run_pipeline, the range from
end_date will automatically be broken up into blocks of size
chunksize days and run separately. The default behavior is still to run a single chunk.
In general, pipelines run with large chunksizes will consume more memory, but will run faster, and pipeline run with small chunksizes will consume less memory, but will run more slowly. A good rule of thumb is that you probably want a chunksize of at least 126 (half a trading year equities), but smaller chunksizes might be desirable for extremely complex pipelines.
The split between the Algorithm API and the Research API has been a historical source of confusion for many Quantopian users. Users often need to solve similar problems in research and in algorithms, but the two environments are different enough that a solution to a problem in an algorithm can end up looking very different to a solution to a similar problem in research. We've gotten better over time at building APIs that transfer cleanly between algorithms and research (the Pipeline and Optimize APIs, for example, were explicitly designed with this goal in mind), and we've also started revising older APIs to bring the two environments into closer alignment where possible.
When we released support for futures in the Research API, we added a new experimental
history() is essentially a wrapper around the older
get_pricing function, but renamed to reflect the fact that it serves the same essential purpose as the
data.history() method in the Algorithm API. One important difference between the Research API history and Algorithm API history, however, was that the research version provided defaults for many parameters that the algorithm version did not. This made it easier to call the research
history interactively, but made the correspondence between the functions less obvious and made it harder to port code using
history between the environments. With the addition of the new convenience methods for fetching price/volume/returns, we feel that making the two history functions as consistent as possible is more important than making the research version easier to type, so we've updated the research version to align more closely with the algorithm version.
The signatures of the two history functions are now:
# Algorithm API version def history(assets, fields, bar_count, frequency): # Research API version def history(assets, fields, start, end, frequency, start_offset=0):
With this change, it's clearer that the only difference between the two historys is in how they talk about date ranges: Algorithm API history fetches data for a trailing window that ends at the current algorithm time; Research API history fetches data for a period between a fixed start and end, optionally with an additional offset from the start.