Pandas Recipes

This page contains code snippets that focus on working with pandas, a popular Python library that is used throughout the Quantopian API. Most Quantopian API functions that output data or results do so in the form of a pandas data structure like a Series or a DataFrame. The recipes below illustrate some of the techniques you can use to manipulate these data structures.

Working With Pipeline Outputs in Research

The following code snippets provide tips for working with MultiIndex DataFrames, which are the output type of pipeline runs in research.

Looking at results for one asset in research

This code snippet illustrates how to index into a particular level of a pandas MultiIndex DataFrame. On Quantopian, this is most often needed when trying to look at the results of a pipeline run in research for one equity or a subset of equities. The snippet below defines a pipeline with one column, latest market cap, and then plots the results for just AAPL. Note the use of xs, which we can use to index into the DataFrame with a particular key at a particular level. For pipeline outputs in research, level 0 of the MultiIndex represents the simulation date and level 1 represents the equity. Run in Research.

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data.factset import Fundamentals

from quantopian.research import run_pipeline

pipe = Pipeline(
    columns={
        'mcap': Fundamentals.mkt_val.latest,
    }
)

df = run_pipeline(pipe, '2015-01-01', '2016-01-01')

# Select data that corresponds to AAPL and plot it.
df.xs(symbols('AAPL'), level=1).plot()

Looking at results for a subset of dates in a pipeline output

This code snippet illustrates how to look at a subset of a MultiIndex DataFrame. On Quantopian, this is most often needed when trying to look at the results of a pipeline run in research for a subset of dates. This is useful if you don't want to have to re-run the pipeline over a different date range The snippet below defines a pipeline with one column, latest close price, and runs the pipeline for a year. It then gets the subset of the pipeline output between Feb. 3rd, 2015 and Mar. 2nd, 2015 by using the DataFrame.loc function.

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import EquityPricing

from quantopian.research import run_pipeline

pipe = Pipeline(
    columns={
        'price': EquityPricing.close.latest,
    }
)

df = run_pipeline(pipe, '2015-01-01', '2016-01-01')

# Select data for all data between 2015-02-03 and 2015-03-02.
df.loc['2015-02-03':'2015-03-02', :]

Note

When using loc to get a subset of dates of the pipeline output, you need to make sure that the start and end dates range you query for exist in the pipeline output index, or the function will return an error saying TypeError: Level type mismatch: <date>

Count the number of assets returned by pipeline each day

This code snippet plots the number of US equities trading on a supported exchange every day. Note the use of groupby(level=0) to group our MultiIndex DataFrame by the level 0 index (dates) so that we can count and plot the number of equities each day.

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import EquityPricing
from quantopian.pipeline.domain import US_EQUITIES

from quantopian.research import run_pipeline

# Define an empty pipeline of US equities.
pipe = Pipeline(domain=US_EQUITIES)
df = run_pipeline(pipe, '2015-01-01', '2016-01-01')

# Group our pipeline result by our our level 0 index (date) and plot the number
# of rows each day.
df.groupby(level=0).size().plot(title='US Equities Trading on a Supported Exchange');

Plotting results with DataFrame.plot

This code snippet gets the 1-day, 1-week, and 1-month trailing returns every day between 2014 and 2018 for all US equities. It then plots the results for AAPL using the pandas.DataFrame.plot() function. Note that the dataframe plot function is a convenience function that uses matplotlib/pylab under the hood to do the actual plotting. Also note that the semi-colon on the aapl_data.plot line is not required, but avoids a gybberish-looking output line from being included at the start of the output, so it's recommended.

from quantopian.pipeline import Pipeline
from quantopian.pipeline.factors import Returns

from quantopian.research import run_pipeline

pipe = Pipeline(
    columns={
        'returns_1d': Returns(window_length=2),
        'returns_1w': Returns(window_length=6),
        'returns_1m': Returns(window_length=22),
    },
)

data = run_pipeline(pipe, '2014-1-01', '2018-01-01')

# Get the results specific to AAPL.
aapl_data = data.xs(symbols('AAPL'), level=1)

# Plot the data for AAPL and set the title of the plot.
aapl_data.plot(title='AAPL Returns (2014-2018)');

Using Pandas in the IDE

The following code snippets provide tips for using pandas in the Algorithm IDE.

Compute rolling transformations on minute level data

Rolling transform calculations such as mavg, stddev, etc. can be calculated via methods provided by the pandas Python library. This example uses data from history() to compute the average of the opening 30 minutes' prices each day (essentially a moving average). Note the use of history() in tandem with the DataFrame.mean() function from the pandas library. Run in the IDE.

def initialize(context):
    # AAPL, MSFT, and SPY
    context.equities = [sid(24), sid(5061), sid(8554)]

def before_trading_start(context, data):
    price_history = data.history(
        context.equities,
        fields="price",
        bar_count=5,
        frequency="1d",
    )
    log.info(price_history.mean())

Note

The output type of data.history <~quantopian.algorithm.interface.BarData.history() can change depending on the dimensions of the inputs. In the above example, we provided a list of equities and only a single field, so the output type was a pandas.DataFrame. If you change from passing a list of equities to just a single equity or if you change from querying a single field to multiple fields, the output type will change and calling the DataFrame class methods will trigger an error.