Pipeline

Many algorithms depend on calculations that follow a specific pattern:

Every day, for some set of data sources, fetch the last N days’ worth of data for a large number of assets and apply a reduction function to produce a single value per asset.

This kind of calculation is called a cross-sectional trailing-window computation: "cross-sectional" because one value is computed for each asset, and "trailing-window" because data is fetched over a trailing window. (Alpha factors are cross-sectional trailing-window computations.)

A simple example of a cross-sectional trailing-window computation is “close-to-close daily returns”, which has the form:

Every day, fetch the last two days of close prices for all assets. For each asset, calculate the percent change between the asset’s previous close price and its current close price.

The purpose of the Pipeline API is to make it easy to define and execute cross-sectional trailing-window computations.

Basic Usage

Working with pipeline is generally done in two parts: defining an object of class Pipeline and running that pipeline object over some period of time. You can think of defining a pipeline like defining a mathematical expression with variables, like \(f = 2*x + y - 3*z\). Defining a pipeline is like defining a mathematical expression with more domain specific variables: \(f = 2*(\text{close_price}) + (\text{earnings_yield}) - 3*(\text{sentiment_score})\). Running a pipeline is the equivalent of plugging numbers into those variables and evaluating the result (but running a pipeline usually involves plugging in millions of values!).

Defining a pipeline is done in 3 steps:

  1. Importing data.
  2. Defining computations.
  3. Instantiating a pipeline.

While each step is explored in detail further in the docs, it is often easiest to start by walking through an example.

In the example below, pipeline is used to describe a computation of 10-day and 30-day simple moving averages of daily close prices for all US equities trading on a supported exchange. The computation is then filtered down to just equities with a 10-day average price of $5 or more.

# Import pipeline built-ins.
from quantopian.pipeline import Pipeline
from quantopian.pipeline.factors import SimpleMovingAverage

# Import datasets.
from quantopian.pipeline.data import EquityPricing


# Define factors.
sma_10 = SimpleMovingAverage(inputs=[EquityPricing.close], window_length=10)
sma_30 = SimpleMovingAverage(inputs=[EquityPricing.close], window_length=30)

# Define a filter.
prices_over_5 = (sma_10 > 5)

# Instantiate pipeline with two columns corresponding to our two factors, and a
# screen that filters the result down to assets where sma_10 > $5.
pipe = Pipeline(
        columns={
                'sma_10': sma_10,
                'sma_30': sma_30,
        },
        screen=prices_over_5
)

The above example constructs a Pipeline representing the definition of the computations described earlier, not the results of those computations.

Note

Under the hood, the Pipeline object describes a directed acyclic graph (DAG) of data inputs and transformations that correspond to the computation definitions provided to the pipeline constructor. Pipeline knows how to structure the graph to maximize the efficiency and speed of performing the computations that it describes.

In order for the computations in a pipeline defintion to be executed, the Pipeline object needs to be run.

Running a pipeline in Research requires explicit start and end dates to be passed. The example pipeline above can be run in Research from 01/01/2017 to 01/01/2018 with the following code:

from quantopian.research import run_pipeline

# Pipeline definition goes here.

my_pipeline_result = run_pipeline(pipe, '2017-01-01', '2018-01-01')

In the IDE, pipelines are attached to algorithms and automatically executed for each day of a backtest. The same example pipeline from above can be "attached" to an algorithm like this:

from quantopian.algorithm import attach_pipeline

# Pipeline definition goes here.

def initialize(context):

        attach_pipeline(pipe)

This separation between defining and running a pipeline allows the Pipeline API to be the same in both Research and the IDE. Having a common API for defining a pipeline in both environments means you can research and analyze alpha factors in the Research environment and then copy your pipeline code over to the IDE for backtesting.

Generally speaking, developing a pipeline in Research is more interactive and much easier to debug than developing a pipeline in the IDE. As such, it is best to start by defining a pipeline in research until you have a pipeline that works as expected, then copy that pipeline into an algorithm in the IDE.

Defining a Pipeline

As mentioned above, defining a pipeline is done in three steps: 1. Importing data. 2. Defining computations (where the vast majority of your time will be spent). 3. Instantiating a pipeline.

The sections below explore each of these steps in more detail.

Importing Data

DataSets

Before defining a pipeline, you will need to import any data that you want to use. In pipeline, datasets are imported as DataSet objects. Pipeline DataSets are collections of objects that tell the Pipeline API where and how to find the inputs to computations. Importantly, a DataSet does not hold actual data. Since these objects generally correspond to database columns, the attributes of a DataSet are referred to as "columns".

DataSets can be imported using the usual Python import syntax; for example, the following code imports the EquityPricing DataSet.

from quantopian.pipeline.data import EquityPricing

The full list of importable DataSets can be found in the Data Reference.

BoundColumns

After importing the DataSet (or DataSets) you want to use, the next step is to reference the field(s) that you want to use from that DataSet. In pipeline, each "field" of data is represented as a BoundColumn. A BoundColumn is a column of data that is concretely bound to a DataSet. Instances of BoundColumns are dynamically created upon access to attributes of DataSets. Inputs to pipeline computations are most commonly of type BoundColumn, so it is important to understand how to access a BoundColumn in pipeline. The below code snippet imports the EquityPricing datasets and accesses one of its attributes to get a reference to a BoundColumn.

# Import the EquityPricing DataSet.
from quantopian.pipeline.data import EquityPricing

# Access the EquityPricing close attribute to instantiate a
# BoundColumn. Note that a BoundColumn DOES NOT store data, it
# is instead used to inform the pipeline engine where to retrieve
# the data when performing computations. Printing the below
# statement will not display daily close prices.
daily_close = EquityPricing.close

dtypes

Each BoundColumn on a DataSet has a particular np.dtype (short for 'data type'). Valid BoundColumn dtypes include float64, int64, bool, datetime64, and object (representing string values). Since a BoundColumn does not actually contain data, it has a specified dtype so that pipeline knows the type of data that will populate the field when the pipeline is run. The dtype is vital to pipeline because it dictates what computations can be applied to the field. For example, you can take the sum of two float-type fields, but you cannot sum a float-type field and an object-type (string) field.

The dtype of a BoundColumn can also determine the type of a computation. In the case of the Latest computation, the dtype determines whether the computation is a Factor, a Filter, or a Classifier.

DataSetFamily

Some datasets are accessible as a DataSetFamily instead of as a DataSet. A DataSetFamily is like a collection of DataSets, where each member dataset has the same columns. Each member of a family is identified by a tuple of named attributes, called its coordinates. To select a member from a DataSetFamily, you call the family's slice() method, passing the coordinates of the desired member.

For example, geographic revenue exposure data is accessible via the GeoRev DataSetFamily. To select the DataSet containing data for estimated revenue exposure to the Western European Union, you would write:

# Import the GeoRev DataSetFamily.
from quantopian.pipeline.data.factset import GeoRev

# GeoRev is a DataSetFamily, gr_weu is a DataSet.
gr_weu = GeoRev.slice(region='WESTERN EUROPEAN UNION')

# You can also pass slice coordinates positionally.
gr_weu = GeoRev.slice('WESTERN EUROPEAN UNION')

# Once we have a DataSet, you can access a bound column like
# you would with any other DataSet (by accessing an attribute).
# est_exposure_weu is a BoundColumn.
est_exposure_weu = gr_weu.est_pct

Some data on Quantopian is accessible as a DataSet while other data is accessible as a DataSetFamily. Generally speaking, DataSetFamily integrations are reserved for data that is structured in such a way that certain variables (like a categorical label) need to be fixed in order to select a logical daily timeseries of data. To learn more about what this means and some of the background behind the design of the DataSetFamily object, see the explanation in this forum post.

Note

Each data integration on Quantopian is has its own page in the Data Reference. To determine if a particular integration is available as a DataSetFamily or a DataSet, navigate to the Pipeline Datasets and Columns section of the reference page for the relevant data source.

Custom Data

See also

Self Serve Data

In addition to built-in data, you can upload your own data to Quantopian using the Self Serve Data tool. Custom data uploaded via Self Serve is usable in pipeline. To learn more about using custom data in pipeline, see the Self Serve Data section of the documentation.

Defining Computations

Once you've imported the data that you want to use, you will need to define the computations your pipeline should compute each day.

These transformations are referred to as pipeline terms. Pipeline terms are implemented via Factors, Filters, and Classifiers.

Factors

A Factor is a function from an asset and a moment in time to a number:

\[f(\text{asset}, \text{timestamp}) \rightarrow \text{numerical value}\]

An instance of the Factor class must be defined before it can be called in a pipeline. There are a set of built-in factors that are available out-of-the box. If you want to perform a computation that doesn't exist as a built-in, you can define your own custom factor.

Every Factor stores four pieces of state:

  1. inputs: A list of BoundColumn objects and/or other pipeline terms (factor, filter, or classifier) describing the inputs to the factor.
  2. window_length : An integer describing how many rows of historical data the Factor needs to be provided each day.
  3. dtype: A np.dtype object representing the type of values computed by the Factor. Most factors are of dtype float64, indicating that they produce numerical values represented as 64-bit floats. Factors can also be of dtype datetime64[ns]. Factors default to dtype float64.
  4. A compute function that operates on the data described by inputs and window_length. When a factor is computed for a day on which there are \(N\) assets in the Quantopian database, the underlying pipeline engine provides that factor's compute function a two-dimensional array of shape \((\text{window_length} \times N)\) for each input in inputs. The job of the compute function is to produce a one-dimensional array of length \(N\) as an output.

The dtype and compute pieces of state should be provided when defining the Factor object, since those are "inherent" to the Factor. However, inputs and window_length can be provided either as defaults when defining the Factor class object or when the Factor object is called in your pipeline (since a factor might be applied over many different inputs or different window lengths).

For example, consider the built-in SimpleMovingAverage factor:

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import EquityPricing
from quantopian.pipeline.factors import SimpleMovingAverage

mean_close_10 = SimpleMovingAverage(
    inputs=[EquityPricing.close],
    window_length=10,
)

In this example, the dtype is float64, and the compute function is the simple moving average (both of which are defined in the built-in SimpleMovingAverage object). In the code snippet above, the inputs argument is defined as the close column of the EquityPricing DataSet, and the window_length as 10 days.

Built-In Factors

Built-in factors are available for many common operations (like simple moving average, etc.). Built-in factors are available for import via the quantopian.pipeline.factors module.

For example, the built-in Returns factor computes close to close returns over a specified window length.

from quantopian.pipeline import Pipeline
from quantopian.pipeline.factors import Returns

# The default inputs argument for Returns is EquityPricing.close.
returns_2w = Returns(
    window_length=11,
)

For a complete list of built-in factors available on Quantopian, see the API Reference.

CustomFactors

For operations not available as built-in factors, you can build your own CustomFactor. As outlined above, you'll need to provide at least the compute function when defining your CustomFactor.

For example, consider this standard deviation CustomFactor:

class StdDev(CustomFactor):
    def compute(self, today, asset_ids, out, values):
        # Calculates the column-wise standard deviation, ignoring NaNs
        out[:] = numpy.nanstd(values, axis=0)

Here, the compute function is defined using numpy.nanstd() and the dtype defaults to float64 (this is the default of any custom factor unless you choose to override it).

The inputs and window_length can be defined upon instantiationg of the CustomFactor, but it's also possible to define one or both as defaults within the CustomFactor class:

class StdDev(CustomFactor):
        inputs = [USEquityPricing.close]
        window_length = 5
    def compute(self, today, asset_ids, out, values):
        # Calculates the column-wise standard deviation, ignoring NaNs
        out[:] = numpy.nanstd(values, axis=0)

The CustomFactor object can then be instantiated as follows:

std_dev_5 = StdDev()

Many other functions beyond np.nanstd can be used in compute; the compute function can be any function that translates a series of values to a numerical value.

In this example, dtype defaulted to float64. However, it might be necessary to set dtype to datetime64 if you expect the output of your factor to be a datetime (this is the only time you should override the default dtype). For example:

class MyDateFactor(CustomFactor):
    dtype = np.dtype('datetime64[ns]')
    def compute(self, today, assets, out, inputs):
        ...

In most cases, CustomFactors are used to perform more complex operations on fields. If you need to combine fields using basic operations (addition, multiplication, etc.), see Combining Factors.

Note

Instances of built-in factors and custom factors are both instances of the pipeline Factor. They both store all four pieces of state (inputs, dtype, window_length, compute) and they all have access to the same set of factor methods.

Combining Factors

Factors can be combined, both with other factors and with scalar values, via any of the basic mathematical operators (+, -, *, etc). This makes it easy to write complex expressions that combine multiple factors. For example, constructing a factor that computes the average of two other factors is simply:

f1 = SomeFactor(...)
f2 = SomeOtherFactor(...)
average = (f1 + f2) / 2.0

It is generally preferred to combine factors by using the basic mathematical operators as opposed to defining CustomFactors to achieve the same result. Using the mathematical operators is generally simpler to read in the code.

Note

Any factors can be combined using basic mathematical operators, regardless of whether they are built-in or custom factors.

Using Factor Methods

Each instance of the Factor class has several methods that can be used to perform transformations that are common to a numerical values timeseries. Some of the more popular Factor methods include zscore(), percentile_between(), and winsorize(). The full set of available factor methods are listed in the Factor definition in the API Reference.

Some factor methods support transformations that result in a new factor (e.g. zscore() and winsorize()), while others support a transformation that returns a Filter (e.g. percentile_between()). Checking the return type of each transformation before using it is important so you know how to use the output properly in your pipeline.

Slicing Factors

Note

In this section, we refer to slicing a factor, which is a different operation than slicing a DataSetFamily (referenced earlier on this page).

In certain situations, you might want to use the output from a factor for one asset as the input to another. Using a technique called "slicing", it is possible to extract the values of a Factor for a single asset. For example, you might want to regress a particular factor against the returns of SPY (an ETF that tracks the S&P500 index). Slices are created by indexing into a factor by asset; this action creates an object of the Slice class. These Slice objects are then used as an input to a CustomFactor.

Note

Only slices of certain factors can be used as inputs. These factors include Returns and any factors created from rank() or zscore(). The reason for this is that these factors produce normalized values, so they are safe for use as inputs to other factors.

When a Slice object is used as an input to a custom factor, it always returns an \(N \times 1\) column vector of values, where \(N\) is the window length. For example, a slice of a Returns factor would output a column vector of the \(N\) previous returns values for a given security.

Each day, a slice only computes a value for the single asset with which it is associated, whereas ordinary factors compute a value for every asset. As such, slices cannot be added as a column to a pipeline.

See also

Cookbook recipe: slicing a factor.

Filters

Like a factor, a Filter is a transformation of input data. The difference between filters and factors is that filters are functions that produce boolean-valued outputs, whereas factors produce numerical or datetime-valued outputs:

\[f(\text{asset}, \text{timestamp}) \rightarrow \text{boolean}\]

In general, filters are used for narrowing down the set of assets included in a computation or in the final output of a pipeline.

There are two common ways to create a Filter: comparison operators and built-in Factor/Classifier methods.

Comparison Operators

Just like you can filter pandas DataFrames with comparison operators, you can filter pipelines with comparison operators (>, <, ==, etc.). For example, the following code would create a filter, close_price_filter, that returns True for all equities with close prices over $20 on a particular day:

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import EquityPricing

# Define a factor representing the most recent close price (yesterday's close).
last_close_price = EquityPricing.close.latest

# Define a filter that returns True each time last_close_price returns a value
# greater than 20.
close_price_filter = (last_close_price > 20)

Factor/Classifier Methods

Various methods of the Factor and Classifier classes return a Filter. For example, the top() method produces a Filter that returns True for the top \(N\) securities of a given factor each day. The following example produces a filter that returns True for 200 assets every day, indicating that those assets were in the top 200 by last close price across all known assets:

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import EquityPricing

last_close_price = EquityPricing.close.latest

# This is a filter.
top_close_price_filter = last_close_price.top(200)

The percentile_between() method is another example of a a Factor method that produces a Filter. For a full list of Factor methods that return Filters, see the methods of Factor.

You can also use comparison operators with Classifiers (described further down on this page) using the eq() method. For example, the following code would create a Filter nyse_filter that returns True for all stocks traded on the NYSE on a particular day:

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import Fundamentals

# Since the underlying data of Fundamentals.exchange_id
# is of dtype 'object', .latest returns a Classifier
exchange = Fundamentals.exchange_id.latest

# The Classifier method `eq` returns a filter that outputs True
# each time our classifier outputs 'NYS'.
nyse_filter = exchange.eq('NYS')

Classifier methods like isnull() and startswith() also produce Filters. For a full list of Classifier methods that return filters, see the methods of Classifier.

Built-In Filters

There are several built-in Filters that filter assets based on liquidity, SID, and more.

One notable built-in Filter is the Quantopian Tradable Universe, which screens out illiquid stocks. The Quantopian Tradable Universe is the recommended tradable universe to use when researching strategies on Quantopian. You can access the Quantopian Tradable Universe filter as QTradableStocksUS(). For example,

from quantopian.pipeline.filters import QTradableStocksUS

base_universe = QTradableStocksUS()

In this example, base_universe would be a Filter that you could add to a Pipeline in order to narrow your trading universe to the Quantopian Tradable Universe.

Note

In order to enter the contest or be eligible for a capital allocation, your algorithm must trade within the Quantopian Tradable Universe.

For a full list of built-in Filters, see the API Reference.

Combining Filters

Like factors, filters can be combined. Combining filters is done using the & (and) and | (or) operators. For example, the following code will screen for securities that are in the top 10% of average dollar volume and have a latest close price above $20:

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import EquityPricing
from quantopian.pipeline.factors import AverageDollarVolume

dollar_volume = AverageDollarVolume(window_length=63)
high_dollar_volume = dollar_volume.percentile_between(90, 100)

latest_close = EquityPricing.close.latest
above_20 = latest_close > 20

is_tradable = high_dollar_volume & above_20

This filter will evaluate to True for securities where both high_dollar_volume and above_20 are True. Otherwise, it will evaluate to False. A similar computation can be made with the | (or) operator.

Note

You must use & and | to combine filters. The keywords and and or are not supported when combining pipeline filters.

Masking

Sometimes, it is better to ignore certain assets when computing pipeline expresssions. There are two common cases where ignoring assets is useful:

  1. An expression is computationally expensive, and results are only relevant for certain assets. A common example of such an expensive expression is a factor computing the coefficients of a regression (linear_regression()).
  2. An expression performs comparisons between assets, but comparisons should only be performed among a subset of all assets. For example, using the top() method to compute the top 200 assets by earnings yield, ignoring assets that don't meet some liquidity constraint.

To support these two use cases, all factors and many factor methods accept an optional mask argument, which must be a Filter indicating which assets to consider when computing.

For example, let's define a pipeline that computes the top 200 assets ranked by market cap, but let's restrict that computation to only consider assets that are in the top 50% of assets ranked by average dollar volume. To do this, begin by creating a high_dollar_volume filter. This filter can then be supplied to the mask argument of top.

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data.factset import Fundamentals
from quantopian.pipeline.factors import AverageDollarVolume

dollar_volume = AverageDollarVolume(window_length=63)
high_dollar_volume = dollar_volume.percentile_between(50, 100)

mcap = Fundamentals.mkt_val.latest

mcap_top_200 = mcap.top(200, mask=high_dollar_volume)

Applying the mask to mcap.top restricts the top() method to only return the top 200 assets within the ~4000 assets passing the high_dollar_volume filter, as opposed to considering all ~8000 without a mask. Since mcap_top_200 is another filter, you could then pass it as a mask to another compuation if you wanted.

CustomFilters

See also

CustomFactors

For boolean-output operations that cannot be expressed using comparison operators or factor/classifier methods, you can build your own CustomFilter. Defining a CustomFilter is just like defining CustomFactor, except the dtype is a boolean and the computation must result in a boolean value output for each asset. Note that the need to define a CustomFilter is very uncommon. Whenever possible, you should define a filter using comparison operators or built-in methods to keep your code simple.

Classifiers

A Classifier is a function from an asset and a moment in time to a categorical output such as a string or integer label:

\[f(\text{asset}, \text{timestamp}) \rightarrow \text{category}\]

Note

Classifiers and filters are similar in that they both return non-numeric outputs. However, filters specifically return booleans. Additionally, filters are almost always used to filter data, while Classifiers are most often used to group data.

Classifiers are most commonly created by accessing the .latest attribute on a BoundColumn of dtype int64 or object (string). An example of a classifier producing a string output is the exchange of a security. To create this classifier, begin by importing the EquityMetadata dataset. Then, use the latest attribute to instantiate a classifier returning the latest exchange on which each asset trades:

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import EquityMetadata

# Since the underlying data of EquityMetadata.listing_exchange
# is of type object (string), .latest returns a classifier.
exchange = EquityMetadata.listing_exchange.latest

Another way to define a pipeline classifier is with a factor method. Factor methods like quantiles() result in a Classifier. For a full list of Factor methods that result in Classifiers, see the API Reference.

Note

If the underlying data of a BoundColumn is numeric, latest returns a Factor. If it is string-type or integer labels, latest returns a Classifier.

Note

At this time, CustomClassifiers are not available.

Grouping with Classifiers

An important application of classifiers in pipeline is grouping. For example, you might want to compute earnings yield across all known assets and then normalize the result by dividing each asset’s earnings ratio by the mean earnings yield for that asset’s sector or industry.

In the same way that the optional mask parameter allows you to modify the behavior of demean() to ignore certain values, the groupby parameter allows you to specify that normalizations should be performed on subsets of rows, rather than on the entire row at once.

In the same way that you pass a Filter to define values to ignore, you can pass a Classifier to define how to partition up the rows of the Factor being normalized. Note that this is only applicable on methods that take a groupby argument. See the API Reference to see which functions take a groupby argument.

Instantiating a Pipeline

Once you've defined your computations using Factors, Filters, and Classifiers, you will need to instantiate your pipeline. Begin by importing the Pipeline class:

from quantopian.pipeline import Pipeline

After importing Pipeline, you must instantiate your pipeline with three optional arguments:

  1. columns: A dictionary; keys are column names, values are pipeline terms (factor, filter, or classifier).
  2. domain: A Domain that specifies the set of assets and a corresponding trading calendar over which the expressions of a pipeline should be computed.
  3. screen: A Filter that gets applied as a post-processing screen on the pipeline output.

Each of the three Pipeline constructor arguments are described below.

Columns

The columns argument informs the pipeline engine about the pipeline terms which should be included in the output pd.DataFrame when the pipeline is executed. When you run a pipeline, the pipeline engine will figure out the most efficient way to compute the results that are included in the output which includes any terms on which any output columns depend (for example, pipeline might need to compute a filter before applying that filter as a mask on a factor).

The columns argument must be a dictionary where the keys are column names (string) and the values are pipeline terms (factor, filter, or classifier).

Domain

The domain argument informs the pipeline engine about the set of inputs that should be processed. Concretely, the domain of a pipeline controls two things:
  1. The calendar to which the pipeline's input rows are aligned.
  2. The set of assets to which the pipeline's input columns are aligned.

The domain argument must be a Domain object. The set of domains supported on Quantopian can be found in the Data Reference. Currently, each domain on Quantopian corresponds to a country's stock market. However, it is possible that domains corresponding to other sets of inputs might be added to Quantopian in the future.

Currently, all supported domains are importable from quantopian.pipeline.domain. Each country's domain is named **_EQUITIES, with ** replaced by a country code. For example, you can define a pipeline to be run over the Japanese equities domain like this:

from quantopian.pipeline import Pipeline
from quantopian.pipeline.domain import JP_EQUITIES

pipe = Pipeline(columns={}, domain=JP_EQUITIES)

Note

If no domain is specified, a pipeline's domain will default to US_EQUITIES, representing the US equity market.

Note

For the mathematically-inclined, the name "domain" refers to the mathematical concept of the domain of a function, which is the set of potential inputs to a function. For more information about the design of domains, see the public design document on GitHub.

Screen

The screen argument is used to apply a post-processing filter to the output dataframe of a pipeline execution. Unlike the columns and domain arguments, the screen argument is a convenience method that doesn't actually affect the pipeline execution.

The screen argument must be a pipeline Filter. Once the pipeline has been executed successfully, any assets for which the supplied filter yields False will be dropped from the output dataframe. Since this is a post-processing step, it is often helpful to not supply a screen to a pipeline and instead do manual filtering on the pipeline result. For instance, if you have a computationally expensive pipeline and you want to test multiple filters, it would make sense to run the pipeline once, store the result, and then apply different filters to the pipeline output (usually a fast operation) so that you don't have to run it multiple times.

Running Pipelines

Once you have defined a pipeline, the next step is to run it (also referred to as 'executing' a pipeline). The Pipeline object doesn't actually contain any data; instead, it's a "computational expression" that will be evaluated using a particular dataset or datasets. In order to access the data indicated in your pipeline, the pipeline needs to be run. Running a pipeline essentially plugs in real data to all of the computations that were defined in the pipeline.

Thanks to DataSets and BoundColumns, the pipeline engine knows where to get the data to plug in to a Pipeline definition. In addition, the pipeline knows about the trading calendar and the existence of any listed assets that correspond to its domain. As a result, pipeline knows how to load the required data to evaluate its computations for the dynamic set of all active equities in the specified domain each day. Under the hood, pipeline is able to perform these computations extremely efficiently by pre-fetching data and chunking its computations. By default, pipelines in the IDE are run in 6-month chunks while pipelines in Research are run in 12-month chunks, allowing pipeline to pre-fetch more data and perform the required computations more quickly. Importantly, the chunksize does not affect the output of pipeline, it only affects the speed and memory usage of the pipeline engine.

Pipeline also knows about timestamps in each dataset, so it can surface data in a point-in-time fashion and prevent lookahead bias. The result of a pipeline is evaluated one day at a time and pipeline computations are only allowed to access data that was available prior to the simulation date. For example, if a pipeline was run from 01/01/2017 to 01/01/2018, a data point that was learned on 05/01/2017 (timestamp at 05/01/2017 6:00pm ET) will only be available when computing the results for 05/02/2017 onward. The concept of point-in-time data on Quantopian is further explained here in the Data Reference.

The output of running a pipeline depends on the environment in which it is run. In Research, a pipeline is run using run_pipeline(), which requires explicit start and end dates to be provided. In the IDE, a pipeline must be 'attached' to an algorithm where it is evaluated on each day of a backtest. For more on pipeline outputs, refer to Running Pipelines in Research and Running Pipelines in the IDE.