It may be helpful to first understand the 'why' for using a pipeline.
Backtest speed and memory usage is a challenge especially when accessing many fields of data (eg many different fundamentals). Consider a simple requirement to retrieve the last 5 days of close prices for a stock each day. The backtester would execute a database query each day of the backtest to fetch the close prices. If the backtest was run over a 5 year period this roughly translates into about 252 x 5 = 1260 queries of the database. Moreover, 80% of the data that's fetched each day had already been fetched the day before (4 of the 5 close prices will be the same).
Could we do this more efficiently? Since we know what data we need (the close prices) and we know how much data to fetch (5 years), couldn't we get all the data at once? Couldn't we greatly reduce the database overhead and the total data fetched if we simply did one query at the beginning of our backtest? The answer is yes. However, the devil is in the details (as always). First, the pesky reality of stock splits and de-listing. The split adjusted prices and volumes and even the tradable securities are typically not static throughout the backtest period. They vary dependent upon the backtest dates. Second, the amount of data to optimally query and cache is not obvious. This will vary dependent upon the type of data, the quantity, and the duration of the backtest.
Pipelines to the rescue... The 'why' for using a pipeline is to improve data fetch and calculation performance, to handle stock splits, and to provide an (ostensibly) easy interface so a user doesn't need to be concerned with down and dirty implementation issues. The output of a pipeline is a nice neat pandas dataframe with columns for each piece of data you need all in one place.
So, It may help conceptually to re-order the pipeline implementation steps you listed above:
- Look over and find the base data that you want your algorithm to
use. Choose from any of the fields (called 'bound columns' in
pipeline parlance) in any of the data sets provided by Quantopian
- Create specific data fields your algorithm will use based on these
basic data. Either the raw data fields or calculations based upon
these fields. These are the factors, filters, and classifiers in
- Create a pipeline object and tell it what specific data fields to
use from step 2. These become the columns in the output dataframe. You can also
specify a 'screen' to limit the output of the pipe to specific
securities but this isn't required. You can always sort and filter
after you get the output. Knowing all the required data BEFOREHAND
allows the pipeline to optimize what data to fetch.
- Attach the pipeline to the Algorithm. This effectively stores the
pipeline object reference so the algorithm can access it later.
- Run the function 'pipeline_output' each day before the market opens.
This does the actual fetch of the specified data fields and returns
them in the form of a single pandas dataframe. Note that in the
background the pipeline will be retrieving more data than for that
single day for optimization reasons. The pipeline object stores
these pre-fetched data. The next time the 'pipeline_output' function
is called, if the pipeline object already has the output (from a
previous call), then the data is returned from cache without doing
an additional database query.
Now, you asked 'How come my print(output) is giving so many NANs and not values?'
Whenever you use a mask for a factor, any security which doesn't pass the filter (ie it is 'masked out') will return a NaN for the factor value. In your case you used 'spy' as a mask for your factors. Every security which doesn't pass the 'spy' filter will return NaN for that factor.
You also asked 'How do I use Morningstore Fundamentals?'
You basically have all the steps in your post, but I'll re-order them and comment:
# the required imports to use pipeline
# this imports the Pipeline class definition
from quantopian.pipeline import Pipeline
# this imports the two pipeline functions needed by your algorithm
from quantopian.algorithm import attach_pipeline, pipeline_output
# import any datasets you want to use
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.data import morningstar
# import any built in filters, factors or classifiers you want to use
from quantopian.pipeline.filters import Q500US
from quantopian.pipeline.factors import AverageDollarVolume,SimpleMovingAverage
# First create all the data fields (factors, filters, classifiers) that you will use
# Create any filters you may want to use as masks to speed up factor calculations
# Create any factors and classifiers you want to use. Use masks above if desired.
# Note the easiest way to create a factor from any dataset (including morningstar)
# is simply to use the '.latest' method as below
ev = morningstar.valuation.enterprise_value.latest
mean_10 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10, mask=spy)
mean_30 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=30, mask=spy)
perc_diff = (mean_10 - mean_30) / mean_30
# Now create an instance of the pipeline and add the desired columns you want returned in the resulting output
# below actually creates a filter 'on the fly' resulting in True where the factor perc_diff is greater than 1
# Attach the pipeline to the algorithm (basically store the pipe object reference and give it a name for easy use later)
# Run the pipeline we previously defined. It will return the columns of data we requested.
# Sort, filter, find, compute, or do whatever you wish on the pandas dataframe returned in 'output'
# All the data is neatly accessible as columns with the rows indexed by the securities.
# By default every security (the rows) in the Quantopian database will be returned. These can however be filtered
# to a smaller subset by setting the 'screen' parameter of the pipeline
my_stock_picks = output['ev'] > 100000
This isn't a complete algorithm. It doesn't schedule any functions and doesn't do any ordering. However, it does give the basic steps needed to define data needed by the algorithm and create a pipeline to fetch that data.
Hope that helps.