Back to Community
Need Help with Pipeline, Just a Beginner

Hi Guys and Gals,
Just started on Q2. I am used to iterating with the data. I cannot anymore so I have to use pipeline.

This is how I believe it works.

  1. You have a load the pipeline with Pipe=pipeline()
  2. Then you attach the pipeline.
  3. After it has been attached you add the variables, class factors, and filters to the stock with pipe.add(ma,'ma')
    4.I then use set_screen to apply my filter.
  4. Then I use pipeline_output to have the stocks to purchase.

My questions in order of importance,

How come my print(output) is giving so many NANs and not values?

How do I use Morningstore Fundamentals?

from quantopian.algorithm import attach_pipeline, pipeline_output  
from quantopian.pipeline import Pipeline  
from quantopian.pipeline.filters import Q500US  
from quantopian.pipeline.data.builtin import USEquityPricing  
from quantopian.pipeline.factors import AverageDollarVolume,SimpleMovingAverage  
# from quantopian.pipeline.data import morningstar  

def initialize(context):  

    pipe=Pipeline()  
    attach_pipeline(pipe,'my_pipeline')  
    spy=Q500US()  
    mean_10 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10, mask=spy)

    # 30 day close price average.  
    mean_30 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=30, mask=spy)  

    perc_diff = (mean_10 - mean_30) / mean_30

    pipe.add(mean_10, 'mean_10')  
    pipe.add(mean_30, 'mean_30')  
    pipe.add(perc_diff>1, 'perc_diff')  



def before_trading_start(context,data):  
   output=pipeline_output('my_pipeline')  
   print(output)  
2 responses

It may be helpful to first understand the 'why' for using a pipeline.

Backtest speed and memory usage is a challenge especially when accessing many fields of data (eg many different fundamentals). Consider a simple requirement to retrieve the last 5 days of close prices for a stock each day. The backtester would execute a database query each day of the backtest to fetch the close prices. If the backtest was run over a 5 year period this roughly translates into about 252 x 5 = 1260 queries of the database. Moreover, 80% of the data that's fetched each day had already been fetched the day before (4 of the 5 close prices will be the same).
Could we do this more efficiently? Since we know what data we need (the close prices) and we know how much data to fetch (5 years), couldn't we get all the data at once? Couldn't we greatly reduce the database overhead and the total data fetched if we simply did one query at the beginning of our backtest? The answer is yes. However, the devil is in the details (as always). First, the pesky reality of stock splits and de-listing. The split adjusted prices and volumes and even the tradable securities are typically not static throughout the backtest period. They vary dependent upon the backtest dates. Second, the amount of data to optimally query and cache is not obvious. This will vary dependent upon the type of data, the quantity, and the duration of the backtest.

Pipelines to the rescue... The 'why' for using a pipeline is to improve data fetch and calculation performance, to handle stock splits, and to provide an (ostensibly) easy interface so a user doesn't need to be concerned with down and dirty implementation issues. The output of a pipeline is a nice neat pandas dataframe with columns for each piece of data you need all in one place.

So, It may help conceptually to re-order the pipeline implementation steps you listed above:

  1. Look over and find the base data that you want your algorithm to
    use. Choose from any of the fields (called 'bound columns' in
    pipeline parlance) in any of the data sets provided by Quantopian
    https://www.quantopian.com/data
  2. Create specific data fields your algorithm will use based on these
    basic data. Either the raw data fields or calculations based upon
    these fields. These are the factors, filters, and classifiers in
    pipeline parlance.
  3. Create a pipeline object and tell it what specific data fields to
    use from step 2. These become the columns in the output dataframe. You can also
    specify a 'screen' to limit the output of the pipe to specific
    securities but this isn't required. You can always sort and filter
    after you get the output. Knowing all the required data BEFOREHAND
    allows the pipeline to optimize what data to fetch.
  4. Attach the pipeline to the Algorithm. This effectively stores the
    pipeline object reference so the algorithm can access it later.
  5. Run the function 'pipeline_output' each day before the market opens.
    This does the actual fetch of the specified data fields and returns
    them in the form of a single pandas dataframe. Note that in the
    background the pipeline will be retrieving more data than for that
    single day for optimization reasons. The pipeline object stores
    these pre-fetched data. The next time the 'pipeline_output' function
    is called, if the pipeline object already has the output (from a
    previous call), then the data is returned from cache without doing
    an additional database query.

Now, you asked 'How come my print(output) is giving so many NANs and not values?'
Whenever you use a mask for a factor, any security which doesn't pass the filter (ie it is 'masked out') will return a NaN for the factor value. In your case you used 'spy' as a mask for your factors. Every security which doesn't pass the 'spy' filter will return NaN for that factor.

You also asked 'How do I use Morningstore Fundamentals?'
You basically have all the steps in your post, but I'll re-order them and comment:


# the required imports to use pipeline  
# this imports the Pipeline class definition  
from quantopian.pipeline import Pipeline

# this imports the two pipeline functions needed by your algorithm  
from quantopian.algorithm import attach_pipeline, pipeline_output    

# import any datasets you want to use  
from quantopian.pipeline.data.builtin import USEquityPricing  
from quantopian.pipeline.data import morningstar

# import any built in filters, factors or classifiers you want to use  
from quantopian.pipeline.filters import Q500US  
from quantopian.pipeline.factors import AverageDollarVolume,SimpleMovingAverage

def initialize(context):  
    # First create all the data fields (factors, filters, classifiers) that you will use  
    # Create any filters you may want to use as masks to speed up factor calculations  
    spy=Q500US()  

    # Create any factors and classifiers you want to use. Use masks above if desired.  
    # Note the easiest way to create a factor from any dataset (including morningstar)  
    # is simply to use the '.latest' method as below  
    ev = morningstar.valuation.enterprise_value.latest  
    mean_10 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10, mask=spy)  
    mean_30 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=30, mask=spy)  
    perc_diff = (mean_10 - mean_30) / mean_30

    # Now create an instance of the pipeline and add the desired columns you want returned in the resulting output  
    pipe=Pipeline() 

    pipe.add(ev, 'ev')  
    pipe.add(mean_10, 'mean_10')  
    pipe.add(mean_30, 'mean_30')  
    # below actually creates a filter 'on the fly' resulting in True where the factor perc_diff is greater than 1  
    pipe.add(perc_diff>1, 'perc_diff')  

    # Attach the pipeline to the algorithm (basically store the pipe object reference and give it a name for easy use later)  
    attach_pipeline(pipe,'my_pipeline')  


def before_trading_start(context,data):  
    # Run the pipeline we previously defined. It will return the columns of data we requested.  
    output=pipeline_output('my_pipeline')

    # Sort, filter, find, compute, or do whatever you wish on the pandas dataframe returned in 'output'  
    # All the data is neatly accessible as columns with the rows indexed by the securities.  
    # By default every security (the rows) in the Quantopian database will be returned. These can however be filtered  
    # to a smaller subset by setting the 'screen' parameter of the pipeline  
    print(output)  
    my_stock_picks = output['ev'] > 100000

This isn't a complete algorithm. It doesn't schedule any functions and doesn't do any ordering. However, it does give the basic steps needed to define data needed by the algorithm and create a pipeline to fetch that data.

Hope that helps.

Thanks for the help! I haven't been on Q in a while. But I think I can design an algorithm with this help!