Back to Community
Creating a Pipeline from two datasets?

Newbie here.

I'm looking to make a strategy from two datasets. As we cannot have two pipelines, I am looking to create a single pipeline created from the two datasets basing on varying conditions, independent conditions for each dataset.

Is there a way to do this cleanly? Any examples to refer to?

3 responses

Maybe start with some basics.

A pipeline is an object who's primary function is to fetch data in an optimal way. You define all the data fields you want to get and it then returns that data with a call to the 'pipeline_output' method. You can define 'raw' data fields (eg 'volume', 'market cap', etc) or you can define calculated fields based upon raw data (eg 'demean', 'rank', 'simple moving average'). The pipeline objects job is to optimize the database queries and any associated calculations and return the results as a nice powerful pandas dataframe. The columns of the dataframe are the data fields you defined and the rows are each security.

Now, you can implement some simple boolean logic within a pipeline through the use of masks and a final screen. This will limit the securities (ie the rows in the returned dataframe) to a particular subset. However, just because you can doesn't mean you always should. This isn't required and isn't at times even desirable.

A dataset is, well, a set of data. One example is 'USEquityPricing'. Each dataset typically has multiple fields. In the case of USEquityPricing those fields are 'open', 'high', 'low', 'close', and 'volume' data. Quantopian provides a number of datasets (https://www.quantopian.com/data) which you can then choose any of their associated fields to add as columns in a pipeline definition or use as inputs to a calculated pipeline column.

These concepts are introduced in the tutorials https://www.quantopian.com/tutorials/pipeline.

@Martin
You stated "I'm looking to make a strategy from two datasets". That's no problem. Go ahead and pull data fields from as many datasets as you wish (though realistically there are memory and processor constraints) and put them into your pipeline definition.

" I am looking to create a single pipeline created from the two datasets basing on varying conditions, independent conditions for each dataset".

The approach I'd recommend is to separate your data from your logic. Create a pipeline object to return only the data. Create separate code to define your conditions. Don't do any filtering or logic in the pipeline definition. Any logic that can be done with the pipeline methods can just as easily be done using pandas methods with the returned dataframe. Start by defining the data you need. In this case 'last_close', 'gain', 'sector'

def pipe_definition(context):

    universe = Filters.Q1500US()  
    last_close = USEquityPricing.close.latest

    gain = Factors.Returns(inputs=[USEquityPricing.close], window_length=30, mask=universe)  
    sector = Sector(mask=universe)  
    return Pipeline(  
            columns = {  
            'last_close' : last_close,  
            'gain' : gain,  
            'sector' : sector,  
            },  
            )

Then in your 'before_trading_start' method (or elsewhere in the code) use the pandas dataframe methods to manipulate, calculate, and select stocks which are your 'conditions'. You can have as many of these conditions as you wish.

    # Assume the pipeline output was assigned to 'context.output'


    condition_a = 'sector == 206 and gain > 0 and last_close > 5.0'  
    context.a_stocks = (context.output.  
                     query(condition_a).  
                     nlargest(TARGET_STOCKS, 'gain').  
                     index) 

    condition_b = 'sector == 100 and gain < 0'  
    context.b_stocks = (context.output.  
                     query(condition_b).  
                     nsmallest(TARGET_STOCKS, 'gain').  
                     index) 

I've attached an algorithm with something like this implemented.

Hope this gives you some ideas. The big takeaway... separate the data from the logic. This is is good programming practice in general but will certainly make for a lot clearer pipeline definition. Then get to know the pandas dataframe methods and use them. There are dataframe methods for just about any sorting, selecting, and calculating one would wish to do. They're your friends.

Clone Algorithm
27
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 5930320a742ea96e3d1f8809
There was a runtime error.

Thanks Dan, much clearer when you frame it that way.

Will give it a shot

Fantastic overview Dan!

It's actually pretty sound advice for a complete beginner to avoid using Pipeline to compute factors, and to do any data processing manually using pandas - that's perfectly viable in Research, and will let you get some experience with both. It'll be tough at first but becoming proficient in pandas is well worth the time.