Back to Community
Using the Mean of Todays Values within the Pipeline

Hello everybody,

I am calculating the number of bear and bull data. Within the pipeline, those values are calculated per Stock. However, I want that value for the overall universe of stocks - the mean() of them. But I can't find any way to compute that number to use it WITHIN the pipeline for further calculations.

What I'm searching for is more or less something like: bear_mean = num_bear_now.groupby(timestamp).mean()

Appreciating any help!!! Really hope, someone can help me

Loading notebook preview...
22 responses

There are a couple of ways to get the average, or mean, value of a factor within pipeline. The straightforward approach is to write a small custom factor. Something like this

#Custom factor for calculating mean of a factor  
class Factor_Mean(CustomFactor):  
    # Default inputs  
    window_length = 1

    def compute(self, today, assets, out, factor_data):  
        out[:] = np.nanmean(factor_data)

    bull_mean = Factor_Mean(inputs=[bull_scores_average], mask=universe & bull_scores_average.isfinite())  

That will broadcast the mean value of the factor to all assets. Make certain to include a mask to limit which assets to include in the average. If the window length is set to something other than 1 then all the factor values will be 'flattened' and the result will be the mean of all values over all the days in the window.

A second approach, sort of a hack, is to use the demean method.

Demeaning data means subtracting the sample mean from each observation
X[demeaned] = X - X[mean]

Therefore, if one subtracts the demeaned value from each observation one is left with the mean
X - X[demeaned] = X - (X - X[mean]) = X[mean]

So, the following will do the same as our custom factor.

    bull_mean = bull_scores_average - bull_scores_average.demean(mask=universe & bull_scores_average.isfinite()) 

Both methods will broadcast the same value to all assets. One advantage of the demean method is the groupby parameter can be used to get the mean value for each group. This can also be done of course with a custom factor but requires a little bit of coding. Finally, it's generally not appropriate to average a number of averages. However, it may be appropriate here?

Attached is a notebook with the code.

Click to load notebook preview
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thank you so much Dan!! you are a lifesaver. That was exactly what I needed.

@Dan Whitnable

I've been trying to use your solution to Niklas's problem, to find the number of signals each day, and ultimately to use that value in some logic within pipeline.

Please see attached notebook, where I have stripped down the factor to the basics.

I have commented out rows 45 and 58, which when left in cause the following error message:

NonPipelineInputs: Unexpected input types in str. Inputs to Pipeline expressions must be Filters, Factors, Classifiers, or BoundColumns.  
Got the following type(s) instead: [<class 'numpy.ndarray'>]  

Any advice, pointers appreciated.

Click to load notebook preview

See attached notebook. Made a few changes. Also, there is now a built in method to sum factors .sum() so no need for the custom factor anymore. See the docs for a complete list of summary methods which can be used for factors (https://www.quantopian.com/docs/api-reference/pipeline-api-reference#methods-that-create-summary-statistics). You may find some of these methods helpful.

Best of luck.

Click to load notebook preview

Thanks Dan :-)

I mistyped and actually meant to "count" the signals, rather "sum" them. So I followed your advice and looked at the list of methods (your link below) and replaced ".sum" with ".notnull_count".

https://www.quantopian.com/docs/api-reference/pipeline-api-reference#methods-that-create-summary-statistics

Thanks again.

Hi @Dan Whitnable

Can I ask a folllow up queston pls?

In the custom factor I'm trying to add some logic into the compute section. My intent is to filter the output of the custom factor using the count of the signals coming from the custom factor.

The custom factor produces buy and sell signals. I want to be able to choke the custom factor in certain instances - say when there are just buys and no sells.

I have tried this by developing two tests (one for buy signals and one for sell signals) within the compute section of the custom factor.

I then wanted to use the np.where statement to modify the output of the custom factor based on some logic.

However when I do this, i get an error message saying this:

could not broadcast input array from shape (34,8464) into shape (8464)

The new code I have added to the compute section of the custom factor in the above notebook is:

        b1 = (low - lb) < 0  
        b2 = (close - lb) > 0  
        buy_signal = b1 & b2

        s1 = (high - ub) > 0  
        s2 = (close - ub) < 0  
        sell_signal = s1 & s2  
        frac_fac = (close[-1]-lb)/(ub-lb) 

        out.lb[:] = np.where( buy_signal, frac_fac, na)  
        out.ub[:] = np.where( sell_signal, frac_fac, na)  
        out.frac_fac[:] = frac_fac  

Tinkering some more, this does not throw an error:

        out.lb[:] = np.where(buy_signal[-1],frac_fac,None)  
        out.ub[:] = np.where(sell_signal[-1],frac_fac,None)  
        out.frac_fac[:] = frac_fac

I also need to rename the outputs as are no longer upper and lower "bands", but now buy & sell signals

@Ben,

I wonder why you use highest 'high' in frac_fac (CustomFactor) to calculate lb and lowest 'low' to calculate ub?

            lb = np.nanmax(high[:self.window_length], axis=0)  
            ub = np.nanmin(low[:self.window_length], axis=0)  

@Vladimir

The lb and ub bands in this notebook are just placeholders for my actual indicators. The output looks really similar to Bollinger Bands, and I use them n a similar way.

I'm a noob to Python, and have returned to Quantopian after a false start 2 years ago - determined this time to get a working version of my algo/indicator - really appreciate all the support from the quantopian forums and only hope I can return the favour at some point.

While I have you! I'm just trying to undestand when to use "close[-1]" versus just "close". Iunderstand "close[-1]" returns the most recent value in the boundcolumn. I'm just wondering if in the following code:

frac_fac = (close[-1]-lb)/(ub-lb)  

I should perhaps also be referring to lb as "lb[-1]" etc...?

@Ben,

Your frac_fac(CustomFactor) very similar to %K(FastK Stochastic Oscillator) but with opposite lb and ub definition.

FYI
Stochastic Oscillator was developed by George C. Lane in the late 1950s

%K = (Current Close - Lowest Low)/(Highest High - Lowest Low) * 100

Lowest Low = lowest low for the look-back period.
Highest High = highest high for the look-back period.

Tks Vladimir - i will have a look at those, appreciate the info.

@Dan Whitnable

Made some progress on this. My goal is to get the count of signals produced each day from a custom factor and then use that in some logic to decide whether or not to use the signals that day.

Please see below a version in which I have replaced my custom factor with a simple dummy custom factor.

The problem I am trying to understand this w/e is why the count of signals is always higher than my universe of stocks?

Is it possible that the code is not in fact counting the daily signals as I intend?

Any advice, insights greatly appreciated.

Click to load notebook preview

Your code is indeed correctly counting the number of signals each day. The buy and sell signals are being generated by the custom factor frac_fac. However, there isn't any mask value set so it will calculate signals for all assets in the Quantopian universe (~9000). It seems you want to just calculate signals for the universe you specified to be Q500US. If that's the case, then set a mask when creating the frac_fac factor. Like this

    # Instantiate our frac_frac factor  
    indicators = frac_fac(window_length=w_length+s_step, mask=universe)

The results will now probably be more in line with what you expected.

Good luck.

Thanks Dan!

@Dan Whitnable

Hi Dan - I'm now getting totals for signals each day thanks to your advice on addition of the mask.

One thing I don't fully understand is when it is necessary to use "[-1]" with a variable (to signify referring to most recent value in the array).

In the following code snippet from the previously attached notebook, in some of the logic I do not use "[-1]" but in the logic immediately after it I do use "[-1]". I do not understand why.

# @Dan - why do the following 6 lines of code not require "[-1]" after each term on the right hand side of each assignment statement?  
        b1 = (low < lb)  
        b2 = (close > lb)  
        buy_signal = b1 & b2 

        s1 = (high > ub)  
        s2 = (close < ub)  
        sell_signal = s1 & s2  


# @Dan - And if I do not include the "[-1]" with each of the terms on #the RHS of the following assignment statement, I get a run time error  
#   "could not broadcast input array from shape (20,500) into shape (500)".  
# I understand that 500 is the unmasked universe of stocks, but what does "20" refer to?

        frac_fac = (close[-1]-lb[-1])/(ub[-1]-lb[-1])  

Thanks for all support.

@dan Whitnable

I've been stuck on this for a week - thought I was closing in on it, but is slipping from my grasp - please can you give me a pointer Dan?

In line 26 of the run_pipeline cell of the attached notebook I have the line:

    sigs_today = min([sells_count,buys_count])

The problem is it does not behave as intended. It just returns the "buys_count" figure. (it returns the last term in the list, rather than the min).

Is it perhaps treating it as a string and sorting them alphabetically? If so - how can I avoid that, and get it return the element with the minimum value?

Thanks as always.

Click to load notebook preview

@dan whitnable

Hi Dan - Regarding the question above, would be good to know if it is in fact possible to achieve what I'm trying to do? That is, to count the number of buy sigs and then count the number of sell sigs from a particular factor, and then determine the minimum value (either buy r sells) and use that as a pipeline output to make sure an equal number of buys and sells are generated each (and for days when that does not occur, to set all signals to null).

Tks in advance.

Ben

The thing to remember when defining a pipeline is one is working with factors and not the actual data. It's easy to forget because one can casually code factor_sum = factor_1 + factor_2 and it works as expected. However, that is only because factors have code in their base class to 'understand' addition (and a few other operators). The ONLY operators and functions which work with factors are the ones explicitly defined. Moreover, and this is unfortunate, if an operator or function is used which isn't defined for factors, an error is often not generated. Things just go on and provide results probably much different than intended. This is what's happening here.

The problem is indeed the following line. The min function is not defined for factors.

 sigs_today = min([sells_count, buys_count])

Up until recently, creating a factor which returned the min of two other factors, would have required a CustomFactor. Not hard but a little extra work. However, now there is a clever if_else method which can be used for this purpose. Check out the docs. So, something like this will return a 1D factor which is the min of sells_count or buys_count.

sells_less_than_buys = sells_count < buys_count  
sigs_today = sells_less_than_buys.if_else(sells_count, buys_count)

How does this work? First, start with a filter. In this case sells_less_than_buys. Typically, filters have a value of either True or False for each asset. Some assets may be True while others False. In this case however, our filter is a 1D filter with the same value for every asset (either all True or all False). This is because the factors from which it is derived are 1D factors having the same value for each asset. The if_else method will return the first value (ie sells_count) when sells_count < buys_count and will return buys_count when it's not. This effectively will return the minimum of the two counts.

Try that. It should work as intended.

Good luck.

Thanks Dan! I think I get it, albeit tenuously! Tks again.

Hi @Dan Whitnable

The method you described above is working great - thanks again.

This AM I've been trying to modify it to see the effect of placing a floor higher than zero.

I've tried adding the following after your code.

    n=100 #the floor value  
    np.where(sigs_today<n, 0 , sigs_today)

Although no error is thrown, the output does not change, making me think its exactly the same challenge as you described, and solved.

So I then attempted to use your technique to achieve the desired effect, with the following:

    n=100

    sigs_filter = sigs_today < n  
    sigs_today = sigs_filter.if_else(0, sigs_filter)

This generates the following error message:

TypeError: Filter.if_else() expected a value of type ComputableTerm for argument 'if_true', but got int instead.

Any advice appreciated!

@Ben. The if_else method expects two factors as arguments. It doesn't expect an integer (in this case 0). That is what's causing the error TypeError: Filter.if_else() expected a value of type ComputableTerm for argument 'if_true', but got int instead.. The numpy where method you tried fails for the previous reason that factors are not actual data, or arrays, and therefore cannot be used as inputs to numpy functions.

One could use the if_else method again if the total stocks are less than a minimum number. A 'trick' is that the top method will return zero assets if the N parameter is less than or equal to 0. Something like this

    # get the minimum of either buys or sells  
    min_of_buys_sells_signals = (buys_count < sells_count).if_else(buys_count, sells_count)

    # require a minimum number of buys and sells  
    # if our count is less than this minimum then set to a negative number  
    # setting to a negative number will return zero assets when using the `top` method  
    min_required_buys_and_sells = 60  
    negative_count = min_of_buys_sells_signals - min_required_buys_and_sells

    count_less_than_min_required = min_of_buys_sells_signals < min_required_buys_and_sells  
    sigs_today = count_less_than_min_required.if_else(negative_count, min_of_buys_sells_signals)

    # will not return any assets if `sigs_today` is negative  
    buys = indicators.buy_sigs.top(sigs_today)  
    sells = indicators.sell_sigs.top(sigs_today)  

This should do what you want. A word of caution. Pipeline will cause an error in the notebook environment if no rows are returned for any dates. It's OK if it doesn't return anything for some dates, however nothing for all dates get's it confused. This can happen if the minimum value is set too high and there are never any buys or sells. (Setting the minimum value to 100 will cause this).

All that said, this could probably be done in a more clear fashion after the pipeline is run. One can then use standard python, numpy, and pandas. Pipeline is really limited to factor and filter methods, which as we are seeing, are a bit limited at times. In fact, I often use pipeline solely to fetch data with no logic. Completely separate getting the data from manipulating the data. Then, once pipeline is run, the data can be manipulated, sliced, and diced as needed. The separation of data and code may be more intuitive to some.

There are really only two drawbacks to doing all the logic after pipeline is run. First, a pipeline definition can typically be copied and pasted from a notebook to an algo in the IDE and it will run the identically in both environments. However, because a pipeline returns a multi-indexed dataframe in a notebook and a single-indexed dataframe in an algo, the code for manipulating the datafames will be different. Having the same code for both makes for an efficient development workflow. Second, if the logic is encapsulated within a pipeline it can more easily be analyzed using Alphalens. Anyway, something to consider. It's personal preference.

Check out the attached notebook. It should be working as you expected and setting a 'floor' for the number of buys and sells.

Click to load notebook preview

Thank you!