Back to Community
Help with CustomFactor: Counting Number of 'Up' Days

Hi - I'm new to Quantopian and trying to write a pipeline to filter companies in QTradableStocksUS universe where the last close > 50 day SMA for X number of days consecutively.

I tried this for Apple stock and based it off this post which was useful

It builds but it gives unexpected results. For example for Apple specifically, the factor 'bad_days_count' (which counts the number of days since the close last < the 50 day SMA) should either be 0 or increasing.

However in my results for some dates it doesn't change
eg. (20 -> 20)
2020-01-14 -> 2020-01-15

For some dates it jumps in the opposite direction
eg.(23 -> 19)
2020-01-21 -> 2020-01-22

For most of the other days though, it works fine so not sure what is causing discrepancies around these dates. I think it should work in principle, does anyone have any ideas on where this may be going wrong?

Loading notebook preview...
5 responses

I've not worked out what it is but I have a feeling its something to do with this line:
days_since_last_bad_day = np.argmax(bad_day_flipped, axis=0)

I've done a few tests to check the outputs on each iteration and they seem to be as expected, however I think there is something funky with numpy operations on the above line. I have attached the same notebook except this time I replaced np.argmax() with np.sum() and I narrowed down the time range period_start = '2020-02-04'
period_end = '2020-02-07'

On the output I can see from 2020-02-04 ->2020-02-05 it goes up by 1, however the only possible addition to the array would have to be for the new date ie. 2020-02-05 and if this were the case then bad_days should be true and argmax() should go down to 0.

Not sure if the numpy operations are working as expected for this?

Loading notebook preview...

I believe your SMA calculation isn't doing what you wish it to do. You are looking ahead in time.

       # This takes the mean of the price 50 days ago through the current price (so far so good)  
       sma50cf = np.nanmean(last_close_price[0:-1], axis=0)

       # However, then it checks for any prices greater than the mean  
       # As an example, when looking at the close 50 days ago  
       # it compares the close to the mean of the future 50 days prices  
       # Probably not what we want  
       bad_day = last_close_price < sma50cf

One needs to use a rolling mean and not the total mean. When doing this, one would need a window length of 100 days and then start checking 50 days ago.

This could be done, however, I'd separate the problem in two. First, create a factor which calculates the amount above (or below) the 50 day SMA the current price is. Second, create a factor which counts the number of consecutive positive values. This avoids messing around with the rolling function. Something like this for the first factor.

 class Amount_Above_SMA(CustomFactor):  
     inputs = [USEquityPricing.close]  
     window_length = 50  
     # Set window_safe to True so it can be used as an input to another factor  
     window_safe = True

     def compute(self, today, assets, out, close_prices):  
         sma_close_prices = np.nanmean(close_prices, axis=0)  
         out[:] = close_prices[-1] - sma_close_prices

Then, the second factor to count consecutive positive values.

 class Consecutive_Days_Above_Zero(CustomFactor):  
     Returns the number previous days where the input is above zero  
     # No defaults. Must be set when instantiated.  
     # Takes a single input  
     def compute(self, today, assets, out, values):  
         # The the first row (0) in values is the earliest day  
         # The last row (-1) is the latest  
         # We want to use the numpy 'argmax' method to count from the latest day.  
         # So, flip the values so it's in reverse order  
         values_latest_to_earliest = np.flipud(values)

         # Now, a little trick, if 'argmax' doesn't find a value it returns 0  
         # However, we want it to return the max if all are above zero  
         # So, append a row to the end with zeros so it will find the max length  
         asset_count = values_latest_to_earliest.shape[1]  
         zeros = np.zeros(asset_count)  
         zero_array_row = zeros.reshape(1,-1)  
         values_latest_to_earliest = np.append(values_latest_to_earliest, zero_array_row, axis=0)

         # Finally, use 'argmax' to find the first value less than zero  
         out[:] = np.argmax(values_latest_to_earliest<=0, axis=0)

Attached is a notebook showing this. I believe this is what you were looking for?

Loading notebook preview...

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks Dan, just have 2 questions:

1) Why did you choose window_length 200 in the second custom factor 'Consecutive_Days_Above_Zero' - is that just an arbitrary max length if its all above 0?
2) I see how your solution works, however part of me still doesn't totally understand why the original method failed.

A Pipeline is an object that represents computation we would like to perform every day.
window_length : An integer describing how many rows of historical data the Factor needs to be provided each day.

Therefore, I was under the impression that

sma50cf = np.nanmean(last_close_price[0:-1], axis=0)  

would get run everyday, I would get an SMA_50 for each day between the start/end_periods specified? (as each day the custom factor was run it would take a window_length=50 from the evaluation date?)

However (if I understand correctly) your comment it seems to suggest sma50cf is a static mean from the (end period - 50 days?)

Therefore, when I do
bad_day = last_close_price < sma50cf it compares the close price of each day within (end date - 50 days) to an SMA calculated from future prices?

In your solution, you seem to specify a 50 day window in 'Amount_Above_SMA' but then only take the last row and then put it into a custom factor with a much longer window_length (200 days?)

Please let me know if I am misunderstanding anything?

@ James Hoang I'll try to answer your questions.

First you asked "Why did you choose window_length 200 in the second custom factor 'Consecutive_Days_Above_Zero' - is that just an arbitrary max length if its all above 0?" Yes, that was just an arbitrary look back length. If the max days above zero you ever wanted to consider is less, then you could shorten that.

The second question "why the original method failed" is a little more involved.

It is correct that the compute function in a custom factor is executed each day. Let's take an example how that works.

# Assume the pipeline is run from 2020-02-04 thru 2020-02-07  
# The following code would be executed 4 times with new data each day  
sma50cf = np.nanmean(last_close_price[0:-1], axis=0)  

# sma50cf would have the following values (using pseudocode) each day pipeline is run  
2020-02-04  mean(2020-2-3 - 50 days : 2020-2-3)  
2020-02-05  mean(2020-2-4 - 50 days : 2020-2-4)  
2020-02-06  mean(2020-2-5 - 50 days : 2020-2-5)  
2020-02-06  mean(2020-2-6 - 50 days : 2020-2-6)

Every day the pipeline is run and fetches a new window of data. Also note the pipeline runs before the market opens each day so it fetches the previous days data. That is why the dates above are shifted by 1 day. For each day the value of 'sma50cf' is static. It's the mean of the last 50 days, from the pipeline date, of close prices. Then, the next day, it will be the mean of the last 50 days from the next pipeline date.

Now, let's look at the code for comparing if a day is above the 50 day SMA.

# Again, use the dates above. Let's look at the results of the following code  
bad_day = last_close_price < sma50cf

# The 'last_close_price' is a 2D numpy array.  
# The number of rows is specified by window_length (50) in this case  
# The number of columns is equal to the number of assets  
# 'sma50cf' is 1D numpy array.  
# It has one row with the mean value of the 50 days of prices.  
# There is one value for each asset  
# 'bad_day' will also be a 2D numpy array like 'last_close_price'  
# What do the rows of 'bad_day' look like?  
bad_day[2020-2-3] = last_close_price[2020-2-3] < mean(2020-2-3 - 50 days : 2020-2-3)  
bad_day[2020-1-31] = last_close_price[2020-1-31] < mean(2020-2-3 - 50 days : 2020-2-3)  
bad_day[2020-1-30] = last_close_price[2020-1-30] < mean(2020-2-3 - 50 days : 2020-2-3)  
bad_day[2020-1-29] = last_close_price[2020-1-29] < mean(2020-2-3 - 50 days : 2020-2-3)  

Note the above dates represent the pipeline being executed on 2020-2-4. The above dates are what the compute function sees on 2020-2-4. The first comparison makes sense. It compares the close on 2020-2-3 to the mean(2020-2-3 - 50 days : 2020-2-3). However, the other comparisons do not. They compare a previous close with the entire mean. Now, this can be done but I am assuming it's not what is intended. I am assuming what is intended is the following

#  'bad_day' should look like this  
bad_day[2020-2-3] = last_close_price[2020-2-3] < mean(2020-2-3 - 50 days : 2020-2-3)  
bad_day[2020-1-31] = last_close_price[2020-1-31] < mean(2020-1-31 - 50 days : 2020-1-31)  
bad_day[2020-1-30] = last_close_price[2020-1-30] < mean(2020-1-30 - 50 days : 2020-1-30)  
bad_day[2020-1-29] = last_close_price[2020-1-29] < mean(2020-1-29 - 50 days : 2020-1-29)  

Notice the mean values 'roll' depending upon which row in 'bad_day' is being evaluated. Each iteration of the pipeline does pass a new window of data to the compute function. However, as that function is being executed, the data is fixed for that day.

There isn't a good 'roll' function in numpy which is why I didn't pursue an approach trying to implement the above 'roll' inside of the compute method. However, there is no need to 'roll' the data explicitly. The approach I suggested get's around that. It uses the fact that the pipeline effectively 'rolls' a new window of data and presents it to the compute function each day. The approach creates a factor which outputs the last (most recent) value each day. Then, by passing that factor to a second 'Consecutive_Days_Above_Zero' factor, the pipeline mechanism does the 'roll' work for us. Pipeline calculates the value of the our factor for each day and present these 'rolled' means to the 'Consecutive_Days_Above_Zero' factor.

Hope that makes sense?

Ah when you put the dates like that then it makes sense. Thankyou!