Back to Community
Forward filling nans in pipeline

An effort to forward fill nans in pipeline adapted from stackoverflow.
Is there an easier way?
Or if so, would this route provide for more flexibility for improvements? For instance, considering a large window_length, leave nans in place for the stock to be excluded later with pipeline_output('p').dropna() ...

  • if any solid collection of nans at beginning or end of the window beyond a certain number.
  • if count of interspersed nans beyond a certain ratio.

Example use:

class Quality(CustomFactor):  
    #inputs = [mstar.income_statement.gross_profit, mstar.balance_sheet.total_assets]  
    inputs = [Fundamentals.gross_profit, Fundamentals.total_assets]  
    window_length = 24  
    def compute(self, today, assets, out, gross_profit, total_assets):  
        norm = gross_profit / total_assets  
        norm = nanfill(norm)  
        out[:] = (norm[-1] - np.mean(norm, axis=0)) / np.std(norm, axis=0)

def nanfill(arr):  
    mask = np.isnan(arr)  
    idx  = np.where(~mask,np.arange(mask.shape[1]),0)  
    np.maximum.accumulate(idx,axis=1, out=idx)  
    arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]  
    return arr  

Test to be certain.

2 responses

Why would Fundamentals.total_revenue require forward filling? Are there companies for which total_revenue is not reported by the company? Or are these errors in the Fundamentals database?

Simple sample factor seemed like a good idea, arbitrary. Changed to Quality involving mean() and std().

To count nans and log before and after the treatment replacing them:

def nanfill(arr):  
    nan_num = np.count_nonzero(np.isnan(arr))  
    if nan_num:  
        log.info(nan_num)  
        log.info(str(arr))  
    mask = np.isnan(arr)  
    idx  = np.where(~mask,np.arange(mask.shape[1]),0)  
    np.maximum.accumulate(idx,axis=1, out=idx)  
    arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]  
    if nan_num:  
        log.info(str(arr))  
    return arr  

In my experience in backtests with nans forward filled this way I've seen some improved performance.
Try for example with factors in the Notebook at https://www.quantopian.com/posts/faster-fundamental-data

For instance on 2016-09-07 I saw a count of 2411 nans with LongTermDebtRatioChange() using Q500US that are presumably then forward filled if this is behaving itself (except for any at the beginning since this doesn't do backfill), however it is complex so I'd like to have this checked, modified, quashed and/or verified by someone.

Is it reasonable to fill them or are there times when it could produce false signals such as when their percentage is high, imagine a percentage threshold built in. How could that be implemented? It would have to count and treat each stock individually inside the ndarray. And it may be beneficial to have a switch for turning the overall replacement of nans on and off for comparison, perhaps simply if not do_nanfill: return arr. Note that out is assigned, I applied the change mentioned at that stackoverflow link and even though the results seem ok at a glance I'm not 100% certain.