Back to Community
Pipeline API Feature Request: retrive single data points instead of a full array of data (window_length)

The Pipeline API is a great concept but for moment it's quite difficult to implement algorithms that use historical fundamentals data without incurring in a out-of-Memory or timeout error.
I already exposed the problem in details in this post:

Given that all the fundamentals in the Pipeline API are quarterly, you currently need to load about 260, 456 or even more days of data to compute annually TTM data, for example:

quarter_lenght = 65  
ttm    = [               -1,   -quarter_lenght, -2*quarter_lenght, -3*quarter_lenght]  
ttm_py = [-4*quarter_lenght, -5*quarter_lenght, -6*quarter_lenght, -7*quarter_lenght]

class NetIncomeChange(CustomFactor):  
    window_length = 7*quarter_lenght + 1  
    inputs = [morningstar.income_statement.net_income]  
    def compute(self, today, assets, out, net_income):  
        net_income_ttm = np.sum(net_income[ttm], axis=0)  
        net_income_ttm_py = np.sum(net_income[ttm_py], axis=0)  
        out[:] = (net_income_ttm - net_income_ttm_py) / net_income_ttm_py  

My suggestion to avoid an exaggerated memory consumption, is to provide also the possibility to specify only the indexes of the required data points instead of the full window length, for example something like that:

window_datapoints = [1, -quarter_lenght, -2*quarter_lenght, -3*quarter_lenght, -4*quarter_lenght, -5*quarter_lenght, -6*quarter_lenght, -7*quarter_lenght]  

would for sure consumes a lot less resources than the current:
window_length = 7*quarter_lenght + 1

What do you think about? Is it doable?


1 response

I would like something like this too! Maybe an alternative input syntax would be something similar to normal Python data slicing but with addition of a total data point # count? e.g.
window =
window_length / # of data points (e.g. 4 data points),
time between steps (e.g. 21 days, 250 days, etc),
start_point vs today (i.e. today minus x days)

this would allow for flexibly fetching the data points you want over a longer time frame with the resolution you want but avoid a lot of redundant data points (and thus memory error).