Back to Community
Custom Pipeline factor - how to access close ndarray with each asset?

Hi I'm trying out something with the pipeline code and making an Exponential Regression factor to calculate the slope of a price series to rank them. I'm however finding it very difficult to access each row of the close ndarray I'm taking in to create the calculation on.

Sorry I'm a bit new to numpy/scipy so perhaps there is a simple way to run a linear regression over the whole array as a function but I couldn't see how to do it. The problem is I can't access each row of the this array from the list assets the factor is passed as I can't find a way to search by label for each of them.

my code snippet:

    def compute(self, today, assets, out, close):  
        x_index = pd.Series(range(self.window_length))+1  
        close_log_returns = np.diff(np.log(close))  
        scores=[]  
        for asset in assets:  
            asset_returns = close_log_returns[:,asset] # this is breaking when I get to a larger number as asset is a sid like 46549 and this is not in the index  
            slope, intercept, r_value, p_value, std_err = stats.linregress(x_index, asset_returns)  
            score = slope * np.sqrt(252) * r_value**2  
            if score == score:  
                scores.append([asset, score])  
        out[:] = scores  

I get an error like this:

IndexError: index 8388 is out of bounds for axis 1 with size 8388  

Sorry, it's late in the evening/morning here and any help would be appreciated!

Thanks

Michael

12 responses

I sorted it finally.

Michael,

I'm very glad that you managed to figure this out! Would you be willing to post your solution? There may be community members in the future who come across a similar roadblock and could really learn from you.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

class ExponentialRegression(CustomFactor):  
    [setup stuff]

    def compute(self, today, assets, out, close):  
        scores = np.empty(len(close.T)+1, dtype=np.float64)  
        for col in close.T:  
            [do something on col and add it to scores]  
        out[:] = scores 

Thats it. There is also a method in numpy apply_along_axis but after doing some tests I found it to be marginally slower for some reason.

Hope that helps others.

Thanks to this post, I had a hard time figuring out the same issue. I hope the compute api would be only passed in with each symbol's data at a time instead of the array including all symbols.

Hi Michael,

How did you put data into scores?

Thanks Michael, that helped me a lot.

@Yingzhong: I succeeded in adding data into scores by doing something like this, I hope that makes sense (note that I had to remove the +1 for the length of empty np array declaration). There's probably a better way of getting the index of the current column?

class CF1(CustomFactor):  
    inputs = [USEquityPricing.close]  
    window_length = 250  
    def compute(self, today, assets, out, close):  
        scores = np.empty(len(close.T), dtype=np.float64)  
        i = 0  
        for col in close.T:  
            scores[i] = Somecalc(col)  
            i += 1  
        out[:] = scores  

@yingzhong - exactly as Charles had illustrated above. Just use the i variable as an index. It's the fastest way. You can correlate the row to the assets variable later but in reality this is probably all you need to do in the function to pass back the calculated data.

Hi Michael, would you mind sharing the sorted exponential regression custom factor also?

@Kenneth P. its just the block of code at the top wrapped in the new loop style. There was no changes to that part. instead of scores.append I used the i pointer to do scores[i] = score although you might be able to get away with just adding them all in without the check.

Try it out in a custom factor and a simple algo that runs for 1 or 2 days and just print out the head of the close dataframe or the pipeline output to see how it works. Or use the debugger.

I ended up prototyping the factor in an ipython notebook first just using a simple numpy series object with simple values I knew were going to give me correct answers (think xUnit style testing. My background is financial systems development and you need to test things before you write them) to validate the factor logic worked before I spent the time to wire it into the algo on quantopian. (trust but validate...) :-)

@Michael, thanks. Unfortunately my implementation is still breaking at the definition of asset_returns with the same error as above, and I can't seem to figure out why.

    def compute(self, today, assets, out, close):  
        scores = np.empty(len(close.T)+1, dtype=np.float64)  
        x_index = pd.Series(range(self.window_length))+1  
        close_log_returns = np.diff(np.log(close))  
        i = 0  
        for col in close.T:  
            asset_returns = close_log_returns[:, assets] #breaking  
            slope, intercept, r_value, p_value, std_err = stats.linregress(x_index, asset_returns)  
            score = slope * np.sqrt(252) * r_value**2  
            scores[i] = score  
            i += 1  
        out[:] = scores  
IndexError: index 8414 is out of bounds for axis 1 with size 8408  

your looping on the transposed close.T but accessing the close_log_returns.

How is that going to work? They are the wrong shape and different objects. Better to loop over the object you are interested in...

        close_log_returns = np.diff(np.log(close))  
        i = 0  
        for col in close_log_returns.T:  
            asset_returns = do_something_funky_to_my(col) # assets is a wild goose chase, forget about it....  
            slope, intercept, r_value, p_value, std_err = stats.linregress(x_index, col) # is this not what you want?  
            ... then the rest ...  

You probably don't even need the asset_returns variable unless you are transforming the row again after taking the log normal returns, just operate on the col you are getting back from the loop on the transposed panel.

@Kenneth is that a solution for you?

We should put in a stack overflow voting system in here so people can say wether or not a reply is the correct answer or not. Would be nice to know people value the time and effort you put into giving free online tech support....