Back to Community
change in memory usage/availability?

I've been troubleshooting an algo I have in the contest. To enter the contest, I ran a 2-year backtest, which ran fine. However, now when I try to run the same code, it fails due to a memory error after ~ 6 months of elapsed backtest time.

One of my pipeline factors is:

class Quality(CustomFactor):  
        inputs = [morningstar.income_statement.gross_profit, morningstar.balance_sheet.total_assets]  
        # window_length = 3*252  
        window_length = 252

        def compute(self, today, assets, out, gross_profit, total_assets):  
            norm = gross_profit / total_assets  
            out[:] = preprocess((norm[-1] - np.mean(norm, axis=0)) / np.std(norm, axis=0))  

If I run the code with only this factor, and window_length = 3*252 I still get the memory error. However, if I reduce the window length to window_length = 252 the backtest will complete.

Has anyone seen similar problems? Have there been system changes that would explain this behavior? I thought I saw murmurings of some pending improvements in how fundamental data are handled--were the changes implemented? If so, might this be the explanation?

8 responses

I had timeout repeatedly when I tried getting fundamental data for >252*2 from inside Pipeline.
One year is fine though expensive.. very slow for two years.

Posted here in EDIT as postscript - see Notebook for very expensive 31s time.
I had to use get_fundamentals() outside Pipeline. That took 0.49s for the same data.

Karl

Thanks Karl -

I'll keep playing around with it. Something clearly changed under the Quantopian hood, since I simply copied the contest algo back into the backtester, and could no longer run a 2-year backtest. It would be nice to hear from Q support on this one, since something was changed in the direction of badness.

I have run into similar issues.

Pipeline seems inherently unpredictable when it comes to how it batches requests together, so it is easy to run out of memory or time if you are doing something a little more elaborate in it. I've posted about a potentially very easy fix here, but haven't heard back from Q yet.

We're working on a revamp of the way that fundamental data is accessed in Pipeline. The change will bring faster queries and should add consistency to the behavior. More specifically, we are moving from a system that loads fundamental data from a database to a file-based system. One advantage of the file system is that data load times will be independent of the amount of people requesting it (the files will be distributed across backtest servers). This should make query times more predictable. We're expecting to launch the new version soon.

I'm not yet sure if the new system will help with memory issues, but we can investigate.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

I've had similar issues with algorithms using fundamental histories, good to hear news on the update.

Thanks Jamie -

My main concern at this point is the unpredictability. For one of my contest entries, I thought it would be interesting to run a Pyfolio tear sheet and include the out-of-sample data. The link is: https://www.quantopian.com/live_algorithms/592eb4761760420010d11150 .

I can no longer run even a 2-year backtest (which had been run automatically for me to enter the contest). In fact, I just tested the code, and it will only run for a few days before encountering a memory error. It is a speculation, but was something changed in preparation for your revamp of the way that fundamental data is accessed in Pipeline that could have impacted memory availability?

Perhaps you could provide a little tutorial on how Pipeline and the backtester in general handles memory. For example, for the code below, when I call make_factors() and create combined_alpha does the memory required to pull in the trailing window of data and do the computation get tied up, such that if I scale to a very large number of factors I'd have a memory problem? Or does the Pipeline API free the memory, once the call to each custom factor is complete?

Also, when you revamp the way that fundamental data is accessed in Pipeline, will you still chunk the Pipeline computations? Or will the file system be fast enough that chunking will no longer be required (since you won't need to make slow calls to a remote database)? If you are reading from a local SSD, it should be pretty zippy. Or maybe you are loading up on RAM, and can put the entire file system there? How big is the fundamental database? Presumably chunking requires more memory, leaving less for the algo. Also, it can consume the 5-minute before_trading_start computational window in an unrealistic way--there's lots of time available during live trading (I think), but due to backtest chunking, in practice, the 5-minute window is not fully available (and in some cases, can be largely consumed).

def make_factors():  
    class OptRev5d(CustomFactor):  
        inputs = [USEquityPricing.open,USEquityPricing.high,USEquityPricing.low,USEquityPricing.close]  
        window_length = 5  
        def compute(self, today, assets, out, open, high, low, close):

            p = (open+high+low+close)/4

            m = len(p)  
            a = np.zeros(m)  
            w = np.zeros(m)

            for k in range(1,m+1):  
                (a,w) = get_weights(p[-k:,:],close[-1,:])  
                a += w*a  
                w += w

            out[:] = preprocess(a/w)  
    class OptRev30d(CustomFactor):  
        inputs = [USEquityPricing.open,USEquityPricing.high,USEquityPricing.low,USEquityPricing.close]  
        window_length = 30  
        def compute(self, today, assets, out, open, high, low, close):

            p = (open+high+low+close)/4

            m = len(p)  
            a = np.zeros(m)  
            w = np.zeros(m)

            for k in range(3,m+1):  
                (a,w) = get_weights(p[-k:,:],close[-1,:])  
                a += w*a  
                w += w

            out[:] = preprocess(a/w)  
    class MessageSum(CustomFactor):  
        inputs = [stocktwits.bull_scored_messages, stocktwits.bear_scored_messages, stocktwits.total_scanned_messages]  
        window_length = 21  
        def compute(self, today, assets, out, bull, bear, total):  
            out[:] = preprocess(-(np.nansum(bull, axis=0)+np.nansum(bear, axis=0)))

    class Volatility(CustomFactor):  
        inputs = [USEquityPricing.open,USEquityPricing.high,USEquityPricing.low,USEquityPricing.close]  
        window_length = 3*252  
        def compute(self, today, assets, out, open, high, low, close):  
            p = (open+high+low+close)/4  
            price = pd.DataFrame(data=p, columns=assets)  
            # Since we are going to rank largest is best we need to invert the sdev.  
            out[:] = preprocess(1 / np.log(price).diff().std())

    class Yield(CustomFactor):  
        inputs = [morningstar.valuation_ratios.total_yield]  
        window_length = 1  
        def compute(self, today, assets, out, syield):  
            out[:] =  preprocess(syield[-1])

    class Momentum(CustomFactor):  
        inputs = [USEquityPricing.open, USEquityPricing.high, USEquityPricing.low, USEquityPricing.close]  
        window_length = 252

        def compute(self, today, assets, out, open, high, low, close):

            p = (open + high + low + close)/4

            out[:] = preprocess(((p[-21] - p[-252])/p[-252] -  
                      (p[-1] - p[-21])/p[-21]))

    class Quality(CustomFactor):  
        inputs = [morningstar.income_statement.gross_profit, morningstar.balance_sheet.total_assets]  
        window_length = 3*252

        def compute(self, today, assets, out, gross_profit, total_assets):  
            norm = gross_profit / total_assets  
            out[:] = preprocess((norm[-1] - np.mean(norm, axis=0)) / np.std(norm, axis=0))  
    return {  
            'OptRev5d':              OptRev5d,  
            'OptRev30d':             OptRev30d,  
            'MessageSum':            MessageSum,  
            'Volatility':            Volatility,  
            'Yield':                 Yield,  
            'Momentum':              Momentum,  
            'Quality':               Quality,  
        }  
    factors = make_factors()  
    combined_alpha = None  
    for name, f in factors.iteritems():  
        if combined_alpha == None:  
            combined_alpha = f(mask=universe)  
        else:  
            combined_alpha = combined_alpha + f(mask=universe)  

should add consistency to the behavior

Hi Jamie -

In some respects, it would be preferable to have a completely deterministic, real-time system. If I've inferred your architecture correctly, you have N algos deployed on a given hardware server. So, there will still be some variability based on the value of N and what each algo is doing at any given time (although for live trading, presumably, the algos all run in parallel, simultaneously, since you are triggering on the whole wall clock minute). But I guess you can put an upper bound on things. The problem from an algo development standpoint is that you don't specify the headroom required in terms of execution time, and as my example above illustrates, memory is a wildcard, too. I suppose the paradigm is that if a user can run an M-year backtest, then live trading will fly. It would be preferred to know a priori, before even writing code, that it will work, versus having the circle back in the development process, once the code fails the backtest verification step.

Jamie -

I'm curious - wouldn't you just chuck the fundamentals data into the same bin as all of your other data? What's special about the fundamental data? Or maybe it is just a lot larger data set than the rest, and so you need a special system?