Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Backtest Speed Comparison Of Data.History And Pipeline

I've diligently searched for information about whether and when a pipelined approach to algorithm construction has superior backtest speed compared to the more conventional approach of using data.history. One of the few nuggets of wisdom I found on the subject was posted by Dan Whitnable in March of 2019:

"The main raison d'être of pipeline is to speed up the data fetches. Using the data.history method to fetch the past 20 days prices every day for 100 days results in 100 separate database calls. However, much of the historical data each day will be the same each day. One keeps fetching the same data over and over again. Pipeline improves this situation by 'chunking' the total timeframe (ie 100 days) into smaller chunks and may therefore only call the database a couple of times to get the first 50 and then the next 50 days of data. So, the guideline is if speed is important then use pipeline."

Everything I've read in this community tacitly assumes that a pipelined algorithm will always be faster than one that doesn't employ a pipeline -- for every workload. Consequently, I never would have guessed that a non-pipelined algorithm would be almost twice as fast  for some workloads.

But this is what I found after performing my own tests.  I implemented two versions of an algorithm in the IDE (one using data.history and one using pipeline) that trades a single security once-daily based on three moving average crossovers -- fast (nom. 7 day sma), medium (nom. 50 day sma, slow (nom. 200 day sma).  

This essentially emulates the Golden Cross strategy with one enhancement: the algorithm seeks to enter soon after the 50-day trough following a Death Cross in order to capture the otherwise "lost" portion of recovery gains until the 50-day crosses above the 200-day.

The once-daily-executed Python code for the pipeline version is highly efficient, consisting solely of 3 assignment statements and 1 if/elif statement pair with embedded dataframe query.  The data.history version is similarly spartan, consisting solely of 3 data.history requests, 1 assignment statement, and 1 if/elif statement pair.  

I performed multiple runs using both a 4-year and an 11-year backtest time-frame, all run in the overnight hours so as to minimize the effects of system loading on the results.  The elapsed times reported are the best run times obtained:

method 4 Years     11 Years           
data.hist        52sec        151sec           
pipelined        93sec        254sec
difference       -43%          -40%

These elapsed times include the initialization (and initial pipeline fetch) which in all above figures was 12-14sec.

Both versions of the algorithm experienced nearly identical gains and draws during backtests; the slight deviations are attributable to minor differences in the sma outputs for data.history.mean vs the pipeline's SimpleMovingAverage factor.

It seems reasonable to conclude that for algorithms trading a limited set of securities largely based on a limited set of price-derived signals that the data.history approach provides superior backtest performance.

To establish the crossover point where it may be advantageous to use a pipeline approach, I incrementally increased the number of securities (each with 3 sma calculations) in each version of the algorithm backtesting over a 4-year time-frame.  I found that the data.history version's performance remained superior until the 30 total sma calculation point was reached -- far more than most algorithms are likely to require.

  The pipeline offers unparalleled functionality in the area of security selection, but in the market timing of trades, data.history still has a lot to recommend it. It appears that a decision to use a pipeline approach to algorithm construction would more rationally be motivated by a requirement for extensive security selection functionality than by a "need for speed".