Back to Community
batch transform testing - trailing window updated minutely

I'm trying to write a custom moving average function that will update every trading minute and use a trailing window of minutely ticks. Under the minutely backtest, it appears that I am able to extract a trailing window of 390 ticks with this code:

@batch_transform(refresh_period=0, window_length=1)  
def get_prices(datapanel, sids):  
    return datapanel['price'][sids].values  

The window is updated every minute (due to the setting refresh_period=0).

Note that the code won't execute under the daily backtester, due to this line:

log.debug(' prices: ' + str(prices[388]))  

But, if I ignore the error, I can still run under the minutely backtest.

The problem is that the minutely backtester runs this simple algorithm very slowly. Any idea why? I tried commenting out the logging, but it didn't help.

Clone Algorithm
Backtest from to with initial capital
Total Returns
Information Ratio
Benchmark Returns
Max Drawdown
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 50dadfe382c1d1240b72404c
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.
20 responses

A minor update...I added logging of the time duration to return a result from the batch transform:

    t = time.time()  
    prices = get_prices(data, context.stocks)  
    elapsed = time.time() - t  

It takes about 0.2 seconds to get the prices...this largely accounts for the sluggishness (since the call is every minute of the backtest...390 times per typical trading day). It's not a big deal for me at this point, but perhaps someone can suggest a more efficient way of extracting the data. Quantopian folks, since this is all open source and you like to be transparent, what is going on "under the hood" when the batch transform is called? Is there some "cloud computing" operation that requires communication between remotely located servers (e.g. the backtester & the database)? If so, a ~200 ms latency might make sense.

@Grant, there should not be any network operations under the hood of your batch_transform. Eddie is the middle of a performance evaluation he started last week, so your timing for this report is wonderful. He should have some insight to this by the end of next week, I think.

@Grant: Yes, this will be very slow. The reason is that we are recreating the data panel that gets passed to your batch_transform every minute -- huge overhead. That's the reason there is a refresh_period parameter in the first place. You might want to consider writing an iterative implementation your transform which will be much faster. What transform do you want to implement with this?

The previous post was from me, but I wasn't logged in.

Hi Thomas,

Thanks for the guidance. I figured I'd start by seeing if I could write a custom weighted moving average that updates every minute, based on a trailing window of minutely prices. The idea is to use the moving average in an algorithm that makes a trading decision every minute (e.g. the OLMAR one we implemented). The batch transform seems like the way to go, but I am open to alternatives. I suppose that I could store the trailing tic data in a numpy array, but that sorta goes against the idea of developing the batch transform in the first place, right? Any way to speed it up?

Regarding the slowness of the batch transform for this implementation, it is not a fundamental problem, so long as the overall backtest execution time scales linearly. I can just estimate how long it will take to run and return to it when it is done.

Hi Grant,

The iterative transform would be perfect for this (and our existing mavg transform is written in an iterative fashion and quite fast). The batch transform was really designed to give you multiple SIDs in an easy way. Unfortunately the interface to the iterative transform is still a little rough. We will eventually refactor that but it's not too high up our list currently. If you want, you can look at the example for the moving average here: The action is happening inside of MovingAverageEventWindow. Let me know if you have further questions.

Thanks Thomas,

I'm not too keen on the batch transform, anyway, since missing tics are patched up for thinly traded securities (as I understand, this will be fixed at some point). I'm not sure what you mean by an "iterative transform"--I'm guessing that for each tic you drop the oldest price from the moving average and add in the latest one. I'll take a look at the link you provided above to see if it provides guidance.



An iterative transform essentially uses the previously computed values instead of recomputing everything on each iteration.

I have a pull request with quite some enhancements to the batch_transform (including access to the underlying event list if you need speed, but also more general speed-ups). Here is the new doc string for what will be possible (not included in quantopian yet).

If I'm following, it sounds as if you propose a pass-by-reference switch in the batch transform:

create_panel : bool
If False, will create a pandas panel every refresh
period and pass it to the user-defined function.
If True, will pass the underlying deque reference
directly to the function which will be significantly

If so, would this present a risk of the tic data being modified by the algorithm?

Yeah, one could do that. I don't really see a problem with that however. If someone wants to shoot himself in the foot he's free to do so :).

Just thought I'd mention it. Supposing someone made a mistake, would the data get corrupted only when the algorithm is running, or would it somehow persist? Might be worth thinking this through, since a real confusing mess could be created.

That's true, thanks for pointing that out!

A similar question arised on the zipline mailing list where I gave an example of why an iterative implemention for e.g. moving average will be faster:

Well the main reason is that for each computation, the data (which is stored in a list) has to get copied to a pandas data frame. Then you compute the mean over the whole pandas data frame. While that's OK there is a much faster alternative:

One simply computes a running sum. For example, each new price that enters the window gets added to that sum, each price that drops out of the window will be subtracted. If you then want the moving average you simply divide that sum by the window length. Note that we didn't have to copy the data into a dataframe and we only have to add 2 numbers for each new event, instead of summing the complete data frame for each new event. This is also called dynamic programming.


We designed the data flow so that the simulation for order fulfillment is performed prior to the events reaching the algorithm itself, so modifying the event data stream will have no effect on the simulation results. The data sent to a user algorithm is discarded immediately after the algorithm uses it, and is never persisted. This way we don't need to copy the data in memory as we run the simulation.


Oops, I misunderstood the question. Ticks held for the batch transform are mutable, and are held in memory between invocations. However, changes to those events would only affect the calculations done by your algorithm, not the simulation itself.

Fawce & Thomas,

Not sure I understand all the details here, but in a proposed change to the batch transform, Thomas has a switch that will "pass the underlying deque reference." So, I figured that within an algorithm, it implies that the deque (presumably a reference to the tic data) could be inadvertently modified. If the modified deque is then used by another algorithm, the corrupted data would propagate. Allowing bugs in one algorithm to propagate to other algorithms should be avoided, right?

However, I think what Fawce is saying is that the deque corruption would be limited to the algorithm that caused it, and not propagate,correct?

Grant, All data is isolated per algorithm and per backtest.


Regarding the moving average, I'll tinker around to see if I can get something acceptable to work on a minutely basis. Seems like I should get the tic data into a numpy array and then use the numpy vectorized functions to do the computations. I can compare the speed to your running sum and the batch transform approach. In the end, it'd be nice to have the data in a format compatible with number crunching functions (in MATLAB, this is the approach...feed functions vectors/matrices and get back vectors/matrices).




I'm revisiting the idea of using a trailing window of sid data, updated minutely. I'd like all of the data to be from within the current trading day. Is there anyway to do it, other than the batch transform (which I recall works, but slowly)?

Another approach that perhaps someone could illustrate would be, in the second-to-last minute of the trading day, to perform an analysis on all of the tic data for a sid (or multiple sids), up to that point in the day. Then, an order would be submitted and fulfilled during the last tic of the day. I think I can already do this efficiently with the batch transform, but perhaps there is a better approach. Again, note that I don't want data from the previous trading day.



The only way I can think to do that right now is with a carefully-sized batch transform trailing window. Have the window be 325 minutes long, and examine the window at 4:25 every day, and you have all of the data for the day. Unfortunately, that will be inefficient because the batch transform is updating every minute. But it might work.

It sounds like you're going down a path I've been thinking a lot about lately. As a product manager, I'm always trying to think about intermediate milestones that you can use to learn from on the way to the big milestone. I've been thinking about daily, end-of-day trading as an intermediate step for full live trading. I haven't come up with a product design I like yet, though.

Hello Dan,

It should be pretty straightforward to write the function I'm thinking of for the end-of-day trading (or any time of day, for that matter). An if-then statement is used to restrict the call to the batch transform to only a specified time near the market closing time. Then, only the data from within the current day is extracted and analyzed. One might be able to use Pandas for this, directly within the batch transform...perhaps someone knows how to do this and could jot down an example?