Back to Posts
Listen to Thread

Recently, I noted some odd behavior by the batch transform. I found that it apparently re-orders the sids if context.stocks is a list. The order is maintained if context.stocks is a set. The outputs for the respective cases are below:

context.stocks = [sid(351), sid(1419), sid(1787), sid(25317), sid(3321), sid(3951), sid(4922)]

2013-02-26handle_data:25DEBUG<type 'list'>  
2013-02-26handle_data:26DEBUG[ 2.4601 87.18 99.43 13.89 31.8 20.58 102.35 ]  
2013-02-26handle_data:27DEBUG[[ 13.89 87.18 20.58 31.8 102.35 99.43 2.4601]]  
context.stocks = {sid(351), sid(1419), sid(1787), sid(25317), sid(3321), sid(3951), sid(4922)}

2013-02-26handle_data:25DEBUG<type 'set'>  
2013-02-26handle_data:26DEBUG[ 13.89 87.18 20.58 31.8 102.35 99.43 2.4601]  
2013-02-26handle_data:27DEBUG[[ 13.89 87.18 20.58 31.8 102.35 99.43 2.4601]]  

Is this the expected behavior, or is it a bug? If it is not a bug, I suggest making this clear in your documentation and training, since it is a subtle difference that can result in erroneous output (as I learned).

The "Add Backtest" button is not available, so I am posting the code in-line:

import numpy as np

# globals for get_avg batch transform decorator  
R_P = 1  # refresh period  
W_L = 1  # window length

def initialize(context):  
    #context.stocks = [sid(351), sid(1419), sid(1787), sid(25317), sid(3321), sid(3951), sid(4922)]  
    context.stocks = {sid(351), sid(1419), sid(1787), sid(25317), sid(3321), sid(3951), sid(4922)}  
def handle_data(context, data):  
    prices = np.zeros(len(context.stocks))

    # get prices  
    for i, stock in enumerate(context.stocks):  
        prices[i] = data[stock].price  
    # get batch transform prices  
    if get_prices(data,context) == None:  
        log.debug("get_prices(data,context) == None")  
        return  
    prices_bt = get_prices(data,context)  
    log.debug(type(context.stocks))  
    log.debug(prices)  
    log.debug(prices_bt)

@batch_transform(refresh_period=R_P, window_length=W_L) # set globals R_P & W_L above  
def get_prices(datapanel,sids):  
    return datapanel['price'].values  

Attached, please find the algorithm that I could not attach to the post above (no "Add Backtest" button). --Grant

Clone Algorithm
1
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

Hello all,

Any insights on this? I'd like to close it out.

Thanks,

Grant

Hi Grant,

So sorry for the delay - Dan forwarded this to me when you first submitted it and I didn't get back to look at it until now. I've been working on a really fun new feature, and I let this slip - sorry!

The issue you have uncovered is by design. We don't guarantee any order on the data parameter to handle_data or to the dataframe columns in the datapanel sent to batch transform. Instead, we decided to go the route of using keys/column labels.

Thanks for pointing this out.

thanks,
fawce

Thanks Fawce,

No problem...just didn't want it to slip by. I'm still not quite clear on what's going on. My goal is to have a sure-fire way of capturing a trailing window of security data in a numpy ndarrary. For the code above, the batch transform re-orders the columns when the sids are defined in a Python list. But if I define the sids with a Python set, the batch transform maintains the sid order.

Some questions:

  1. I am not clear from your response that the behavior highlighted by my code above will be consistent. If I call the batch transform with the sids defined by a Python set, will I always get back an ndarray with the columns ordered according to the sid order in def initialize(context)?
  2. When a call the batch transform with a Python list of sids, the ndarray columns are re-ordered. What criteria are used by the batch transform to order the columns? Or is it random?
  3. What would be the best practice? From your response, it sounds like I should be using keys explicitly when extracting data from the dataframe columns. Is there a way to re-write the code above so that I obtain the same result, regardless of the Python data type of context.stocks (list or set)?
  4. When using set_universe, are the sids in a Python list or set?

Grant

Hi Grant,

Thanks for detailing your thoughts. I'll jump right into your questions:
1. My read on the ordering is that the set is dumping the sids in order of their hashes, and that the datapanel['prices'].values is dumping the prices out in the order of the sid hashes. In other words, it is a coincidence of the implementation of the dataframe.values property and the implementation of set. Neither of those two objects should be considered ordered lists. The list, on the other hand, is ordered as you coded it. It should retain that sort until you modify the list.
2. It is deterministic, but based on the need to index into the columns, and so appears to be in order of the hash of the column key.
3. It really just depends on how you want to code your function. The key access is available on the dataframe, andif you want an ordered ndarray, pandas has a builtin function to produce a sorted ndarray. See the attached backtest for an example using the dataframe.as_matrix function.
4. The sids in the universe are available via the data parameter to handle_data - which is not sorted. You can iterate through the keys with:

for stock in data:  
   price = data[stock].price  

thanks,
fawce

Clone Algorithm
1
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

Thanks Fawce,

This does the trick:

@batch_transform(refresh_period=R_P, window_length=W_L) # set globals R_P & W_L above  
def get_prices(datapanel,sids):  
    return datapanel['price'].as_matrix(sids)[0]  

Alternatively, for a window length greater than 1 (e.g. W_L = 5), I can use:

@batch_transform(refresh_period=R_P, window_length=W_L) # set globals R_P & W_L above  
def get_prices(datapanel,sids):  
    return datapanel['price'].as_matrix(sids)  

Time increases with increasing row number (with the last row corresponding to the current tic).

The code you provided would seem to maintain the ordering explicitly.

Grant

Log in to reply to this thread.
Not a member? Sign up!