Back to Community
Minutely History frame interval slicing

I would like to make a dataframe with just the first 30 minutes of data from each day that is backfilled for several days from the start of a backtest. Can anybody show me a good way to pull the first 30 minutes from each of the 10 trading days in:
today = history(3900,'1m','prices',ffill=False)
?

any help appreciated!

17 responses

Hi Robby,

The right way to do this is using Pandas groupby functionality. Essentially, you want to write a grouping function that breaks your big frame into smaller chunks, and then another function that takes those smaller chunks and returns the useful subset you care about.

I put together a short IPython Notebook in our Research Environment (currently in Alpha) that shows how to do this with the kind of data returned by history().

NBViewer Link

Source on GitHub

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hello Robby,

Scott definitely has more experience, but here's my angle on your question. Using the Pandas pivot table functionality, I reformat the data into a Pandas DataFrame with ascending date along the vertical axis (index) and ascending trading minute for each date along the horizontal axis (columns). Rather than using the Pandas DataPanel approach, I used a Python dictionary keyed by sid to store each DataFrame.

I did not do any spot checks for accuracy, but I think the code is correct. The only problem I see is that it appears to be bogging down the backtester (I suspect due to the pivot table computation every simulated minute). So, you could try just calling the routine at market open, or 30 minutes into the trading day, for example.

As a note to Scott, it'd be nice if history had a switch to natively return the data in this format, with trade date as an index and trade minute as columns (in a DataPanel, I suppose, when there are multiple sids).

Grant

from pytz import timezone

def initialize(context):  
    context.stocks = [sid(8554),sid(19920)] # SPY & QQQ

def handle_data(context, data):  
    prices = history(3900,'1m','price',ffill=False)  
    prices['date'] = prices.index.date  
    prices['time'] = prices.index.tz_convert(timezone('US/Eastern')).time  
    prices_pivot = {} # dict of pivot tables, keyed to sid  
    for stock in context.stocks:  
        prices_pivot[stock] = prices.pivot(index='date', columns='time',values=stock)

    for stock in context.stocks:  
        print stock.symbol  
        print prices_pivot[stock].tail(3).iloc[:,0:3]  
Clone Algorithm
9
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
from pytz import timezone

def initialize(context):
    
    context.stocks = [sid(8554),sid(19920)] # SPY & QQQ

def handle_data(context, data):
    
    prices = history(3900,'1m','price',ffill=False)
    
    prices['date'] = prices.index.date
    prices['time'] = prices.index.tz_convert(timezone('US/Eastern')).time
    
    prices_pivot = {} # dict of pivot tables, keyed to sid
    for stock in context.stocks:
        prices_pivot[stock] = prices.pivot(index='date', columns='time',values=stock)

    for stock in context.stocks:
        print stock.symbol
        print prices_pivot[stock].tail(3).iloc[:,0:3]
There was a runtime error.

seriously comprehensive answers guys. The group by functionality seems like the way forward. It will open up a few doors for me.
I am pretty excited about the research environment. Is there a beta eta?

thanks again.

@Scott,

I can't get your research environment code to work in the back-tester. I am getting a "Runtime exception: IndexError: out-of-bounds on slice (end)" from the first_thirty_minutes function. Can you check the code (below) and see what I am doing wrong. Thanks for any help.

from pandas.tseries.tools import normalize_date

def initialize(context):  
    context.xxx = [sid(24), sid(5061)]


def handle_data(context, data):  
    pricing = history(1950,'1m','price',ffill=False)  
    grouped = pricing.groupby(my_grouper).apply(first_thirty_minutes)  
    log.info(grouped.head(40))  

def my_grouper(ts):  
    "Function to apply to the index of the DataFrame to break it into groups."  
    # Returns midnight of the supplied date.  
    return normalize_date(ts)

def first_thirty_minutes(frame):  
    "Function to apply to the resulting groups."  
    return frame.iloc[:30]  

It looks like this is a discrepancy between pandas 0.12.0 (which is what's currently available in the backtester) and pandas 0.14.1, which is what's currently in research. Specifically, pandas has changed the behavior of the .iloc indexer when a slice runs off the end of a frame.

if you have a DataFrame with 20 rows, in the latest pandas, doing

df.iloc[:30]  

returns the entire DataFrame. In pandas 0.12.0, this raises an IndexError, as you're seeing.

I think what you're seeing is the fact that the end of your history frame contains fewer than 30 minutes for the current day until 10:00 AM, which means that the frame passed to first_thirty_minutes has fewer than 30 rows, which causes the .iloc to barf.

The expedient solution is probably to switch to something like:

def first_thirty_minutes(frame):  
    if len(frame) <= 30:  
        return frame  
    else:  
        return frame.iloc[:30]  
df.groupby(pd.TimeGrouper('D')).apply(lambda df: df.head(30))  

This might be something you're looking for. TimeGrouper is essentially what does work behind the scenes in resample. It'll be quicker.

Ah, I always forget that you can use head() non-interactively. That's the correct way to do this in pandas 0.12 for sure, and probably just in general.

I'm not 100% sure that TimeGrouper will play nicely with our data because it doesn't have a uniform frequency. If Dale's expression works though, I'd say
it's cleaner than my solution.

Thanks Dale!

Not sure what you mean about the uniform frequency. TimeGrouper does use a uniform-frequency DatetimeIndex to get the integer edge values i.e. the integer value of every midnight minute. At that point, it does what you'd expect and runs through once, marking when the frequency rollsover. So all it needs is for the index to be ordered.

I don't actually know if it does shortcut binning between uniform frequencies, i.e. create a bin range with step=60 when converting from minutes to hours. I think resample still checks the integer values. Been awhile since I looked deep into the bowels of it.

Btw, how does one do a inline code comment? Like the single backtick on github?

What I mean by non-uniform frequency is that the DatetimeIndex's (DatetimeIndices...?) of our minutely frames don't actually have "minutely frequency" as pandas understands it, because they only include minutes during market hours. Our daily frames have a similar issue where pandas doesn't consider them "daily frequency" because they skip weekends and non-trading days. I'm not sure if TimeGrouper depends on having frequency information on the input its trying to group, or if it just knows how to map values to bins. This is also the sort of subtle behavior that often changes between releases due to performance tweaks.

Triple backticks lets you do an entire block. Tragically, there's no functionality currently for inlining ala GitHub's single backticks.

Ah gotcha. Yeah TimeGrouper is fine for that. It doesn't actually check the frequency but the index integer values. Even the uniform indexes are just a store of integers that represent nanoseconds.

pd.date_range(start="2000", freq="D", periods=10).asi8  

When you're doing the resample, the target frequency is only used to compute the edges. The above would give you the right edge of the midnight rollover in nanoseconds. So 12:00:00000001 AM or however you represent that. The source frequency could be used, but it is not required, we would just a simple loop in cython and build up the locations that mark the rollover in the original index.

# pseudo  
labels = pd.date_range(start=index[0], end=index[-1], freq=freq)  
edges = labels.asi8  
bins = []  
values = index.asi8  
for i in range(len(values)):  
  if values[i] < next_edge:  
    continue

  next_edge = edges.pop(0)  
  bins.append(i)  

Is an approximation for what you get from this:

grouped = df.groupby(pd.TimeGrouper('D'))  
grouper = df.grouped.grouper  
bins = grouper.bins  
labels = grouper.labels  

All the time series machinery was built to handle ordered non-uniform DatetimeIndex. Any shortcutting that would make use of the frequency metadata would be added on top of that. At its base, its' just a series of Timestamps.

Scott & all,

Any idea how to speed up my pivot table code posted above (while still generating the table)? I like this format, rather than having the data in a time series, but it really seems to bog things down.

Grant

It depends on the data. If the prices frame is uniform in some manner, meaning that every day has the same range of minute bars, then you can just reshape instead of pivot.

Original:  
A1 A2 A3 B1 B2 B3

Reshaped:  
A1 A2 A3  
B1 B2 B3  

Imagine that the letters are days and minutes are numbers. You could reshape and then provide the labels yourself. If your data is irregular, it's still probably faster to just reindex against a uniform index and then reshape.

Thanks Dale,

Good point. Sorta doable, since SPY can be used as a proxy for a market clock. I have to think about how to handle early market closes, though.

Grant

Yar. I don't know how the quantopian data comes in. If the early closes results in NA's with the same # of points-a-day, you can still reshape. Either way, I'd hold off doing the time zone conversions until you create your table. Then just convert the time-colimns afterwards.

Dale,

It's an event-driven backtester, so no events, no data. So, when the market closes early, the time series just jumps to the next market day (i.e. no NaNs filled to the normal closing time).

Grant

Hm, yeah I get that it's event driven, I wasn't sure where history was getting it's data from.

http://nbviewer.ipython.org/gist/dalejung/584aab06d76ea3de1361

Like I said in my previous post, reindex-ing is going to be fast even if it feels like it's doing more steps.

I am not able to figure out why in the research while using the minute data frequency. It is getting me the data starting from 14:31. I was expecting to start getting the data from the beginning of the day the reason why I want this it is to get the first 30 minutes mean of each day. Does anyone have seen this problem?