Back to Community
function like "history", but for fetch_csv data?

Hi all,

This seems like a basic question, but is there a way to safely (bias-free) get the historical data for time series that I have retrieved using fetch_csv?

I could do it unsafely, of course, but I am loth to introduce errors like that...

Simon.

13 responses

Hello Simon,

I'm not really familiar with fetcher, but I don't understand your question. Are you referring to the mechanics of using the imported time series in an algo, or to the risk of the time series, due to its origin, being offset in time with respect to Quantopian's time stamps?

Grant

As far as I can tell, when using fetcher to get time series data, the only thing you have access to is the most recent point, by time stamp, as a column in your data. What I want is the previous 250 points as well, but I want to do it safely, so that my algo still cannot have access to future data.

I am working on a solution similar to that here https://www.quantopian.com/posts/method-to-get-historic-values-from-fetcher-data but a bit more elegant, still using transforms to pack the data as CSV in a new column of the transformed fetched DataFrame in post_func

Before the algo starts, can you stick 250 rows in context.fetched_data, and then update context.fetched_data as the algo runs? You'd drop the oldest row, and add a new one every time handle_data() is called (add SPY to ensure it is called every minute). As long as the initial loading of context.fetched_data is done correctly, it seems like this should work.

Otherwise, if you don't need the trailing window at the algo start, just set up context.fetched_data as an empty DataFrame and build it up to 250 rows, and then maintain it in the same fashion.

I definitely do need the build-up, I plan to live-trade this one. The first method you describe is the 'unsafe' one I was trying to avoid, it's super easy to get one-off errors when deciding how much to start with. I'll post my solution here once it is done, it's basically the same as that other guy's, but with as few loops as I could manage.

For what it's worth, there's a clever trick using globals posted by Peter Bakker here:

https://www.quantopian.com/posts/load-data-from-external-source-and-place-it-on-context-as-is

Perhaps useful to you in this case.

Thanks - that's the technique I am trying to avoid, because it's too easy to accidentally look at future data.

This is still a work-in-progress, but if anyone else is facing this problem, perhaps this will help.

from pandas import DataFrame,Series  
from zipline.utils import tradingcalendar  
import functools  
import re

vixUrl = 'http://www.cboe.com/publish/scheduledtask/mktdata/datahouse/vixcurrent.csv'  
AdaptationWindow = 250

def initialize(context):  
    fetch_csv(vixUrl,  
              symbol='VIX',  
              skiprows=1,  
              date_column='Date',  
              pre_func=addFieldsVIX,  
              post_func=shift_data)

def handle_data(context, data):  
    context.vix_vals = unpack_from_data(data, 'VIX')

def fix_close(df,closeField):  
    df = df.rename(columns={closeField:'Close'})  
    # remove spurious asterisks  
    df['Date'] = df['Date'].apply(lambda dt: re.sub('\*','',dt))  
    # convert date column to timestamps  
    df['Date'] = df['Date'].apply(lambda dt: pd.Timestamp(datetime.datetime.strptime(dt,'%m/%d/%Y')))  
    df = df.sort(columns='Date', ascending=True)  
    return df

def subsequent_trading_date(date):  
    tdays = tradingcalendar.trading_days  
    last_date = pd.to_datetime(date)  
    last_dt = tradingcalendar.canonicalize_datetime(last_date)  
    next_dt = tdays[tdays.searchsorted(last_dt) + 1]  
    return next_dt

def add_last_bar(df):  
    last_date = df.index[-1]  
    subsequent_date = subsequent_trading_date(last_date)  
    blank_row = Series({}, index=df.columns, name=subsequent_date)  
    # add today, and shift all previous data up to today. This  
    # should result in the same data frames as in backtest  
    df = df.append(blank_row).shift(1).dropna(how='all')  
    return df

def shift_data(df):  
    log.info("Pre-Shift")  
    df = add_last_bar(df)  
    df.fillna(method='ffill')  
    df['PrevCloses'] = my_rolling_apply_series(df['Close'], to_csv_str, AdaptationWindow)  
    dates = Series(df.index)  
    dates.index = df.index  
    df['PrevDates'] = my_rolling_apply_series(dates, to_csv_str, AdaptationWindow)  
    return df

def unpack_from_data(data, sym):  
    if (sym in data and  
        'PrevCloses' in data[sym] and  
        'PrevDates' in data[sym]):  
        v = data[sym]['PrevCloses']  
        i = data[sym]['PrevDates']  
        return from_csv_strs(i,v,True).apply(float)  
    else:  
        log.warn("Unable to unpack historical {s} data.".format(s=sym))

def addFieldsVIX(df):  
    log.info("VIX: Pre-Massage")  
    df = fix_close(df,'VIX Close')  
    log.info("VIX: Post-Massage")  
    return df

# convert a series of values to a comma-separated string of said values  
def to_csv_str(s):  
    return functools.reduce(lambda x,y: x+','+y, Series(s).apply(str))

# a specific instance of rolling apply, for Series of any type (not just numeric,  
# ala pandas.rolling_apply), where the index of the series is set to the indices  
# of the last elements of each subset  
def my_rolling_apply_series(s_in, f, n):  
    s_out = Series([f(s_in[i:i+n]) for i in range(0,len(s_in)-(n-1))])  
    s_out.index = s_in.index[n-1:]  
    return s_out

# reconstitutes a Series from two csv-encoded strings, one of the index, one of the values  
def from_csv_strs(x, y, idx_is_date):  
    s = Series(y.split(','),index=x.split(','))  
    if (idx_is_date):  
        s.index = s.index.map(lambda x: pd.Timestamp(x))  
    return s

There is regrettably one loop in the my_rolling_apply_series, but I don't think it can be avoided. If anyone knows how, please let me know! It still seems pretty quick.

I updated this with my latest code to fetch fix, encode the history of it and so forth, and reliably shift it to avoid look-ahead bias in backtest and live-trading.

Comments welcome.

Hey Simon,
I hear the issue you are having here. I'm not sure the best way to solve it. I think batch transform does what you want, but that is being depreciated. It's possible the modeling/factor/name TBD API solves this eventually, but it's not there yet.

I don't have an answer right now. I just wanted to acknowledge it's on my radar.

KR

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Any update on this? This seems like it should be a pretty simple fix, but the workaround is incredibly annoying.

Bump, any update on this ?

Hi Gabriel,

The best way to upload a time-series csv, with accurate historical access via Pipeline, is the Self-Serve data feature. In order to accurately represent your data in pipeline and avoid lookahead bias, your data will be collected, stored, and surfaced in a point-in-time fashion similar to our existing Quantopian Partner Data.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.