New Feature: Fetcher!

This feature is a big one! I can't tell you how many times someone has asked me: "I have a set of prices on my computer, can I use it in Quantopian?" or "Do you have this specific other data source in Quantopian?" Until today, the answer was no!

Fetcher permits you to import any time-series data in a CSV format into Quantopian. The data can be additional information about a stock already in Quantopian, like if you import the short interest for a list of stocks, or it can be totally stand-alone data like the price of palladium.

Fetcher imports the data into a pandas dataframe. Once in the dataframe it is sorted by datetime. Fetcher permits you to manipulate and format the imported dataframe both before and after the timesort. Once the formatting is done, the dataframe is then used as a datasource just like the existing price data that Quantopian uses.

This is incredibly powerful. Until today, you've only been able to code your algorithms against signals that you find in the price and volume stock data. Now, you can code your algorithm against any data source you can get your hands on! I particularly want to point out http://www.quandl.com/. They've put together 4 million datasets that you can use for free that are easily accessed through Quantopian. The sample algo below loads the prices of palladium and gold from Quandl, as an example.

For the full explanation of the feature, you can read the help here and here.

Here is a quick primer. The generic code that you use for Fetcher:

fetch_csv(url, pre_func=None, post_func=None, date_column='date',  
           date_format='%m/%d/%y', timezone='UTC', symbol=None, **kwargs)  

Here is a simple implementation for a file shared on Dropbox:
fetch_csv('https://dl.dropbox.com/u/1/history.csv', date_col='date', date_format='%m/%d/%Y')

Here are the first few lines of that CSV, so you can see how the formatting works:

date,symbol,days_to_cover
2/28/2013,MTH,3.43719
2/15/2013,MTH,2.841933
1/31/2013,MTH,3.197315

Clone Algorithm
185
Loading...
Backtest from to with initial capital ( data)
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.
36 responses

This is a truly excellent addition in functionality, well done guys.

Excellent, this is a wonderful addition, I will use it to pull VIX and SKEW prices, at least!

Just a reminder to everyone that is fetching economic data series -- as Jessica Stauth pointed out in NYC a week or two ago, often these data are time-stamped such that the historical time stamps are earlier than the data was available at that point in time.

In her case, the NYSE short interest data time stamps were 8 days prior to when those figures were released.

In order to prevent look-ahead bias, you must adjust those time stamps!

Simon, that is a very good point. We work very hard to keep look-ahead bias out of the system, but Fetcher does permit it to creep back in if you're not careful.

One example: say you import unemployment as your signal, and you enter November's unemployment as Nov 1, 2012 - that's look ahead bias, because you don't know the unemployment data then! You have to import it with the release data of the data as the key date - that's the first date that you can trade with the information.

I'm going to make a point to cover this new feature in some detail in today's webinar. If you're looking for a deeper run-through, sign up here.

Yeah, it can be more subtle than that, since most economic release series have their own dates, which do not correspond to the dates of the data's release, but rather with the dates of the data themselves. Worse, these data are often revised, where data from the past is updated in the future, so if you are back-testing over the series, your algo might work great with all the revised figures, but those revisions were future data.

EDIT: still an awesome feature!

This is one of the best and necessary updates I've seen. Last I checked, there were only certain approved python libraries that we could reference. Is there a list somewhere?

Hi guys,

I completely agree about this great new feature--thank you.

In testing, I cloned this algorithm and get the following when I build:

Something went wrong on our end and we've been notified about it. Sorry for the inconvenience.  
LookupError: unknown encoding: idna  
File test_algorithm_sycheck.py:17, in initialize  
File algoproxy.py:882, in fetch_csv  
File requests_csv.py:168, in __init__  
File requests_csv.py:203, in load_df  
File requests_csv.py:194, in fetch_data  
File api.py:55, in get  
File api.py:44, in request  
File sessions.py:276, in request  
File models.py:221, in prepare  
File models.py:290, in prepare_url  

I know you know about it (via that message) so I should probably not be redundant here. Is this a short or long fix? No pressure.

@gary, you can see the whitelist here: https://www.quantopian.com/help#ide-module-import

Oh wow perfect. statsmodels was the main one that I was looking for.

@ryan, thanks for the heads up, checking...

@ryan, the problem is fixed and Fetcher works again. Sorry for the inconvenience!

This looks very interesting... at a glance it looks like you're sticking with US Equities.. but if you let us fetch OHLC data from someone like dukescopy we could run algorithms on FX too :)

Great new feature! Any possibility of adding the Quandl python package to the whitelist?

Hey Fawce and Dan--thanks for the fix. It's working for me and I am exploring the capabilities (like pre_func, post_func and how to use multiple columns in a remote DataFrame). I'll update my other post/thread with my working examples once I get to a good point.

@Ryan, thanks. Please let me know if you have questions as you kick the tires.

@George, quandl did a lot of great work on their API, and it fits in perfectly with Fetcher. We were just talking about adding it, and have a question for you. To guard against lookahead bias in the simulation, we'd restrict quandl to the initialize method, and provide a way for you to push the dataframe you receive into the algo's data sources. That way you'll get all the quandl goodness plus our data alignment, batching, transforms, and so on for the data. Would you be happy with that?

thanks,
fawce

Not only is this awesome, but I've been looking for something like Quandl for ages!

fawce,

Glad to hear you are thinking about it - I think what you are proposing would work. To be honest, I just thought it was nice and simple to use the quandl get function:

import Quandl as ql
vxv = ql.get('CBOEFE/INDEX_VXV')

But now that I think about it more, I wonder of quandl is going to be happy with all the quantopians adjusting parameters of their algos all day, reloading the data for every backtest. Maybe quantopian can get a mega-authorization key or something like that. (Hope I'm not opening a can of worms..)

I love what you guys are doing!

thanks,
george

I've tried a lot but will not post my mess of non-working code.

Instead, would it be possible to see an example of using Fetcher without the 'symbol' keyword, so that it can get a .csv file containing derived signals, something like:

date,GOOG,IBM  
2008-05-14, 0.012, 0.3  
2008-05-15, 0.011, 0.21  
2008-05-16, 0.017, 0.4  
2008-05-19, 0.011, 0.1  
...

using a call like

fetch_csv('http://someurl.org/sig.csv', date_column='date', date_format='%Y-%m-%d')  

All the examples I have seen use the 'symbol' keyword in fetch_csv().

Then, in handle_data(), I want to use use a stock's price with its associated signal: data[cur_sid].price (the normal thing with price from sid()) and data[cur_sid].sig (my uploaded .csv file). I realize that I need to introduce the 'sig' column name somewhere, a la the 'post_func=rename_col' in the Fetcher example. But I just cannot seem to figure it out.

A separate issue: it would be ideal to have the ability to use the column names in my .csv file (GOOG, IBM in my example above) as a way to populate my portfolio in initialize() with all the appropriate sids.

This way, my single uploaded sig.csv file carries all the ticker info so I do not have to actually type sid() or their numbers. I can set up the code on Quantopian and control which stocks are tested simply by changing the .csv file on my server (or specifying a different .csv file).

Are such things possible?

Thanks,
Ryan

George,

Quandl actually gave us just such a mega auth key! Whenever you run Fetcher with a quandl.com, we're auth'ing it with Quantopian's key. The Q+Q combo has hit a chord, so maybe it would be cool to add a new builtin 'fetch_quandl' so you could just do:

fetch_quandl('CBOEFE/INDEX_VXV')  

instead of

fetch_csv('http://www.quandl.com/api/v1/datasets/CBOEFE/INDEX_VXV.csv')  

and in the background, we'll use the quandl api.

thanks,
fawce

Hi Ryan,

Here you are: https://www.quantopian.com/posts/ranking-and-trading-on-days-to-cover
You should be able to grab the csv I'm using to see the structure of the file and so forth.

thanks,
fawce

Thanks for the big example, Fawce. That is very helpful. I'm looking forward to adapting it. I see the big difference is, instead of an N+1 column DataFrame-looking .csv file, it is instead a stack of N 3 column DataFrame-like for however many dates you want. Got it.

Do I understand correctly that there is a limit of 10 securities in a portfolio?

Thanks again,
Ryan

Ryan,

My algo limits the portfolio to 20% of the universe (top and bottom deciles). That works out to be about 100 securities for the universe I chose.

thanks,
fawce

Great, thanks for your patience and explanations.

Ryan

Is it possible to import a time series of minute data using fetch_csv? The examples above all appear to be daily data. If so, what format should the dates and times be given in?

Thanks,
David

Yes David, you can. Within your fetch_csv you'd need to specify it. Something like this:

        date_format='%Y-%m-%d %H:%M:%S'  

Hi Dan,

Thanks for building out this new feature and it was a pleasure meeting you in person at the last Quantopian meetup (you may remember me as the annoying person who asked a few questions about how you guys secure your site against malicious Python attacks). I was wondering if it's on the project timeline for Quantopian to open source Fetcher?

The reason being that a lot of people (meaning at least me) are trying to use Zipline as a way to backtest with external data sources (in SQL/NoSQL), e.g., historical options/forex data. Fetcher will be an excellent tutorial for folks like me who are trying to build a customized data source. I promise if a good example of how a custom source is built, I'll upload my custom data source up on Github. Thanks for your team's excellent effort on building a great backtester both online (Quantopian) and offline (Zipline) Until then, I'll try to figure it out via PyDev via trial-and-error.

Hi Dan,

Thanks for building out this new feature and it was a pleasure meeting you in person at the last Quantopian meetup (you may remember me as the annoying person who asked a few questions about how you guys secure your site against malicious Python attacks). I was wondering if it's on the project timeline for Quantopian to open source Fetcher?

The reason being that a lot of people (meaning at least me) are trying to use Zipline as a way to backtest with external data sources (in SQL/NoSQL), e.g., historical options/forex data. Fetcher will be an excellent tutorial for folks like me who are trying to build a customized data source. I promise if a good example of how a custom source is built, I'll upload my custom data source up on Github. Thanks for your team's excellent effort on building a great backtester both online (Quantopian) and offline (Zipline) Until then, I'll try to figure it out via PyDev via trial-and-error.

Hi,
I found at that this snippet can be used to set headers in case a file has no headers. However, how to define a different delimiter like ';' instead of ',' ?

def set_headers(df):  
    header_row = ['Date','Open','High','Low', 'Settle', 'Volume', 'Prev. Day Open Interest']  
    df.columns = header_row  
    return df  

pre_func=set_headers,
usage similar to
post_func=rename_col,
J.

Hi J,

fetch_csv uses pandas read_csv under the hood. You should be able to pass in any of optional arguments accepted by read_csv (listed below) to fetch_csv, including 'delimiter'

Let me know if this solves your problem! I've made a note to add this level of detail to our Fetcher help docs.

Best wishes,
Jess

# The following optional arguments are supported for pandas' read_csv  
# function, and may be passed as kwargs to the datasource below.  
# see http://pandas.pydata.org/  
# pandas-docs/stable/generated/http://pandas.io.parsers.read_csv.html  
ALLOWED_READ_CSV_KWARGS = {  
    'sep',  
    'dialect',  
    'doublequote',  
    'escapechar',  
    'quotechar',  
    'quoting',  
    'skipinitialspace',  
    'lineterminator',  
    'header',  
    'index_col',  
    'names',  
    'prefix',  
    'skiprows',  
    'skipfooter',  
    'skip_footer',  
    'na_values',  
    'true_values',  
    'false_values',  
    'delimiter',  
    'converters',  
    'dtype',  
    'delim_whitespace',  
    'as_recarray',  
    'na_filter',  
    'compact_ints',  
    'use_unsigned',  
    'low_memory',  
    'buffer_lines',  
    'warn_bad_lines',  
    'error_bad_lines',  
    'keep_default_na',  
    'thousands',  
    'comment',  
    'decimal',  
    'keep_date_col',  
    'nrows',  
    'chunksize',  
    'encoding',  
    'usecols'  
}

Hey guys. Definitely agree that this is a great feature. A few questions: 1) How are symbols mapped to sid's? In many instances, a symbol maps to many sid's, and sometimes the first choice is not the most sensible choice. How is this handled in fetch_csv? 2) How should we use fetch_csv in production? Is initialize() called at the beginning of each day and we can put the fetch_csv call in there? Or do we need to call it in handle_data()?

Hey John,

We use ticker,date pairs to match symbols in the external file to Quantopian's sid universe, so your file needs to have the correct ticker 'as of' the date it's associated with in the file. I believe we only support straight matches currently and don't have fuzzy rules, so if there isn't an exact match I think we'll drop it and print a warn statement.

Fetcher gets called in initialize and in live trading the file gets re-parsed each morning when the algo is launched in production in the early morning (currently I think 2-3am). This means that new data is available to your production algo at most once per day with the current design (we are looking at options to increase this frequency and allow for intraday updates).

One other subtlety is that the source file should maintain an unchanging historical record of past data with updates limited to appending each new day's data. In the long run we may develop a solution that gets around this historical dependency, but in our current design it is critical that the file persist prior content.

Check out the attached code for a simple example of how to use a single time snapshot of a stock index via Fetcher to define the algorithm's universe. As always more detail and examples can be found in the help docs.

Best wishes, Jess

Clone Algorithm
31
Loading...
Backtest from to with initial capital ( data)
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
There was a runtime error.

Hi,

I read that there must be a match with the Sids defined in Quantopian.
1) What if I have a symbol like DAX or Future not present in QT?
2) Is the 'Universe' defined for that or the signal data? I have no idea what difference between signal and security data is?
3) How to define own back test start and end dates programmatically? I have the idea it is related to the SID? How to define I want to run a test over a certain period?

J.

For Security Info, your CSV file must have a column with header of symbol which represents the symbol of that security on the date of that row. Internally, Fetcher maps the symbol to the Quantopian security id (sid). You can have many symbols in a single CSV file. However, any symbol that is not also initialized (either using sid() or set_universe()) is discarded.

    #df = read_csv(page,delimiter = ';', names=['Date','Open','High','Low', 'Close', 'Volume','OI'], index_col=None);df.head();print df['Date']  
    fetch_csv(  
        'http://MyUrl.com//FDAX-DAILY-19971119-20140301.txt',  
        #names=['Date','Open','High','Low', 'Close', 'Volume','OI'],  
        delim_whitespace=True,  
        index_col=None,  
        date_column='Date',  
        symbol='FDAX',  
        pre_func=set_headers,  
        post_func=rename_col,  
        date_format='%m-%d-%Y')  
        #universe_func=(my_universe))  
        #parse_dates=[0],  
        #date_parser=parse) #%Y%m%d %H%M%S') #20140103 ,  
    # DAX match?  
    context.stock = sid(7447)  
==> if not provided, no data, now how to benchmark?

Hello,

Here are the answers to your questions:

1) You can only backtest and live trade the SIDs that are in our database. We have data for over 8,000 US stocks and ETFs! This includes all stocks that traded since 2002, even ones that are no longer traded.

2) The universe is all the stocks initialized in your algorithm and present in your fetched file. Security data is a price, volume, or another attribute for the stock you're trading. Signals can be anything else - economic indicators, international stocks, prices of precious metals etc. You can use Fetcher to import security data or signal information to drive your trading logic.

3) You can't set the start and end dates in your code, you have to choose the backtest period from the calendar in the IDE. But it would be nice!

Hi Everyone,

It appears that while you can import a universe from a CSV, you can't get the history() of that universe, the following very simple program does not work.

Any idea on a fix?

-Russell

def handle_data(context, data):

    # Un/comment here - RWH  
    price_history = history(21,'1d','price')  
    volume_history = history(10,'1d','volume')  
    for stock in data:  
        print stock.symbol 

def preview(df):  
    log.info(' \n %s ' % df.head())  
    return df

# Function for returning a set of SIDs from fetcher_data  
def my_universe(context, fetcher_data):  
    sids = set(fetcher_data['sid'])  
    symbols = [s.symbol for s in sids]  
    context.count = len(symbols)  
    print "total universe size: {c}".format(c=context.count)  
    return sids

def initialize(context):  
    # set_universe(universe.DollarVolumeUniverse(floor_percentile=74.0,ceiling_percentile=76.0))  
    # set_universe(universe.DollarVolumeUniverse(floor_percentile=55.0,ceiling_percentile=57.0))  
    # set_universe(universe.DollarVolumeUniverse(floor_percentile=90.0,ceiling_percentile=92.0))  
    # Using Fetcher universe 

    fetch_csv(  
       "https://dl.dropboxusercontent.com/u/929/myCSV.csv",  
        pre_func=preview,  
        date_column='date',  
        universe_func=(my_universe))

# myCSV.CSV  
#date   symbol  GICS Sector  
#10/1/13    BIDU    Health Care  
#10/1/13    TPLM    Materials  
#10/1/13    FDML    Information Technology  
#10/1/13    SFY Health Care  

Russel,

Thanks for reporting this bug, really important find. Your repro case was really helpful, and we are working on a fix now.

thanks,
fawce

Russel (and everyone),

Good news, we fixed this issue. The attached backtest is from the repro example above. You have to clone it and run to see the logs. N.B. I changed the csv file to be comma delimited.

have fun,
fawce

Clone Algorithm
12
Loading...
Backtest from to with initial capital ( data)
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
There was a runtime error.