Proposed changes to Fetcher and universe selection

Hi everyone,

Over the past several months since we released fetcher, we've seen heavy use and received a number of suggestions for improvements. As we did with the proposal for history management, we started the design process for our revisions with a spec/help file of the new feature.

You can read the proposal here: https://gist.github.com/fawce/7154053

The thrust of the proposal is three changes:

• allow fetcher data sets to define universes on a daily basis
• more clearly separate fetching signal/indicator data and fetching stock data
• higher fidelity between fetcher powered backtests and fetcher powered live trading

Thank you to everyone who has given us direction, feedback, and advice on the forums, in feedback submissions, and in person at meetups.

We look forward to finalizing the design with your help.

thanks,
fawce

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

17 responses

Hello Fawce,

As you know I've been looking at 'fetcher' quite a bit in the last few days. The proposal looks great but I will have to read it a few times.

My first question follows from an algo I am working on now: what happens when a symbol in 'fetcher' corresponds to more than one SID i.e. ADM is SID(128) and SID(36355)? These have overlapping date ranges i.e:

'sid': Security(128, symbol=u'ADM', security_name=u'ARCHER DANIELS MIDLAND CO', start_date=datetime.datetime(1993, 1, 4, 5, 0, tzinfo=), end_date=datetime.datetime(2013, 10, 25, 5, 0, tzinfo=), first_traded=None)

and

'security': Security(36355, symbol=u'ADM', security_name=u'ARCHER-DANIELS MIDLAND COMPANY CORPORATE UNIT', start_date=datetime.datetime(2008, 5, 21, 5, 0, tzinfo=), end_date=datetime.datetime(2011, 6, 1, 5, 0, tzinfo=), first_traded=datetime.datetime(2008, 6, 3, 14, 32, tzinfo=)

In my algo I have enumerated both in context.stocks. You can see that the first was matched with the 'fetcher' data i.e. 'sid': Security(128, symbol=u'ADM'. As the algo traded ADM on 2013-02-15 the order was successful.

P.

Thanks Fawce,

I did a very quick skim over the draft...lots of details to digest! Under the section "Fetcher is limited to daily data" it is not clear why fetcher couldn't be used on minute data. For example, what if I fed in a date, followed by 390 data values separated by commas. Then each value could be applied to each respective minute of the day (I realize that there is a potential for look-ahead bias, but the user would need to manage it).

Grant

@Peter - I think you have discovered a bug in our source data for the security timeseries. The design is that there should never be overlap in the date ranges for two uses of the same symbol. Fetcher assumes this, and looks up the sid using the symbol and date of the row in the csv. I need to look into the data in detail to figure out what happened with ADM.

@Grant - thanks for reading the spec. I wanted to be clear in the spec that the fetcher calls would only happen once per day in the live environment, and therefore, new data would be fed to your algorithm on a daily frequency. I agree that you could pass data at the market open, which your algo code interprets as a list of values. Fetcher won't do anything to stop this kind of use, but I discourage it, because I think it would end up being unwieldy when you are trading or testing with live data.

Hello Fawce,

Thanks. ADM is a random selection as far as I am concerned - it's actually the lowest numbered SID in the symbol data provided by Christian here.

P.

Hello Fawce,

More feedback:

• Your documentation and examples suggest that the data need to be individual float values:

date,symbol,short_interest
2/28/2013,IBM,2.841933

Is the parser configured to handle more complex inputs? For example, what if I want to feed in a list or numpy ndarray:

date,symbol,external_data
2/28/2013,IBM,[27.2,13.9,False]

• "If a row's date/symbol do not match with our stock history, the row is ignored." -- Either a warning or an error should result, instead of ignoring the row.
• For consistency with fetcher, I suggest increasing the number of individual sids that can be listed in an algo from 100 to 150.
• What is the total number of sids that can be in the universe? Is it 250 (100 listed explicitly within the algo and 150 additional listed with fetcher)?
• "Universe membership is projected onto trailing windows from batch_transform and history. set_universe has quarterly membership changes, and also manages a batch_transform backfilling process for new universe members at quarter changes. In order to enable the higher frequency updates to the universe with fetcher, backfilling is not supported at this time." -- What does this mean? Are you saying not to use the batch transform and history with fetcher? Will an error result?

I'll keep plugging through the document, as time allows.

By the way, do you know if I could put CSV files for fetcher on github?

Grant

Hello Grant,

You can put CSV files for 'fetcher' on gitbub i.e.

fetch_csv('https://raw.github.com/pcawthron/StockData/master/One%20Symbol%20Per%20Row%20-%20Forward%20Fill%20Test%202.csv',
date_column = 'date',
date_format = '%m/%d/%Y')


View the raw file in github and copy that URL.

P.

Under "Fetching signal information" it is kinda clunky to upload data one variable at a time, as you illustrate:

date,cpi,ijc
2007-12-01,211.445,123456

What if larger data sets are to be loaded with fetcher? Say I have a vector of floats, of length 1000? I'd need a header row with 1000 variables...inelegant.

Did you consider a more compact file format for fetcher (e.g. in MATLAB, typically data is stored in the native compact .mat format)?

Is it feasible to load in minute-level data with fetcher, even though the loading happens prior to market open? Or is this precluded to avoid look-ahead bias? If it is possible, I suggest describing the mechanics explicitly, along with some examples.

--Grant

Hello Fawce,

Is it correct that fetcher will only run (i.e. grab offline CSV data) when the algorithm is launched? If so, this will be pretty clunky under paper/live trading, since to update the fetched data, the user will need to stop the algorithm and then re-launch it, right?

Grant

Hi Grant,

Each morning, we run the algorithms in live trading through a warmup process that includes fetching the data. That way the algo will have fresh fetcher data each [edit: trading] day.

thanks,
fawce

Ah...sorry, you may have mentioned that before. What else happens during the warm-up process? I'm wondering if other elements of the code in the algorithm are executed (e.g. def initialize(context):)? --Grant

Hi Grant,

Warm up is designed to restore the state of your algorithm to wherever it was at the end of the previous trading day. That includes your orders, positions, fetcher data, batch_transforms (applies to the upcoming history method too) and the algo's context. The rest of the algorithm should be considered stateless and should instead depend on these other components. Among other things, warm up will re-run fetcher calls. Just beware, you shouldn't depend on specific behaviors of the implementation, because this is an area where we are doing quite a bit of optimization work to speed things up and the implementation will change. The end result should be the same though: your algorithm state will be restored from the prior close.

There is a similar process we call "catch up" that can happen any time there are delays to the market data or algo processing. There are cases where the algorithm experiences a delay in the market data. When we detect that your algorithm is "behind the tape", we run all the backlogged data through the algorithm, but orders that are placed during this phase are canceled before filling or before being sent to the broker in real trading. As a result, the algorithm needs to consider the positions and the open orders when considering rebalancing and other calculations related to current exposure. Orders placed after catchup will be processed as normal. Cases of delayed market data are logged to your algo's output window.

thanks,
fawce

Thanks Fawce,

Sounds right, "your algorithm state will be restored from the prior close." If I understand correctly, fetcher will be the only way to influence the algorithm on-the-fly (without stopping it and re-launching). In general, fetcher could be used to set switches and parameter values for the upcoming trading day. An offline optimization routine could be run overnight, with the result passed to the algorithm via fetcher. Sounds pretty powerful.

Regarding the "catch up" you describe, I'll need to digest it. My first impression is that getting "behind the tape" shouldn't happen in a system that is event driven; if the algorithm triggers off of the data, then there will be no way to fall behind, right?

Grant

Grant,

I agree that one could consider the bar as the determinant of the time from the algo's perspective, that's essentially our model in backtesting. However, in live trading the market is actually moving while your algo is processing. Our design considers it dangerous to be trading with stale market data in a live context; hence catch up.

thanks,
fawce

Fawce,

I'll have to think about that one...I guess I don't understand how things could get out of sync/stale, but I have a naive understanding of how electronic markets work. I suppose that it is because your data vendor's stream has variable latency (jitter) that is a significant fraction of a minute (or larger).

Grant

Grant,

It doesn't in a normal course of events. I'm talking about a network partition, or another outage from the vendor. Imagine the markets are trading for 10 minutes during a network outage - when the network comes back, we have to catch up.

thanks,
fawce

Thanks Fawce,

You might also consider some form of configurable user notifications. If there were an outage, it'd be nice to get a text/e-mail/phone call. Also, the "health" of the trading system should be available to the algorithm itself, so that code can be added to make automated decisions. Some users may not want to apply your catch-up routine, and instead write their own.

Regarding the topic of this thread, if there are specific elements of your fetcher implementation that you are debating, just let me know, and I'll have another look at the document.

Grant

Hello Fawce,

Perhaps not the intended usage, but with this update to fetcher, will it be possible to import strings of code text and then execute the code in the algorithm?

Grant