data sources - backtesting vs. live trading?

On https://www.quantopian.com/posts/march-prize-algorithm-restart, Jonathan Kamens revealed:

history() data comes from our historical data, not from Nanex data, even in live trading

So, when backtesting, the data come from the historical data source exclusively (not Nanex). And in live trading, presumably prior day's data come from the historical source, but for the current day, the data come from Nanex? So, history() returns data from both the historical data source and the accumulated Nanex feed for the day? Or is the Nanex feed only supplying the current price (i.e. from the last trading minute), and the historical data vendor is supplying a real-time feed for all other trailing minutes?

Reading https://www.quantopian.com/faq#data, it is not clear (at least to me), that the data used for live trading are derived from two different sources. It would be easy to assume that it all comes from Nanex:

For paper trading and real-money trading, we get a realtime feed of trades from Nanex's NxCore product. Those trades are bundled into one-minute bars and fed to the trading algorithms.

One take-away here is that if you want guarantee your trailing data to match the data used in your algo at any point in time, you'll need to accumulate your own data, rather that using history() (unless Q is performing the check, and fixing the historical data, assuming that Nanex is the authoritative source). The problem is that there is no way to pull up the Nanex feed data, as one can do for the historical data, which are available in the research platform.

9 responses

Just for those of us who are feeling particularly pedantic today (me included), even after reading all those posts I don't what the situation is.

Let us abandon all hope of using the English language, due to its limited specificness, and resort to the lowly use case:

    # Called at 14:32 UTC

A = history(3, '1m', 'close_price')[context.Sid]
B = history(2, '1d', 'close_price')[context.Sid]
C = data[context.Sid].close_price

print A[-1]            # ????          (14:32 UTC, i.e. 2nd bar of the day)
print A[-2]            # historical?   (14:31 UTC)
print A[-3]            # historical?   (yesterday)

print B[-1]            # mixed?        (up to 14:32 UTC)
print B[-2]            # historical.   (yesterday)

print C                # NxCore        (14:32 UTC)


and Volume. forgot that one. Volume = sum of mixed sources?

Thank you

What if you use IB paper trading?

Angelo,

The OHLCV minute bar feed still comes from Quantopian, but the fill prices, etc. are determined by IB.

Grant

Quantopian support,

Any feedback on this? This business of using mixed data sources (and only giving users access to the so-called backtest data set) seems like a murky approach to me. At a minimum, I recommend documenting exactly how you are melding the two sources.

I see that Nanex offers historical data back to 2004 (see http://www.nanex.net/historical.html). Looking over their description, it seems like you should run their historical data through your "injester" bar calculator, and then replace the historical data set with one that matches one-to-one with the live feed. Then, you'd have 100% correspondence between backtesting and live trading, with respect to the security data.

Also, assuming that the vendor for your historical data is not Nanex, then you'd have only one vendor to deal with in addressing data integrity issues.

Something new about this, Quantopian support? So, when backtesting, the data come from the historical data source exclusively (not Nanex). And in live trading, presumably prior day's data come from the historical source, but for the current day, the data come from Nanex, is that correct?

In my opinion, this situation needs to be cleaned up. We have data set #1 & data set #2. #1 is used for backtesting and the research platform and some combination of #1 & #2 are used for live trading. Shouldn't there just be one data set, with a 1:1 correspondence between backtesting/research and live trading?

Just my opinion, Grant. When dealing with any dataset you find that no two datasets on the exact same metrics always give the same result, especially intraday data. Sounds like a real problem, right?

A robust algorithm will work fine on slight or systematic differences in datasets. If your algorithm is so sensitive to slight variations or betting on systematic differences in equally valid data definitions then your algorithm is unstable.

Of course, if the differences are because one dataset is valid and the other has mistakes that introduce a bias, well then all bets are off. What differences do you see between Nanex and the backtester? Are they random or systematic. When you find a difference what does the "truth" reveal? By the truth, in this context, I'd go with exchange records. I only see some anecdotal evidence in the OPs description. Are there systematic differences?

Sally,

Yes, I get the point, but it seems that if any kind of discrepancy/glitch/quality control issue comes up, you're kinda stuck. It would be nice to be able to run backtest simulations and research platform analyses on the same data as used for live trading. As it stands now, it isn't even clear in live trading which data are from Nanex and which come from the historical source. So, a starting point would be to add a few lines to the help page with the details.

Grant

quantopian support /Dan dunn,
please do look at "Grant kiehne" concern about the validity of even with 1 minute bar. i do see your comment on 'minimal' time to investigate . Please advice on the latency you see when trying pass thru compare of IB 1 min bar and nanex pipe thru quantopian. This will provide newer folks trying to asses your platform some measure of latency induced slippage which is OK and reasonable.

Great job in conceiving and implementing your platform

regards