Quantopian historical daily volume data seriously off

Stock: NVO
Date: Feb 11, 2013

NASDAQ, Marketwatch, Yahoo all give almost same number for volume at this day around 11,340,000.
While Quantopian matches stock prices (Open, High, Low, Close) with the above sources, it is off by almost 2 million (!!!) on the volume: 9,441,175

And not only for this date - days before and after that volume is messed up.

I came across this absolutely by accident testing some algo. It creates serious doubt in data accuracy.
I would understand if - at this scale of volume, a few tens of millions - the volume was off by a few thousands, or even a few tens of thousands. But by almost 2 millions???

30 responses

Same thing for Apple (AAPL) on the same date:

Volume as per NASDAQ, YAHOO: 128,206,578
Volume as per Quantopian: 118,627,131

Discrepancy in volume data: close to 10 million (!!!) shares

Help page does not specifically say, but they could be getting data from a source that does not have totals for off exchange networks like ECN's and Dark Pools....

Add it in yourself via fetcher.

We go into this question a bit in our FAQ.

There are two key differences between the data source used by Yahoo et al and by Quantopian.

1. Quantopian is only reporting on trades that happens during market hours - no pre-market or after-hours crossing session data. That causes volume differences.
2. Quantopian gets data from all exchanges, whereas Yahoo et al are using the "exchange of record" - that means prices, particularly open and closes, differ in any given bar.

I think the more you get into it you'll find that trade data has no "right" answer. What you have to work with is "this is the answer that is predictably generated by this particular data process." In Quantopian's case, the process is to get the full firehose of trades on the major exchanges and then generating minutely bars from that data.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

It's important not to think of Quantopian's "daily bars" as daily bars as one knows of them, which are calculated by a specific methodology which is known and comparable between vendors.

The "daily bars" on Quantopian are really just bars which are generated from a tick feed using some subset of the ticks that occur between 9:30am and 4:00pm EST. As such, they may or may not include cancellations, pre- and post- market data, late-reported trades ... I actually have no idea which tick flags are used to include or exclude ticks from their NxCore feed.

The short story is, you can't compare Quantopian data with anything else.

From my recent reading of various posts it is my understanding that the historic data provider is not Nanex. Nanex provides only the real time feed. But your point is well taken @Simon.

Based on this story and this historical quote table, Yahoo Finance historical quotes do not include pre- or after-market prices (and based on that -volumes either).
That keeps the example I have pointed to with no explanation yet.

Unless "just bars … occuring b/w 9:30 am and 4:00 pm" means Qunatopian is focusing only on minute (intra-day) trading algos.
Otherwise, one wouldn't be able to build a solid and validated trading algo if one cannot rely on close enough (statistically insignificant) reflection on what is daily opening or daily closing price ? That would mean Qunatopian is not reliable for algos that make decisions based on daily (as opposed to minute) data.

Yahoo Finance historical quotes do not include pre- or after-market prices (and based on that -volumes either).

Be careful making that leap, usually most pre-market trades count towards daily volume, even if they cannot set the last price.

In any case, I don't want to speak for Quantopian here, this was just my understanding of how they calculated the bars for their daily backtests. I just wanted to convey that I don't believe it is possible to do any reconciliation of Quantopian's daily bars with anyone else's. There have been several posts about this I believe, but I am not sure quite how to find them.

Let's not stray the discussion aside with non-relevant generalization: I pointed out a discrepancy only in the volume. The rest of Quantopian's daily data - open, close, high, low - seem to be pretty much in line with the other major data sources.

But discrepancies of millions of shares - close to 10% (!!!) - of the whole volume for the day are unexpalinable for me with such arguments.
Also, to answer another argument above: AAPL is traded on NASDAQ only, period. There is no "other" exchanges where it is traded so in that case there cannot be such reason for the huge discrepancy.

I strongly suspect the discrepancy is the inclusion or exclusion of the opening and closing crosses.

@Simon Thornington: if that were the case, it'd be great for Qunatopian to have a special note on that in the FAQ. Thus one engaging on testing own algos will know in advance that there is intentional effort to present a different set of data and what is the rationale behind it.

The way it looks right now is pretty puzzling: daily prices match the other sources, but daily volumes are way off.

Well, I wish I could find the threads from last year or the year before, but yeah I agree!

Kris, I need to respectfully disagree with the statement that "AAPL is traded on NASDAQ only." It is correct to say that AAPL is listed on NASDAQ, but it is traded on each of the dozens of exchanges. For example purposes, check out this image from Nanex. That image is representing the price and volume of the hundreds of trades per second in AAPL, and it's also using color to represent on which exchange the actual trade occurred. Yahoo et al do not aggregate the data themselves; they purchase their price/volume data for AAPL from NASDAQ, the listing exchange of record.

For live trading we purchase our data from Nanext, the NxCore product. We buy the full firehose from them - every trade - and we aggregate it ourselves. For backesting, we purchase our data from a different data provider who gives us pre-aggregated minute bars. (And yes, I agree that we need to make it so that you can backtest with the NxCore data. We haven't done it yet partly because we don't have the data as far back as 2002, and partly because it's a lot of work. But we will do it!) In the end, the different aggregation methods mean that you end up with different final numbers.

Eventually we'll want to support the after-market crossing, and when we do we will certainly add the data to support that. For now, we only support in-market hours trading, and we provide the in-market trading data. Until we add the after-market data, you'd need to import that data using Fetcher if your trading strategy uses it. Obviously, that data won't be useful in Quantopian until the next morning when trading reopens.

As for the specific date for NVO that you've pulled out, unfortunately I don't have a good way to dig deeper into that. I haven't been able to find a data source that tells me the after-market volume for a specific stock for a specific day. The NASDAQ site isn't super helpful.

Thanks for the comprehensive answer, Dan! Definitely clarifies a lot of things. And admirations for your efforts!
I highly recommend to you to add these additional clarification in the FAQ. It would be of great help to understand these things in advance.

On the particular case: I gave an example with AAPL too. There might not be after hours easily available on that either, but the article and the historical quote I pointed to at least prove that as far as for example "Daily Low Price" is concerned, the after-market value of it is not included in historical daily quotes.
Otherwise we would see $110 as the Low (commented in the article), as opposed to$124 (both mentioned in the article and seen in the historical quotes).
Considering NASDAQ is specifically clarifying after hours trading terms - here - I'd assume that after hours volume is not included in NASDAQ's historical daily quotes.

We go into this question a bit in our FAQ.

There are two key differences between the data source used by Yahoo et
al and by Quantopian.

Quantopian is only reporting on trades that happens during market
hours - no pre-market or after-hours crossing session data. That
causes volume differences. Quantopian gets data from all exchanges,
whereas Yahoo et al are using the "exchange of record" - that means
prices, particularly open and closes, differ in any given bar. I think
the more you get into it you'll find that trade data has no "right"
answer. What you have to work with is "this is the answer that is
predictably generated by this particular data process." In
Quantopian's case, the process is to get the full firehose of trades
on the major exchanges and then generating minutely bars from that
data.

This is interesting. What data do most folks that write backtests use - pre-market or non-pre-market data? data from all exchanges or 'exchange of record? I suppose ultimately it doesn't matter since ultimately you are trying to identify patterns for a given dataset. However it does mean that if we have existing studies/backtest that seem to show success based on other (i.e. yahoo) datasets, we can not use them here.

Hi Dan,

Why not contact your data vendor(s) and ask them to explain the difference to us? I have to agree with Kris that such a large difference should have an definite reason behind it. Reading through this thread, I don't think his question has been answered, but I'd expect that your vendor would know the answer, given that it is their business to have a handle on such discrepancies. Who knows, maybe there is a bug/flaw somewhere (yours or theirs), and you'll catch it by digging deeper.

Grant

@mbs mbs: It matters a lot since patterns might appear in one dataset and not in another - if data is way too different, which is the case for Qunatopian's daily volume. As I have shown above, Yahoo publicly available data set at finance.yahoo.com is using only intra-day values - at least for daily prices (open, close, low, high). Otherwise, we would've seen different lows and highs. Same for Nasdaq.
But again, the discrepancy I have noticed concerns daily volume only - daily prices are same as the major data sources.

@Kris - to clarify, what i meant is that given a set of data, even if that data is 'wrong', as long as it's consistently wrong both historically and going forth...if i can identify a pattern, i can trade that pattern. Now of course that means that a system i write against that data likely will not be usable if the data source changes...

I also noticed different highs in SPY compared to yahoo. Is there a way to download the data that is used here ? or should i just make an algorithm that prints out the data ? :-D

thx

Michal.

Is there a way to download the data that is used here ?

The terms of Quantopian's license with our data vendor allows the data to be used only on our site within the Quantopian application.

or should i just make an algorithm that prints out the data ? :-D

Michal, I know you were joking, but I want to make sure it is clear to everyone that any attempt to extract the pricing and volume data from our site would be a violation of our terms of use.

In other words, don't do that.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

hm i am really asking my self how are the other people doing they research without the data...

Well anyway i solved the problem for myself.

hm i am really asking my self how are the other people doing they research without the data...

I'm going to let someone else address that particular question, because there are others who will do a much better job of it than I can. I was just jumping in about the data question because keeping our data secure is my job, so it's a hot-button issue for me. ;-)

Some folks here use Yahoo data in python, I quite liked MATLAB, others like R.

Mid 2018 and question is unanswered, issue still unresolved.
This is a huge problem in my mind. Just getting started on Quantopian and first thing I noticed major discrepancy in volume data. Everywhere else - google, yahoo, morning start etc. are in agreement for volume data but Quantopian is off by quite a bit !.
I wonder if everyone is aware of this problem. I wouldn't have even started evaluating Quantopian had I known because strategy I am trying to code heavily depends on volume data.

5/31/2018 MMC volume data reported on Quantopian 1.6 mm, everywhere else 3.1mm. 50% lower. Quantopian volume data is consistently lower and varies from 10% to 50% based on small manual sampling.

Does this matter, if the relative volumes are still correct? For example, if one compares the ratio of AAPL to MSFT versus time on Quantopian compared to Yahoo finance or whatever, if the ratio is correct, isn't that good enough for algo development?

In the context of algo development for the contest/fund, one can find the total volume of the QTradableStocksUS (QTU) point-in-time and then compute the relative volume of each stock in the QTU, for example. Or do a ranking by volume. Etc. It is not clear why the absolute volume level is required, but perhaps there are use cases I am not considering.

Hi Everyone I use data from Yahoo to set my order specs, bring it in using Fetcher, and then rely on the Q data for the intraday. The volume filter I use to determine the order size is only to minimize slippage (now consistent with the default slippage model in Q, thanks to Dan, Grant and Leo). Given that volume is set so low in the default model the Q versus the rest volume discrepancy may not matter at what Dan calls (us mere) mortal levels. Hope this helps - if you need specifics let me know - I will post my code after a couple more tweaks. Happy weekend Savio

@Grant Kiehne, ultimately I want to run the algo in real trading on my IB (interactive broker) account and I am afraid it will yield different results based on real volume data. It looks like we can't trade with quantopian platform, unless you win the competition. I am new here so please let me know if I am wrong about not being able to trade.

@Savio Cardozo, would love to see and understand the solution you have built, please do share.

@ Ha Nan, as far as I know Q does not offer live trading (other than through their allocation or via IB) but you can test your strategy on Q first. I should be able to post an update to my code by end of week. The following link to a post on Q explains the concept of what I am doing, implemented in Python (which I have yet to figure out, so I do the optimization in VB Excel and then fetch the orders into Q) https://www.quantopian.com/posts/the-efficient-frontier-markowitz-portfolio-optimization-using-cvxopt-repost-cloning-of-nb-now-enabled

"NCLH", 2019-02-27, volume in Quantopian is 1.58M, but the data in other sources are 3.12M.

"PDD", 2019-02-28, volume in Q is 7.17M, but in the other sources it's 27.78M.

Definitely I can't rely on the volume data of Quantopian to do daily research.

Any update or improve on this since several years past?
@Dan Dunn

@Jonathan Kamens
Nobody cares this issue?