Back to Posts
Listen to Thread


I have two questions regarding the data that Quantopian uses:

  • Where does Quantopian receive it's data from?
  • Is the daily data reconstructed from the minute data, or is it from a
    separate database?

Thanks for answering


We purchased the minute-bar data. We haven't released who it is simply because we didn't get permission from our vendor! We're working on getting that in a future negotiation.

The daily bars are constructed from the minute bars.


Hello William & Dan,

Good question. It is a bit mysterious how OHLCV data for thousands of securities trading on multiple markets could be generated at a minute-level rate. It seems like no small feat. Is the Quantopian vendor actually capturing every trade (even the high-frequency events)? I'd be interested in the details. It seems like way too much data to capture in non-volatile memory, and then post-process, right? Is it a real-time system that captures OHLCV data on-the-fly and streams to disk?

It is also a mystery why the data are expensive. I've heard various explanations, but none have been convincing. Is it a monopoly situation?

Also, I'd think that your vendor would have no problem with Quantopian advertising their service. What's there to negotiate? Or do you consider the name of the vendor to be a Quantopian competitive secret (which would be a reasonable stance)?


Data being expensive is nothing new - its because of licensing fees (paid as a service, not a one-off-cost, since you are interested in having it updated daily). The top of the chain is the exchange, which can charge very large sums for complete data access to the relatively small number of firms who can pay it. That's the top of the pyramid, and it filters down.

I assume the data is held "in the cloud" and servers with 128GB of ram is not impossible, but the load is of course spread across multiple units. It's all about virtual servers these days, where memory is actually shared among any number of servers/clusters etc, so the amount of memory used is not limited to a physical machine and can vary dynamically.

This topic raises a question: how the data provider's "brand name" relates to the quality of the data.

Would people have a different opinion of the data if it was linked to a certain data provider? This could bias someone either positively or negatively, a high profile "trusted" provider might causes traders to be over confident. Similarly, one might be negatively biased by a "low quality" provider.

Generally speaking, data is not to be trusted. If your strategy depends on clean data you will be in trouble. In the real world data is very messy; there are misquotes, errors, etc. that can cause problems. It would be a worthwhile thought experiment to look at what happens to your strategies if some small random noise is added to a real time series.

Log in to reply to this thread.
Not a member? Sign up!