Back to Community
Historical data issues

SLMB_P: on 2009-10-27 open_price and high = 199000.00
EEMS_P: on 2005-07-26 open_price and high = 2000.00

19 responses

Yahoo Finance reports $36.51 for SLM Corporation (SLMBP) on Oct. 27, 2009, but 0 volume: https://finance.yahoo.com/q/hp?s=SLMBP&a=09&b=11&c=2009&d=10&e=21&f=2009&g=d.

TDS_S: on 2008-10-14 open_price and low = 0.01
BCH: on 2008-06-12 open_price and high = 227.994 while low and close_price = 42.09
CNP: on 2010-05-06 low = 0.077 while open/high/close around 14 (flash-crash day though)
GGB: on 2007-08-16 low = 0.005 while open/high/close around 10
NTL: on 2010-04-21 low = 1.96 while open/high/close around 22

These bad data points are easy to spot in the research notebook.

Please explain what kind of data sanity checks you apply to the database. Thanks.

Alexis, thanks for pointing these out. We get our historical data from a provider and they, like all data vendors, are not perfect. We actively work to clean the data and maintain its integrity. The latest round of effort was shared here. I've added these prices to get fixed in the next round of fixes.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Alisa, could you expand on the data sanity checks that Quantopian applies to the database? It's quite an important topic and we need to trust the data on which we run backtests. Thanks.

If you dedicate a programmer to this job for a day, you'll probably find hundreds. I think these are fresh:

EOX (26560)  
split 1:7 on 22 Oct 2012  
get_pricing(26560, fields=['price'], start_date='2012-06-02', end_date='2013-06-02').plot()


MEIP (25776)  
split 1:6 on 19 Dec 2012  
get_pricing([25776], fields=['price'], start_date='2012-11-15', end_date='2013-01-02').plot()


CALI (34643)  
split 1:6 on 10 Oct 2012  
get_pricing([34643], fields=['price'], start_date='2012-10-01', end_date='2013-01-02').plot()


OPXA (30504)  
split 1:4 on 17 Dec 2012  
get_pricing([30504], fields=['price'], start_date='2012-12-01', end_date='2013-01-01').plot()  

In a D.E. Knuth approach, Quantopian should consider offering $1 per correct data error. Offering a bounty would certainly spruce up the database :)

21437 (PRCS) : PRAECIS PHARMACEUTICALS INC  
split 1-5 on or after 1st Nov 2005  
get_pricing([21437], fields=['price'], start_date='2005-06-01', end_date='2006-01-01').plot()


29324 (CDII) : CD INTERNATIONAL ENTERPRISES I  
1-26 followed 26-1 forward split (?)  
get_pricing([29324], fields=['price'], start_date='2008-06-01', end_date='2009-06-01').plot()


35020 (RMS) : RYDEX INVERSE 2X S&P MIDCAP 400 ETF  
bad print for April 2010?  
get_pricing([35020], fields=['price'], start_date='2010-01-01', end_date='2010-06-01').plot()


40408 (CVOL) : CITIGROUP FUNDING INC  
split 1-10 on 2 Jan 2013  
get_pricing([40408], fields=['price'], start_date='2012-06-01', end_date='2013-06-01').plot()


There seems to be a lot of stocks that hit exactly 0.01 in price for one day? Are they all bad prints?


9621 (OXGN) : OXIGENE INC  
split 1-12 on 28 Dec 2012  
get_pricing([9621], fields=['price'], start_date='2012-06-01', end_date='2013-06-01').plot()


24888 (BOSC) : BOS BETTER ON-LINE SOLUTIONS  
split 1-4 on 14 Dec 2012  
get_pricing([24888], fields=['price'], start_date='2012-01-01', end_date='2013-06-01').plot()


27925 (MGT) : MGT CAPITAL INVESTMENTS INC  
split 15-1 on 21 Mar 2012  (not sure?)  
get_pricing([27925], fields=['price'], start_date='2012-02-01', end_date='2012-04-01').plot()  

28112 (SBLK) : STAR BULK CARRIERS CORP  
split 1:15 on 15 Oct 2012  
get_pricing([28112], fields=['price'], start_date='2012-02-01', end_date='2013-04-01').plot()


9189 (KGC) : KINROSS GOLD CORP NEW  
no split? but somethings up  
get_pricing([9189], fields=['price'], start_date='2004-06-01', end_date='2005-06-01').plot()  

There seems to be a lot of stocks that hit exactly 0.01 in price for one day? Are they all bad prints?

I encountered an example of this last night during a backtest! For one day only (I think it was January 24, 2008), AAB_WS drops to $0.01 and then returns to roughly $7. That convinces my algorithm to buy obscene amounts of that stock (millions of shares) and then sell it. It had a modest return up until that point, but the algorithm completely loses it all at that moment.

I love this platform, but after reading these posts it makes me wonder whether my algorithms' successes/failures in backtests can be trusted. Like many others, I'm sure, I'm enjoying this free testing service while considering using what I've learned here in live trading in the future. Even when the historical data is accurate, you run the risk of encountering patterns during live trading that your algorithm never encountered during back tests. Developing algorithms by testing them on bad historical data and then live trading with them only increases that risk.

It's up to Quantopian to decide how to convince us that our results are meaningful. One option would be to allow for simulated live trading using data from IB as an alternative to back testing. Another option would be to allow users to report concerns about the accuracy of data and to see warnings about it as the data is accessed during a back test. I'm sure others will think of even better suggestions.

Edit: Correction by Grant, in a comment below:

Note that Quantopian does not use data from IB for live trading, but rather a real-time minute bar feed based on Nanex Nxcore (see https://www.quantopian.com/faq#data).

Thanks Grant!

In the research notebook, I found for 2011 alone 23 instances where abs(return) > 50% in 1-minute bars (using open-to-open, high-to-high, low-to-low, close-to-close returns). I don't have the time to investigate them all though:

Equity(26487 [HK]) 1  
Equity(21612 [DNDN_Q]) 1  
Equity(5452 [NSM]) 1  
Equity(389 [AAMRQ]) 2  
Equity(20037 [VSEA]) 1  
Equity(35114 [SFSF]) 1  
Equity(16945 [RMBS]) 2  
Equity(26462 [NETL]) 1  
Equity(23686 [GSIC]) 1  
Equity(33752 [VRUS]) 1  
Equity(2202 [DFG]) 1  
Equity(3547 [HGIC]) 1  
Equity(10287 [TLB]) 1  
Equity(25865 [GLBC]) 1  
Equity(33894 [RRR]) 1  
Equity(40385 [SUNH]) 1  
Equity(31879 [SYUT]) 1  
Equity(8489 [GLBL]) 1  
Equity(27067 [TRGT]) 1  
Equity(21036 [AMRN]) 1  
Equity(16257 [CLDA]) 1  
Equity(31638 [HRBN]) 1  
Equity(21809 [SFN]) 1  

One option would be to allow for simulated live trading using data from IB as an alternative to back testing.

Note that Quantopian does not use data from IB for live trading, but rather a real-time minute bar feed based on Nanex Nxcore (see https://www.quantopian.com/faq#data).

An open question, though, is whether Quantopian live trading still uses the backtest data set? For example, does history() pull from the backtest data set, or is there a separate historical data set that exactly matches the Nanex Nxcore minute bar live feed?

Note that Quantopian does not use data from IB for live trading, but rather a real-time minute bar feed based on Nanex Nxcore (see https://www.quantopian.com/faq#data).

Thanks for correcting that, Grant. I'll try to edit my original comment to correct it there as well.

One approach would be to write a script in the Quantopian research platform to screen for bad data, using the same universe as you plan to implement for you backtest. Apparently, bad data are a problem in live trading, as well (see https://www.quantopian.com/posts/tips-for-writing-robust-algorithms-for-the-hedge-fund, Tip #13).

+1 for James Jack's comment, "If you dedicate a programmer to this job for a day, you'll probably find hundreds." The way to tackle the problem, I'd think, would be for Q to have a script running 24/7/365 churning over the data looking for badness, versus putting the burden on Q users. Or maybe this is being done already, and badness still leaks out?

Anybody know the common causes of bad data/prints? It is mysterious that in this world of electronic markets that such errors would be common? And wouldn't any data vendor that supplies inaccurate data end up going bankrupt? Or is it just a matter of getting what you pay for?

Another angle is that maybe bad data result in money-making opportunities if the misinformation propagates into the market, and participants start making bad decisions? Any cases of this? Maybe a strategy could be developed?

The reason I said that is because I dedicated a couple of hours and found quite a lot ;) So hopefully someone at Quantopian can be more productive with it:

Think of any unusual situation, write an algo to check for that unusual situation, and when it happens, print sid + date and flag for review. E.g. beta > 4 or beta < -2 is a good starting point. (I will post a few more I found a bit later if I have time...)

Thing is, at the end of day, you're always going to get some amount of bad data. But my opinion is:

Splits+dividends not accounted for are not 'bad data'; they must be correctly handled, period. Good luck trying to assess the many hundreds of splits like "1082 for 1000" from close prices alone: that kind of processing can only be done at Q's HQ. Even "1 for 2" can be tricky for volatile / thinly traded stocks.

Bad prints/data should be eliminated, but I suspect this is almost impossible. I'm guessing any data vendor may accidentally spew out 0.01 from time to time. What would be nice is to control the level of badness by e.g. inject_random_data() function. If bad prints occur in real-life trading, the algo must account for that possibility as much as possible. Robustness can be measured by just how often bad prints might affected it. Otherwise, how can you possibly hope to know?

@Alisa,

Just curious...do you use your historical data vendor for live trading (e.g. to produce the data provided by history())? Or do all of the data come from Nanex Nxcore? The reason I ask is I'm wondering if it is at all feasible to pre-screen for bad data in live trading, or if screening has to be done as the algo runs live (assuming that Nanex Nxcore derived data are used exclusively for live trading)?

It is mysterious that in this world of electronic markets that such errors would be common?

My 2 cents. I want to say: welcome to the real world. Bad data is a reality that you need to learn to deal with. It comes in many and sometimes very creative ways (wrong prices, missing prices, wrong data type, wrong decimal, etc.). Sometimes bad data comes directly from the exchange. Sometimes it's obvious, sometimes it's not (e.g., a trade price that is off compared to other trades around that time - how can you tell it is bad data?). When filtering prices, you make a trade-off between type I and type II errors (i.e., the risk of filtering out a trade that was valid or keeping a trade that was invalid). When you leave the filtering process to your data provider's judgement, you need to trust its process. The deeper you go into details, the more you realize that managing data is hard. So it's not a big surprise to see slight data differences between data providers.

The good news is that most of us don't need to go that far. We just need reasonably reliable data. Now as James said, not accounting for dividends and splits is not a reasonable data issue. It must be fixed.

And wouldn't any data vendor that supplies inaccurate data end up going bankrupt? Or is it just a matter of getting what you pay for?

Assuming people would care about data quality, possibly. PremiumData has been providing historical stock data without dividends for years and seems to be doing quite well for example. More expensive data providers also have data issues (e.g., Bloomberg). Or just look at Quantopian. How many backtests have been run over the last 3 years? Hundred of thousands if not millions. How many people complained about data issues? Just a few.

Hello, hello?

Issues haven't been fixed.

Hi Alexis - sorry we haven't gotten to these yet. Data issues like this get prioritized in with our other development work. The bigger the bug, the way it does (or does not) affect real money trading, things like that affect the prioritization. The data bugs get lumped together so every new one we find makes it a "bigger" bug. At some point that moves it to the front of the line and we fix them all in a big chunk, like we did for the splits data correction last month. It will get done! Just not today, and not tomorrow. . . .

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

To add to Dan's post:

We are working on catching these errors internally. Over the last two weeks, we've identified many problematic datapoints. However, the full set of datapoints under investigation is enormous -- around a hundred billion datapoints. So there's lots of fine combing to do. Moreover, once we have identified a bad datapoint, not only do we need to change it, but we also need to make sure that the new datapoint is actually correct (in the context of the new dataset with changed datapoints). There are lots of edge cases, so this is in part a manual process.

So you can imagine that this is not something we can do instantaneously. Beyond that, we aim to publish corrected data as one or several big batches. Doing surgical hotfixes for individual datapoints would not be the right thing to do at this stage, for reasons related both to operations and to data sanity.

There are a lot of little flash crashes ever day nowadays. It's just part of the market structure. Many "bad prints" are preventable, however, if filtering on the ticks as they come in, since many of them have trade conditions that indicate that they are late-reported trades, derivatively-priced trades and so forth. It's much harder to identify which prices are "real" once they are already aggregated into minute bars.

Understood - thanks for the clarification!