Back to Community
VIX/VXV Pipeline Data Critical Issue (closing prices incorrect) (help?)

For the past several weeks, I have been building strategies using the Pipeline method of pulling in VIX and VXV data. Just recently, I was going through trades and verifying data and came across an alarming issue. I also came across a secondary issue, but that may be cleared up by correcting the first.

Primary Issue: VIX and VXV Pipeline data is frequently wrong. http://screencast.com/t/unp8uM9lMj1

  • The attached code shows this via output from log.info(). Basically, I am building a Pipeline and attaching VXV data. I then simply output the VXV Close price via log.info() to review. A manual comparison to actual closing prices of VXV via another service finds that there are many cases of closing price duplication (ie. same closing price for 2 days) and loss of the real closing price for one of those days.
  • Additionally, to help prove the point, I import VXV data via the fetch_csv() function. I then output this data via log.info() along side the Pipeline'd VXV data. The fetch_csv data is spot on when comparing to other services.

Secondary Issue: Pipeline data is offset by an extra day. http://screencast.com/t/unp8uM9lMj1

  • Using [-1] to pull the previous closing price, the results for VXV are actually from 2 days before. For example, if the closing price for 12/1/10 is 23.99, then that means that it should only appear as available to the strategy on 12/2/10. In actuality, 23.99 is showing up in the output as available on 12/3/10, a full day later.

This same issue can be found by using Pipeline data for VIX. In the code, simply uncomment the log.info() line that outputs the closing values for VIX to see it. ##

Clone Algorithm
15
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 57ce2a720e6868102e3f77e1
There was a runtime error.
24 responses

Ouch! That's pretty bad...

I've verified this issue a few different ways so its not something that seems to be a one-off event. @Simon, your confirmation also gives confidence its not some anomaly in the code.

So the question becomes, are the Forums the best place to alert Quantopian staff to this? Or is there a better channel to go through to at least get an acknowledgement they are aware?

Hi, thanks for the heads up. Forums typically will work and also have the benefit of getting help from the rest of the community. A help ticket has a higher probability of being seen by a staff member.

I dug into this a little bit yesterday but haven't done a full investigation yet. The strange thing is that this is from timeframes prior to our daily loading of the data from Quandl so that the difference between the asof_date for each record (the datetime to which the data applies) and the timestamp (the datetime at which the data is available to a strategy) is uniform and set to a specific duration. We'll have to dig a little deeper here.

Thanks
Josh

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

That makes some sense as I've only confirmed this issue prior to 2016 for VIX and possibly earlier than that for VXV.

OK, I have a handle on it now.

Here's what we did originally with this data from Quandl.

For any data set, we have two segments of data: data that we are loading to reflect history and the data that we process on an ongoing basis.

As i mentioned above, the asof_date field is the datetime that we get from the data itself. It typically is the "date to which the record applies". So if it has an asof_date of October 1, then for VIX that is the volatility metrics (open, close, etc) for the day of October 1.

We also have the timestamp field. The timestamp field indicates the time at which this data is actually available to an algorithm. This field is meant to help us prevent look ahead bias. So if you have data about October 1, but it takes 10 days for Quandl (or the CBOE or whoever) to publish the data and for Quantopian to process it, then the timestamp field will be October 11.

For new data points that we load each day, the timestamp field is set to be the value the data actually gets stored and is available through the Quantopian API. We have actual values.

But for historical data that we loaded initially, we don't know what that value would have been, back in 2002 or 2010 or 2013. For that historical data that we loaded, we need to assign a lag between the asof_date and the timestamp. In the case of these volatility data sets, we ran the processing for some time and then calculated the mean. Over that time period, we observed a 44 hour lag, on average and therefore set the difference between the provided asof_date and timestamp to be 44 hours.

We did this calculation for each dataset from Quandl so the actual lag varies from set to set. Some of the macro economic datasets have a 7 day lag (like the ADP employment data). Others are shorter (like Yahoo's VIX).

The conclusion is that the data, from the sample we took initially, indicated that the CBOE data from Quandl wasn't available prior to open -- it was afterwards and therefore the cause of the lag in the data availability in your backtests in 2010.

That said, we can look back at the data now that we've processed more data and reconsider the lag we provide on that historical data. You can check out the latest real lag by examining asof_date and timestamp for these data sets. My quick examination this morning led me to think that the lag in more recent data records has gone down.

Hope this helps explain the behavior.

Thanks
Josh

Wait, so you measured how long quandl takes to get the close price of VIX into its data sets, and on average, it took 44 hours?

That essentially confirms that quandl is useless for anything.

To add some perspective from someone who live-trades this stuff, this only reconfirms my opinion that it's far more effective for me to fetch_csv these data straight from CBOE myself than use any prefab versions from Pipeline/Data. Since it's my money, I care far more about getting the right numbers to trade on every day and in backtest, and avoiding being unnecessarily exposed to bugs in Quandl/Pipeline/Data for no reason.

Quandl VIX being lagged by 44 hours doesn't help me (by reducing surprises because backtests are as crappy as live trading), rather it tells me that using Quandl VIX is simply out of the question, forever. I do understand why you do it this way, but please understand that a lot of people will view it as insane and useless.

Sorry "insane and useless" may be unfair. I am just surprised (shocked?) that the 44 hour lag was signed off on as acceptable when clearly, there's no reasonable explanation for not having the closing price of a common index by the following morning. This is not a systemic lag like those found in economic indices for prior weeks/quarters, this is an operational lag caused by software somewhere.

44 hours is a bit misleading because it's 44 hours from the asof_date which is timestamped as midnight the previous date. But your underlying point stands.

I think examining the more recent records coming in from CBOE via Quandl showed that the lag might have settled back into a more usable state and we can re-evaluate that historic lag that we impart on the data.

Also, the yahoo vix source from Quandl seems to come in more promptly.

I'd love to see the lag measured directly from the CBOE website. We can also pursue adding those sets to our data program.

A close examination of the data revealed the problem. Looks like the data from Fridays are not typically loaded onto the site until late Sunday/early Monday. So our simple mean of the lag was a poor model. The data is always ready for the next session, it's just that those Friday lags are causing problems in these cases. I'll submit a bug.

I've been using VIX (and related) data from CBOE directly for years. They are reliably updated every evening EST. Any delay from Quandl is on their side. However, when I've used their data directly myself (within the past several years) all VIX (and related) quotes are available by 9pm EST. Truthfully, I was using VIX quotes as far back as 2009 and never-ever experienced anything even close to a 44 hour lag or even 24 hours. That's real-world expectancy and I would argue that's what Quantopian should try to deliver.

Although I do understand Quantopian's take on this as wanting to replicate real availability of datasets brought into their system, I think Simon has a very valid point. I'll add to that point in that at some point you have to disregard the past because going forward you only care about what is available at present. If I am basing a trading strategy off of the VIX, I don't care that the VIX quote took 44 hours to get to Quantopian. I just care that the closing quote is accurate for the day it corresponds to being that VIX is a "real-time" calculation.

I think this is a big difference from economic data, or other lagging reports, that may be "AS OF" a specific date but don't get published until a week or month later. When dealing with most quotes of any security or index, the biggest lag you will see is a few hours after the market close (in the vast majority of cases). If Quantopian has a delay getting reliable data into their system, that should be disclosed loudly and a faster method of bringing it in researched. At minimum, two feeds should be available. One that replicates the "real world" historical availability of the data, and one that "acts as if" all historical data had been available with the same delay expected currently.

After all, if I'm backtesting a VIX system where I can expect to have closing daily quotes available every night, I want to backtest that system's robustness "as if" those quotes were always available to me because that's what I'm going to get going forward. If I can't get that going forward, Simon is right, the data is useless.

We're in the process of fixing this, as discussed in the previous comment. It's a relatively easy fix. Thanks for bringing it to our attention. In the meantime, the yahoo_index_vix dataset has a more realistic lag.

Looks like you posted while I was typing mine. I wasn't trying to belabor the point.

We will switch with yahoo_index_vix for VIX and cross-check against the CBOE .csv import. When you have an update on VXV, please let me know. We want to use and support Pipeline whenever possible as that is direction you guys are supporting.

FYI, best to double-check that the Yahoo VIX values match CBOE's. Last time I checked, they had fairly regular data errors.

CBOE VIX CSV is updated with today's data. I forgot to check earlier. Quandl apparently pulled it an hour ago.

At the risk of belaboring the point further :) . . .

The updates almost always arrive promptly from Quandl in time for market open. The problem here is strictly related to the data used for backtests prior to when we started loading new data on a daily basis.

I'm also very interested in using the VIX data and I obviously would expect it to be both accurate and as current as possible.

I would like to learn more about this "yahoo_index_vix dataset". Could someone please explain or show some sample code as to how I might get started with using this "yahoo_index_vix dataset". Is it like a library or sid that is available and would allow for the use and manipulation of current and historical VIX data?

Winston, I've learned and gained a ton from these Forums starting as a new Python programmer and trying to incorporate strategies into Quantopian. Most beginners questions can be answered reading the Help and Tutorial pages, and searching the Forums,comments, and algorithm source code.

I wrote a algorithm to help visualize what dates have discrepancies between CBOE's CSV files and Quantopian's Pipeline. No guarantees as to the complete accuracy of this, but it seems to match up with what I've seen (historically terrible, with recent improvements since mid-2015).

Basic Premise: Assume CBOE's published CSV files are 100% accurate. Flag the Pipeline values that do not match and display that visually on the Backtest in the Custom Data section as 1 or -1 for mismatches and 0 as matching.

I found it very interesting that the same pattern of errors existed for VXV and VIX. Almost mirror images. It definitely supports Josh's explanation of a historical import issue. This will better sense if you examine the visual output.

Clone Algorithm
15
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 57d2232390d6421035cebde6
There was a runtime error.

Hi all,

We've since re-calculated the lag for the historical load of CBOE data from Quandl. We've excluded Friday values from the lag calculation. We now set the timestamp for CBOE data (via Quandl) at approximately 7am ET the following day and available for use in before_trading_start(). The primary issue highlighted by Chris at the outset of this thread should be resolved.

I've run Chris' helpful algorithm above. It shows that now the only instances of deviance from the 'perfect' post-facto CBOE file are from our day forward processing timeframe. Based on visual inspection of some of these instances, these reflect real instances of when the processing of new data was not available.

Again, thanks for your help and I hope you make greater use of this data moving forward.

All the best,
Josh

Is there a pipeline which gets the data directly from CBOE, rather than through Quandl or Yahoo, which you point out are sometimes delayed in updating their data in time?

In my experience (last week notwithstanding), the CBOE CSV files are quite reliably updated in time.

Thanks Josh, I inspected your fix by running the code for both CBOE and YAHOO pipeline sources. CBOE is very accurate, except for 4 dates since 2015. YAHOO is much less so and upon inspection is appears the closing values differ frequently by 2-10 cents on average (see attached backtest). Yahoo must be getting the closing value differently. Since that doesn't match with any data source I've used in the past, I'll be eliminating Yahoo as an option and sticking with CBOE.

VXV is also well fixed. I am curious, when you refer to processing of new data not being available, you are specifically meaning that the Quandl source did not have the data available by 7am the following day?

Lastly, would this be the proper place to request Pipeline inclusion of VXMT data from CBOE? Now that the VXV/VIX import issue is fixed, the same can be run for VXMT and I'll be able to possibly enter some strategies into the "contest" as fetch_csv() won't be required.

Clone Algorithm
15
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 57d86ea80377cf102c69f8ff
There was a runtime error.

@Simon,
We do have a project in our backlog to get the data directly from CBOE directly but it is lower in our priority queue (to be completely honest). The advantage of the (free) data sets through Quandl is that we have a generic integration to Quandl so we can very easily add any time series-based data set. Integrating to a new source (like CBOE) is a bit more effort and we have other partners signed up that are higher in the queue right now.

@Chris,
You asked "when you refer to processing of new data not being available, you are specifically meaning that the Quandl source did not have the data available by 7am the following day"

I do not know the root cause of these instances in which the data was available after before_trading_starts() runs. before_trading_starts() runs at 8:45am. The data typically is available on Quantopian at around 7am (but sometimes it is later). There are a variety of reasons why the data might not be available on Quantopian in time for 8:45am:

1) CBOE has had problems and not made the data available to Quandl in a timely fashion
2) Quandl has had problems with their processing or their API and not made the data available in a timely fashion
3) Quantopian has had problems processing the data and not made the data available on the platform in a timely fashion

The underlying root causes for these processing problems could be wide ranging.

Any of those three organizations might have had a problem that causes the data to not be available in time for before_trading_start() at 8:45. I haven't dug into these scenarios in detail. This is one of the reasons I'd prefer to get the is CBOE data (in the long term) directly from CBOE -- it would reduce the number of potential points of failure in this process. (For the record, I think Quandl is great company and their API has proven to be very reliable).

Regarding VXMT, I'm happy to throw that into our backlog. I thought I had found all the volatility indices on Quandl when we added the CBOE sets originally. I guess I missed this one (or it was subsequently added).

Thanks,
Josh

@Josh A few years ago, there was a suggestion that Quantopian would be adding actual proper index data to the minute-level feed. Intraday volatility indices would open up a whole new class of algorithms, though perhaps not ones that Quantopian would be interested in funding. Pipeline and Data is great, but to be honest, if the VIX data is not available in the morning (17 hours later!), it is essentially useless. Perhaps worse-than-useless, because many people will assume that the data is good. Not all will be so diligent as Chris in validating that the pre-packaged data from Quantopian meets basic integrity standards.

Thanks Josh, those sources of failure make sense and its good to point it out. I'll keep an eye out for VXMT being added to the Pipeline data sources. That will be helpful when playing with some VIX concepts.

Not sure if this is the correct thread to ask this specific question, but here goes.Is there any way that I can enter a manual closing price?Normally with Python code I would just user input ().That doesn't seem to work in Quantopian.Is it me?Have I missed something?