Back to Community
Can we have more meaningful datasets please?

Since Quantopian advocated using their datasets to generate alpha for allocations, I began to study the available datasets. Very handful of them are meaningful. Out of the 50 datasets mentioned most are irrelevant ones from Quandl etc. Even of those relevant many use blaze and are not supported in algorithms or pipeline. Can you please add more datasets so that we can succeed in our quest for alpha?

9 responses

Yes. We agree with the philosophy that datasets are the key to success in the quest for alpha. Adding datasets for you to work with is a top priority item for us. The reason you haven't seen many new datasets added recently is because we are improving our data system such that we can add new datasets at a much quicker rate. Our goal is to have thousands of datasets for you to choose from, and the changes we are making will make that possible.

All that being said, we're always looking for requests of particular types of datasets. Are there any particular datasets that you'd like to see on the platform? If you don't feel comfortable sharing them in a public setting, feel free to email in through our support channel ([email protected]).

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Realtime indices, and commodities spot prices :)

Hi Jamie,

Thanks for your email. Will drop a note to feedback with list of interested data sources.

Best regards,
Pravin

Hi Jamie -

Thousands of data sets? I'd be interested in your motivation, and how you'd expect individual users to use that many data sets productively. It ends up being one data set per listed company, basically. Say each data set has an teeny-tiny transient uncorrelated alpha. What would be the path to writing an algo that could get a $50M allocation? I can see how this might fit with the framework provided on https://blog.quantopian.com/a-professional-quant-equity-workflow/ and Pipeline, assuming there is enough alpha in daily data. Do you think a naive equal-weight alpha combination will work? Or will something more sophisticated be required, to do the combination?

On a related note, it sounds pretty daunting for a single Q user to sift through thousands of data sets, combining the good ones into a comprehensive, scalable algo. But say I picked one, and showed that there was a little bit of alpha there. How could I get paid, so that you could license my little gold nugget for the fund, and I could buy a sandwich?

One suggestion for a data set would be Internet health (e.g. https://www.akamai.com/us/en/solutions/intelligent-platform/visualizing-akamai/real-time-web-monitor.jsp). Daily data should be pretty easy to come by. And deriving a real-time minutely feed, I'd think, would be fairly straightforward. Even down to individual companies, it should be possible to get the data by writing a script to query site availability (e.g. Amazon, Facebook, etc.). At some level, there must be alpha in such data sets, but if someone is already doing it (almost certainly), then minutely data may not be fast enough. You might run this by Fawce though, given his do-good tone on https://www.quantopian.com/posts/phasing-out-brokerage-integrations. In all likelihood, you could be profiting off of criminal activity (but then, you are hooked up with Point 72, which has a very sketchy history, as portrayed in the book, Black Edge). By the way, if you do end up using my Internet data idea, my one-time licensing fee is $500 cash ($20 bills would be nice).

For the Q fund, would there be any way to publish data that would allow users to do the kind of algo viability analyses that presumably you can do? For example, say I'm working on writing a new algo. I'd like to know the degree to which it might be accretive, so that I know that I'm not wasting my time (and your platform resources). Presently, it is a total open-loop, time suck. Building a crowd-sourced fund without giving the crowd access to the fund as it is built would seem to be counter-productive. But alas, ironically, the whole crowd-sourced, collective, "we are all in this together" concept seems to be totally lost on you guys, in my opinion. What is your sense from the inside (we can start another thread, if you want)?

There are a lot of great comments here but I'll focus on the area that I have expertise in!

Our allocation process attaches high value to algorithms that use alternative datasets. We evaluate all algorithms that use alternative data, including strategies that use either free datasets or premium datasets.

Along with that we're working to add new and meaningful datasets, as Jamie mentioned, there is some product work being done to make that possible.

Seong

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

@ Seong -

Thanks. I knew all of that. If one writes a multi-factor algo, how will you assess the contribution of each factor, after the alpha combo and optimization steps? If the algo uses a mix of traditional and alternate factors point-in-time how will you rank the algo for its use of alternate data sets? It is easy enough to mix in lots of factors but the alpha could still be dominated by traditional ones. It almost seems like you are interested in single-factor algos which would make the whole assessment problem easier. Even then, how will you know the actual source of the alpha? I guess I don't quite follow (without unraveling the strategy in detail which goes beyond your terms of use, I think). If the algo is accretive to the Q fund, the source of the alpha is irrelevant.

Hi Jamie -

Point-in-time data sets germane to pairs trading might be interesting. For example, a reference data base of all pairs back to 2002, found by brute force might be useful.

Also, you could grab a paper such as the one posted here, and create a data base to overcome platform limitations (see Pravin's Jan 23, 2016 comment, "If you increase number of stocks to 40+ either the kernel dies or it runs out of memory or it takes hours to complete.")

Generally, any point-in-time look-up tables to overcome platform limitations would be helpful.

Daily VWAP
Bid-ask

There's no doubt a long list of industry-standard historical data sets and feeds that you could deploy. For example, I see that you are working with SS&C (see http://investor.ssctech.com/releasedetail.cfm?releaseid=1019053) and they provide data (see http://www.ssctech.com/institutional-and-investment-management/Solutions/DataManagement/MarketDataServices.aspx). So, it might be a simple matter of giving them a call and asking for a list of non-proprietary feeds that potentially they could provide to you and Q users. You could ask Nanex to do the same. In the end, you'll end up with a list that is common to Quantopian and the rest of the hedge fund world, but it is not clear if that would put you at an advantage.

A more innovative approach would be to put the point-in-time data set generation directly in the hands of the crowd, wherever possible. For example, my understanding is that your real-time OHLCV minute bar generation function is implemented in hardware/software and termed the "injestor." It takes in a live feed of trade data from Nanex Nxcore and generates minute bars. So, one thought would be to sort out how individual users could deploy real-time code on the injestor, and also generate a matching historical data base. You could impose a constraint that users still aggregate at a minutely sampling rate (synchronous with the wall clock in some fashion...how do you do this, by the way?). This would seem to be a drop-in for your existing minutely infrastructure.

It would also be interesting, in light of your decision to discontinue broker integrations for individual users (see https://www.quantopian.com/posts/phasing-out-brokerage-integrations) how this will affect the licensing arrangements with your data vendors, if at all. You are eliminating the possibility of fully leveraging the data sets, with "one-click" deployment with a broker, but users could still manually trade (and run backtests on historical data). Having thousands of data sets available sounds great, but will they be licensed such that your average Joe user will be able to access them (I'd have to imagine that some/all designated "managers" would be able to access them)? What sort of a system do you envision?

The other consideration is that any data set that is truly unique, a mother load of alpha, can't be put into the public domain, since one would expect an immediate alpha decay (and you'd have to trust that any "managers" with knowledge of it wouldn't reveal its existence). There are also legal/regulatory considerations, since anything unique and super-profitable could be interpreted as falling into the "gray edge" category (using the white/gray/black edge categorization in the book Black Edge). Anything that makes money hand-over-fist will draw scrutiny. So, putting it out even to "managers" under contract could create a legal headache for Quantopian (although I expect you'll be a "small fish" for awhile, and would fly under the regulatory radar screen).

Another thought would be to provide an API for individual Q users to mung data feeds that could then be made available to other Q users for use in their algos. You already have the general infrastructure and marketplace set up; it would seem to be a logical next step to sort out how the crowd could participate. Maybe the model would be that the crowd-supplied feeds would be free until used in an allocated hedge fund algo. Then Q would simply roll in the cost of licensing the feed into the profit shared with the Q fund algo author. Of course, you'd need to get this started pronto, since it would be nice to have several years of out-of-sample performance for said Q-user data feeds, before having much confidence in them (the same would be true of any derived "signal" feeds, where over-fitting/bias is gonna be a real problem, as is possible for the Alpha Vertex feed).