kitchen sink data set ML algo?

There's a lot of free Data sets on line now; has anyone tried just shoving them all into deep learning network and seeing what it comes up with? Is that even possible with Pipeline in Research?

It would be interesting to see what it came up with on all the free data sets, and interesting to see how much it would cost to run live, and then interesting to see how much capital it would take to be profitable after data feed costs. I'm guessing you could make something Sharpe 1.5ish, but the feeds would cost $2k/mo (unless the$350 premium bundle includes ALL of them, including LogitBot etc and all the new ones?)...

9 responses

Hi Simon,

One thought for the Q team is to consider how they might decouple the data from the limitations of their platform for the masses. My sense is that it'll be challenging to take a stab at questions like yours until they sort this out. I'd asked several times regarding the possibility of Q employees (not users) accessing data in an open fashion (e.g. pipe it to a pc at their desk), but there was no response. On https://www.quantopian.com/data, I see:

With all third party data provided on this site, Quantopian restricts its use to this website. These restrictions are part of the deal we've struck with the data providers. We suggest trying out the research environment as an alternative.

Perhaps this extends to employees, even it the data were kept within the Q network?

I guess if I were a Q employee interested in trying to answer forward-looking research questions, I would quickly run out of patience with the limitations of the platform (don't get me wrong...the fact that it is out there for 90,000+ users for free at a basic level is a real feat!). To me, the question isn't "Is that even possible with Pipeline in Research?" but rather are the data available to run on an open platform? Or does the licensing preclude it? If the data are accessible, then what would it cost hardware-wise to start playing around? My guess is ~ $5K? ~$10K? A bit pricey, but then Point72 has \$250M burning a hole in their pocket (at least that is the marketing spin).

The point would not be for Q to start competing with its users in developing strategies, but rather to show what might be possible, and to provide a path for eventual deployment to users. As things stand now, it seems like they're kinda hamstrung by the platform to look out more than a few months.

First I have to say that IMHO the current version is unsuitable for large scale machine learning - especially deep learning as it requires a lot of CPU/GPU processing power and memory.

IMHO absolutely the best way would be to allow people to rent [virtual/virtualized] servers from some isolated part of the server room where these computers would have access to the data in bulk. I'm quite sure this would still be permissible by contract as the data wouldn't leave the Q premises. I'm aware that it might be hard to restrict people from downloading the data if full access is available. With proper contracts with sanctions that developers would have to sign it should be possible to restrict downloading the data via legal terms if not by technical means.

I would love to play with different kinds of machine learning methods and would be willing to pay for server to have access to decent processing power and memory and be able to edit the code via usual tools (emacs etc) and run normal version controls, reporting to files, running longer runs (at least days) etc.

One issue is that their business model is weird. If you read the statement on https://www.quantopian.com/about , they want to take money from folks like Point72, pair it up with users algos, and take a cut. So, why would anyone pay for the privilege of using Q's resources to help them make money? But then, they are getting folks to pay for data, conferences, training, etc.

Simon, not deep learning, but the approach outlined here: https://www.quantopian.com/posts/machine-learning-on-quantopian can easily be extended to include more data sources as factors. These would then automatically be picked up by the ML classifier.

In general, I think we should not get tripped up by deep learning. There is no good reason to assume that it will out-compete ensemble based methods on this type of data. Deep learning starts to become powerful when there is:
1. Structure to be exploited: Think convolutional nets for image recognition. Currently we don't have an analog of that for alpha signals.
2. You have tons of data: The amount of data in the ML notebook is not nearly there, even if we were to add many more factors.

I'm not saying it won't work, but it's not a magic bullet where instead of 53% accuracy with ensemble methods, deep learning will get you 60%. At least from what I've seen on the forums, we are not even at 53% yet (figuratively speaking), so lets get there first and then push for fancier techniques that might squeeze it to 54%.

@Mikko: At which exact point in your workflow do you run into platform limitations?

@Grant:

I guess if I were a Q employee interested in trying to answer forward-looking research questions, I would quickly run out of patience with the limitations of the platform.

I'm currently pushing the ML workflow forward and all limitations so far were easily lifted.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

I didn't want to get stuck on deep learning; I was more curious what falls out of simple generic ML on all the premium datasets, and whether it would cover data costs, and at what capitalisation,

@Simon: OK, cool. Yeah, that would be a great research project. Just add all premium datasets to the ML workflow notebook and see how well it does. If I can get to it I'll give that a shot.

@Mikko: At which exact point in your workflow do you run into platform
limitations?

.. Where do I start ..

• There is no version control (which I think is totally essential to any development), now I have to dig old code from backtests and guess which backtest is which
• I can't use libraries I want to, libraries are restricted to versions/libraries Q has chosen
• I don't like the editor, I would like to use emacs that I'm used to
• I can't use any kind of directory structure to structure my development environment/different algos
• I can't reuse code - I can't create libraries that I could use in several programs, I would really much like to test genetic algos with your data but I don't want to cut'n'paste code from algo to another.. I think I would also like to have my own backtester that is optimized for performance that I could use to run thousands of backtests much faster than the current backtester can.
• The performance is very limited compared to personal server (I would very much like to have my own personal workhorse)
• I can't run large runs (that take days) with different parameters / algos and log the results in some strucutred format that I can analyze later
• There is no way to structure results that I could gather from running let's say 1000 backtests
• Running large scale calculations (I mean days) is unreliable and slow both on "research" and "backtesting" environments

These are just some of the issues that came into my mind. I would really want to experiment with your data with some more complicated methods..

http://www.tylervigen.com/

Hmmm....kitchen sink sounds ominously Tyler Cohen!

@ Thomas -

The point is that you may not even be asking the more researchy questions as Simon posed, due to your computational restrictions. It is a psychological/cultural chicken-and-egg thing. I think it is tough to buy a cheapo desktop pc for home use with less than 8 GB of RAM, yet the research platform only supports 4 GB. GPUs are a commodity. Etc.

There is a lot of merit in starting simple, and not adding cost and complexity. But the flip side is that over the years, I have seen first-hand a magical interaction that occurs when state-of-the-art technology intersects with exploration and invention. Good tools have a way of paying for themselves very quickly, but it is not always clear to the bean-counters how this works at the fuzzy front end (production/operations is a different story, since you just have to show that a penny will be saved on the bottom line, and that everything amortizes properly).