Back to Community
Solution for loading ML models?

Hi folks,

I found Quantopian while trying to find a way to create automatic trades based on a GBM (actually a series of GBMs) I have built. Are there any solutions yet to putting machine learning models into Quantopian? I searched and found a few threads with some suggestions, but they were either labor intensive or impractical, and the last item I found was from about six months ago.

Ideally, I'd pickle the model locally, then upload it and open it during the initialization of the algorithm. Quantopian has the sklearn library, which is what I'd need.

I have too much data to upload and train the model within Quantopian (25MB limit/1GB of data). Plus, it would take far too long.

Converting the model to Base64 and include in-line in code was another suggestion... has anyone done this successfully? That still requieres the model to be pickled/unpickled, which we don't have access to?

11 responses

Hi John -

The base64 module is not supported:

import base64

Importing base64 raised an ImportError. Did you mean to import bisect instead?  

You could request Quantopian to review it for white-listing.

As far as I know, here's what you have to work with:

  1. Retrieve data using fetcher.
  2. Copy & paste as in-line code.
  3. Run your ML code within the algo and store results in the context object.

Note that Quantopian was hot on the path of supporting more intensive ML (see ), but the effort seems to have lost steam. From various comments, though, I've gotten the sense that it is definitely on their radar screen.

Thanks Grant - I actually hadn't got as far as trying base64, because before you use that, you have to pickle it, which is not supported either. :)

I had seem some of the ML posts and thought they'd gotten further along. Any classifier or predictor worth it's salt is going to take more than 25MB of data to train and exceed the time limit. Increasing the memory available seems like an expensive solution, compared to incorporating "half" of the pickle module, that would allow you to import a model you trained locally without allowing the "export" function.

I had thought about uploading data in 25MB chunks and using all the computing time for the first few weeks to train the model, but thats also seems like a long work around. Maybe it's worth a try, at least for a backtest.

Thanks for your help!

The time limit for before_trading_start is 5 minutes (per trading day). I don't know how much memory is available to the backtester, assuming you would use Quantopian data, but it should be more than 25MB. My understanding is that if you can work within pipeline, there can be some benefits. If you can outline what you need to do, maybe other users or Quantopian can assist.

I also posted a request for an update to regarding memory limitations, so you might want to "listen" to that thread.

Out of interest how many stocks and how many features per stock are you talking about?

Also what period. Q data goes back to around 2002.

@Grant, thanks - the MB limit was the file size limit on uploads, rather than the memory limit. I'm not sure what you could do with 25MB of memory these days. Although I'm pretty sure I upgraded my first PC from 4MB to 8MB though and file like it was state of the art. :)

@Anthony - the data I have is about 25-30 features into a GBM trained through sklearn (which is supported) and the data I was using to train was 2013 to 2015 and test was 2016. I've pared it back to about 10 features without a significant loss in accuracy, through, so thats promising. I trained on about 3k stocks via data downloaded from Quandl/Yahoo Finance. I have a ton of fundamental data I haven't used yet - most of the features are the talib technical indicators.


So at the moment your features are talib indicators and based on prices only? Out of interest where did you obtain the fundamental data?

Working away at my end on ML with great enjoyment but no particular expectations.

Can't be bothered with stocks so sticking with indices, ETFs, futures etc. Just using price at the moment but not at all sure ML can do much with price alone.

There are a few sticking points here, I suspect, for Quantopian:

  1. They allow an unlimited number of backtests to be run by an unlimited number of users. So, if each backtest carries with it gobs of data, things could get expensive from a storage standpoint alone.
  2. They "mine" backtests to find good ones. Starting at 6 months out-of-sample, my understanding is that they re-run certain (surely not all?) backtests to see which ones are holding up. To maintain this approach and scale to 1M users will require lightweight algos; each algo can't require effectively a high-end workstation and 12 hours to re-run it.
  3. It is speculation on my part, but I gather that their requirement/preference for the Q fund is to have users not dump outside data into algos, but rather have algos that are self-contained, using internal data and processing power. Perhaps over-fitting becomes a problem, if users are allowed to upload data out-of-sample?

Something will eventually give, I expect, since they aspire to become the world's first $10B crowd-sourced hedge fund. A little more horsepower will eventually make sense from a business perspective.

@Anthony - the fundamental data I screen-scraped from websites mostly. I have a couple of crawlers that accumulated it and cleaned it up for use, then I store it all on a database I keep locally. I've had a little cross-validation success with ML models based on price alone - thats what I'm working with - but it takes a lot of data to get them to work.

@Grant - you're absolutely right about the sticking points, but the computationally expensive part of machine learning is training the model. Once you have a trained model, making predictions from it is relatively fast - a model that takes a few hours to train can make predictions in a few seconds - easily within the current 5 minute time limit and I think easily within the memory limits too (at least, I think if you were smart you could keep it within the memory limits).

As for the business perspective, I completely understand they don't want to put massive money into computing power for what - even on Quantopian - is a niche. They could solve their problems by keeping the current memory limitations but allowing us to import models trained elsewhere.

@ John Cook - I totally agree. Without an option to load up the trained model, there'll be no ML on Q :(. Models with optimisation may require hours if not days to train ..

I recall someone at Quantopian mentioning a new API concept for backtesting using user-uploaded signals (presumably daily as for Pipeline). This would be a path to do ML offline with external data (and I suppose compete in contests and for an allocation). Not really an elegant solution, though.