Back to Community
Machine Learning from Streaming Data

I found this blog post from the guys at BigML: http://blog.bigml.com/2013/03/12/machine-learning-from-streaming-data-two-problems-two-solutions-two-concerns-and-two-lessons/ which I think is very relevant for the Quantopian community.

Among other things, it makes the observation that there are two paths when trying to apply machine learning methods to streaming data (c&p from the post):

  1. Incremental Algorithms: These are machine learning algorithms that learn incrementally over the data. That is, the classifier is updated each time it sees a new training instance. There are incremental versions of Support Vector Machines and Neural networks. Bayesian Networks can be made to learn incrementally.
  2. Periodic Re-training with a batch algorithm: Perhaps the more straightforward solution. Here, we simply buffer the relevant data and retrain our predictor “every so often”.

On Quantopian most shared algorithms take the second approach and that is really what the batch_transform was built for where you can specify how often you want to retrain your model (refresh_period).

I think the first approach has a lot of potential as well. I sorta did a mixture of the two approaches with the HMM algorithm that uses the previously learned model parameters as a prior for the updating. Generally, Bayesian methods where you can take the posterior of your trained model as a prior for when you retrain ("yesterday's posteriors are todays priors") seem very amendable to this.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

10 responses

Very interesting... the problem I see with incremental algorithms (ie. learning on all data available and just adding to it) is whether or not the training data from x years ago is relevant. For Bayesian networks that attempt to model investor behaviour, for instance, are the patterns observed in 2002 still relevant for 2012? Do investors still behave the same way? Or has some market "mode" changed? I'm not sure - that's one reason I've been focusing on batch_transform... of course, the length of batch_transform can be arbitrarily set and it is difficult to determine what's "reasonable".

Hello Thomas and Alex,

Here's a reference to an algorithm that could be implemented in Quantopian:

http://cdn.intechopen.com/pdfs/29827/InTech-An_innovative_systematic_approach_to_financial_portfolio_management_via_pid_control.pdf

In skimming through the paper, I can't sort out how they re-balance the portfolio based on the PID controller...please let me know if you figure it out.

Additional references:

http://web.mit.edu/eecsgsa/www/events/2008/ppst/session1.pdf http://web.mit.edu/eecsgsa/www/events/2008/ppst/session2.pdf

Grant

Hi Thomas,

How do you optimize the "learning rate" for the online approach? I would imagine that you would choose different learning rates for different parts of the day, or perhaps as a function of trading volume/volatility?

Moreover, are you always using the same training data? Or are both your training data and learning rate regime-specific?

Hello Thomas,

This topic seems pretty muddled to me. To take a familiar example, the OLMAR algorithm (with daily re-balancing), has an epsilon parameter. It could be globally re-optimized every night (prior to market open the next day). Thus, epsilon would vary with time, with a daily update period coinciding with the portfolio re-balancing period. Alternatively, one could re-optimize epsilon less frequently (e.g. weekly/monthly/quarterly). Would the daily optimization ("re-training") be termed an incremental algorithm, with less frequent optimization classified as periodic? In the end, the result is the same--epsilon is regularly tweaked to optimize performance of the algorithm. So, the distinction between incremental and periodic seems arbitrary.

If model parameter adjustment is required (i.e. optimization/training), why not do it in an "online" fashion, every buy/sell/hold cycle? What's the downside (other than the fact that coding a robust online optimization routine may not be worth the effort)?

In the context of Quantopian, it would be helpful to understand your vision. From my perspective, members should be able to take full advantage of time when the market is closed (e.g. overnight/weekends/holidays) to run optimization routines. Additionally, for minute-by-minute trading, the tools should be highly efficient in crunching numbers so that model parameters can be adjusted every market tic (at least on a local scale).

Perhaps a better way to frame the problem would be in terms of global versus local model parameter optimization? Generally, I agree with your thinking that local optimization should be done as you describe here: http://blog.quantopian.com/parameter-optimization/. However, there's also the global optimization problem, which sometimes is better done "offline."

Grant

Hi Thomas,

Another consideration is the definition of "streaming" of data. It implies that data will be presented to the algorithm at random points in time. Presently, although described as "event-driven," Quantopian samples the market every minute (or day), so there is an underlying regular clock (market tic). Under live trading, is this how Quantopian will work? Or will the time between events vary statistically (down to a minimum)? Seems like this would be an important consideration in devising an optimization scheme. If the average is a minute, but the minimum is 10 ms, then it could be kinda tricky to do a sophisticated walk-forward optimization.

Grant

Here is the online support vector regression without training from the beginning http://onlinesvr.altervista.org/

Thanks Yacoov. While that looks interesting it's pretty outdated and in C++ so probably not the right tool. It is surprising that there is so little code on online learning.

Hi,
I have read the SVM stuff. Learning from the start can only be done with a supported trainingset, otherwise it takes considerable time. If you look at the tests you see performance is an issue. It takes 40 seconds for 1000-2000 samples. The advantage is that you can add and remove bad data to the SVM. So a bad trade can be added to the error set and a good trade to the supported set. I suggest to use training data on a daily basis, minute data is maybe useful if you first apply some filtering or oversampling to reduce the number of samples. Libsvm is mentioned as offline learning. A nice feature is stabilisation, but how to control how much over or underfitting there is? Other option is to train offline and then during trading add and remove new samples. In the past 15 years ago, i used principal component analysis to get interesting data out of a data set to reduce the number of datapoints. That might be useful. However, i believe you can also train offline and then load the results in the algo and than retrain again. Also because you otherwise need to have your algo run a couple of months before you have it finally trained.

I Will search for the web or maybe use Libsvm first to see if it works. On what do we want to apply this algo? On the 2rsi algo? Or indicating that violatility will increase or decrease profits?

J.

Hi Thomas ,
You are right there is little code on online learning.Extreme learning machines algorithm has online sequential version: http://www.ntu.edu.sg/home/egbhuang/elm_codes.html It is claimed that the Extreme learning machines trains very fast compare to most NN and SVM algos http://www.ntu.edu.sg/home/egbhuang/index.html

Hi Quant Trader,

Thanks for the explanation.My idea is to project the X bar future high, median and low levels of the possible range(upwards, downwards,flat) with online SVM or self trainning NN.If this range is no longer valid, new bands,will be predicted .Very similar to Mogalef bands http://mogalef.com/index.php?p=1_10_The-Mogalef-Bands