Back to Community
Data Structures for Financial Machine Learning

Data Structures for Financial Machine Learning

This is the second post in our series exploring Lopez de Prado's book "Advances in Financial Machine Learning". If you haven't yet, read the Introduction to "Advances in Financial Machine Learning" by Lopez de Prado.

Also note that these are really just explorations of these methods and how they can be implemented on Quantopian. They do not constitute prescriptions of how to develop trading algorithms on the platform, but might some day lead to that.

1: Introduction

Our goal, at a basic level, is to predict the future. This is a task of immense complexity, and is essentially impossible in the classical sense. A much easier endeavor is predicting what is. An example of this is facial recognition; based on a model of humans and individuals, we can detect and label faces. This model is constructed with a series of parameters that likely have little intuitive meaning on their own, but combined can recreate any human face.

Now imagine that we live in a science-fiction world, millions or billions of years in the future. The human race has evolved in... interesting ways, and we now have distinctly different faces (4 eyes, trunks for noses, take your pick.) A facial recognition algorithms from the 21st century would be useless, as the true parameters for humans have changed over time. In financial data, this process is rapid and neverending. This is why traditional out-of-the-box machine learning algorithms struggle on financial data; they learn what is, not what will be.

There are two key solutions to this problem; training on nearly stationary data, and rapid iteration of the the validation, implementation, and decommissioning process. Facial recognition algorithms pick up on factors that are relatively consistent for individuals over time, such as bone structure, rather than those that change over time, like facial hair or complexion. In financial data, these variables are less obvious, but Lopez de Prado has presented strong arguments that they can be found. As for the rapid iteration, why would you keep running an algorithm whose alpha has decayed to nearly zero over time? Back to the facial recognition example, it might be time to re-train your model if the accuracy has dropped after a few million years of human development. The situation is similar in finance, though with dozens of temporarily-profitable models running in parallel.

For the rest of this post, please see the attached notebook!

Loading notebook preview...

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

7 responses

Super helpful NB - thank you!!

Would you consider doing a video lecture on this as well?

Thanks for all the work AND the post...very helpful!

So my initial take-away is that I should be using dollar-bars instead of price-bars for any
prediction models that work better with stationary time series.(e.g. mean-reversion factors)
Is that right?


Thank you Joakim! We're working on that in the near future; thanks for the suggestion.

@Alan, That's definitely reasonable, but constructing bars in this manner does not translate that well into live trading. One would have to change the value target to some multiple of an exponentially-weighted moving average of daily/minutely value, and add behavior for the high-activity minutes that surpass that target. In general, though, dollar bars should allow you to draw conclusions with greater confidence.

Thanks for sharing this :-)
Here is my version of volume and dollar bars computation. (would be great to think about a way to built this inside a algorithm, dynamically :-).
I think we have both exactly the same bars (lol thanks god!).

I am trying to get the volume(dollar)-imbalance bars, but it look like I am too stupid to understand that formulas in the book :-(. OK maybe I should spend a bit more time on it! Would be great if one could make a notebook which build them with some explanation on how to built it :-).

Loading notebook preview...

I had a bit of time this morning so I tried to implement the volume and dollar bars computation in an algorithm.

My first (inelegant) try point some issue, which come from the execution time. It look to be clear that it is impossible to compute the bars for a "large" universe (bigger than 10ish equities).
I computethe bars in before_trading_start, might be more efficient to update the bars each minutes in handle_data.
I compute a large dataframe with double index (equity, time), maybe there is more efficient way to store the data.

So its really a first try! But might be usefull for other as a starting point.
I really think that this kind of construction should be done in background and accessible through a data.history method (not called history).

Happy hacking :-).

I did not manage to understand how to share an algorithm so here is a copy past in a notebook...

Loading notebook preview...

Looking through the code, we find the maximum single-period trade value and then construct dollar dollars from that bin size, modifying our input distribution. On an intuitive level, this would make the impact of data proportional to the dollar value of the trade.

Instead of that, what if we just introduced the dollar value as a weight in our loss function?

Thanks for sharing your knowledge.. great work .