Back to Community
Introduction to "Advances in Financial Machine Learning" by Lopez de Prado

Machine learning is a buzzword often thrown about when discussing the future of finance and the world. You may have heard of neural networks solving problems in facial recognition, language processing, and even financial markets, yet without much explanation. It is easy to view this field as a black box, a magic machine that somehow produces solutions, but nobody knows why it works. It is true that machine learning techniques (neural networks in particular) pick up on obscure and hard to explain features, however there is more room for research, customization, and analysis than may first appear.

Today we'll be discussing at a high level the various factors to be considered when researching investing through the lens of machine learning. The contents of this notebook and further discussions on this topic are heavily inspired by Marcos Lopez de Prado's book Advances in Financial Machine Learning. If you would like to explore his research further, his website is available here.

1. Data Structures

Garbage in -> garbage out. This is the mantra of computer science, and modeling doubly so. A model is only as good as the data it accepts, so it is vital that researchers understand the nature of their data. This is the foundation of an algorithm, and it will succeed or fail on the merits of its data.

In general, unstructured and unique data are more useful than pre-packaged data from a vendor, as they haven't been picked clean of alpha by other asset managers. This is not offered on the Quantopian platform, but you can upload data for your own use with the Self-Serve feature. If you do not have unique data, the breadth of offerings on the Quantopian platform gives us plenty to work with. Data can vary in what it describes (fundamentals, price) and in frequency (monthly, minutely, tick, etc). Listed below are the chief types of data, in order of increasing diversity:

  1. Fundamental Data - Company financials, typically published quarterly.
  2. Market Data - All trading activity in a trading exchange or venue.
  3. Analytics - Derivative data, analysis of various factors (including all other data types). Purchased from vendor.
  4. Alternative Data - Primary information, not made from other sources. Satellite imagery, oil tanker movements, weather, etc.

The data structures used to contain trading information are often referred to as bars. These can vary greatly in how they are constructed, though there are shared characteristics. Common variables are open, high, low, and close prices, the date/time of the trade, and an indexing variable. A common index is time; daily bars are a structure that each represent one trading day, minute bars represent one minute of trading, etc. Time is held constant. Trading volume is another option, where each bar is indexed with a consistent number of shares traded (say, 200K volume bars). A third option is value traded, where the index is dollars (shares traded * price per share).

  • Time Bars:
    Bars indexed by time intervals, minutely, daily, etc. OHLCV (Open, High, Low, Close, Volume) is standard.

  • Tick Bars:
    Bars indexed by orders, with each set # of orders (usually just 1) creating a distinct bar. Order price, size, and the exchange the order was executed on are common. Unavailable on Q platform.

  • Volume Bars:
    Bars indexed by total volume, with each set # of shares traded creating a distinct bar. We can transform minute bars into an approximation for volume bars, but ideally we would use tick bars to maintain information for all parameters across bars.

  • Dollar Bars:
    Similar to volume bars, except measuring the total value (in $) traded hands. An example would be $100,000 bars, with each bar containing as precisely as possible that dollar value.

Alternative data structures exhibit statistical properties to different degrees, with volume and dollar bars typically expressing greater stationarity of returns than time and tick bars. These properties play a big role when considering which bars to use in a machine learning framework, as we will discuss next.

2. Statistical Properties and Stationarity Transformations

Much of the literature on machine learning, and statistical inference in general, makes the assumption that observations are independent and identically distributed (iid). Independent means that the occurance of one observation has no affect on any other and identical means that our variables are derived from the same probability distribution (e.g. have the same variance, mean, skew, etc).

Unfortunately, these properties are rarely found in financial time series data. Consider pricing; today's price is highly dependent on yesterday's, the mean price over some time interval is constantly changing, and the volatility of prices can change rapidly when important information is released. Returns, on the other hand, remove most of these relationships. However, variance (i.e. volatility) of returns are still changing over time as the market goes through different volatility regimes, thus are not identically distributed.

The different bar types (and additional data structures) exhibit varying statistical properties. This is important to consider when applying machine learning or other statistical inference techniques, as they assume that inputs are iid sampled (or stationary, in time series). Using dollar bars in lieu of time bars can make the difference between a weak and overfit algorithm versus a consistently profitable one. This is just one step in the search for stationarity, however, and we must have other tools in our arsenal.

The note above about independence of price series vs return series illuminates one concept: the tradeoff between memory and stationarity. The latter is a necessary attribute for inference, but provides no value without the former. In the extreme, consider transforming any series into strictly 1's; you've successfully attained stationarity, but at the cost of all information contained in the original series. A useful intuition is to consider degrees of differentiation from an original series, where greater degrees increase stationarity and lower memory. Returns are 1-step differentiated, the example of all 1's is fully differentiated, the price series has zero differentiation. Lopez de Prado proposes an alternative method, named fractional differentiation, that aims to find the optimal balance between our opposing factors; the minimum non-integer differentiation necessary to achieve stationarity. This retains the maximum amount of information in our data. For a thorough description read chapter 5 of de Prado's Advances in Financial Machine Learning. With this implemented, and our data sufficiently prepared, we are almost ready to whip out the machine learning algorithms. Finally, we have to label our data.

3. Labeling for Learning

3.1 - Triple-Barrier Method

Most machine learning classifiers require labeled data (those that don't are powerful but difficult to engineer, coming with high risk of overfitting.) We intend to predict the future performance of a security, so it seems fair to label each observation based on its ensuing price performance. It is tempting to just use whether the returns were positive or negative over a fixed time window. This method, however, leads to many labels referring to non-meaningful, small price changes. Moreover, real trading is often done with limit orders to take profits or stop losses. Marcos Lopez de Prado proposed a labeling strategy the he calls the Triple-Barrier Method, which combines our labeling desires with real market behaviour. When a trade is made, investors may choose to pre-emptively set orders to execute at certain prices. If the current price of security \(s\) is $5.00, and we want to control our risk, we might set a stop-loss order at $4.50. If we want to take profits before they vanish, we may set a profit-taking order at $5.50. These orders are set to automatically close the position when the price reaches either limit. The stop-loss and profit-taking orders represent the two horizontal barriers of the Triple Barrier Method, while the third, vertical, barrier is simply time-based: if a trade is stalling, you may want to close it out within \(t\) days, regardless of performance.

The classifier outputs a value of either -1 or 1 for each purchase-date and security given, depending on which barrier is first hit. If the top barrier is reached first, the value is set to 1 because a profit was made. If instead the bottom barrier is hit, losses were locked in and the value is set to -1. If the purchase times out before either limit is broken and the vertical barrier is hit, the value is set in the range (-1, 1) scaled by how close the final price was to a barrier (alternatively, if you want to label strictly sufficient price changes, 0 can be output here).

3.2 - Meta-Labeling

Once you have a model trained for setting the side of a trade (labeled by the Triple-Barrier Method), you can train a secondary model to set the size of a trade. This accepts the primary model as input. Learning the direction and size of a trade simultaneously is much more difficult than learning each separately, plus this approach allows modularity (the same sizing model may work for the long/short versions of a trade). We must again label our data, via a method de Prado calls Meta-Labeling. This strategy assigns labels to trades of either 0 or 1 (1 if it takes the trade, 0 if not) with a probability attributed to them. This probability is used to calculate the size of the trade.

Useful considerations for binary classification tests (like the Triple-Barrier Method and Meta-Labeling) are sensitivity and specificity. There exists a trade-off between Type 1 (false positive) and Type 2 (false negative) errors, as well as true positives and true negatives. F1-Score measures the efficiency of a classifier as the harmonic average between precision (ratio between TP and TP+FP) and recall (ratio between TP and FN). Meta-labeling helps maximize F1-scores. We first build a model with high recall, regardless of precision (learns direction, but with many superfluous transactions). Then we correct for low precision by applying meta-labeling to the predictions of the primary model. This filters out false positives and scales our true positives by their calculated accuracy.

4. Learning Algorithms for Direction and Bet Size

Now that we've discussed the considerations when structuring financial data, we are finally ready to discuss how programs actually learn to trade!

A machine learning archetype known as ensemble learning has been shown time and again to be robust and efficient. These algorithms make use of many weak learners (e.g. decision trees) combined to create a stronger signal. Examples of this include random forests, other bagged (bootstrap-aggregated) classifiers, and boosted classifiers. These produce a feature space that can be pruned to decrease the prevelance of overfitting. This discussion assumes you are at some level familiar with machine learning methods (particularly ensemble learners.) If you are not, scikit-learn's tutorials are a fabulous starting point.

4.1 - Bootstrap Aggregation (Bagged Classifiers)

This is a popular ensemble learning method that aggregates many individual learners that are prone to overfitting if used in isolation (decision trees are common), into a lower variance 'bag' of learners. The rough recipe is as follows:

  1. Generate N training datasets through random sampling with replacement.
  2. Fit N estimators, one on each training set. Fit independently from each other, trained in parallel.
  3. Take the simple average of the forecasts of each of the N models, and voilà! You have your ensemble forecast. (If a classifier problem with discrete options, it's majority-rule voting rather than a simple average. If a prediction probability is involved, the ensemble forecast uses a mean of the probabilities).

The chief advantage of bagging is reducing variance to address overfitting. The variance is a function of the number \(N\) of bagged classifiers, the average variance of a single estimator's prediction, and the average correlation among their forecasts.

Lopez de Prado also presents sequential bootstrapping, a new bagging method that produces samples with higher degrees of independence from each other (in the hope of approaching an IID dataset). This further reduces the variance of the bagged classifiers.

4.2 - Random Forests

Designed to reduce the variance/overfitting potential of decision trees. Random forests are an implementation of bagging (the 'forest' being the aggregation of many trees), with an extra layer of randomness: when optimizing each node split, only a random subsample (without replacement) of the attributes will be evaluated, to further decorrelate the estimators.


Courtesy of Medium

4.3 Feature Importance

Feature importance analysis allows us to prune the features in our noisy financial time-series dataset that do not contribute to performance. Once features are discovered, we can experiment on them. Are they always important, or only in some specific environments? What triggers a change in importance over time? Can those regime switches be predicted? Are those important features also relevant to other financial instruments? Are they relevant to other asset classes? What are the most relevant features across all financial instruments? What is the subset of features with the highest rank correlation across the entire investment universe? Pruning our feature space is an important part of optimizing our models for performance and risk of overfitting, like every other consideration above. In general, there is a lot to explore here, and it is outside of the scope of this already lengthy post.

4.4 Cross-Validation

Once our model has picked up on some features, we want to assess how it performs. Cross-validation is the standard technique for performing this analysis, but does require some healthy finagling for application in finance, as is the theme today. Cross-validation (CV) is a technique that splits observations drawn from an IID process into a training set and a testing set, the latter of which is never used to train the algorithm (for fairly obvious reasons); only to evaluate it. One of the most popular methods is \(k\)-fold, where the data is split into \(k\) equally-sized bins (or folds), one of which being used to test the results of training on the remaining \(k-1\) bins. This process is repeated \(k\) times, such that each bin is used as the testing bin exactly once. All combinations are iterated over.

This method, however, has problems in finance, as data are not IID. Errors can also result from multi-testing and selection bias due to multiple sets being used for both training and testing. Information is leaked between bins because our observations are correlated (if \(X_{t+1}\) depends on \(X_t\) because they are serially correlated, and they are binned separately, the bin containing the latter value will contain information from the former). This enhances the perceived performance of a feature that describes \(X_t\) and \(X_{t+1}\), whether it is valuable or irrelevant. This leads to false discoveries and overstating returns. Lopez de Prado presents solutions to these problems, that also allow more learning to be done on the same amount of data. One in particular, that he calls Purged \(K\)-Fold CV, simply purges any values in the training set whose labels overlap in time with the testing set. Another deletion strategy he calls embargo is used to eliminate from the training data any signals that were produced during the testing set (essentially just deleting some # of bars directly following the testing data). He also provides a framework to find the optimal \(k\) value assuming little/no leakage between training and testing data.

So, finally, we have everything we need. Prune our data, label it for direction, train an ensemble learning algorithm, label for size, train another algorithm on that, and combine. Integrate the results of this with the Quantopian IDE and backtester and send it to the contest! This may seem like a lot, and it is, but much of the programming is modular. With the groundwork laid, you just might find yourself churning out more impressive strategies than ever before.


Courtesy of Wikipedia

5. Further Learning

This discussion drew heavily from Advances in Financial Machine Learning by Marcos Lopez de Prado. This is an excellent resource if you are already familiar at a high level with investment management, machine learning, and data science. If you are not, the Quantopian lecture series is a great place to start, especially combined with Kaggle competitions and the scikit-learn machine learning tutorials. Don't be afraid to just play around, it can be fun!

Once you're comfortable with all of that, go through Lopez de Prado's book and work on implementing these methods with a data structure you created, and plugging the result into a few different specially-calibrated machine learning algorithms. If you have a predictive model, test it out-of-sample for a good long while (though your methodology should have prevented overfitting) and see if it sticks. If it does, congratulations! Pit your strategy against others in the contest and we might give you a call.

Well, that was a lot! If you are new to machine learning in general, in finance, or just want to learn more in general, please take advantage of the resources discussed here. We hope to provide significantly more in-depth resources for these topics in the future. This represents a good start to producing a truly thought-out and well-researched strategy, and we hope you make some amazing things with it. Good luck!

55 responses

Nice but unless you whitelist higher level open sourced AI/ML libraries like Keras, PyTorch, Tensorflow or Theano then it's very limiting. One can do more sophisticated AI/ML algos offline and upload results to Self Serve Data but they're most likely be limited to OHLC data since Q does not allow to export non-OHLC such as Fundamentals. The other issue is compute time limitations. Otherwise, this is a good intro.

Great post @Anthony, thank you!

Would you consider doing a webinar on (Intro to) ML for Finance?

For some context: Anthony is an intern in the Quantopian research team and is doing amazing work exploring this workflow. We read Lopez de Prado's book at Quantopian with great interest. The workflow he proposes in the book is a bit different than the cross-sectional ML workflow I outlined in ML on Quantopian Part I, Part II, and Part III, where features are computed at the same time for all stocks and than ranked relative to each other. Here, LdP proposes to look at each stock individually, starting with the suggestion to not use time-bars and not have the labels be computed over the same time-horizon every time. I think this does make sense as opportunities could arise on different time-scales for different stocks at different times.

In this work, we are curious how you can do this on Q. Anthony has a couple of other posts lined up that actually explore the ideas highlighted here in more detail with code, so stay tuned for that.

Has anyone else read the book? What have been your thoughts?

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

I have only just started the book. I have thought about the activity based bars concept. Unfortunately, I haven't figured out a vectorized implementation for building the bars. A Quantcon attendee wrote a nice blog post on the concept. He used Cython to speed up his loops in building the bars from tick data on the S&P futures.

While Quantopian doesn't have tick data for building the bars, it may be possible to get a decent approximation for larger volume bar aggregations using the 1-min data. When I get a chance, I'll play with the concept and get an idea for how long it would take to build volume-based bars for a universe of stocks, and then I'll post my work in a notebook.

I came up with a couple of functions to compute volume bars in Quantopian based off of minute data. (They can easily be modified to create dollar-volume bars as well).

The first function uses a loop and is quite slow. However, while I was coding this, I thought of an idea for a more efficient version. I'm not sure "vectorized" is the correct description (although that is what I used in the notebook to describe this second function). I basically use the groupby function in pandas to speed up the computation. The two versions will create slightly different bars as is described in the attached notebook.

The notebook contains the functions I wrote. In addition, I included a section where I compare the average run time of each function on roughly 15-20 days of data for 1 symbol. Lastly, I included a few charts along with descriptive statistics showing the distributions of the volume vs. time bars for a couple of months.

Feel free to recommend any improvements to the functions (or anything else for that matter).

Loading notebook preview...
Notebook previews are currently unavailable.

Hello Thomas, we have been exploring machine learning with generating features over different timescales for each individual stock outside of Quantopian (using Quandl subscription data), then translating this outside of Quantopian into buy or sell signals and then uploading it into Quantopian to run the backtester to see what the results are. This work is not completed yet, we need to further optimize this by exploring different features (we used simple things like MA etcetera). It does give positive results but did not beat our best rule-based algorithms on the same topic yet.

The test we used is that we manually selected a list of 6 ETF's which we know over a long time period always one of them would give a positive result within a time period (rebalance every week). Based on historic data, we can calculate the absolute maximum return we would have received when every week we would allocate 100% of capital to the best performing ETF for the next week. So this was our maximum. We already build a rule-based algorithm and we know the results from this as well. What we are now trying to do is build out this simple concept with Machine Learning to see whether we can overachieve on our best rule-based algorithm. In theory, if we find the right input features, the Machine Learning algorithm should always be able to outperform the rule-based algorithm but we are not there yet.

When training a RandomTree classifier on data that was labeled by the Triple-Barrier method, are you assuming that all distributions to have the same data length? That is, if you have 5 securities that you split into volume bars (thanks @Michael Matthews), there is no guarantee that all will have the same length. Once security may have 450 volume bars, while another may have 462 in the same time period. Should I force the data to be uniform in size?

I guess a better question is, what kind of information should I be trying to feed into the classifier?

@Michael,

Wow! Quick turnover on that notebook, it looks fabulous. You didn't mention this in the nb, but your vectorized solution actually has the excellent property of minimizing the difference from your target volume, rather than a plain-old threshold. This does incorporate mild look-ahead bias like you mentioned, but for an analysis of statistical properties it's not so troublesome. That runtime is excellent, too.

Aiming for 6 bars per day is interesting, because we're aiming to separate our bars from the time domain. The tradeoff in our target volume is between having higher frequency data (generally better for machine learning purposes) at low sizes, and a smaller error at larger sizes. It's up to the researcher how that is implemented, however.

There is an upcoming forum post on this very topic, so keep on the lookout for that.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks, Anthony. Nice Post by the way. I'll be on the lookout for the next one.

I haven't really thought through the selection of the target volume all that much. My thought was that it should probably vary by stock depending on the stock's average volume although questions arise, such as: When do you calculate the average volume (periodically, after a significant change in avg. volume)? Will the volume target change over time if the average volume changes?

I used a targeted number of bars per day just to give some frame of reference. I went with 6 bars per day in the illustration because it is roughly comparable to an hourly chart (for stocks), which can be useful for a 1 to 5 day holding period. Obviously, heavier (lighter) volume will create more (fewer) bars, which is what we want. We are somewhat limited by how granular we can go b/c as you approach 1 bar per minute, you will not get much improvement by using volume bars since we do not have the tick-by-tick data. As you said, I think considerable discretion can be used by the researcher depending on his/her use case.

Hi,

I just started to look at the book, look very interesting :-).
I decided to go through all exercices, starting at chapter 2. Would be nice if someone would review my solution... What do you think should I post it here or should I start a new topic for each chapter?

Did Mr de Prado use these techniques as a fund manager in his bond fund? If he did, then what were the results like? Did his use of machine learning techniques enable him to outperform the market and his competitors? For a significant period. Because of course chance must be ruled out.

It has been proved beyond doubt that machine learning and artificial intelligence has proved of great use in telling us what IS. There is little if any evidence that AI is of any significant use in telling us what WILL BE.

AI can recognize a face as a face, a dog as a dog. It can take the rules of any bounded problem and better a human's efforts in dealing with that problem; Go and Chess being examples of this.

Does AI have the ability to predict the future price of a financial instrument at any date in the future with a success rate significantly better than 50% (random)?

If it does not then spending endless hours reading Mr De Prado's work may better be spent in a some more fruitful manner. The Alchemist was eventually proved right: theoretically we can transform bread into gold, or any matter into any other configuration of matter.

Will we ever be able to report the same success in our efforts to predict the future? And if not then the financial market analyst might wish to find a better use of his time.

Anthony, looking forward to the code samples. I have been going through the book recently. Some thoughts:

Labeling: Useful topic in general, would definitely be more useful in standalone research using scikit-learn. Eager to know what the code samples provide.
Stationarity: Will be curious to see how you are achieving this in code.
Triple Barrier: Not sure how it fits into contest requirements with minimum end of day leverage requirements and Quantopian pipeline and order_optimal_portfolio. Some clarification would help
Direction and Bet Size: Curious how this study is transformed into Quantopian factor rankings and order_optimal_portfolio. I see the most benefit here.
Feature selection/CV: These are all research work related, a lot of chapters in the book are devoted to this. Will be eager to see code examples demonstrating the book's concepts.

I'd like to see how ML fits into quantopian workflow. Time is limited in the platform for any substantial real time model building.

@Kyle, the triple-barrier method will result in an uneven # of observations anyways, so the input data can be mismatched in size. The vertical barrier is of interest, however, as you can implement it as a set amount of time or set # of alternative bars. If you prefer the latter, then issues of different bar #s arise. In general, implementing the 3B method works best when treating each security individually, rather than in a basket. The labels produced are unimportant outside of the context of an individual security, as they are used to validate whether an individual trade was 'good' or 'bad' (in terms of profit.) If that's not clear, let me know.

@Michael, thank you! There are many options for selecting your target volume, though de Prado seems fond of the exponentially weighted moving average for these purposes (he uses its standard deviation for the volatility calculation in the triple barrier method). Your points are sound, particularly in regards to tick data unlocking a whole new world of exciting research. For Quantopian's longer timeframe perspective, however, we can derive value from these lower frequency bars.

@David, it's always insightful to see how others attempted to solve problems that you struggled with, so feel free to post in the forums with any code/questions you have. I do encourage you to perform sufficient due-diligence beforehand, however (asking for help is great, but ideally after exhausting most of what you can do on your own).

@Zeno, I cannot speak to de Prado's performance as a fund manager, however I can say that the work he has shown is extremely compelling. I liken the problem of ML in finance to facial recognition; but over the course of millions of years of human evolution (theoretically). An algorithm might be excellent at picking up modern human faces, but if our future borders on sci-fi and our genetics change, it will be useless. One of de Prado's focuses is adapting financial datasets to achieve stationarity, a maintenance of variables' properties over time. Profiting in this environment is extremely difficult, but his arguments are mathematically convincing and revolutionary. Note, this isn't sponsored in any way, I am just quite fond of the research he puts out (and was even before my time at Quantopian).

@Leo, it is interesting how/if this brand of machine learning can or will fit into the Quantopian workflow, and frankly it's somewhat up to you to decide. De Prado is less interested in risk constraints and market neutrality than Quantopian, so his methods would need to be massaged a bit to be suitable. For some clarification, the triple barrier method is also part of the research workflow. It is a means to label significant price observations for training a classifier. Ideally, the result of this research would be a model that simply outputs buy/sell orders based on live pricing data. Add some additional constraints to fit the Quantopian contest, or build them into de Prado's methods, and you should be golden. This is no easy task, and requires a strong foundation. Hopefully we can provide this over the next few months, but I encourage you to work at it! The community can provide some amazing code, you all deserve tons of credit.

Stationarity in markets. Hmm, a nice dream. In terms of votailty perhaps, hence the attractiveness of the Vix concept. I agree that the prospect of a technological singularity is most exciting. I look forward to bio hacking my mind and body. I do not expect however to be able to use technology or AI to shoot the lights out in the stock markets.

Call me cynical if you will and I hope you prove me wrong. But I wouldn't count on it.

Going to cross-post a question I had about the fractional differentiation concept from the thread Lopez de Prado posted here a while back (but abandoned). Perhaps someone here is able to answer it, it is kind of crucial IMO, the whole chapter might fall semi-apart just by asking it.

"In the chapter about fractional differentiation you argue we can often obtain a stationary price series by differentiating using a d < 1 instead of d = 1 (i.e. good old returns). There's a table in the chapter that illustrates this fact for a list of different instruments and a code sample is provided to find the smallest d for which the series is stationary.

This code sample uses statsmodel's adfuller function to test for stationary, but it uses maxlag=1, which as far as I can tell results in quite an optimistic test, i.e: series are found to be stationary when just looking at them tells you they're clearly not. I found that when I'm using the default maxlag parameter, the obtained d values are generally in the order of ~0.3 higher.

So I guess my question is: why did you choose maxlag=1 and is it still possible to use these weakly/hardly stationary series in a predictive manner? The CUSUM filter mentioned in the book might be a solution, but if I think about it, it seems like a way to introduce stationarity implicitly and doesn't really need a stationary series to begin with."

@Ivory,

Could you please upload a notebook that demonstrates this? You're right that this is an integral part of the information presented, and any conflicting research is vital to consider. In my research, the maxlag parameter has (surprisingly) little impact on the resulting test statistic, so I would be eager to see a contradiction. Interesting point, thank you for bringing it up.

Excellent introductory post!

I should add that the method of partial differentiation of time series has been in existence since the 80s - it came from Hydrology, as
did the idea of the Hurst exponent (popularized in Finance by Edgar Peters). If one is curious of clustering methods, for instance, you’d
want to look at gene sequencing. Tons of cross pollination leads to better and novel results.....

Picking up on James Villa's comment above, it seems that to get started with ML, one would need to be able to output a trailing window of alpha factor values from Pipeline (which I gather it is not designed to do...it only outputs the current alpha vector). One would end up with columns corresponding to SIDS, and rows corresponding to daily alpha values. And additional columns corresponding to training data (e.g. returns, etc.). And then these data would be fed into a ML routine that goes "kerchunk, kerchunk,..." and spits out a single alpha vector for the day.

Presumably Thomas W. does this in his ML algo, but it is not so obvious to me how it is done.

This sort of trailing window of training data would seem to be common to any ML approach. Could a standard set of Pipeline-based tools be put up in an importable Q library (perhaps 'quantopian.research.experimental' would be appropriate)?

I implemented walk-forward and combinatorial cross-validation with purging and embargoing here:
https://github.com/sam31415/timeseriescv
If you're working on your own machine, you can install the package using pip. You probably can't import it on Quantopian, but you can always copy-paste the source code. Here is a medium post with more info:
https://medium.com/@samuel.monnier/cross-validation-tools-for-time-series-ffa1a5a09bf9

@Grant: Yes, I agree that it would be great to have easier access to historical pipeline values in the backtester. In the ML algo post I got around that problem by implementing the ML part as a pipeline factor as well because pipeline factors can get a window of pipeline values. One could probably also somehow copy them out with a dummy history factor that writes to a global variable, although that would be quite hacky.

@Samuel: That looks amazing, thanks so much for sharing. I love how you kept close to the sklearn API. In the medium term we should definitely include this on the platform, until then, maybe you would like to post a sample NB that demonstrates it but just copy & pastes the code?

@Al: Thanks for the insight. I completely agree that a lot of progress in quant finance can come from applying tools from other domains (as has been proven historically). Actually I think this is a core strength of Quantopian where we try to make it as easy as possible for people from various other domains to try their ideas without having to commit to a new career path.

@Thomas: I don't really have shareable data ready to include in a notebook, and the pandas.util.testing module that I used to generate fake data in the tests won't import in the research environment.

In principle, all you have to do is to paste the content of cross-validation.py into a cell of the notebook. You can then use the classes PurgedWalkForwardCV and CombPurgedKFoldCV. That's in principle. In practice, I get a lot of errors due to the restrictions of the research environment:
- It doesn't like f-strings apparenly.
- Same with type hinting.
- abc and typing can't be imported.
I stopped trying when I got:

InputRejected: Insecure built-in function 'super' Last warning! One
more import error and your account will be suspended for security
reasons until a human can talk to you.

@Samuel: Bummer, but thanks for trying!

@ Thomas -

I was thinking of built-in functionality within Pipeline for alpha combination that uses a trailing window of alpha factor values and their associated returns, etc. It is not clear, though, whether it might be better to do it outside of Pipeline, in the dedicated 5-minute before_trading_start window.

Anyway, is there something keeping you guys from working on this stuff? I kinda get the impression that you are either all in some kind of holding pattern, or busy with other stuff. I see the little "Q" by everyone's picture, so I guess you are all still employed.

@Grant: Our focus right now is on expanding our coverage of markets to global equities.

Thanks Thomas -

In my opinion, there is a bit of a gap in the Q support for the workflow in the area of Alpha Combination, which would include ML, as your former CIO, Jonathan Larkin discusses:

The weighting scheme can be quite simple: sometimes just adding ranks or averaging your alphas can be an effective solution. In fact, one popular model does just that to combine two alphas. For increased complexity, classic portfolio theory can help you; for example, try solving for the weights such that the final combined alpha has the lowest possible variance. Lastly, modern machine learning techniques can capture complex relationships between alphas. Translating your alphas into features and feeding these into a machine learning classifier is a popular vein of research

I'm thinking that to approach this ML stuff incrementally, I need some generic code structure within the Alpha Combination step, so that I can move easily from the simple to the more complex ways of combining factors (which will allow for benchmarking both computational efficiency and algo financial performance, besides just being good coding practice).

I posted a request for guidance here: https://www.quantopian.com/posts/alpha-factor-combination-in-pipeline-how-to-fancify-it.

I understand you are busy going after the next big thing, but maybe someone could scratch something on a piece of scrap paper over a coffee break and then ask one of your customer support folk to post it?

@Grant: I think the ML part 3 has all the parts that are needed. There definitely could be many improvements made but it should be workable.

O.K. Well, I guess I've have to try to unravel it myself. At this point, I'm not interested in the ML, but it seems to be kinda wrapped up in the code. Kinda confusing that you'd have an intern spend a lot of time on a ML project, but not want to support it more formally...

@Zenothestoic

Marcos De Prado is now the head of machine learning at AQR management. We will be able to tell if what he preaches actually works based on their funds performance which is publicly available.

@ Zenothestoic

Uhhhhh. Yeah, you are probably correct. He still has so much to prove and for sure will never accomplish anything with this AI/ML crap......
Just in case you did not pick up on the sarcasm that I was trying to lay on pretty thick there, see below, this is his website I have followed for quite some time. And his papers are incredible; I have not read all of them but I would blindly recommend any of those I have not read simply because of the ones I have read. My comments on your post are below the website info. If you have any questions or comments, please feel free to ask, argue, tell me how dumb I am, whatever. I'm always open to even nonconstructive criticism.

from http://www.quantresearch.info/

Machine learning (ML) is changing virtually every aspect of our lives. Today ML algorithms accomplish tasks that until recently only expert humans could perform. As it relates to finance, this is the most exciting time to adopt a disruptive technology that will transform how everyone invests for generations. This website explains scientifically sound ML tools that have worked for me over the course of two decades, helping me to manage large pools of funds for some of the most demanding institutional investors.

same website, different section:
Marcos López de Prado is a principal at AQR Capital Management, and its head of machine learning. Before AQR, he founded and led Guggenheim Partners’ Quantitative Investment Strategies (QIS) business, where he developed high-capacity machine learning strategies that consistently delivered superior risk-adjusted returns, receiving up to $13 billion in assets.

Concurrently with the management of investments, between 2011 and 2018 Marcos was also a research fellow at Lawrence Berkeley National Laboratory (U.S. Department of Energy, Office of Science). He has published dozens of scientific articles on machine learning and supercomputing in the leading academic journals, and SSRN ranks him as one of the most-read authors in economics. Among several monographs, he is the author of the graduate textbook Advances in Financial Machine Learning (Wiley, 2018).

Marcos earned a PhD in financial economics (2003), a second PhD in mathematical finance (2011) from Universidad Complutense de Madrid, and is a recipient of Spain's National Award for Academic Excellence (1999). He completed his post-doctoral research at Harvard University and Cornell University, where he teaches a financial machine learning course at the School of Engineering. Marcos has an Erdős #2 and an Einstein #4 according to the American Mathematical Society.

RECENT ACADEMIC CONTRIBUTIONS

Author, Advances in Financial Machine Learning (Wiley, 2018).
Co-editor, Cambridge Elements in Quantitative Finance.
Member of the advisory board, Journal of Portfolio Management.
Co-editor, Journal of Financial Data Science.
Member of the board of directors, International Association for Quantitative Finance.
Adjunct professor, Cornell University, Special Topics in Financial Engineering V (ORIE 5256).
Over 50 peer-reviewed publications in scientific journals, including:

Notices of the American Mathematical Society
Journal of Financial Economics (JCR 5Y IF: 7.513)
Review of Financial Studies (JCR 5Y IF: 5.864)
IEEE Journal of Selected Topics in Signal Processing (JCR IF: 4.361)
Mathematical Finance (JCR IF: 2.714)
Journal of Financial Markets (JCR 5Y IF: 2.234)
Quantitative Finance (JCR IF: 1.170)
Journal of Computational Finance (JCR 5Y IF: 0.831)
Journal of Portfolio Management (JCR IF: 0.812)
Journal of Risk (JCR IF: 0.627)

My personal response to your original post about ML/AI methods in finance...

As for the actual role of AI/ML/etc techniques as applied to Finance, I do not think we have to use these as much to "predict" specific points, levels, values. Although I will say when it comes to that, these modern techniques go far beyond econometric methods that very intelligent people have worked on for years (albeit they have pretty much spent decades just trying to get multiple regression to produce accurate future values, instead of maybe looking into other possibilities...). I guess what I am trying to say is that I feel like it can be just as much about what can be interpreted from the analysis in general, opposed to precise measures. The qualitative info gleaned from quantitative techniques, if you will.

Just because people are not coming out daily with examples of some machine learning approach accurately predicting the exact close price in the S&P three days from now does not mean there is no value to these methods. Of course, everyone's opinions will differ with risk tolerance, horizon, etc, but I think some of the valuable insights are about what can be obtained from, say, maybe the relationships between certain data and how they interact or the patterns that can be found among markets or instruments which the human eye would never be able to spot.

While some, very secretive, brilliant people are possibly using ML/AI/NLP/whatever to predict exact prices or future price distributions (cough cough Renaissance and their coveted Medallion Fund), I think people can get plenty out of this technology if it is used as just another tool (that is superior to all other current tools by a long shot). However, it is very important to realize just how much Marcos López de Prado understands about these methods. If someone has no previous background in the mathematics, statistics, and computer science, these methods can be dangerous. Even worse, if someone thinks they know all about these topics but is not actually on the higher level subject matter that deals with them, then they can be even more dangerous. "A little knowledge can be a very scary thing", I think that is how it goes. Hopefully you see where I am coming from.

Basically, please try not to dismiss these techniques because financial markets are impossible to predict (from a deterministic standpoint). They can help with a wide variety of issues that the previous methods in finance have handled incorrectly for a very long time. That being said, be sure to know your role, for lack of a better term. Learn the technical aspects behind these applications, so you can understand when, how, and why certain methods are applicable while others are not; along with what changes may need to be made in order to apply textbook examples to real world data. I guess I am trying to say the equivalent of the age old "correlation does not mean causation" (once again, hope you get the idea but probably wrong quote), except these subjects deal with topics on a much higher level than correlation and problems could arise that are much more difficult to see and understand than the simple difference of correlation vs causation.

Coming back to information bars, has anyone noticed an inherent instability in the formulas? Specifically, when we look at the formula for the tick imbalance bars, it is clear that as probabilities of up-ticks and down-ticks approach 0.5 (which they will with very high likelihood) the right-hand side of the inequality under argmin{} approaches zero, which in turn makes T* tend to 1. This is because when that part is zero the rule emits the bar after only one tick (as any value of theta is higher than 0, making T=1 the lowest value to satisfy the inequality). If we then approximate E[T] by exponentially smoothing it, as suggested in the book, the ones will eventually dominate. And once T* hits 1 it stays there. This is actually confirmed when calculating the bars on actual tick data (I used e-mini futures ticks for Sep 2013) - depending on the initial value of E[T], the T* drops to 1 more often than not and fairly quickly. The same problem persist if we use volume/dollar imbalance bars.

More surprisingly, the problem of vanishing T* appears in tick runs bars (and their volume/dollar brethren) as well. It’s surprising because no part of the inequality can be zero in this case. However, the min value of max{P[b_t=1], 1-P[b_t=1]} tends to 0.5 as probabilities of up/down_ticks approach this number (max of 0.5 and 0.5 is of course 0.5). This means that with every iteration we basically halve the value of previous E[T] and once this process establishes itself it drives the values of T* to 1, too.

Of course, this may be intended behaviour, although in that case I can’t really see any informational value of this exercise. I was thinking of maybe putting a floor to values of T* but this kind of defies the whole purpose. Then again, probably I’m just missing something and I would very much appreciate any comments/suggestions.

Another very nice book, probably known by most of you: : The Elements of Statistical Learning.
but one need a bit of math background (even if a reader without a good math background can read it and skip the equations...) I found this book more useful than the Lopez one, as it present concept from scratch, kind of a bible for a quant who want to be rigorous ;-).

Then...
@Zenothestoic

First one have to understand what is ML and what it is not! AI is very vast domain, you can have AI bot which does not rely on ML, and first chess bots had strictly not a single part of ML in them! In the strict definition of AI, any trading algorithm is an AI...

Lot lot is said on ML, most of it is complete crap written by people who see ML like a black box, and have no clue of what ML really is! But ML is well defined mathematically, in a very rough wording: ML is fitting! Linear regression is ML! All but really all ML machinery have a single purpose: finding the right parameters of a model.... this is true from clustering to deep learning!

So if one want to build strategies where every parameter are defined by some fundamental understanding of the market, one does not need ML. But as soon as you tune your parameters to try to maximize the profit of your algorithm, you are using ML thechnics! So it might be interesting to understand what ML is, how to use it, what are the limitations, what are the risks.... Except if you never ever, but really never ever, tune a single parameter of your algos...

@ Anthony C: Nice review, thank you. Don't let any of the cynics discourage you!

@ Zeno the Cynic ;-)) who indeed, as @ David D points out, does not make a clear distinction between ML & AI, writes: " ... machine learning and artificial intelligence has proved of great use in telling us what IS. There is little if any evidence that AI is of any significant use in telling us what WILL BE". Well, I probably might have agreed, until i read that ML (specifically RL) has recently been used for winning a significantly wider range of games. Not just Go & Chess, but other more adaptive modern games where the rules are not necessarily clearly defined up-front.

I note the word "bounded" in the comment: "It [i.e. ML, in some form] can take the rules of any bounded problem and better a human's efforts in dealing with that problem". Now that's quite an interesting thought. Are the "rules" of the markets bounded or not? and how important is that?

As for the techno-singularity (i presume referring to Ray Kurzweil's concept), i always wondered WHO would actually be the beneficiaries? Probably not the broad masses of humanity but rather just the ultra-wealthy who will initially be the only one who can afford the bio-chip implants, personal AI interfaces, etc, and then after that THEY will be the ones who get to run the game... so let's just hope some of them at least have a modicum of altruism & compassion.

Anyway, coming back from the realms of sci-fi, @ David D makes some good comments about current-day ML & AI.
If ML is to be used in the best possible ways, then it is essential to have a clear and thoughtful understanding of what ML is, does and can potentially do for us.

My personal view is that, as Ernie Chan implies, efforts to accurately predict tomorrows stock-market prices are probably not actually the best use of ML anyway, and there are certainly lots of good alternatives for applications of ML that are potentially very relevant to Fundamental Analysis.

No annual report will ever explicitly tell us who are the bad managers, the ones who are just frittering away shareholder wealth while indulging in their own ego-projects, but ML has already proven to be very good in helping us to read-between-the-lines with issues like credit-worthiness, fraud, and those sorts of things. Extrapolation to detecting "bad management implies bad investment" from available subtle clues really ARE more about "what IS" than about "predicting tomorrow's prices".

I am indeed very cynical of Quantopians approach. See my comment on https://www.quantopian.com/posts/free-cash-flow-to-enterprise-value-with-factset-data-template-fundamental-algo

How come an algo solely concerned with free cash flow is defined as pure volatility? I fear that while all this is a good attempt, they are aiming at a moving and not very precise target.

Let me clarify a bit "ML is fitting". The question is how a computer can "learn". A computer cannot learn! Put that in mind! A computer is not a brain, it is not adaptive and plastic. It is a simple stupid machine which execute an instruction pipeline on given addresses. But "machine learning" sound much more nice for founding than "fitting some parameters"!

ML is not a tool which can by some magic predict the future, it even cant understand the past by its own! One have to understand what is modeling to understand a bit of what we can do with a computer. The goal (for playing game, analyzing data, building a new physics theory, creating a beautiful algo which will make you rich...) is always the same. You have some raw data and you want to build "output data" from them. Those output data can be anything (from a list of Go moves to simple statistical quantities describing your raw data). So you have some data and you want to processes them to get another set of data. But you need a link between those 2 sets! This link is a model. So you can think about a pipeline:

raw data -> model -> output data.
The model is a mathematical function which take the input data and a set of parameters and return the output data.

Then the big question is what this model is, and how to build the best model to get the best output data (the best move in Go, the most relevant statistical properties, the best investment....). ML nor AI nor the magic holy pastafary God can give you this model! This model is not something your computer can learn! So you build a model, for example you decide that movement of object follow the rules of Newtonian dynamic, and gravity. You have a model, but this model can depend on some parameters for example the gravitational constant G. But what is the value of G??? To know this (or to learn it), you need datas.... You get some data, and then you compute the value of G that fits this data. More data you got more accurate you can "learn" the value of G... ML is the statistical machinery that allow you to get this value of G! It does not learn how Newtonian dynamics works... it can just allow you to find the parameter of your model which make the best link between your output data and your raw data, KNOWING THE MODEL!

This is also true for Neural Net. A neural network is a nice "pictorial" description of a very complicated non-linear function. This function depends on parameters, but the structure of the function is DEFINED. Then the learning phase is one again fitting the parameters of this function!

So: indeed ML is not a tool which predict the future, it is not a tool which predict the past, it is not a tool which predict the present! It is a tool which allow you to find the best parameters of a GIVEN model to link 2 datasets. The question in algorithm trading is not: does ML is useful! Because saying that it is not, means that one believe that he have the holy power of building models without any data.... Or said differently: completely random investment is the best investment solution.

The ultimate aim of AI is precisely what you (partially correctly) say it it currently not. To arrive at plasticity and the capability to learn. To become a brain you might like to say.

As to predictability that is precisely what a brain spends most of its time doing. If I cross the road now will that car avoid me if I use my customary speed. Is this business idea a good one? Is there any evidence that this product will become popular.

And what exactly is 'learning'? Is it something that only brains or the human brain can do? Does it require great insight and plasticity? Certainly not in most cases. It merely requires looking at cause and effect in the past and project that onto the future. And no, not all of these predictions are based on classical physics. So in many ways current algorithms can be said to learn. And some are even self adaptive.

Intuition? Consciousness? According to Guilo Tononi those are merely a matter of mere information and complexity and that therefor most matter has consciousness to some degree. Including current computers. The guy used to work with Crick of DNA fame and is very well regarded in consciousness research.

And what of prediction? Is it even possible anyway or merely an illusion as many eminent scientists believe free will to be. Or the other way around. And even if we live in a purely deterministic universe (and hence have no free will in the absence of a multiverse) are stock markets complex chaotic systems? If so the chaos theorists tell us that even these purely deterministic systems are in practical terms unpredictable for well rehearsed reasons.

"Completely random investment is the best investment solution." Well that certainly is not a belief I subscribe to.

Myopia in investment strategies. You need to look from a long long way out to avoid it.

I am not aware of a single software which manage to rich consciousness yet. I am not aware of a single software which as managed to learn something without a model. It is nice to "dream" about sci-fi but currently the state of AI is fitting parameters. There is attempt to have self generated models but we are far away of something which could work. Yes it is the state of research now, and if you can code a general purpose AI: dont waste your time on anything else, as I am sure you would find a buyer for a couple Trillion dollars!

By the way: a chaotic system is a deterministic one... Stochastic system are not deterministic, but in nature we have yet not fund a single fundamentally stochastic system ;-).
Then the question you ask are not about ML, they are about financial system. If you believe that the markets are not predictable, then why to try to invest, casino works fine too.

No, David you are missing my point entirely. You are not grasping what neuroscientists are positing as to the nature of consciousness. Which, if they are right, may only be some hyped up form of information processing. Take a look at Tononi and Integrated Information Theory.

This is not as you (somewhat contemptuously?) put it "science fiction" but current neuroscience. We are indeed some way away from modeling the human brain (despite the Geneva project ) and hence it seems likely that AI and brain research will move forward slowly in tandem.

It is in my (possibly misinformed view) a mistake to be too narrowly focused. This is a common problem now that the Renaissance Man is dead. But my suggestion is that you might find the latest inroads into consciousness research (both philosophical (Dennet) and practical - Tononi) of interest when thinking about machine learning and AI in relation to investment.

The takeaway from modern neuroscience is that consciousness may merely be highly complex information processing. And that to some degree therefore every part of the universe has some form of consciousness albeit mostly limited. If he is right therefor, even my humble laptop has (very limited) consciousness by his definition.

I find great pleasure in thinking in the round and not getting too caught up in myopia. In general (with some notable exceptions in the past) I try not to give way to sarcasm or unhelpful thoughts

By geneva project you probably mean "Blue Brain"? For blue brain to replicated the human brain (which will be done at some point), we would need a factor 1000 in computational power, which means a computer which need around 100 nuclear power plant. let say we could drop the current consumption per flop by a factor 10 (which might be achievable) its still 10 nuclear power plants ;-).

The question is not to replicate the brain, the question is to find another way to AI. Otherwise, lets build biological computers...

The ideas of Bennett are great, but then how to implement them in a computer in a useful manner??? I had spend quite a lot of time trying ;-). Sci-fi is for me anything which is not yet implemented, if you have an idea, you usually sit 4 days and implement it ;-).

One big issue in current day science is the big split between what is said in the media/conference/papers/books and what is real.

I am unable to not be sarcastic ;-).

"Sci-fi is for me anything which is not yet implemented"...well that casts a rather odd light of much of fundamental physics! Perhaps we should have ignored Einstein in the days before his theories were tested?

As to the impossibility of computation that takes me right back to the point I was making earlier.

If markets are complex adaptive systems then even if they turn out to be deterministic, there are too many variables to compute and thus may well remain beyond the scope of any sort of AI past present or future. We may as well try and predict long term weather, our chances of success are probably about equal.

But back to more practical matters.

I make the point once again about myopia. The object of so many funds seems to be short term gain. This quarters results. My back testing of various fundamental factors and ratios relating to free cash flow (the touch stone of valuation in the investment banking world) ran into some problems. Even without ML it should be obvious that a stock will only survive if it is healthy.

And yet according to quantopian the results of back testing an FCF ratio reveals that it has no specific return. Bollox, surely? To coin a phrase.

I think the problem with ML and Q's non ML (but getting there) approach is that some absurdities are thrown up which need to be investigated and ironed out. With people combining so many factors and so many constraints I believe errors get made and important factors overlooked.

And yes, of course, we humans are at present the only form of intelligence which can correct those errors. My comments on the FCF factor algo can be found here: https://www.quantopian.com/posts/free-cash-flow-to-enterprise-value-with-factset-data-template-fundamental-algo

Indeed one should either ignore or try to test a theory which is not yet tested. In fact if you propose a new theory, you will never be able to pass the peer review if you do not provide strong tests. Einstein theory was tested before they have been formulated! Special relativity was known to be the correct transformation for electrodynamics (that is why they are called Lorenz transformation, respecting the Lorenz symmetry group, and not the Einstein one). What Einstein did with its special relativity was to find a framework to explain electrodynamics behavior in a covariant framework (here covariant means independent of the coordinate system). Then the halflife of particles was known to dependent of their energy, special relativity managed to explain why.

His "natural" second step was to try to understand what happen when reference frames are accelerated, this leaded to general relativity and a new description of gravity, which has been immediately tested versus Newtonian gravity using the mercury perihelion precession. It was due to the fact that General Relativity was able to explain this precession (which was one of the big issue of Newtonian gravity/dynamics) which make the discovery publishable.

So yes a theory which cannot be tested should be put in the bin.... If you are interested in theory construction Hasan Ibn al-Haytham is a thinker which worse some attention, he is the one who formulates that a theory needs to be tested to be validated, this has driven human out of the dark ages of middle age.

It is for sure that Market are complex system, it is known that one cannot predict exactly the market. Anyone who would try to do so would be a fool! But one can try to predict statistically the market, with the goal of finding a small hedge. It is just as cosmology, we cannot have the exact initial conditions of our universe but we can describe it statistically and infer lot of information on the system from this statistics. In market analysis one is also bounded to study only statistical properties... Then tools as Bayesian statistics, MCMC,... are damn useful to infer if a model is good, such method are statistical learning approaches which are therefore part of ML...

"I think the problem with ML and Q's non ML (but getting there) approach is that some absurdities are thrown up which need to be investigated and ironed out. With people combining so many factors and so many constraints I believe errors get made and important factors overlooked."

This can only be very very true! But using as many factor as possible is a good approach, especially if the factors are not "too much" correlated. The question is how to combine them in a controlled environment.

Then it is not so obvious for me that a stock survive only if it is healthy, I think it is more true that a stock survive if investors believe that it is healthy... And believes are not always correlated with reality :-).

@Zeno vs @David, i am enjoying this debate, and i think that (both of) your discussions in the more philosophical aspects are useful in helping us all to think in broader terms. Picking up on a number of specific points, in no particular order:

Zeno's " ... a stock will only survive if it is healthy" is true in the long-term, although a lot of dogsiht stocks with no CF do survive for long enough in the short-term to make plenty of money for many investors. DavidD's comment: "... it is more true that a stock survive if investors believe that it is healthy... And believes [beliefs] are not always correlated with reality" is of course the well-known old "market as a beauty contest" metaphor, and both of your perspectives are consistent with Warren Buffett's comment that: in the short-term the market is like a voting machine (i.e. technicals dominate) vs in the long term like a weighing machine (i.e. fundamentals dominate).

As we will always have a lot of unknowns in investment, it certainly makes sense to me to try to stack up as many things in our favor as we can, including both technicals & fundamentals (and i agree especially CF), as well as both thoughtful human logic and also ML where appropriate. The investment game is hard enough and IMHO we need all the help we can get from ALL sources without dismissing any potentially useful ones.

Soon enough (if not already) our competitors will have (or maybe are already using) neuromorphic chipsets and/or quantum computers. This is not SF stuff. D-Wave Technologies of Canada already markets commercial quantum computers of up to 2,000 qbits (Ref. https://www.dwavesys.com/quantum-computing) and even if they are too expensive for most users, they are now available in "as a service on the cloud" form, to which Microsoft is also already catering with their Q# quantum computing language (no relation to Q as in Quantopian, to the best of my knowledge :-)).

To me, a logical approach (for those who can afford it) is or will probably soon be to compile an encyclopedic database of all price patterns on all time scales for all securities ever recorded and then use quantum computing to assign the highest probability to "what price response is likely to come next" based on whatever has just happened and in whatever context it occurred. Meanwhile .... i just plod on trying to write algos based on logic & "common sense" and the best that i can do with data analysis, but i am under no illusions that the days of the "little" investor / trader / algo writer (like most of us here) are probably numbered.

As to more mundane issues such as Zeno's comments about results of back testing an FCF ratio, i believe there are several different issues involved here with regard to FCF, which i also believe is a good determinant of stock quality. The following comments are based purely on the results of my own investigations, starting from reasonably trivial ones and becoming more significant.
1) There appears to be some difference between results based on FCF from Morningstar and from the newer fundamental data set. OK, no big surprise.
2) I am always very concerned when i run what appears to be the same problem twice and get different results. Sometimes this really seems to have happened. Maybe in some cases i just made mistakes that i didn't realize, but there definitely have been times when results were different from runs on different occasions. I think Q does occasionally make "minor adjustments" to various parts of the system and occasionally the impact is not-trivial in some specific situations.
3) The results we obtain from our algos are sometimes robust and sometimes very sensitive to even extremely minor input changes. In particular, on many occasions, i have seen major differences between results when a variable is omitted vs when it is included but i believed that i had "commented it out" by multiplying its contribution by zero. For example, try this with dividend yields. The problem is largely related to missing data and the use of NaN's in the database and their impact on subsequent calculations because of effective changes in what securities actually get used or not.
4) In many of my algos i use a combination of different factors and their combined effects are sometimes non-linear, i.e. output is not just = a1*inputA + b1*inputB, but more like output = if NOT input C then a1*inputA + b1*inputB else if input C then a2*inputA + b2*inputB*inputC, etc. Sometimes in retrospect it is difficult to disentangle exactly what caused what. Whether the final results get called "specific", "common", "uncommon" or anything else seems to me to be partly semantics and partly an artifact of whatever other inputs were used, but i can definitely say without any doubt that i do observe CF, FCF and EPS all have clear impacts on results.

@Tony

Quantum computer are currently not yet really useful. For plenty of reasons. But largest one is the unability to have long term memory storage of Qbits (I think state of the art is order 1s). This memory issue even prevents to run complex algorithm as the Qbits are lost before the algorithm manage to finish the computations.... The error management is definitely a second problem. I think best working solution reach something like 25% of correct answers ;-). So quantum computers are the future for sure, but we still have a bit of development to be done before it becomes really useful. Especially with the current price of those machines... With 15 Mio you can build a quite decent computer :-) reaching something like 5 Pflops (including power and cooling costs). So a big cluster would be my investment choice ;-).
Then I think that reasons of having an hardware designed for "extremely" fast response is debatable. For an execution pipeline or ultra high frequancy trading you indeed need to be reactive at the microsecond, but this will be done very well using a FPGA at a cost which is very competitive (around 20K), and FPGA are easy to "code" as you have "translators" for C or Fortan. This is especialy true, as usual execution algo are very simple. But what would be the reason to have a complex algo which depend on data that are produced very slowly (like fundamentals) at a ultra high speed, you will gain a hedge only during the first few millisecond after data production...

Hi @David,
My understanding is that D-Wave have already considered & addressed the first problem that you mention and now they are not using Q-computing alone but actually linked in combination with "conventional" high-power computers for running a range of "solutions" including optimization, ML etc with specific practical applications already in use in various domains such as cybersecurity, bioinformatics & financial analysis.

Of course you are right about cost as a problem, but presumably there is some quantum-computing equivalent to Moore's Law so no doubt costs will come down.

Your ideas about "extremely fast response" are fair enough, but my understanding is that the main benefit of quantum computing is not only speed but also the generation of a wide spectrum of probabilistic answers instead of just one. I can't comment on your statement about "only" 25% of correct answers, but maybe 25% of a whole lot is actually not bad!

If you think about using quantum or other very sophisticated machines not so much for their speed advantage (as you say, only to gain an edge for the first few milliseconds) but rather to address a whole class of very complex problems, then my guess is that's probably where an even larger potential advantage exists. Personally, if i had access to all the data that i needed and to one of these machines now, i would be using it to gain deeper insights into understanding things that we don't understand well now. And that would probably very much include a better understanding some "slow" things including fundamentals where the issue is not speed but complexity. Cheers, best regards.

But using as many factor as possible is a good approach, especially if the factors are not "too much" correlated.

From personal trading and writing experience I can say that in my view this is a minefield. I once wrote an article called "Tuning up the Turtle". You can get it free now, it seems through SCRIBD (thanks for the royalties guys! )

Actually no, I think you have to subscribe: I have made it available here: Turtle

It was back in the days when I was still trading commodities in some size using a trend following methodology and thought the method would last for ever and always come back even if it went through a bad period.

I was relying on non correlation between 100 or so futures. Which is great, until it is not. Even where there seems to be some fundamental and convincing reason for non correlation (the ten year US Treasury and the stock market) it will let you down badly some times. Positive correlation historically between bonds and stocks can reach 90%.

It is amusing to re-run that Tuned up Turtle on subsequent data.

In theory non correlated assets work a treat. In practice you need to be very sure of the reason for that lack of correlation. This is very, very far from clear when you use multiple alphas in the Quantopian chebang. And even if you do manage to isolate the non -correlations and the reason for the relationship you will occasionally be in for a big shock.

Perhaps sophisticated machine learning or some distant progeny of today's attempts at embryonic AI will manage the problem? Who knows. I can only relate my past experience and relate it to the current attempts on Quantopian. I do not actually mean to be negative (although I am aware it comes across that way).

I just have many, many years of practical experience in seeing these things unravel.

Hi Tony and David,

I'm enjoying your discussion on quantum computing. As a regular at The New York Trading Show, I have seen a few quantum computer hardware vendors such as D-Wave, in display and presentations. While speed is the main draw, capacity is the breakthrough. In short, we now have a major leap in computing power in terms of processing speed with the capacity to handle more complex problems simultaneously, in mostly, optimization schemes and routine. HFT firms are mostly the current users in Financial Trading field mainly for the speed aspect of it as David is alluding to. I very much agreed with Tony that the larger potential advantage and opportunity exists in its ability to tackle more complex problems with more information at breath taking speeds. The capacity to hold, process, analyze then predict with more data as information availabilty increases is where the advantage lies. We now will have more access to a digital footprint of historical facts and data that describes the evolution of the capital markets as it relates to financial trading. Fundamentals, prices at different frequencies, news and sentiment data and other alternative data form the input variables to algothrimic trading algos that processes them with the goal of achieving its desired objective, the basic framework. There are also concurrently advances in proposed solutions to these complex problems in traditional statistics and recently in AI/ML.

I ask myself, with all these data how does one separate the signal from all that noise with a significant amount of consistency? Personally, I have taken the theoretical view of Chaos Theory to form the framework of my hypothesis. Underneath a seemingly random phenomena is an underlying somewhat deterministic system that is predictable. Emphasis on somewhat deterministic because over time it is not always the case because randomness occurs with increased uncertainty, the inflection point of regime changes. So there will be over time, packets of predictability and packets of randomness which can be both exploited for profits. And this is good enough for me because there a million ways to skin this cat, the endgame being to generate consistent profits at a desired risk level. Cheers!

Hi @James, great comments.
Thanks for your practical insights about what is happening at trading shows in NY. (Here is SE Asia, the closest we get to that is in Singapore but nothing lately inspired me enough to leave my own little island and go there since the last QuantCon in 2017 :-)) I'm not surprised that currently the main users of quantum computing are HFT for speed, but i think that will change.

This is a new tool that promises to help with gaining fundamental insights that were impossible or very difficult in the past. For example, even though Altman's Z-score may be old-hat now, it was ground-breaking stuff back in the late 1960's. Now fast-forward and imagine the modern-day equivalent: turning quantum computing power onto, for example, a very fundamental topic of corporate financial strength & quality, as the best guides to survivability and likely growth and profitability of companies as investments.

@Zeno implies a good point that practical street-smarts in trading are a whole different skill set, and certainly it is clear that not everyone posting in to Q actually has those sort of practical skills. The Turtles concept worked, and then didn't work so well, and then people wrote "Turtle Soup" systems to fade the Turtles, and now often neither work, although SOMETIMES still ... And of course correlations do come & go and nowadays "everyone knows" that when the going really gets tough all the supposedly "uncorrelateds" go to 1 as the participants all race for the exits at the same time.

I like Zeno's comment: "I just have many, many years of practical experience in seeing these things unravel. and i believe that sort of practical experience is not to be dismissed lightly. For my own part, EVERY system that i design & test, for any financial asset whatever, even now more than 30 years later, still always gets tested on how well it would have stood up in the Australian market on 19th October 1987 .... when i happened to be working as an engineer in the building next to the Adelaide Stock Exchange and could hear all the shouting coming from next door!! :-)) Of course it's not that i expect that event to be repeated; that is not the point at all, but rather that, as someone (was it Churchill or Santayana or Comrade Lenin) said: "Those who do not learn from history are doomed to repeat it". So much for war stories about 1987, but now imagine ALL of that sort of cumulative experience from participants in all markets over all timeframes being reviewed continually as part of ongoing trading decisions. The quantum computing / ML combination that is already available (for those few who can afford) and it is presumably already being applied to do exactly this. Or if it isn't, then no doubt it very soon will be. Again, in partial answer to Zeno: Are some of the ideas written about here in Q a long way behind that sharp leading edge? Oh yes, I'm sure they are. The point i want to make is that, whether anyone likes it or not, the sort of high-powered quantum computing / ML / AI that will give users the equivalent of decades, no not decades but lifetimes, of "experience" is coming, and may already be here in the hands of a few.

That is just conjecture on my part, but what i do know, because i have researched it very thoroughly, is that trading profits become continually more elusive over time, at least for anyone who has to pay "normal" transaction costs. Systems (like Turtles) that used to be profitable are no longer. If our experience, knowledge, understanding grows over time, why then does trading become more difficult & not easier? Presumably because "someone" is/are already using all the advanced tools that some people here just regard as "SF type fictions".

I like James' comment that: "Underneath a seemingly random phenomena is an underlying somewhat deterministic system that is predictable.[at least somewhat and some of the time anyway], but those "packets of [semi-] predictability" are probably enough, and in fact may be all that we will get in future. Nothing forces us to trade all the time, and i think that being very selective in the "battles" that we choose is probably the best way to continue winning.
Best wishes for happy & successful trading to those who do, and happy & successful theorizing to those who don't ;-))

@Zenothestoic

Taking a Bayesian approach:
I think it is a common mistake to infer to much information on the priors. If informative priors are often necessary to decrease the volume of the parameter space, it is always dangerous to put too much information in it. And I think that your comment about your algo is exactly showing this. You build a model which was based on an assumption which was not valid anymore... Maybe if the model be allowed to relax this assumption this issue would not have appear. I think that one of the big difficulties with finance (in respect to physics) is that the correctness of an assumption on the market depend on time, meaning that a model have to be quite adaptive and assumptions tested very regularly.

I agree that if one "blindly" mix hundred of factors, one is doing something very wrong :-). The question is, how to extract features on the fly, then construct a combination of those "intput factor" which is dynamic and allow the model to best fit the current "state of the market".

Then a question about "computational power" and big data. I strictly dont believe that quantum computer are at a stage to compete with a decent cluster (ok by decent I mean one of the top 20, but what is 50Mio every 3 years for very large companies?). Big data is a very relative question, if you ask some people they will speak about dataset of order GB, others TB, others PB.... What is a "decent" workflow of data production? 1MB/s, 1GB/s, 1TB/s ? I think current limits with a "decent cluster" is order TB/s. My simulations of cosmological structure formation are producing order 30TB of data every 15 seconds.... A good cluster should be able to ingest order 1TB per second from the web.... So yes definitely Tony "The point i want to make is that, whether anyone likes it or not, the sort of high-powered quantum computing / ML / AI that will give users the equivalent of decades, no not decades but lifetimes, of "experience" is coming, and may already be here in the hands of a few. " I dont think it may already be here, it is for sure something already used... I think NN is not yet very used, as it is too much unpredictable (look at the well known wolf/snow issues in image classifications for example).

It is indeed frustrating to not be able to have access to large computational power, but I think that with Q systems quite lot can be done! We still miss some essential features (possibility to transfer files from the research to the algo is one of the real burden!). Having a benchmark that need 1 days of computation to be run on 2 month...

Hi @David,
Actually Neural Networks were widely used in financial market trading efforts about 10 or 15 years ago, but those older generation NNs failed to live up to expectations mainly because they proved to be, as many people said: "very good at remembering but not very good at generalizing". No doubt there is a resurgence of interest now with the modern generation of NNs & other so-called "Deep Learning" methods.

Although i am personally very interested in keeping an eye on progress in ML & high-powered computing in its various forms for all sorts of problem-solving, honestly i think if anyone's aim is mainly to do well in the Quantopian competition and to strive for an allocation, then rather than worrying too much about ML & computer power, it is probably far more important to give careful consideration to Fundamental & other data, to think about what they really mean, and to consider the limitations of each of the individual financial statement data items. The implications hidden in there do require careful thought & logic, but do not necessarily require any high-powered computing at all, beyond what Q provides us, and maybe also dusting off some old textbooks on corporate accounting ;-))

@Narinaga,
Just to be clear, are you talking about the content of Section 2.3 /Section 2.3.2(Bars/Information-Driven Bars, pgs 25-32) of LopezDePrado's book?
alan

@Alan C
Yes, exactly, the section on Information-driven bars.

@thomas
What he's proposing is not cross sectional quant investing? Which leads to the question of is cross sectional quant investing out dated and no longer alpha producing? I mean he says it in his introductory that econometrics has been around for hundreds of years and not really worthwhile. Firms like Ren Tech, do you really think they are doing cross sectional quant investing?

There are a large number of firms called Ren Tech, one of which is a technology consulting company, another is in rapid product development, another is a computer service company, another an automotive panel beater, another in equipment rental, another a boilermaker, etc, etc and yes i certainly do agree, definitely none of those seem to have much interest at all in cross sectional quant investing.

Oh, NOW i get it, you mean Renaissance Technologies!!! ;-)))
Please don't mind me, its probably just an after-effect of reading Guy's post(s) for too long ;-)))

@Ting: Yes, that's not cross-section but more of an individual stock approach. I wouldn't say this suggests anything about factor based investing being dead. I think of them more as implementations. At the end of the day, you need predictive signals. These can be implemented in a factor-based way or like LdP does. Specific signals might be easier to trade in a factor-based vs LdP way. I would say event-based things are probably easier with LdPs approach.