Machine learning is a buzzword often thrown about when discussing the future of finance and the world. You may have heard of neural networks solving problems in facial recognition, language processing, and even financial markets, yet without much explanation. It is easy to view this field as a black box, a magic machine that somehow produces solutions, but nobody knows why it works. It is true that machine learning techniques (neural networks in particular) pick up on obscure and hard to explain features, however there is more room for research, customization, and analysis than may first appear.

Today we'll be discussing at a high level the various factors to be considered when researching investing through the lens of machine learning. The contents of this notebook and further discussions on this topic are heavily inspired by Marcos Lopez de Prado's book Advances in Financial Machine Learning. If you would like to explore his research further, his website is available here.

## 1. Data Structures

Garbage in -> garbage out. This is the mantra of computer science, and modeling doubly so. A model is only as good as the data it accepts, so it is vital that researchers understand the nature of their data. This is the foundation of an algorithm, and it will succeed or fail on the merits of its data.

In general, unstructured and unique data are more useful than pre-packaged data from a vendor, as they haven't been picked clean of alpha by other asset managers. This is not offered on the Quantopian platform, but you can upload data for your own use with the Self-Serve feature. If you do not have unique data, the breadth of offerings on the Quantopian platform gives us plenty to work with. Data can vary in what it describes (fundamentals, price) and in frequency (monthly, minutely, tick, etc). Listed below are the chief types of data, in order of increasing diversity:

1. Fundamental Data - Company financials, typically published quarterly.
2. Market Data - All trading activity in a trading exchange or venue.
3. Analytics - Derivative data, analysis of various factors (including all other data types). Purchased from vendor.
4. Alternative Data - Primary information, not made from other sources. Satellite imagery, oil tanker movements, weather, etc.

The data structures used to contain trading information are often referred to as bars. These can vary greatly in how they are constructed, though there are shared characteristics. Common variables are open, high, low, and close prices, the date/time of the trade, and an indexing variable. A common index is time; daily bars are a structure that each represent one trading day, minute bars represent one minute of trading, etc. Time is held constant. Trading volume is another option, where each bar is indexed with a consistent number of shares traded (say, 200K volume bars). A third option is value traded, where the index is dollars (shares traded * price per share).

• Time Bars:
Bars indexed by time intervals, minutely, daily, etc. OHLCV (Open, High, Low, Close, Volume) is standard.

• Tick Bars:
Bars indexed by orders, with each set # of orders (usually just 1) creating a distinct bar. Order price, size, and the exchange the order was executed on are common. Unavailable on Q platform.

• Volume Bars:
Bars indexed by total volume, with each set # of shares traded creating a distinct bar. We can transform minute bars into an approximation for volume bars, but ideally we would use tick bars to maintain information for all parameters across bars.

• Dollar Bars:
Similar to volume bars, except measuring the total value (in $) traded hands. An example would be$100,000 bars, with each bar containing as precisely as possible that dollar value.

Alternative data structures exhibit statistical properties to different degrees, with volume and dollar bars typically expressing greater stationarity of returns than time and tick bars. These properties play a big role when considering which bars to use in a machine learning framework, as we will discuss next.

## 2. Statistical Properties and Stationarity Transformations

Much of the literature on machine learning, and statistical inference in general, makes the assumption that observations are independent and identically distributed (iid). Independent means that the occurance of one observation has no affect on any other and identical means that our variables are derived from the same probability distribution (e.g. have the same variance, mean, skew, etc).

Unfortunately, these properties are rarely found in financial time series data. Consider pricing; today's price is highly dependent on yesterday's, the mean price over some time interval is constantly changing, and the volatility of prices can change rapidly when important information is released. Returns, on the other hand, remove most of these relationships. However, variance (i.e. volatility) of returns are still changing over time as the market goes through different volatility regimes, thus are not identically distributed.

The different bar types (and additional data structures) exhibit varying statistical properties. This is important to consider when applying machine learning or other statistical inference techniques, as they assume that inputs are iid sampled (or stationary, in time series). Using dollar bars in lieu of time bars can make the difference between a weak and overfit algorithm versus a consistently profitable one. This is just one step in the search for stationarity, however, and we must have other tools in our arsenal.

The note above about independence of price series vs return series illuminates one concept: the tradeoff between memory and stationarity. The latter is a necessary attribute for inference, but provides no value without the former. In the extreme, consider transforming any series into strictly 1's; you've successfully attained stationarity, but at the cost of all information contained in the original series. A useful intuition is to consider degrees of differentiation from an original series, where greater degrees increase stationarity and lower memory. Returns are 1-step differentiated, the example of all 1's is fully differentiated, the price series has zero differentiation. Lopez de Prado proposes an alternative method, named fractional differentiation, that aims to find the optimal balance between our opposing factors; the minimum non-integer differentiation necessary to achieve stationarity. This retains the maximum amount of information in our data. For a thorough description read chapter 5 of de Prado's Advances in Financial Machine Learning. With this implemented, and our data sufficiently prepared, we are almost ready to whip out the machine learning algorithms. Finally, we have to label our data.

## 3. Labeling for Learning

### 3.1 - Triple-Barrier Method

Most machine learning classifiers require labeled data (those that don't are powerful but difficult to engineer, coming with high risk of overfitting.) We intend to predict the future performance of a security, so it seems fair to label each observation based on its ensuing price performance. It is tempting to just use whether the returns were positive or negative over a fixed time window. This method, however, leads to many labels referring to non-meaningful, small price changes. Moreover, real trading is often done with limit orders to take profits or stop losses. Marcos Lopez de Prado proposed a labeling strategy the he calls the Triple-Barrier Method, which combines our labeling desires with real market behaviour. When a trade is made, investors may choose to pre-emptively set orders to execute at certain prices. If the current price of security $$s$$ is $5.00, and we want to control our risk, we might set a stop-loss order at$4.50. If we want to take profits before they vanish, we may set a profit-taking order at \$5.50. These orders are set to automatically close the position when the price reaches either limit. The stop-loss and profit-taking orders represent the two horizontal barriers of the Triple Barrier Method, while the third, vertical, barrier is simply time-based: if a trade is stalling, you may want to close it out within $$t$$ days, regardless of performance.

The classifier outputs a value of either -1 or 1 for each purchase-date and security given, depending on which barrier is first hit. If the top barrier is reached first, the value is set to 1 because a profit was made. If instead the bottom barrier is hit, losses were locked in and the value is set to -1. If the purchase times out before either limit is broken and the vertical barrier is hit, the value is set in the range (-1, 1) scaled by how close the final price was to a barrier (alternatively, if you want to label strictly sufficient price changes, 0 can be output here).

### 3.2 - Meta-Labeling

Once you have a model trained for setting the side of a trade (labeled by the Triple-Barrier Method), you can train a secondary model to set the size of a trade. This accepts the primary model as input. Learning the direction and size of a trade simultaneously is much more difficult than learning each separately, plus this approach allows modularity (the same sizing model may work for the long/short versions of a trade). We must again label our data, via a method de Prado calls Meta-Labeling. This strategy assigns labels to trades of either 0 or 1 (1 if it takes the trade, 0 if not) with a probability attributed to them. This probability is used to calculate the size of the trade.

Useful considerations for binary classification tests (like the Triple-Barrier Method and Meta-Labeling) are sensitivity and specificity. There exists a trade-off between Type 1 (false positive) and Type 2 (false negative) errors, as well as true positives and true negatives. F1-Score measures the efficiency of a classifier as the harmonic average between precision (ratio between TP and TP+FP) and recall (ratio between TP and FN). Meta-labeling helps maximize F1-scores. We first build a model with high recall, regardless of precision (learns direction, but with many superfluous transactions). Then we correct for low precision by applying meta-labeling to the predictions of the primary model. This filters out false positives and scales our true positives by their calculated accuracy.

## 4. Learning Algorithms for Direction and Bet Size

Now that we've discussed the considerations when structuring financial data, we are finally ready to discuss how programs actually learn to trade!

A machine learning archetype known as ensemble learning has been shown time and again to be robust and efficient. These algorithms make use of many weak learners (e.g. decision trees) combined to create a stronger signal. Examples of this include random forests, other bagged (bootstrap-aggregated) classifiers, and boosted classifiers. These produce a feature space that can be pruned to decrease the prevelance of overfitting. This discussion assumes you are at some level familiar with machine learning methods (particularly ensemble learners.) If you are not, scikit-learn's tutorials are a fabulous starting point.

### 4.1 - Bootstrap Aggregation (Bagged Classifiers)

This is a popular ensemble learning method that aggregates many individual learners that are prone to overfitting if used in isolation (decision trees are common), into a lower variance 'bag' of learners. The rough recipe is as follows:

1. Generate N training datasets through random sampling with replacement.
2. Fit N estimators, one on each training set. Fit independently from each other, trained in parallel.
3. Take the simple average of the forecasts of each of the N models, and voilà! You have your ensemble forecast. (If a classifier problem with discrete options, it's majority-rule voting rather than a simple average. If a prediction probability is involved, the ensemble forecast uses a mean of the probabilities).

The chief advantage of bagging is reducing variance to address overfitting. The variance is a function of the number $$N$$ of bagged classifiers, the average variance of a single estimator's prediction, and the average correlation among their forecasts.

Lopez de Prado also presents sequential bootstrapping, a new bagging method that produces samples with higher degrees of independence from each other (in the hope of approaching an IID dataset). This further reduces the variance of the bagged classifiers.

### 4.2 - Random Forests

Designed to reduce the variance/overfitting potential of decision trees. Random forests are an implementation of bagging (the 'forest' being the aggregation of many trees), with an extra layer of randomness: when optimizing each node split, only a random subsample (without replacement) of the attributes will be evaluated, to further decorrelate the estimators.

Courtesy of Medium

### 4.3 Feature Importance

Feature importance analysis allows us to prune the features in our noisy financial time-series dataset that do not contribute to performance. Once features are discovered, we can experiment on them. Are they always important, or only in some specific environments? What triggers a change in importance over time? Can those regime switches be predicted? Are those important features also relevant to other financial instruments? Are they relevant to other asset classes? What are the most relevant features across all financial instruments? What is the subset of features with the highest rank correlation across the entire investment universe? Pruning our feature space is an important part of optimizing our models for performance and risk of overfitting, like every other consideration above. In general, there is a lot to explore here, and it is outside of the scope of this already lengthy post.

### 4.4 Cross-Validation

Once our model has picked up on some features, we want to assess how it performs. Cross-validation is the standard technique for performing this analysis, but does require some healthy finagling for application in finance, as is the theme today. Cross-validation (CV) is a technique that splits observations drawn from an IID process into a training set and a testing set, the latter of which is never used to train the algorithm (for fairly obvious reasons); only to evaluate it. One of the most popular methods is $$k$$-fold, where the data is split into $$k$$ equally-sized bins (or folds), one of which being used to test the results of training on the remaining $$k-1$$ bins. This process is repeated $$k$$ times, such that each bin is used as the testing bin exactly once. All combinations are iterated over.

This method, however, has problems in finance, as data are not IID. Errors can also result from multi-testing and selection bias due to multiple sets being used for both training and testing. Information is leaked between bins because our observations are correlated (if $$X_{t+1}$$ depends on $$X_t$$ because they are serially correlated, and they are binned separately, the bin containing the latter value will contain information from the former). This enhances the perceived performance of a feature that describes $$X_t$$ and $$X_{t+1}$$, whether it is valuable or irrelevant. This leads to false discoveries and overstating returns. Lopez de Prado presents solutions to these problems, that also allow more learning to be done on the same amount of data. One in particular, that he calls Purged $$K$$-Fold CV, simply purges any values in the training set whose labels overlap in time with the testing set. Another deletion strategy he calls embargo is used to eliminate from the training data any signals that were produced during the testing set (essentially just deleting some # of bars directly following the testing data). He also provides a framework to find the optimal $$k$$ value assuming little/no leakage between training and testing data.

So, finally, we have everything we need. Prune our data, label it for direction, train an ensemble learning algorithm, label for size, train another algorithm on that, and combine. Integrate the results of this with the Quantopian IDE and backtester and send it to the contest! This may seem like a lot, and it is, but much of the programming is modular. With the groundwork laid, you just might find yourself churning out more impressive strategies than ever before.

Courtesy of Wikipedia

## 5. Further Learning

This discussion drew heavily from Advances in Financial Machine Learning by Marcos Lopez de Prado. This is an excellent resource if you are already familiar at a high level with investment management, machine learning, and data science. If you are not, the Quantopian lecture series is a great place to start, especially combined with Kaggle competitions and the scikit-learn machine learning tutorials. Don't be afraid to just play around, it can be fun!

Once you're comfortable with all of that, go through Lopez de Prado's book and work on implementing these methods with a data structure you created, and plugging the result into a few different specially-calibrated machine learning algorithms. If you have a predictive model, test it out-of-sample for a good long while (though your methodology should have prevented overfitting) and see if it sticks. If it does, congratulations! Pit your strategy against others in the contest and we might give you a call.

Well, that was a lot! If you are new to machine learning in general, in finance, or just want to learn more in general, please take advantage of the resources discussed here. We hope to provide significantly more in-depth resources for these topics in the future. This represents a good start to producing a truly thought-out and well-researched strategy, and we hope you make some amazing things with it. Good luck!

28 responses

Nice but unless you whitelist higher level open sourced AI/ML libraries like Keras, PyTorch, Tensorflow or Theano then it's very limiting. One can do more sophisticated AI/ML algos offline and upload results to Self Serve Data but they're most likely be limited to OHLC data since Q does not allow to export non-OHLC such as Fundamentals. The other issue is compute time limitations. Otherwise, this is a good intro.

Great post @Anthony, thank you!

Would you consider doing a webinar on (Intro to) ML for Finance?

For some context: Anthony is an intern in the Quantopian research team and is doing amazing work exploring this workflow. We read Lopez de Prado's book at Quantopian with great interest. The workflow he proposes in the book is a bit different than the cross-sectional ML workflow I outlined in ML on Quantopian Part I, Part II, and Part III, where features are computed at the same time for all stocks and than ranked relative to each other. Here, LdP proposes to look at each stock individually, starting with the suggestion to not use time-bars and not have the labels be computed over the same time-horizon every time. I think this does make sense as opportunities could arise on different time-scales for different stocks at different times.

In this work, we are curious how you can do this on Q. Anthony has a couple of other posts lined up that actually explore the ideas highlighted here in more detail with code, so stay tuned for that.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

I have only just started the book. I have thought about the activity based bars concept. Unfortunately, I haven't figured out a vectorized implementation for building the bars. A Quantcon attendee wrote a nice blog post on the concept. He used Cython to speed up his loops in building the bars from tick data on the S&P futures.

While Quantopian doesn't have tick data for building the bars, it may be possible to get a decent approximation for larger volume bar aggregations using the 1-min data. When I get a chance, I'll play with the concept and get an idea for how long it would take to build volume-based bars for a universe of stocks, and then I'll post my work in a notebook.

I came up with a couple of functions to compute volume bars in Quantopian based off of minute data. (They can easily be modified to create dollar-volume bars as well).

The first function uses a loop and is quite slow. However, while I was coding this, I thought of an idea for a more efficient version. I'm not sure "vectorized" is the correct description (although that is what I used in the notebook to describe this second function). I basically use the groupby function in pandas to speed up the computation. The two versions will create slightly different bars as is described in the attached notebook.

The notebook contains the functions I wrote. In addition, I included a section where I compare the average run time of each function on roughly 15-20 days of data for 1 symbol. Lastly, I included a few charts along with descriptive statistics showing the distributions of the volume vs. time bars for a couple of months.

Feel free to recommend any improvements to the functions (or anything else for that matter).

32
Notebook previews are currently unavailable.

Hello Thomas, we have been exploring machine learning with generating features over different timescales for each individual stock outside of Quantopian (using Quandl subscription data), then translating this outside of Quantopian into buy or sell signals and then uploading it into Quantopian to run the backtester to see what the results are. This work is not completed yet, we need to further optimize this by exploring different features (we used simple things like MA etcetera). It does give positive results but did not beat our best rule-based algorithms on the same topic yet.

The test we used is that we manually selected a list of 6 ETF's which we know over a long time period always one of them would give a positive result within a time period (rebalance every week). Based on historic data, we can calculate the absolute maximum return we would have received when every week we would allocate 100% of capital to the best performing ETF for the next week. So this was our maximum. We already build a rule-based algorithm and we know the results from this as well. What we are now trying to do is build out this simple concept with Machine Learning to see whether we can overachieve on our best rule-based algorithm. In theory, if we find the right input features, the Machine Learning algorithm should always be able to outperform the rule-based algorithm but we are not there yet.

When training a RandomTree classifier on data that was labeled by the Triple-Barrier method, are you assuming that all distributions to have the same data length? That is, if you have 5 securities that you split into volume bars (thanks @Michael Matthews), there is no guarantee that all will have the same length. Once security may have 450 volume bars, while another may have 462 in the same time period. Should I force the data to be uniform in size?

I guess a better question is, what kind of information should I be trying to feed into the classifier?

@Michael,

Wow! Quick turnover on that notebook, it looks fabulous. You didn't mention this in the nb, but your vectorized solution actually has the excellent property of minimizing the difference from your target volume, rather than a plain-old threshold. This does incorporate mild look-ahead bias like you mentioned, but for an analysis of statistical properties it's not so troublesome. That runtime is excellent, too.

Aiming for 6 bars per day is interesting, because we're aiming to separate our bars from the time domain. The tradeoff in our target volume is between having higher frequency data (generally better for machine learning purposes) at low sizes, and a smaller error at larger sizes. It's up to the researcher how that is implemented, however.

There is an upcoming forum post on this very topic, so keep on the lookout for that.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks, Anthony. Nice Post by the way. I'll be on the lookout for the next one.

I haven't really thought through the selection of the target volume all that much. My thought was that it should probably vary by stock depending on the stock's average volume although questions arise, such as: When do you calculate the average volume (periodically, after a significant change in avg. volume)? Will the volume target change over time if the average volume changes?

I used a targeted number of bars per day just to give some frame of reference. I went with 6 bars per day in the illustration because it is roughly comparable to an hourly chart (for stocks), which can be useful for a 1 to 5 day holding period. Obviously, heavier (lighter) volume will create more (fewer) bars, which is what we want. We are somewhat limited by how granular we can go b/c as you approach 1 bar per minute, you will not get much improvement by using volume bars since we do not have the tick-by-tick data. As you said, I think considerable discretion can be used by the researcher depending on his/her use case.

Hi,

I just started to look at the book, look very interesting :-).
I decided to go through all exercices, starting at chapter 2. Would be nice if someone would review my solution... What do you think should I post it here or should I start a new topic for each chapter?

Did Mr de Prado use these techniques as a fund manager in his bond fund? If he did, then what were the results like? Did his use of machine learning techniques enable him to outperform the market and his competitors? For a significant period. Because of course chance must be ruled out.

It has been proved beyond doubt that machine learning and artificial intelligence has proved of great use in telling us what IS. There is little if any evidence that AI is of any significant use in telling us what WILL BE.

AI can recognize a face as a face, a dog as a dog. It can take the rules of any bounded problem and better a human's efforts in dealing with that problem; Go and Chess being examples of this.

Does AI have the ability to predict the future price of a financial instrument at any date in the future with a success rate significantly better than 50% (random)?

If it does not then spending endless hours reading Mr De Prado's work may better be spent in a some more fruitful manner. The Alchemist was eventually proved right: theoretically we can transform bread into gold, or any matter into any other configuration of matter.

Will we ever be able to report the same success in our efforts to predict the future? And if not then the financial market analyst might wish to find a better use of his time.

Anthony, looking forward to the code samples. I have been going through the book recently. Some thoughts:

Labeling: Useful topic in general, would definitely be more useful in standalone research using scikit-learn. Eager to know what the code samples provide.
Stationarity: Will be curious to see how you are achieving this in code.
Triple Barrier: Not sure how it fits into contest requirements with minimum end of day leverage requirements and Quantopian pipeline and order_optimal_portfolio. Some clarification would help
Direction and Bet Size: Curious how this study is transformed into Quantopian factor rankings and order_optimal_portfolio. I see the most benefit here.
Feature selection/CV: These are all research work related, a lot of chapters in the book are devoted to this. Will be eager to see code examples demonstrating the book's concepts.

I'd like to see how ML fits into quantopian workflow. Time is limited in the platform for any substantial real time model building.

@Kyle, the triple-barrier method will result in an uneven # of observations anyways, so the input data can be mismatched in size. The vertical barrier is of interest, however, as you can implement it as a set amount of time or set # of alternative bars. If you prefer the latter, then issues of different bar #s arise. In general, implementing the 3B method works best when treating each security individually, rather than in a basket. The labels produced are unimportant outside of the context of an individual security, as they are used to validate whether an individual trade was 'good' or 'bad' (in terms of profit.) If that's not clear, let me know.

@Michael, thank you! There are many options for selecting your target volume, though de Prado seems fond of the exponentially weighted moving average for these purposes (he uses its standard deviation for the volatility calculation in the triple barrier method). Your points are sound, particularly in regards to tick data unlocking a whole new world of exciting research. For Quantopian's longer timeframe perspective, however, we can derive value from these lower frequency bars.

@David, it's always insightful to see how others attempted to solve problems that you struggled with, so feel free to post in the forums with any code/questions you have. I do encourage you to perform sufficient due-diligence beforehand, however (asking for help is great, but ideally after exhausting most of what you can do on your own).

@Zeno, I cannot speak to de Prado's performance as a fund manager, however I can say that the work he has shown is extremely compelling. I liken the problem of ML in finance to facial recognition; but over the course of millions of years of human evolution (theoretically). An algorithm might be excellent at picking up modern human faces, but if our future borders on sci-fi and our genetics change, it will be useless. One of de Prado's focuses is adapting financial datasets to achieve stationarity, a maintenance of variables' properties over time. Profiting in this environment is extremely difficult, but his arguments are mathematically convincing and revolutionary. Note, this isn't sponsored in any way, I am just quite fond of the research he puts out (and was even before my time at Quantopian).

@Leo, it is interesting how/if this brand of machine learning can or will fit into the Quantopian workflow, and frankly it's somewhat up to you to decide. De Prado is less interested in risk constraints and market neutrality than Quantopian, so his methods would need to be massaged a bit to be suitable. For some clarification, the triple barrier method is also part of the research workflow. It is a means to label significant price observations for training a classifier. Ideally, the result of this research would be a model that simply outputs buy/sell orders based on live pricing data. Add some additional constraints to fit the Quantopian contest, or build them into de Prado's methods, and you should be golden. This is no easy task, and requires a strong foundation. Hopefully we can provide this over the next few months, but I encourage you to work at it! The community can provide some amazing code, you all deserve tons of credit.

Stationarity in markets. Hmm, a nice dream. In terms of votailty perhaps, hence the attractiveness of the Vix concept. I agree that the prospect of a technological singularity is most exciting. I look forward to bio hacking my mind and body. I do not expect however to be able to use technology or AI to shoot the lights out in the stock markets.

Call me cynical if you will and I hope you prove me wrong. But I wouldn't count on it.

Going to cross-post a question I had about the fractional differentiation concept from the thread Lopez de Prado posted here a while back (but abandoned). Perhaps someone here is able to answer it, it is kind of crucial IMO, the whole chapter might fall semi-apart just by asking it.

"In the chapter about fractional differentiation you argue we can often obtain a stationary price series by differentiating using a d < 1 instead of d = 1 (i.e. good old returns). There's a table in the chapter that illustrates this fact for a list of different instruments and a code sample is provided to find the smallest d for which the series is stationary.

This code sample uses statsmodel's adfuller function to test for stationary, but it uses maxlag=1, which as far as I can tell results in quite an optimistic test, i.e: series are found to be stationary when just looking at them tells you they're clearly not. I found that when I'm using the default maxlag parameter, the obtained d values are generally in the order of ~0.3 higher.

So I guess my question is: why did you choose maxlag=1 and is it still possible to use these weakly/hardly stationary series in a predictive manner? The CUSUM filter mentioned in the book might be a solution, but if I think about it, it seems like a way to introduce stationarity implicitly and doesn't really need a stationary series to begin with."

@Ivory,

Could you please upload a notebook that demonstrates this? You're right that this is an integral part of the information presented, and any conflicting research is vital to consider. In my research, the maxlag parameter has (surprisingly) little impact on the resulting test statistic, so I would be eager to see a contradiction. Interesting point, thank you for bringing it up.

Excellent introductory post!

I should add that the method of partial differentiation of time series has been in existence since the 80s - it came from Hydrology, as
did the idea of the Hurst exponent (popularized in Finance by Edgar Peters). If one is curious of clustering methods, for instance, you’d
want to look at gene sequencing. Tons of cross pollination leads to better and novel results.....

Picking up on James Villa's comment above, it seems that to get started with ML, one would need to be able to output a trailing window of alpha factor values from Pipeline (which I gather it is not designed to do...it only outputs the current alpha vector). One would end up with columns corresponding to SIDS, and rows corresponding to daily alpha values. And additional columns corresponding to training data (e.g. returns, etc.). And then these data would be fed into a ML routine that goes "kerchunk, kerchunk,..." and spits out a single alpha vector for the day.

Presumably Thomas W. does this in his ML algo, but it is not so obvious to me how it is done.

This sort of trailing window of training data would seem to be common to any ML approach. Could a standard set of Pipeline-based tools be put up in an importable Q library (perhaps 'quantopian.research.experimental' would be appropriate)?

I implemented walk-forward and combinatorial cross-validation with purging and embargoing here:
https://github.com/sam31415/timeseriescv
If you're working on your own machine, you can install the package using pip. You probably can't import it on Quantopian, but you can always copy-paste the source code. Here is a medium post with more info:
https://medium.com/@samuel.monnier/cross-validation-tools-for-time-series-ffa1a5a09bf9

@Grant: Yes, I agree that it would be great to have easier access to historical pipeline values in the backtester. In the ML algo post I got around that problem by implementing the ML part as a pipeline factor as well because pipeline factors can get a window of pipeline values. One could probably also somehow copy them out with a dummy history factor that writes to a global variable, although that would be quite hacky.

@Samuel: That looks amazing, thanks so much for sharing. I love how you kept close to the sklearn API. In the medium term we should definitely include this on the platform, until then, maybe you would like to post a sample NB that demonstrates it but just copy & pastes the code?

@Al: Thanks for the insight. I completely agree that a lot of progress in quant finance can come from applying tools from other domains (as has been proven historically). Actually I think this is a core strength of Quantopian where we try to make it as easy as possible for people from various other domains to try their ideas without having to commit to a new career path.

@Thomas: I don't really have shareable data ready to include in a notebook, and the pandas.util.testing module that I used to generate fake data in the tests won't import in the research environment.

In principle, all you have to do is to paste the content of cross-validation.py into a cell of the notebook. You can then use the classes PurgedWalkForwardCV and CombPurgedKFoldCV. That's in principle. In practice, I get a lot of errors due to the restrictions of the research environment:
- It doesn't like f-strings apparenly.
- Same with type hinting.
- abc and typing can't be imported.
I stopped trying when I got:

InputRejected: Insecure built-in function 'super' Last warning! One
more import error and your account will be suspended for security
reasons until a human can talk to you.

@Samuel: Bummer, but thanks for trying!

@ Thomas -

I was thinking of built-in functionality within Pipeline for alpha combination that uses a trailing window of alpha factor values and their associated returns, etc. It is not clear, though, whether it might be better to do it outside of Pipeline, in the dedicated 5-minute before_trading_start window.

Anyway, is there something keeping you guys from working on this stuff? I kinda get the impression that you are either all in some kind of holding pattern, or busy with other stuff. I see the little "Q" by everyone's picture, so I guess you are all still employed.

@Grant: Our focus right now is on expanding our coverage of markets to global equities.

Thanks Thomas -

In my opinion, there is a bit of a gap in the Q support for the workflow in the area of Alpha Combination, which would include ML, as your former CIO, Jonathan Larkin discusses:

The weighting scheme can be quite simple: sometimes just adding ranks or averaging your alphas can be an effective solution. In fact, one popular model does just that to combine two alphas. For increased complexity, classic portfolio theory can help you; for example, try solving for the weights such that the final combined alpha has the lowest possible variance. Lastly, modern machine learning techniques can capture complex relationships between alphas. Translating your alphas into features and feeding these into a machine learning classifier is a popular vein of research

I'm thinking that to approach this ML stuff incrementally, I need some generic code structure within the Alpha Combination step, so that I can move easily from the simple to the more complex ways of combining factors (which will allow for benchmarking both computational efficiency and algo financial performance, besides just being good coding practice).

I posted a request for guidance here: https://www.quantopian.com/posts/alpha-factor-combination-in-pipeline-how-to-fancify-it.

I understand you are busy going after the next big thing, but maybe someone could scratch something on a piece of scrap paper over a coffee break and then ask one of your customer support folk to post it?

@Grant: I think the ML part 3 has all the parts that are needed. There definitely could be many improvements made but it should be workable.

O.K. Well, I guess I've have to try to unravel it myself. At this point, I'm not interested in the ML, but it seems to be kinda wrapped up in the code. Kinda confusing that you'd have an intern spend a lot of time on a ML project, but not want to support it more formally...