Algo with Support Vector Machine in Pipeline

I wrote a base algo that incorporates machine learning in the pipeline. i.e. as the pipeline runs, it trains a ML model PER STOCK and comes up with a prediction on the stock's movement. The algo can then use the output of the pipeline and long the predicted up stocks and short the predicted short stocks.

As it stands, this algo does not perform well, but it can serve as a basis for someone else.

/Luc Prieur

405
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 59cc0d4955c6cd54186072ed
There was a runtime error.
14 responses

Thanks!

Thank you for this

Hi Luc,
As i am still basically "non-pythonic" (especially pandas) and i need to get there first before I can do much, i had not planned on diving in to either ML or alternative data from social media for quite a long while yet, but this thread and your algo look very interesting.

Please excuse me if some of my questions seem naieve or simply wrong, due to my limited knowledge of python.

From what I can infer from the little that i understand of the code in your algo, i think that you are taking 2 different sets of inputs from social media (aggregated_twitter_withretweets_stocktwits), one being bullish tweets and the other bearish tweets, and then taking the average number of each type over a rolling window and calculating the slope of those two average numbers of tweets, and then using those 2 items as input to your SVM. Am i more-or-less correct so far?

It also looks like you are taking SMA5 & SMA10 of price, but i can't quite figure out if these are also inputs to SVM or simply used in the filter. [I'm sorry if this sounds stupid on my part but, as I said, I'm new to python]. Anyway, at least as i understand it, you are using the SLOPES of the average numbers of bullish & bearish tweets, and possibly also the VALUES of the two SMA's as ML inputs. Please set me straight if i have this wrong so far.

Do you have any descriptive (i.e. other than python code) documentation of what you are doing that you could share? Perhaps then we could talk some more.

Despite my weakness in python, i have been trading for more than 30 years, i have at least some experience with SVM, and quite a lot of experience in the problems of using ML in trading systems, at least in the context of old-fashioned Neural Networks. Although there are obvious differences, there are also some similarities in the practical problems that one has with regard to good choices of indicators to use as input for any type of ML, and also with the issue of pre-processing.

I look forward to understanding more about what you are doing and sharing ideas.
Best regards, Tony

Tony,

The SMA5 and SMA10 are useless as well as filtering the stocks on SMA5 being larger than SMA10. I should have removed that bit of code. It is not within the scope of using ML in the pipeline. I had picked up that small bit of code from another algo posted in the communities.

As far as your understanding of the ML bit, you understand it correctly. I am using the slope of tweets (bearish and bullish) as input to the ML and the target of the ML is price movement of the following day for said equity.

I copied some of the ML code below.

Here I select data for the two features:
features = ['bull_rs', 'bear_rs']

This is the last row of feature data. It must be used to predict tomorrow's movement. Hence I call it live.
X_live = df[features][-1:]

This is the reminder of the data that I shift one day forward so that the previous day of data is aligned with the
current day price movement, i.e. "result". result is +1 if there was a gain, -1 for loss.

            df[features] = df[features].shift(1)


I remove NaN row (Should be only the first row of data.).
df.dropna(inplace=True)

Scale data
X_train = scaler.fit_transform(df[features])
Set target.
y_train = df['result']

Specify model. Funny, I mentioned SVM, but use Naïve. Changing model is just a dropin.
model = GaussianNB()

Here I train the model and run the prediction for tomorrow in one line.
prediction = model.fit(X_train, y_train).predict(scaler.transform(X_live))[0]

The whole algo is just presented as a template for others to build on. I did not try to make perform or anything.

BR/

Luc

You can contact me directly on linkedin if you wish.

Hi Luc,

Many thanks. I will contact you directly over the next few days.

One of the things I found with ML in general (of whatever type) is that usually the quality and success of the output is very strongly dependent on exactly what you do as pre-processing before the inputs actually go into the ML / AI. The more you can "help" the ML to get started in the right direction, the better, so that it can focus its efforts on the "important stuff" and doesn't have to waste its time trying to figure out (perhaps unsuccessfully) how to do something that we could have just told it beforehand. Specifically, in this case, you have 2 "raw" inputs, namely the number of bullish tweets and the number of bearish tweets. Although at first glance these might seem like logical choices for input to ML, actually the problem with using these 2 items as they are, is that both of them contains a mixture of 2 different types of info, namely 1) Bearishness vs Bullishness and 2) Changing levels of Enthusiasm for tweeting. My suggestion is to do a little bit of pre-processing to separate these two different aspects BEFORE inputting the data to the ML, as follows:

a) Count Total tweets = Bullish tweets + Bearish tweets.
b) Count NET Bullish tweets = Bullish tweets - Bearish tweets (will be +ve if predominantly Bullish, -ve if predominantly Bearish)
c) Proportion NET Bullish Tweets = a) / b). This is now normalized relative to the total number of tweets and becomes more purely a measure of + or - sentiment itself.
d) Take the Average number of NET Bullish tweets, and then from this take the ratio of Current NET Bullish Tweets to Average NET Bullish tweets. This gives a Short-term measure of day-to-day variability in bullishness or bearishness.
e) Slope = trend of NET bullish tweets, gives a Longer-term measure of changes in bullishness or bearishness.

Items c) d) and e) are now all normalized with respect to the total number of tweets and would potentially be useful as inputs to ML. The confusing factor of how many people are currently tweeting (either way) has been removed. We now have 3 inputs rather than the original 2, and they now contain info in a slightly different form that should be easier for any ML to work with. Please could you try this and see if it helps?

The other thought I have is that irrespective of whether you are using Naive Bayes or SVM, and Gaussian or Linear or RBF models, these are all Classifiers which, at least as I understand it, are designed to give a binary 1/0 output. Now is this really what you want? If you only want to decide Long or Short, then OK, but in fact we can probably do much better than that. In the context of an Equity Long-Short strategy with a large universe of possible equities to choose from, what we would actually like is a Ranking of all the equities on a continuous (rather than a binary) scale. The we go Long on the N= however many we want top-ranking (most bullish) equities, and go Short on the N bottom ranking (most bearish) equities. So, to do that, what we would need is something that gives a continuous-valued output rather than a 1/0 output.

Cheers, best wishes, Tony
(also on Linked-in, see Tony Morland)

Tony,

As far as feature engineering, yes, your proposal sounds good. My post was not to propose the best features to use, but rather a template algo for others to build on.

For usage of classifiers instead of regressors, of course one could use a regressor and use the output amplitude to run a long-short strategy. In my experience however, I found that ML has less problems trying to guess a direction rather that an amplitude (i.e. tomorrow's stock price). One could use the classifier's confidence level in a long-short strategy.

I encourage anyone with an improved algo based on mine to publish it in this thread.

/Luc

I have played a little bit more with this base algo and used PyshSignal twitter sentiment as feature to the SVM classifier. As well, I am using past returns as features.

So, as it stands now, on each day, for each stock in the pipeline, the algo trains an SVM classifier on past performance and spits out a prediction. That prediction is used as weight for the optimizer. The whole thing works fine unless you turn on slippage and commission, then not so good. That is normal as it trades every day.

If anyone has any idea as to how to improve this, please post your modified algo.

Thanks.

102
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 5b16ee06c38baf441fefefac
There was a runtime error.

Tried to see if this is viable for a long/short algo with 250 positions on both side, so I applied your 'scanned twitter messages' filter to the QTradeableStocksUS universe, limited to 500 stocks total:

    my_screen = QTradableStocksUS() & SimpleMovingAverage(inputs=[st.total_scanned_messages], window_length=60).top(500)


This seems to yield about 300 stocks per day (probably not more in the twitter data feed?), but it times out so this is currently a dead end for real long/short algos.

5
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 5b17e6079bc4b94446e7afc3
There was a runtime error.

I've been toying around with your code and noticed that it held up well until approx Oct-31-2018 & Nov-01-2018 when something funky went on. Maybe another set of eyes can help me navigate that downturn without introducing too much over-fitting.

12
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 5c76edab8001394a63010da4
There was a runtime error.

Daniel,

I am surprised to see such big jump in the P&L curve. Something weird is happening. I'll check. I like the changes you have made to the original code.

/Luc

Hi Luc, Daniel,

I'm glad to see this thread bumped because it reminded me to revisit a variant of Luc's base algo I did back in June 2018 that passed all contest constraints. Made some changes to inputs, training window, constraints but kept standard SVC routine and limited it to trade approximately 95-100 stocks to avoid timeout issues. The live portion of the tearsheet below is just the updated period from which I last ran the backtest, kinda OOS. What do you guys think, overfitted?

6
Notebook previews are currently unavailable.

James,

It is difficult to say if it is over fitting. There are not many tuning parameters aside from the training window length, the SVC parameter C (if used, but in my original code, it was set to default, i.e. 1.0) and the choice of features. The universe selection is just there to lower the computation time. It is possible that the current market regime is still being learned by the SVC. As such, I don't see how it could be "manually tuned" to over fit.

Make sure that the features you use (or engineer) are stationary in time and that the training window is not too short. Non stationary features are not handled well in classical ML (e.g. if you were to use stock prices instead of returns) and that the training window is long enough such that the SVC is training on a sufficient amount of data.

Thanks for posting your live results. If you could post an older version of your code, that would be nice as well.

Hey Luc,

Thanks for your feedback. I pretty much agree with you that it might be too early to judge overfitting based on the OOS period which was a very tough period including the major correction of Dec. 24 and subsequent slow recovery. If you look at the tail end of OOS performance it looks like it's bouncing back!

My major problem here is limited computation resources that results to timeout issues. The training window is 252 days and limit number of positions to 90-100 for it not to timeout. Given this configuration setup, I get portfolio volatility of ~10%, something I know I can lower if say I can extend training window to 756 and trade 500 positions without timeout issues.

Key here is the inputs, the secret sauce. As to code revision, it' pretty much the same as what Daniel did with TargetWeights and its added constraints, that's it!