Back to Community
Algo with Support Vector Machine in Pipeline

I wrote a base algo that incorporates machine learning in the pipeline. i.e. as the pipeline runs, it trains a ML model PER STOCK and comes up with a prediction on the stock's movement. The algo can then use the output of the pipeline and long the predicted up stocks and short the predicted short stocks.

As it stands, this algo does not perform well, but it can serve as a basis for someone else.

/Luc Prieur

Clone Algorithm
Backtest from to with initial capital
Total Returns
Max Drawdown
Benchmark Returns
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 59cc0d4955c6cd54186072ed
There was a runtime error.
7 responses


Thank you for this

Hi Luc,
As i am still basically "non-pythonic" (especially pandas) and i need to get there first before I can do much, i had not planned on diving in to either ML or alternative data from social media for quite a long while yet, but this thread and your algo look very interesting.

Please excuse me if some of my questions seem naieve or simply wrong, due to my limited knowledge of python.

From what I can infer from the little that i understand of the code in your algo, i think that you are taking 2 different sets of inputs from social media (aggregated_twitter_withretweets_stocktwits), one being bullish tweets and the other bearish tweets, and then taking the average number of each type over a rolling window and calculating the slope of those two average numbers of tweets, and then using those 2 items as input to your SVM. Am i more-or-less correct so far?

It also looks like you are taking SMA5 & SMA10 of price, but i can't quite figure out if these are also inputs to SVM or simply used in the filter. [I'm sorry if this sounds stupid on my part but, as I said, I'm new to python]. Anyway, at least as i understand it, you are using the SLOPES of the average numbers of bullish & bearish tweets, and possibly also the VALUES of the two SMA's as ML inputs. Please set me straight if i have this wrong so far.

Do you have any descriptive (i.e. other than python code) documentation of what you are doing that you could share? Perhaps then we could talk some more.

Despite my weakness in python, i have been trading for more than 30 years, i have at least some experience with SVM, and quite a lot of experience in the problems of using ML in trading systems, at least in the context of old-fashioned Neural Networks. Although there are obvious differences, there are also some similarities in the practical problems that one has with regard to good choices of indicators to use as input for any type of ML, and also with the issue of pre-processing.

I look forward to understanding more about what you are doing and sharing ideas.
Best regards, Tony


The SMA5 and SMA10 are useless as well as filtering the stocks on SMA5 being larger than SMA10. I should have removed that bit of code. It is not within the scope of using ML in the pipeline. I had picked up that small bit of code from another algo posted in the communities.

As far as your understanding of the ML bit, you understand it correctly. I am using the slope of tweets (bearish and bullish) as input to the ML and the target of the ML is price movement of the following day for said equity.

I copied some of the ML code below.

Here I select data for the two features:
features = ['bull_rs', 'bear_rs']

This is the last row of feature data. It must be used to predict tomorrow's movement. Hence I call it live.
X_live = df[features][-1:]

This is the reminder of the data that I shift one day forward so that the previous day of data is aligned with the
current day price movement, i.e. "result". result is +1 if there was a gain, -1 for loss.

            df[features] = df[features].shift(1)

I remove NaN row (Should be only the first row of data.).

Scale data
X_train = scaler.fit_transform(df[features])
Set target.
y_train = df['result']

Specify model. Funny, I mentioned SVM, but use Naïve. Changing model is just a dropin.
model = GaussianNB()

Here I train the model and run the prediction for tomorrow in one line.
prediction =, y_train).predict(scaler.transform(X_live))[0]

The whole algo is just presented as a template for others to build on. I did not try to make perform or anything.



You can contact me directly on linkedin if you wish.

Hi Luc,

Many thanks. I will contact you directly over the next few days.

One of the things I found with ML in general (of whatever type) is that usually the quality and success of the output is very strongly dependent on exactly what you do as pre-processing before the inputs actually go into the ML / AI. The more you can "help" the ML to get started in the right direction, the better, so that it can focus its efforts on the "important stuff" and doesn't have to waste its time trying to figure out (perhaps unsuccessfully) how to do something that we could have just told it beforehand. Specifically, in this case, you have 2 "raw" inputs, namely the number of bullish tweets and the number of bearish tweets. Although at first glance these might seem like logical choices for input to ML, actually the problem with using these 2 items as they are, is that both of them contains a mixture of 2 different types of info, namely 1) Bearishness vs Bullishness and 2) Changing levels of Enthusiasm for tweeting. My suggestion is to do a little bit of pre-processing to separate these two different aspects BEFORE inputting the data to the ML, as follows:

a) Count Total tweets = Bullish tweets + Bearish tweets.
b) Count NET Bullish tweets = Bullish tweets - Bearish tweets (will be +ve if predominantly Bullish, -ve if predominantly Bearish)
c) Proportion NET Bullish Tweets = a) / b). This is now normalized relative to the total number of tweets and becomes more purely a measure of + or - sentiment itself.
d) Take the Average number of NET Bullish tweets, and then from this take the ratio of Current NET Bullish Tweets to Average NET Bullish tweets. This gives a Short-term measure of day-to-day variability in bullishness or bearishness.
e) Slope = trend of NET bullish tweets, gives a Longer-term measure of changes in bullishness or bearishness.

Items c) d) and e) are now all normalized with respect to the total number of tweets and would potentially be useful as inputs to ML. The confusing factor of how many people are currently tweeting (either way) has been removed. We now have 3 inputs rather than the original 2, and they now contain info in a slightly different form that should be easier for any ML to work with. Please could you try this and see if it helps?

The other thought I have is that irrespective of whether you are using Naive Bayes or SVM, and Gaussian or Linear or RBF models, these are all Classifiers which, at least as I understand it, are designed to give a binary 1/0 output. Now is this really what you want? If you only want to decide Long or Short, then OK, but in fact we can probably do much better than that. In the context of an Equity Long-Short strategy with a large universe of possible equities to choose from, what we would actually like is a Ranking of all the equities on a continuous (rather than a binary) scale. The we go Long on the N= however many we want top-ranking (most bullish) equities, and go Short on the N bottom ranking (most bearish) equities. So, to do that, what we would need is something that gives a continuous-valued output rather than a 1/0 output.

Cheers, best wishes, Tony
(also on Linked-in, see Tony Morland)


As far as feature engineering, yes, your proposal sounds good. My post was not to propose the best features to use, but rather a template algo for others to build on.

For usage of classifiers instead of regressors, of course one could use a regressor and use the output amplitude to run a long-short strategy. In my experience however, I found that ML has less problems trying to guess a direction rather that an amplitude (i.e. tomorrow's stock price). One could use the classifier's confidence level in a long-short strategy.

I encourage anyone with an improved algo based on mine to publish it in this thread.