Multicollinearity And Multiple Linear Regression Cheat Sheet

Hey everybody, I'm pretty new to Quantopian as well as python and have been spending the past two months immersing myself into both worlds. I've been coding for > 25 years but my math skillz are rusty I urgently had to refresh my knowledge on linear regression and related topics. Thus decided to extract some code shown in various LR related lectures and turn it into my own little linear regression cheat sheet presented here. This accounts also for multiple independent series plus it demonstrates multicollinearity. Perhaps it'll help clear up all those mystic variables you encountered in linear regression but never dared to ask ;-)

P.S.: If you see any errors or have any suggestions don't be shy and point them out in the comment section. Thanks in advance.

8
Notebook previews are currently unavailable.
10 responses

Thanks M8

So alpha and beta in the stock market (and backtester) are linear regression (or OLS, ordinary least squares) alpha and beta, fed particular inputs?

Blue, I don't understand your question - can you elaborate?

The background is a little complex ...

1. The terms alpha and beta along with other greek letters are used all over the place in engineering and science. For example, most have heard of alpha and beta radiation, and gamma.
2. Linear Regression (with its alpha and beta) is essentially a trend line and yet more than that. Regression? Good question. It's no secret that statisticians like to come up with new terms, check out this glossary of statistics terms if you don't believe it. "Regression" was a new term back in the Victorian era from Darwin's cousin. They made an exception with the generic oft-used terms alpha and beta. I had grown accustomed to understanding beta as slope of the trend line since I was seeing it used that way everywhere (one exception). But now I'm starting to get, that it isn't always slope, can be other things depending on the inputs. Slope is only if the set being compared to, is a constant such as time, I think. Not with two varying sets, as in collinearity perhaps.
3. One often hears about stock market alpha, there's AlphaLens developed by Quantopian, Seeking Alpha, alpha factors etc. Everywhere alpha is always considered a good thing, if not the main thing. And yet the explanations may put you to sleep if you weren't born on Wall Street or have enough reference points of knowledge already. For example:

"Alpha measures the difference between a fund's actual returns and its expected performance, given its level of risk (as measured by beta). A positive alpha figure indicates the fund has performed better than its beta would predict. In contrast, a negative alpha indicates a fund has underperformed, given the expectations established by the " ...

.... blah blah yadda etc. Tough without experience already. Don't get me wrong folks, I'm not looking for anyone to give me their understanding of alpha. I'm asking very specifically whether the two alphas are the same, as it would now be astounding to me if they are not.

It finally dawned on me that the terms alpha and beta in the stock market were not coined out of nerdy lack of vocabulary and instead are precisely the same as alpha and beta in linear regression for particular inputs. In some cases beta is not slope at all (or is it still?) like in returns vs SPY for beta. For alpha, returns vs tbills or something like that, possibly the alpha straight from linear regression. Those alpha and beta might be the same, which is nice if so, as that would make sense.

Meanwhile, ran across this which I suppose sort of tends to confirm it (or is at least in the same ball park) though any extra clarification welcomed:

"the beta you get from Sharpe's derivation of equilibrium prices is essentially the same beta you get from doing a least-squares regression against the data. (Also note that alpha and beta are standard symbols that statisticians use all the time for this type of regression; Sharpe and his followers weren't trying to be obscure, as some people like to believe.) "

If anyone is interested in understanding linear regression, the notebook above is helpful and the following is not to be missed, both visual and interactive as well: http://setosa.io/ev/ordinary-least-squares-regression/

@Blue - the terms alpha and beta may arguably be overused in engineering (disclaimer - I used to be an engineer) but in finance they are very commonly used terms. I have beefed up the multicollinearity section in my sheet a bit and also added a hedging example I have been playing with.

The basic idea is that as a trader you are looking for an 'edge' and it is preferably one that is independent of the market (in most cases the S&P). So let's say you decide that FSLR somehow has meaning to your system, but it can also be some sort of indicator or fundamental measure. The first thing you probably want to do is to figure out if it's really FSLR (or your indicator's values) that's giving you that edge (a.k.a. alpha) or if it is in part or even perhaps mostly the market, which as you may have guessed is referred to as beta. In other words:

net alpha = raw alpha - beta * market

1
Notebook previews are currently unavailable.

You can, for example, obtain the slope (beta) of a list of prices using linear regression like this where the curve data points are being compared to a constant. I think params would be two elements, so params[-1], beta, is the second of two and alpha would be the zero-index counterpart, and in this case with the constant maybe not interesting.

import statsmodels.api as sm

slp = slope(data.history(stock, 'close', 60, '1m').dropna())    # Minutes, note dropna(), important

def slope_calc(in_):
cnstnt = sm.add_constant(range(-len(in_) + 1, 1))
return sm.OLS(in_, cnstnt).fit().params[-1]  # slope, regression beta


And you can even obtain slope as a pipeline factor for multiple stocks simultaneously, for all stocks you are screening because statsmodels likes the ndarray that pipeline provides, although you would need to screen this further for nans at some point, like in before_trading_start:

def make_pipeline(context):
pipe = Pipeline()

[...]

class Slope(CustomFactor):
inputs = [USEquityPricing.close]
def compute(self, today, assets, out, closes):
out[:] = slope(closes)

def slope(in_):     # Return slope of regression line.
return sm.OLS(in_, sm.add_constant(range(-len(in_) + 1, 1))).fit().params[-1]  # slope


^^ I fixed a few errors just now and have re-attached the latest version here.

2