Back to Community
Quantopian Lecture Series: Linear Regression

This notebook is a simple introduction to linear regression. It is a companion notebook to our beta-hedging notebook, which is available here.

This part of the Quantopian Lecture Series. We are currently developing a quant finance curriculum and will be releasing clone-able notebooks and algorithms to go along with this lecture. This notebook will be presented in our meetup.

Loading notebook preview...
Notebook previews are currently unavailable.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

16 responses

Thanks Delaney,

You might also want to consider weighted least squares. Seems like the traded volume should be incorporated into the fit, so that more weight in the objective function is given to data at higher volumes (e.g. if only 100 shares of TSLA changed hands over a given day, the data point should carry a lot less weight in the fit than if 1,000,000 shares changed hands).

Grant

Nice tutorial Delaney. You should probably mention about spurious regression so that we all understand where to apply it.

That's a great example of how false assumptions about underlying conditions can cause tests to fail, Pravin. It appears that to avoid spurious false positives in regression tests, it is important to test for the stationarity of your estimated regression coefficients. We will release a full lecture discussing these kind of issues in a few weeks.

Hello!

We will be hosting a live webinar for this first lecture in our summer series, "The Art of Not Following the Market" on July 9th at 12pm ET. You can register to attend here: http://bit.ly/DontFollowTheMarket. We will also share the recording with the community afterwards.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

@Delany Bloomberg also has a beta page...http://sbufaculty.tcu.edu/mann/__INV%20II%20F2012/Colgate%20beta%20August%202012.gif wherein.. it estimates the Beta and the R-squred... correlation... would be amaze... if the linear regression here.. at quantopian.. would yield... the same results as Bloomberg... ( the same... beta.. the same R-squared.. the same standard error...) ;)

A brief tutorial I found helpful on how to read the OLS results summary from statsmodels

http://www.datarobot.com/blog/ordinary-least-squares-in-python/

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hello! Out of curiosity, where is alpha 0.3 and 0.9 coming from?

def linreg(X,Y):  
    # Running the linear regression  
    X = sm.add_constant(X)  
    model = regression.linear_model.OLS(Y, X).fit()  
    a = model.params[0]  
    b = model.params[1]  
    X = X[:, 1]

    # Return summary of the regression and plot results  
    X2 = np.linspace(X.min(), X.max(), 100)  
    Y_hat = X2 * b + a  
    plt.scatter(X, Y, alpha=0.3) # Plot the raw data  
    plt.plot(X2, Y_hat, 'r', alpha=0.9);  # Add the regression line, colored in red  
    plt.xlabel('X Value')  
    plt.ylabel('Y Value')  
    return model.summary()  

Thanks!

Ah, those are just parameters to the plotting functions that set the transparency. It's admittedly confusing as we're discussion linear regression, which would traditionally use parameter names alpha and beta.

Makes perfect sense now. Thanks Delaney!

• Linear regression gives us a specific linear model, but is limited to cases of linear dependence. • Correlation is general to linear and non-linear dependencies, but doesn't give us an actual model. • Both are measures of covariance.

Is that to say that correlation can capturel non linear relationship ? I thought it couldnt (or you have to use spearman rank calculation) ?

Both can capture linear components of a non-linear relationship, Spearman rank will be more robust to weirdnesses in the data as a rule of thumb. For instance, if $Y = X^2$, then even though the relationship is quadratic, there will be a linear effect for X > 0. Namely, as X increases so does Y and as such the series will covary. This is also known as monotonicity https://en.wikipedia.org/wiki/Monotonic_function. However, if we do not restrict to X > 0 then a correlation will be confused as there's a whole new section of the curve which effectively cancels out the linear relationship.

In practice it is difficult to detect non-linear relationships, because they are more complex and if you throw a ton of different functions at a data set, you'll just overfit. The reason people come up with non-linear models is because they follow from some understanding of the mechanics of the system or some mathematical derivation. Once you have the model you can check the fit, but I wouldn't expect to look at the data and be able to determine that it's actually an $Y = X^3 + Z^0.5$ relationship. Does this make sense?

In practice it is difficult to detect non-linear relationships

Not so sure about that one, as a general statement. If you know what you are looking for, then weak signals buried in noise can be extracted. The lock-in amplifier is a classic example, where "signals up to 1 million times smaller than noise components, potentially fairly close by in frequency, can still be reliably detected." There's probably some theorem out there, but finding a needle in a haystack is possible, under certain assumptions. It would seem that if you know exactly what non-linear relationship you are looking for, and can distinguish the needle from the hay (e.g. pick out all of the hay, and you'll be left with a needle), you'll be able to extract the signal from the noise.

Hi all. Newbie here and trying to use this notebook to teach myself. I have external data that I would like to run through the model. It is data from the South African Index (J200) and the South African Rand (ZAR). I would like to import the data via a CSV, but I have not idea on how to use the data when imported and whether I should synchronize the dates and then symbol in the file or will the code automatically plot each symbol associated with the date? My main issue was what to do in this section of the code:

SAS = local_csv('ZARvsJ200.csv')
start = '2014-01-01'
end = '2015-01-01'
asset = get_pricing('J200', fields='price', start_date=start, end_date=end)
benchmark = get_pricing('ZAR', fields='price', start_date=start, end_date=end)

We have to take the percent changes to get to returns

Get rid of the first (0th) element because it is NAN

r_a = asset.pct_change()[1:]
r_b = benchmark.pct_change()[1:]

linreg(r_b.values, r_a.values)

Thank you.

David, I suggest you start a new thread, and attach the notebook that is causing the problem.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hi, trying out this notebook and have a question about .summary(). It is not showing all the values filled out when I run it in the notebook. Another thread mentioned something about .summary() not being whitelisted on the IDE - not sure if it is related. Tried .summary2() but still not seeing values like R2 and Adj. R2 etc. Should I be using another function?

I just cloned the notebook and re-ran all the cells and the summaries seemed to print fine for me. Would you be able to reply and post the notebook that's giving you issues? Also in general you can manually access all the stuff you need as attributes on the model object.