Back to Community
Linear Regression for Fundamental Factors


I'm quite new to Quantopian and I'm exploring the possibility of use machine learning for my algorithms. I started with an admittedly naive notebook to see if I can find and linear correlation between some fundamental factors and the quarterly returns.

As I developed and tested this notebook I started to have both financial and coding doubts where I'm sure you can point me in the right direction:

  • To normalize values and avoid outliers I used the equity relative rank of the factor instead of the absolute value. How reasonable do you think is this approach?
  • For all factors I found an R^2 in my linear regression very low (always below 0.1). Is my approach incorrect or simply there is no obvious, linear correlation between these common factors and the returns?
  • Any way to improve the pipeline performance? Is it possible to run it monthly instead of daily?

I also hope this could be a useful structure for any other person playing around with fundamental factors.

Thank you in advance for your help!

Loading notebook preview...
Notebook previews are currently unavailable.
10 responses

I had better luck with computing a z score for all factors to normalize instead of a rank. I think that makes more intuitive sense too.

One trick I quickly learned with doing the z score is to remove or at least subdue any outliers. Most of my factors I explicitly remove stocks that have a z score above an absolute value of 2 for the factor scoring. Another approach is to set all z scores above 2 equal to 2 and do the same for -2. For a normal distribution though, anything outside of 2 should represent only 5% of the population.

Hi Francesco,

have you had a look at Alphalens tool available here on Quantopian? It is meant to do exactly the kind of analysis your did, that is examine the relationship between a factor and future returns. The best explanation of the tool I found is contained in this video

Also, here is a NB that shows you what information Alphalens can give you about the factors you are interested in. I ran the tool only for 'EBITDA/REVENUE', but you can simply uncomment the factors you like to analyze and run the NB again

Loading notebook preview...
Notebook previews are currently unavailable.

Stephen, Luca,

Thanks for your response. Both the Z-Score and the Alphalens are very interesting approaches for the analysis of factors.
I was also wondering how this kind of factor analysis could be integrated into the investment algorithm to adapt dinamically to changing conditions in the market and diminishing returns for specific factors.

Thanks again for the help!

You could have a look at the 3 posts "Machine learning on Quantopian" by Thomas Wiecki. The NBs focus on alphas combination, that is how to combine multiple alpha factors together, but it is also related to your question because the ML model proposed by Thomas is trained on a rolling window of factor data and so it evolves with time, reflecting recent market conditions. So the ML approach combines factors together but also it "weights"them accordingly to recent performance. I believe that is what you are looking for.

I personally found Thomas' approach very useful, even though I use my own alpha combination algorithm. The ML model proposed didn't give me extremely good results, so I developed my own algorithm, but still the model is a good base on which you can make your own customization.

I will, thanks Luca!


Just a suggestion, but you should go on and read some papers. At least from what I've read/studied/practically applied, the amount of fundamental factors that are really affecting returns is pretty small. And for the ones that are statistically significant, they have been arb'd away, or they are being shunned by the market (Value vs Growth right now). Cliff Asness of AQR Capital Management has a lot of papers on SSRN. Read them, you'll learn a ton, and even if it doesn't answer this question, it might help you narrow down your search of factors.

That being said, factor analysis is a great way to see what the market prefers at points in time (Example - high dividend stocks during the summer of 2016).

How did you calculate returns for your graph? Are they forward returns? Both axes have extremely large numbers. I haven't looked at your code, and I don't know anything about equity rank factor, but I don't think you need to use it. I think a simple % revenue growth regressed against foward % returns would do the trick.

Test out a factor that is well known: Book-to-Market (Book Value Per Share / Current Price). Write a line of code for forward returns that can easily be manipulated so you can change the length.

Something like this would suffice:
Close.shift(-12)/Close - 1 which would be 1yr foward returns at T = 0 assuming monthly data. If you go over 1yr, annualized your returns.

You should get a much clearer scatter than the one above.

Hi Tom,

Thanks for the suggestion. I'll look through the papers and see how to integrate them in my strategy.

Regarding your questions, my returns (ranked in the graph) are calculated through the Returns factor over a 91 calendar days window (about a quarter). I then shift backward this factor by the time window to have, on the same Dataframe row, the latest Factors and the future Returns.

I did this operation on the dataframe because I didn't know you could pick forward returns, so I had to "wait" enough days to match the Factors at day 0 with the Returns at day 91. I'll try your strategy which should definitely be simpler.


I strongly agree that you should be using Alphalens. Please check out this lecture, I think it will help set up a framework for you. Also see this library of canonical factors

In practice you probably won't find a ton of signal where fundamental values naively correlate with returns. They're very well studied. What's more interesting is building a more sophisticated model based on a real economic hypothesis and testing that using Alphalens. Think about ways in which fundamental factors could assist in making a decision, as opposed to them being the only data used.


The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks Delaney,

I will look into Alphalens and the Github library. My goal so far was to set up a simple but correct framework to analyze factors and build on that more sophisticated hypotheses. Given the lack of experience I was afraid that my notebook implementation was not correct and I did not want to complicate something I was not sure about.

I'll thus assume that the implementation is, in principle correct, and it's absolutely expected to not find signal with these naive factors.

Thanks again for the support!

@Francesco, I believe your implementation is correct and the questions you asked at the beginning of this thread are good questions. I started my factor analysis with an approach similar to yours. Then Alphalens project started and I saw it gave me the exact information I needed from a factor (and more statistical information too). If you take the time to learn it and explore the source code you'll find clever ideas and solutions that would save you lots of time if you had to start something from scratch.