Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Winsorize and zscore on a per sector basis


I have an algorithm with a couple of factors. I winsorize and normalize (using zscore) each factor separately. I would like to keep it separately but doing it on a per sector basis (I mean, for each stock and factor, normalize and winsorize just taking into account the stocks of the sector that the stock belongs to). How would be a clean way to do it?

Thanks in advance.

6 responses

Hi Marc,

You can get the revenue-based sector using the RBICSFocus dataset, and then use the groupby argument in both winsorize and zscore to segment the set of assets by sector. Something like this:

from quantopian.pipeline.data.factset import RBICSFocus

sector = RBICSFocus.l1_name.latest  
my_factor.winsorize(groupby=sector).zscore(groupby=sector) # You can provide your other arguments in here, too.  

Relevant documentation pages:
- RBICSFocus
- winsorize
- zscore

I hope this helps!


The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks Jamie,

That is what I was looking for. The same works using Morningstar data:

from quantopian.pipeline.classifiers.morningstar import Sector

sector = Sector()  
my_factor.winsorize(groupby=sector).zscore(groupby=sector) # You can provide your other arguments in here, too.  

@Jamie, would it be the same (not taking into account the winsorization part) than doing what I'm showing next?

from quantopian.pipeline.classifiers.morningstar import Sector

sectors = [101,102,103,104,205,206,207,308,309,310,311]  
pipeline_columns = {}  
for s in sectors:  
    pipeline_columns['factor_sector_'+str(s)] = my_factor(mask=QTradableStocksUS() & Sector().eq(s)).zscore()

For each column, I would get NaN except for the sector column the current stock belongs to.
And then, when getting the pipeline output you could sum the 1 axis. Something like:

alpha = pipeline_output('factor_pipeline').sum(axis=1).dropna()  

Got the idea from another post from @Grant but not sure if the behaviour would be the same (https://www.quantopian.com/posts/limit-qtu-to-specfic-industry-sector-and-market-cap)

Thanks in advance

Any feedback on this?


@Marc - That appears to do the same thing, though be careful with the last step. I'm not sure how the sum behavior works on the pandas dataframe when NaNs are involved. In some cases, summing a numerical value with NaN might yield a NaN result, so you may need to use something like numpy.nansum() in the combination step.

Correctness aside, is there a reason to prefer the version you shared over the version that uses the groupby argument? I always get nervous when I see a statically defined list of values being used. One of the features of pipeline is that you can define terms in a high level expression language, without having to know about the assets, sectors, or whatever else exist at any one time. So if Morningstar adds a new sector at some point, your code would start using it if you use the groupby argument. However, version that uses the statically defined list of sectors that exist today, wouldn't pick it up.

@Jamie, thanks a lot for your answer and feedback. If they do the same I actually prefer the method you exposed, it's simpler and as you mention, if the external data changes you don't have to manually update anything. My doubt was if the two methods where supposed to do the same or I was missing something. It's clear now.