Pipeline Classifiers are Here!

Community members who followed the original announcement posts for the Pipeline API may recall hints about a third expression type in the original system design. Classifiers have been in our roadmap from the very beginning, since they enable a number of important operations that involve grouping expressions on Factor outputs. Classifiers were cut from the original launch in order to make the rest of Pipeline available sooner, but we'd always planned on adding them eventually.

Today, I'm excited to announce that Pipeline's third major expression type is finally available. Attached to this post is a notebook with several detailed examples of working with Classifiers.

Some highlights from the notebook:

• There's now a new base expression type: Classifier. In the same way that Factors are expressions producing numerical-valued results, and Filters are expressions producing boolean-valued results, Classifiers are pipeline expressions producing categorical-valued results. Another way of thinking about classifiers is that they're computations that produce labels for assets. Canonical examples of classifiers are sector codes, and quartiles/quintiles/deciles of another factor (e.g. deciles of stocks by market cap).
• There are two new Factor methods, demean() and zscore(), that take an optional groupby argument, which can be passed a classifier. These methods produce new Factors that apply normalizations to the daily output of the original Factor. A detailed example of how this process works can be found in new Normalizing Results section of the help docs.
• There are two new builtin classifiers, and several more in the works.
• There are several new Factor methods that produce classifiers by producing quantile labels. The most general of these is Factor.quantiles, which accepts a bin count as an argument. Convenience aliases are available for quartiles (quantiles(4)), quintiles (quantiles(5)), and deciles (quantiles(10)).

I think having the ability to perform grouped aggregations and normalizations opens the door to many more sophisticated quant workflows, so I'm excited to see what the community builds with this new functionality. As always, I'm also interested to hear feedback on how these features could be made more useful. Are there other natural candidates for built-in classifiers (e.g. exchange id or country code) that could enable better algorithms? Are there other normalizations like demean and zscore that could be made Factor methods (one that I have my eye on right now is the existing rank() method)? Are there other interesting possible additions to the Filter/Factor/Classifier algebra? Feedback from users on these (or other) topics would be greatly appreciated.

Happy coding,
-Scott

146
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

18 responses

w00p! It's exciting to see so much progress in the Pipeline, as it's a steep learning curve to use it, but stuff like this makes it worthwhile.

Does this help solve my issue? I am adding the following to my Pipeline, and getting an error:

sector          = Latest(inputs=[morningstar.asset_classification.morningstar_sector_code]) #e.g. 101 = Basic Materials

TypeError: Latest expected an input of dtype float64, but got int64 instead.


I assume there's no type conversion on Pipeline, even int64->float64. Is the sector identifier best encoded into a Classifier?

EDIT: I see there's a built-in classifier called Sector, which I am experimenting with. One gotcha seems to be that you can't use the "==" to test for equality in the pipeline, and you must use "eq", as in:

from quantopian.pipeline.classifiers.morningstar import Sector
sector = Sector()
sector_filter = sector.eq(Sector.BASIC_MATERIALS)


Hi Scott,

Does this change eliminate the need to apply post-pipeline filters using get_fundamentals (e.g. see Simon's post, https://www.quantopian.com/posts/equity-long-short)?

Grant

@Dan for the specific use-case of sector codes, you probably want to use the Sector builtin. I responded to a user running into a similar issue in this post:

For columns that don't have special built-ins, you're generally better off doing column.latest rather than manually constructing an instance of Latest. The reason for this is that there are separate Latest types for Filter, Factor, and Classifier, and the .latest property intelligently chooses which one to return based on the dtype of the column. (The logic for this lives here if you're curious.).

Is the sector identifier best encoded into a Classifier?

Yes. Encoding sector code as a Factor is essentially saying that you want to treat it like a numerical quantity. In particular, Factor gives you arithmetic operations like + and *, which aren't really meaningful for something like a sector code which happens to be represented with integers, but semantically just contains labels. The reply I linked above has more details on this.

One gotcha seems to be that you can't use the "==" to test for equality in the pipeline, and you must use "eq", as in:

Yeah, this is a bit of a wart on the API.

In my first draft I had actually overridden ==, but I removed it because it causes some confusing behavior when expressions are included in collections. For example, if we override == to return a Filter, then any equal-length tuples/lists of pipeline terms will compare equal:

In [1]: class Foo(object):
...:     def __eq__(self, other):
...:         return Foo()
...:

In [2]: f = Foo(); g = Foo()

In [3]: (f, f) == (f, g)
Out[3]: True


This happens because the equality check for tuples is essentially equivalent to something like:

def tuple_eq(left, right):
for left_elem, right_elem in zip(left, right):
if not bool(left_elem == right_elem):
return False
return True


If left and right are tuples of pipeline terms, then the == operator here produces a Filter, which is "truthy", so the loop never breaks and we end up deciding that the tuples are equal even though they contain different terms.

numpy and pandas work around this problem by explicitly raising an error if you ask if they're truthy:

In [31]: (np.arange(5), np.arange(5)) == (np.arange(5), np.arange(5))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-31-35b13e95dd25> in <module>()
----> 1 (np.arange(5), np.arange(5)) == (np.arange(5), np.arange(5))

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()


which at least gives you an (albeit somewhat cryptic) error instead of a nonsensical return. However, this doesn't quite work in all cases. In particular, tuples that actually contain the same objects will still happily compare equal:

In [32]: arr = arange(5)

In [33]: (arr, arr) == (arr, arr)
Out[33]: True


The numpy/pandas solution might end up being the lesser of the two evils in the long run, but it's easier to add in the == override later than it is to remove it if it causes too much confusion, so I haven't bitten the bullet on that yet. At the very least, we could probably do a better job of calling out the oddity of the eq() method.

@Grant

Does this change eliminate the need to apply post-pipeline filters using get_fundamentals (e.g. see Simon's post, https://www.quantopian.com/posts/equity-long-short)?

The new functionality definitely makes it easier to filter by sector/industry or by quantiles of some other calculation. The major piece that's still missing in Pipeline is support for string-based columns. In particular, there isn't yet a good way in Pipeline to filter by exchange, which is often a good way to remove hard-to-trade assets from your universe.

The good news is that I'm actively working on a design for supporting strings, so keep an eye out for an announcement Soon™!

Scott, this looks great. I forget if it was part of this work; does Pipeline still calculate factor data for stocks which will ultimately be screened out? In my StocksOnTheMove algo, I had to calculate the market cap screen within my custom factor for the linear regression calculations, or else it would time out, but that seems like it shouldn't be necessary - if the screen doesn't depend on the factor(s) being calculated, there's no need to calculate them for stocks that don't pass the screen, and Pipeline should have enough info to figure that out?

Though, I suppose you probably want to maintain coherent axes between invocations of compute. Could you pass the mask in as an argument to compute?

Hi Simon, that feature is not yet supported, so as of now factor data is still computed over all stocks, even those which will ultimately be screened out. However we are working on letting you pass a mask to a CustomFactor, which will allow you to filter out stocks before your factor is computed. The work for that can be tracked here.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

@scott could you allow equality tests to simple immutable objects like Sector.BASIC_MATERIALS but throw an exception for anything else?

@Dan we could but then you'd get the same odd behavior if you did something like:

(1, 2) == (some_factor, some_other_factor)


Since the element-wise comparisons there would each return Filters, the result would end up being that the tuple comparison would return True.

The more numpy/pandas-esque solution would be to override __eq__ to have semantics analogous to the other comparison, and then also override __nonzero__ (__bool__ in Python 3) to make them fail with a an error indicating that you shouldn't try to compare collections of pipeline expressions.

Hi Scott,

I noticed on the help page:

zscore() is only supported on Factors of dtype float64

This seems overly restrictive. How does one deal with volumes, for example, which will be integers?

Picking up on Simon's comment above, I've been a bit confused about the overall computational flow, and how pipeline interacts with before_trading_start, which allows a 5-minute computational window, prior to the open. It doesn't seem feasible to winnow down to a relatively small collection of stocks using pipeline and daily data, and then do computations using minutely data on that universe, within before_trading_start. The problem seems to be that update_universe doesn't actually take effect within before_trading_start and history is not supported within before_trading_start. I've been working to cobble something together (see https://www.quantopian.com/posts/code-for-getting-minute-data-into-before-trading-start) but it is pretty ugly. Was this by design, or is it something that will be fixed? Or maybe pipeline could allow pulling of minutely data, with a restriction on the number of securities (e.g. 500 max)?

I would like to rank within sectors. So, it would be

ranked = Returns(window_length=21).rank(groupby=Sector())


This will mean multiple stocks will get the same rank (one per sector). You would then need to choose the top and bottom N per sector, which presumably is something like this:

top = ranked.top(N, groupby=Sector())
bottom = ranked.bottom(N, groupby=Sector())


I am currently doing this via:

ranked = Returns(window_length=21).zscore(groupby=Sector()).rank()
top = ranked.top(N)
bottom = ranked.bottom(N)


Which is an approximation at best.

@Dan I agree that rank/top/bottom all have natural interprations as grouped operations. I've opened an issue on Zipline about it.

@Grant

zscore() is only supported on Factors of dtype float64

This seems overly restrictive. How does one deal with volumes, for example, which will be integers?

All values that are semantically numerical quantities are represented as floats in Pipeline. This includes columns like volume that would seem more naturally represented as integers. The reason for this is that there's no natural value we can use to represent missing data in integer arrays, whereas with floating point data, we can use NaN.

The issue outlined in your second question is a large part of why history() is now supported in before_trading_start in Q2 (announced at QuantCon). In general I'd expect the workflow to be something like:

• Use Pipeline to query for low-frequency data from a variety of sources to produce a small-ish (on the order of hundreds) set of stocks that you want to trade on any given day.

• Use history, either in before_trading_start or in a scheduled function to further winnow down stocks if necessary, and to make decisions about portfolio construction and order execution. This is now feasible in Q2 because we've removed the need to explicitly pre-declare your universe in before_trading_start, so you can just query freely with the set of stocks produced by your pipeline.

• Scott

Hi Simon, that feature is not yet supported, so as of now factor data
is still computed over all stocks, even those which will ultimately be
screened out. However we are working on letting you pass a mask to a
CustomFactor, which will allow you to filter out stocks before your
factor is computed. The work for that can be tracked here.

Is there some reason why you don't immediatly implement the cleaner model where only those equities that need to be calculated are calculated?

The alternative that you are talking about sounds quite a lot of something that I would call a "kludge", ie. a "solution" that is only meant to be a workaround for some exact problem that wouldn't be needed it the actual problem would be solved. Please think this again as pipeline is already so complex that you don't need additioal complexities.

I would also assume that Implementing this right from the beginning will also have a dramatic effect on your server load as the unneeded calculations are not executed.

This sounds really wonderful! Up until now, to 'classify' I have been making a bunch of indexed lists of securities that met some condition or another (very cumbersome) and then when I need to do different things to different securities on the basis of their classification I do something like:

for stock in context.output.index:
if list1[stock]:
(do something)
elif list2[stock]:
(do something different)
else:
(do another something)


I can see how I no longer need the separate lists, but what's the best way to then do something to only those securities of a particular class?

The python way would be to do an apply of a lambda function to some filter of the dataframe.

apply(function,output[output.sector==101])

Why aren't Classifiers allowed for the following data?