Back to Community
String Columns Now Available in Pipeline

Hi All,

One of the biggest remaining holes in Pipeline since the launch of Classifiers has been lack of support for string-typed data. Support for strings was merged in Zipline about a week ago, and as of today we now have support on Quantopian for loading string data in Pipelines.

There are two major use-cases for strings:

  1. Converting them into booleans via string-matching predicates (e.g. "startswith").
  2. Using them as grouping keys to transform numerical expressions (e.g. Z-Score asset returns by country code).

The groupby use-case works for strings exactly the way it does for integer columns like SectorCode. The Classifier announcement post provides an overview of grouping operations, and there's a new Working with Strings section in the Pipeline docs that provides another example with a string column.

The use-case of implementing filters based on string data is supported by a suite of new methods on Classifier:

More information on each of these methods is available in the Classifier API Reference.

To demonstrate the kinds of operations one might want to do with string-based filters, I've attached a notebook that implements 9 common universe selection criteria in Pipeline and analyzes their outputs.

This analysis is a step toward eventually providing recommended synthetic trading universes (e.g. a "Quanto 500" or "Quanto 3000") as efficient Pipeline built-ins, so I'm interested to hear if there are other interesting filtering criteria that could be included in the analysis.

  • Scott
Loading notebook preview...
Notebook previews are currently unavailable.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

5 responses

Nice. So can this eliminate the need to apply get_fundamentals after running pipeline? With the addition of strings, is everything now covered?

So can this eliminate the need to apply get_fundamentals after running pipeline?

Pretty much. To my knowledge, pipeline now supports every field in the fundamentals database.

This analysis is a step toward eventually providing recommended synthetic trading universes (e.g. a "Quanto 500" or "Quanto 3000") as efficient Pipeline built-ins, so I'm interested to hear if there are other interesting filtering criteria that could be included in the analysis.

Bad data is a problem, and will continue to be a problem. It'd be great if as soon as bad data is encountered, it could be added to lists that would be used for filtering, via pipeline. Possible? Presumably, you are already thinking along these lines, since any recommended synthetic trading universes would have clean point-in-time minutely OHLCV bar data, along with any associated auxiliary data provided by Quantopian. Another approach would be to provide canned routines, kept up-to-date by Quantopian, to screen for bad data, using pipeline. Basically, give the tool a list of securities, and return the good and bad ones, perhaps along with information about the problems. Possible?

Thank you. This is a huge help.

@Scott - I just read your notebook and laughed out loud at the reference to the "substantial challenge" :)