Back to Community
Am I doing something wrong? Pipeline row count *not consistent*

When I run my pipeline and get the shape of my pipeline, like this ...

pipeline_output.shape  

I see it has a shape of (37, 5).

Some assets repeat. So, I try to get the unique assets, like this ...

unique_assets = pipeline_output.index.levels[1].unique()  

and I see it has a shape of (10041,).

How did the number of rows get so high? If anything, since I am removing repeated assets, shouldn't I see a number less than 37?
I've attached my Notebook.

Loading notebook preview...
Notebook previews are currently unavailable.
3 responses

Hi Nikhil,

The issue here is buried in the way that pandas dataframes are represented at a lower level. When you reference pipeline_output.index.levels[1], you are actually getting a structure that contains mappings to each of the labels in the 1st level of pipeline_output's multi index. Pandas can then leverage this to store the index as integers as opposed to more expensive types, according to the position of the label in pipeline_output.index.levels[1] (I might be getting the details here a bit wrong, but the concept should be right). You can think of this almost like a lookup key, where pandas can store cheaper representations of the index labels in the actual data structure, and then display the actual label when you want to print it by looking up the label in the lookup key. This is especially helpful when you have lots of duplicate labels in the index (which is very much the case in pipeline outputs!).

Importantly, I believe this happens when the index is created, and is unaffected by any filtering operations. Under the hood, the screen argument on the Pipeline constructor just runs a filter on your final pipeline output, so the index, and its corresponding label mappings, remain untouched.

Instead, if you want the set of unique values in the current filtered down version of your dataframe, you should use pipeline_output.index.get_level_values(1), which will actually return the labels from level 1, as opposed to the lookup table.

Does this help?

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks, Jamie. If I understand correctly, it seems like the underlying data structure will always have the same number of indices, but the screen filter just alters what is displayed.

So, if I apply a unique on your call like

pipeline_output.index.get_level_values(1).unique()  

I get 5 rows, which is what I expected.

Thanks again.

That's correct, the underlying data structure will have the same number of elements in pipeline_output.index.levels[1]. Glad I could help!