Back to Community
Help: Pipeline results dataframe has extra (hidden) entries in index

I'm obviously doing something wrong here (or expecting the wrong thing) but I can't see what it is (yes I've done the searching!)

The code below (condensed into a single cell for brevity) creates a pipeline which filters to the bottom 10 by AverageDollarVolume. I call run_pipeline() with a date range that spans 17 calendar days, 11 trading days. The resulting dataframe has 11 values in index.levels[0], as expected. Each day has 10 values in index.levels[1], as expected. results.info() reports that the multi index has 110 entries, as expected.

I want to extract the list of assets, so I use
results.index.levels[1].unique() which is a technique that is used in the documentation. However, this returns an array of 8930 (in my case) assets and this number is the same however wide or narrow I make the filter. I expected this to be the list of assets referenced in the index, so between 10 and a max of 110 (but probably more like 15), as run here. On one hand this feels like a pandas problem because run_pipeline() returns a pandas DataFrame and I then am only using pandas methods on it, but on the other hand it feels like a Quantopian problem because I have never seen this behaviour in a DataFrame produced by any other means. Help!

def make_pipeline(filterWidth=10):

    # Dollar volume factor  
    dollar_volume = AverageDollarVolume(inputs=[USEquityPricing.close, USEquityPricing.volume],  
                                        window_length=30)  
    # 10-day close price average  
    mean_10 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10)

    filter_dollar_volume = dollar_volume.bottom(filterWidth)  
    return Pipeline(  
        columns={  
            'meanclose': mean_10,  
            'dolvol': dollar_volume  
        },  
        screen=filter_dollar_volume  
    )

filterWidth = 10  
results = run_pipeline(make_pipeline(filterWidth), '2020-05-15', '2020-06-01')  
dateCount = len(results.index.levels[0])  
asset_list = list(results.index.levels[1].unique())  
print('''Number of dates in index.levels[0]: {0}  
Number of rows in dataframe: {1}  
Product of dates and filter width: {2}  
Length of results.index.levels[1].unique(): {3}  
'''.format(dateCount,
           len(results),  
           dateCount * filterWidth,  
           len(asset_list)))  
print('First 5 of asset list:\n', asset_list[:5])  
print('')  
results.info()  

If you want to run it, my imports are these:

from quantopian.pipeline import Pipeline  
from quantopian.research import run_pipeline  
from quantopian.pipeline.data.builtin import USEquityPricing  
from quantopian.pipeline.factors import SimpleMovingAverage, AverageDollarVolume  
import pandas as pd  

Thanks in anticipation...

4 responses

Yes, you are correct that this behavior is in Pandas. The simple answer is to use the get_level_values method and not levels. So, this will get you what you expected

# Don't use 'levels'  
asset_list = list(results.index.levels[1].unique())  

# Use 'get_level_values' instead  
asset_list = list(results.index.get_level_values(level=1).unique() )

The less simple answer, along with a notebook, can be found in this post.

Good luck.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Dan, thanks very much for your prompt and helpful reply. The github discussion seems to have been going on for a long time and for many versions of pandas. It's a pity they chose to spend so much time infighting - they could have fixed the issue by now ;-)

Anyway, you've given me an insight which will be useful beyond Quantopian, and a practical workaround. Cheers!

@Peter Cahill Always glad to help when I can.

(and I've checked it out and it works. Thanks again!)