Back to Community
Why does date_range give a result different from indexing [] for DataFrame Pandas dates?

Here is a simple code with date_range and indexing [ ] I used in the Starter guide example

period_start = '2013-01-01'  
period_end = '2019-12-24'

print(pd.DataFrame ({'close':aapl_close,  

print(pd.DataFrame ({'close':aapl_close,  

gives Nan results

            close  returns  
2013-01-01    NaN      NaN  
2013-01-02    NaN      NaN  
2013-01-03    NaN      NaN  
2013-01-04    NaN      NaN  

Indexing gives correct results

                            close   returns  
2013-01-02 00:00:00+00:00  68.732  0.028322  
2013-01-03 00:00:00+00:00  68.032 -0.010184  
2013-01-04 00:00:00+00:00  66.091 -0.028531  

Based on how the dates are shown by date_range - I suppose the date format of date_range does not match the date format in the Pandas DataFrame.

1) Can you explain please why it gives NaN?

2) What would you suggest to get a specific time range from the Panda DataFrame?

2 responses

Welcome to the wonderful wacky world of dates and times in python. Specifically, how timezone aware and naive dates don't always play nice. Generally, all the times returned by methods on Quantopian are timezone aware with a timezone of 'UTC'. By default, the date_range method returns timezone naive dates. Consider when pandas tries to make a dataframe out of the following:

pd.DataFrame ({  
        index=pd.date_range(start=period_start, periods=6)))

The 'aapl_close' and 'aapl_returns' specified as columns of the dataframe are both pandas series. They have an index of UTC timezone aware dates. They aren't just dates but datetimes which include hours. The date_range index being specified is, by default, timezone naive.

Now, the problem arrises because pandas is smart, sometimes too smart. The first thing pandas does when creating the dataframe above is to create the index. Next it assigns values from the two data series to the appropriate index. Since an index was supplied for the two data columns, pandas doesn't simply assume the first value goes to the first index date. Pandas takes the first index and then 'looks up' the value having the same index in each series. The problem is the two datetimes don't match. Pandas doesn't find any data with the same datetimes as the index so it puts nans in for the data.

What's the fix? The best thing to do is not supply an index. Pandas will infer it from the two series and make things work. Like this

pd.DataFrame ({'close':aapl_close,  'returns':aapl_returns,})

If one really wanted to set the index however, simply supply the timezone information to date_range method. Like this

pd.DataFrame ({  
    index=pd.date_range(start=period_start, periods=6, tz='UTC'))

Note the added parameter of `tz='UTC". One detail however, the dates in that dataframe will now contain all the calendar days of the year. So the result will look like this

                            close   returns  
2013-01-01 00:00:00+00:00     NaN       NaN  
2013-01-02 00:00:00+00:00  68.732  0.028322  
2013-01-03 00:00:00+00:00  68.032 -0.010184  
2013-01-04 00:00:00+00:00  66.091 -0.028531  
2013-01-05 00:00:00+00:00     NaN       NaN  
2013-01-06 00:00:00+00:00     NaN       NaN  

The reason is, as explained above, pandas first creates the index (all calendar days by default) then goes in and assigns the correct data to each day. The new dataframe includes weekends so values for those days are nan (ie there is no pricing or returns on weekends). Generally, we don't want all those nans so generally, don't supply the index and use the index inferred from the price and returns series. That will only include trading days.

To get a specific date or time range in pandas perhaps the easiest is simple bracket notation. For example, to get dates between 2013-01-04 and 2013-01-08 (inclusive) simply enter this:

df = pd.DataFrame ({'close':aapl_close,  'returns':aapl_returns,})  
my_selected_dates = df['2013-01-04':'2013-01-08']

This bracket notation is really shorthand for using the loc method

my_selected_dates = df.loc['2013-01-04':'2013-01-08']

Both work the same but the loc method has a bit more flexibility. This notation also works with datetimes if desired.

Hope that makes things a bit more clear. Typically always use timezone aware dates on Quantopian. Your life will be easier.


The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

wow, Dan, thank you so much :) You are genius!