The first bug is that run_pipeline sometime returns values for dates after the requested ones.
The behavior here is that when you call
run_pipeline(pipe, start_date, end_date), if
end_date are not trading days, they're rolled forward to the next trading day in the US Equity calendar. This is intended, since it's the only behavior that makes it easy to run single-day pipelines (a common task when debugging or testing a new filter/factor/classifier) without the user having to know the historical trading calendar by heart.
Consider the case where a user wants to run a pipeline for the first day of 2016. The obvious thing to write here is something like
run_pipeline(pipe, '2016-01-01', '2016-01-01'). As it turns out, however, the first trading day of 2016 was January 4th, so we have a few options for what we can do:
- Return an empty dataframe.
- Raise an exception informing the user that there are no trading days between the requested dates.
- Roll the start and end dates to the next or previous trading day and then compute.
Option 1 is likely to just confuse users, and Option 2 forces users to know the historical trading calendar by heart in order to use
run_pipeline without errors, so Option 3 seems like the most friendly behavior for a function that will be invoked interactively. Within Option 3, we have a few possible choices:
- Roll both
- Roll both
start_date backward and
end_date forward. (This gives the largest possible window).
start_date forward and
end_date backward. (This gives the shortest possible window).
Option (4) seems like the natural choice, but that doesn't solve the problem in the case that there are no trading days between
end_date, so we'd still end up blowing up or returning an empty result in many cases.
Option (3) has the confusing behavior that
run_pipeline(pipe, '2014', '2014') would return two days of data rather than just one.
That leaves Option (1) and Option (2). The choice here is more or less arbitrary. I think rolling forward (Option (1)) is slightly more intuitive because it has the nice property that
run_pipeline(pipe, '2014', '2014') still gives you data from 2014, rather than rolling back to a previous year.