Back to Community
Faster Fundamental Data

Today, we launched an improved version of Morningstar fundamental data in Pipeline. The new implementation is faster and corrects many data issues that existed in the old system.

Fundamental queries in Pipeline are 2-20x faster. The biggest improvements will be noticed in fields that update less often (monthly, quarterly, etc.) and in queries that use many different fields.

Queries will also be more memory efficient and more consistent in how long they take to complete. There are some queries that are not faster yet, but we are actively working on improving these. Most notably, the Q1500 and Q500 are slower in the new implementation; these should be much faster soon.

In addition to performance improvements, there are changes to the underlying data. The new implementation includes a large number of corrections across many assets and many fields.

How can you use the new implementation?

The new version is accessible in Pipeline via a new namespace. The following example demonstrates how to get the operating_income field in both the new and old formats:

# New way: Fundamentals.my_field:  
from quantopian.pipeline.data import Fundamentals  
operating_income = Fundamentals.operating_income.latest

# Old way: morningstar.category.my_field  
from quantopian.pipeline.data import morningstar  
operating_income = morningstar.income_statement.operating_income.latest  

Built-in pipeline terms that use fundamental data also have a new module:

# New way: classifiers.fundamentals  
from quantopian.pipeline.classifiers.fundamentals import Sector  
sector = Sector()

# Old way: classifiers.morningstar  
from quantopian.pipeline.classifiers.morningstar import Sector  
sector = Sector()  

The attached notebook has more examples that use the new namespace. The notebook also includes examples of the performance and data correctness improvements that come with the new system.

What does this mean for you?

Until the end of September, both the new and old implementations of Morningstar fundamental data will be available in Pipeline. Over the next month, you should compare the old and new data. Try running your research notebooks and algorithms on the new data to understand the impact that any changes in the data might have on your work.

At the end of September, the old namespace will be deprecated and redirected to point at the new data. If you don’t manually update your notebooks or algorithms to use the new namespace, they will automatically start to use the new data. Of course, it is advisable to test the impact ahead of time.

Loading notebook preview...
Notebook previews are currently unavailable.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

67 responses

Hi Jamie,

With regard to "The new implementation includes a large number of corrections across many assets and many fields.", is there a change log that can be made public or included in this thread. Some morningstar fundamental data (for instance: peg_ratio) weren't available all the way back to 2003. Have those missing data been corrected.

I am very impressed with the speedup numbers that you reported.

Best Regards,
Leo

Thanks for the NB Jamie, very neat. 2 questions for you :)

1 - Is it safe to mix old Q1500US and new fundamentals to have the best performance?

2 - out of curiosity, why did you change the hierarchy inside the fundamentals ( i.e. Fundamentals.operating_income vs morningstar.income_statement.operating_income ) ? It's now nicer to type, but it's harder to switch between old and new version. If you kept the same hierarchy one could simply change the import statement to switch between old/new way

Thanks a lot for the faster API, Jamie!
And thanks also for having already migrated my Piotosky score implementation.

I've some questions:

  1. Now that the API is faster, it's possible to use a longer windows_lenght without incurring in timeout. Will it remain the only method to use fundamentals for past quarters or different timeframe?

  2. I'll already posted the following two questions per E-Mail but I think it's relevant for all.
    Do the data corrections regard also the old get_fundamentals() API?

  3. What about the future of this old API? It's confirmed, it will be deprecated?

Thank you all for the positive feedback! I am going to try to answer your questions as they appear in this thread.

Leo M.

Some morningstar fundamental data (for instance: peg_ratio) weren't available all the way back to 2003. Have those missing data been corrected.

Unfortunately we still do not have a long history for peg_ratio. We have only changed how we process the data that Morningstar provides, and they do not provide the history for this field.

Luca

Is it safe to mix old Q1500US and new fundamentals to have the best performance?

It is safe to mix any of the terms from the old API with the new API. For now, using the old Q1500US will be faster but you may want to test that your algorithm behaves okay with the new data. We should have the cache ready for the new Q1500US soon which will make it just as fast as before.

why did you change the hierarchy inside the fundamentals?

We discussed this change a lot, but we decided that it made it easier for new users to find fields or for experienced users to find less commonly used fields. When you are in the notebook and type morningstar., you only see the sub-namespaces, not the names of the fields you are looking for. Now, you no longer need to search all 13 sub-namespaces to find the one field you are looking for. I hadn't considered that it makes it much more tedious to port your algorithms to the new API. If you have Python installed locally, you can automatically change all of your: morningstar.subnamespace.fields into Fundamentals.field by copying your algorithm locally to a file and running this on the command-line:

python -c "import re, sys;print(re.sub(r'morningstar\..*\.([_a-zA-Z0-9]*)', r'Fundamentals.\1', open(sys.argv[1]).read()))" path/to/algorithm/file  

Be sure to replace path/to/algorithm/file with the actual path to your algorithm file. This command uses a regular expression to replace all of the occurrences of morningstart.thing.field with Fundamentals.field and then prints the result. I realize this is not an ideal situation but hopefully this makes it easier to switch over.

Costantino

Now that the API is faster, it's possible to use a longer windows_length without incurring in timeout?

This is correct, you should be able to use much larger windows across many fields in your backtests and in research.

Will it remain the only method to use fundamentals for past quarters or different timeframe?

For now have not planned any API changes around longer look backs, though it is something we have talked about. We would need to spend some more time thinking about what that API would look like.

Do the data corrections regard also the old get_fundamentals() API?

We have not updated the get_fundamentals() API. We will be removing this API in around a month. The main reason for keeping this after pipeline was introduced was that it was faster for some use cases. We believe that we have addressed that with this performance improvement to the pipeline API.

What about the future of this old API? It's confirmed, it will be deprecated?

As I mentioned above, the get_fundamentals() function will be removed. The old morningstar.namespace.field API will switch over to the new system which will inherit the new performance improvements and data corrections. The reason we have made this "opt-in" for now was to give the community time to evaluate the new data and help us find any issues before we made the switch for everyone.

Thanks again, I am happy to answer any more questions regarding this change!

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

@Joe Jevnik, thank you for the detailed reply. One more question, if you don't mind. Do default_us_equity_universe_mask or make_us_equity_universe have some internal caching implemented the same way as Q500US/Q1500US have? I usually like to set a variable universe size via make_us_equity_universe, but if Q500US/Q1500US are much faster then I will try to stick to them.

The three currently cached terms are: Q500US(), Q1500US(), and default_us_equity_mask(). If any of these terms get used in a pipeline, the results will be read from a pre-computed file instead of computed on the fly. The reason we can't cache make_us_equity_universe is that each invocation returns a new term. We would need to pre-compute the results for all possible inputs which is not possible. I should also note that the caching does not work if you pass a custom minimum market cap to any of the cached terms because we have only pre-computed the results for the default values.

This isn't to say that make_us_equity_universe cannot be used, in fact, it is now much faster than before. The same is true of Q(1)500US and default_us_equity_mask with non-default minimum market cap values. Hopefully this helped explain how the cache works!

Thanks a lot, it makes sense.

I guess this is the easiest way to switch between old and new implementation then:

# Right at the top of your algo/NB  
new_way = True

if new_way:  
    # New way: Fundamentals.my_field:  
    from quantopian.pipeline.data import Fundamentals as income_statement  
    from quantopian.pipeline.data import Fundamentals as balance_sheet  
    from quantopian.pipeline.data import Fundamentals as valuation_ratios  
    from quantopian.pipeline.data import Fundamentals as operation_ratios  
    from quantopian.pipeline.data import Fundamentals as valuation  
    from quantopian.pipeline.classifiers.fundamentals import Sector  
    #from quantopian.pipeline.filters.fundamentals import Q1500US, Q500US //not cached yet  
    from quantopian.pipeline.filters.morningstar import Q1500US, Q500US  
else:  
    # Old way: morningstar.category.my_field  
    from quantopian.pipeline.data.morningstar import income_statement  
    from quantopian.pipeline.data.morningstar import balance_sheet  
    from quantopian.pipeline.data.morningstar import valuation_ratios  
    from quantopian.pipeline.data.morningstar import operation_ratios  
    from quantopian.pipeline.data.morningstar import valuation  
    from quantopian.pipeline.classifiers.morningstar import Sector  
    from quantopian.pipeline.filters.morningstar import Q1500US, Q500US

Hi Joe and Jamie,

I'm unable to import some of Morningstar's grade data when using the new query syntax-- check out the examples below.

Please let me know how I should be adjusting my code or if there's something wrong on the backend.

Thanks!

James

Loading notebook preview...
Notebook previews are currently unavailable.

This is wonderful! I found that many algorithm ideas I had attempted could not run due to memory and execution time timeouts.

Question though- is there anything in the works to make it a bit less awkward to work with quarterly and other infrequently updating data? Right now the best method I know of is creating a huge window length, then essentially looking back 252/4+1 days to get the last quarterly update. A massive data structure is created, when I just need 1/64th of those values. I had an idea to do the lookbacks in chunks upfront that wouldn't violate limits, then populate them down into my own much smaller structure stored in the context, but I never got around to it, though it seems like a problem that anyone looking to implement a fundamentals-based algorithm with lookbacks is going to encounter (I have commented on a few posts to help when users first try to figure out how to do this).

First of all my compliments to Jamie and Joe and the other Q developer for this big improvement. I get no timeout error more, even in algo that load 2 or more years of data, like the one described here:
https://www.quantopian.com/posts/error-in-fundamental-data#56fbfe939426be39440000a4

Nevertheless I completely agree with Kevin's comment. It's exactly the reason, why I posted the question above "Will it remain the only method to use fundamentals for past quarters or different timeframe?".... and at least for the moment, the answer is yes.

Sometime ago I proposed to add the possibility to pass only the needed data indexes instead of the complete windows length:
https://www.quantopian.com/posts/pipeline-api-feature-request-retrive-single-data-points-instead-of-a-full-array-of-data-window-length

Fundamental data are now much faster, but if you have to load two years or more of data in a backtest, it still takes a while: maybe a similar change could make fundamental or other quarterly or infrequently updating data even faster!

basic_eps and dividend_rate just for starters are missing.

Can we please get access to Sklearn.neural_network and/or update sklearn.
I noticed you guys downgraded it from 18 -> 16 for whatever reason.

Jamie, Joe, big thank you for the new API!

The improvement in speed is hugh! Something that took 170 seconds in research now only takes 6 seconds!

But I noticed something weird:

To get quarterly data, I used lookbacks with lengths of 1, 65, 130 and 195 days.
With the new API these lengths gave me some duplicate values for the quarters.
I had to change them to something like 1, 90, 160 and 230 to get the same data as with the old API. Why is that?

Anyone else experiencing the same? Be warned I should say!

Up until this moment, I noticed some small changes in Enterprise Value. Those changes decreased my best return with a third :-(.

@Jamie, I have to concur with Donny that backtest results are not identical when pure text switching from (old format) morningstar.* to (new format) Fundamentals.* likely because of the same reason as Donny pointed out that maybe the offsets are different now. Could you please run some data consistency checks for past values when pulling with old and new format. I can provide logs if necessary, please let me know.

Regards,
Leo

@Leo M The backtest results are different because the new system has many data corrections. In my case the returns decreased roughly an half.
It would be interesting to know, what was exactly the nature of the corrections and if other users experienced an increase in performance.

@Donny I use also the indexes [ -1, -65, -130, -195] for TTM data. I tested some symbols for operating_income but got no duplicate (see the attached notebook). Could you please make some examples?

Loading notebook preview...
Notebook previews are currently unavailable.

To simplify the use of TTM (Trailing Twelve Months) data I create the following utility classes:

def make_pipeline(context):  
    quarter_length = 65  


    class TTM(CustomFactor):  
        """  
        Trailing Twelve Months (TTM) current year  
        """  
        window_length = 3 * quarter_length + 1  
        def compute(self, today, assets, out, data):  
            ttm = np.array([-1, -quarter_length, -2 * quarter_length, -3 * quarter_length])  
            out[:] = np.nansum(data[ttm], axis=0)


    class TTM_PY(CustomFactor):  
        """  
        Trailing Twelve Months (TTM) previous_year (PY)  
        """  
        window_length = 7 * quarter_length + 1  
        def compute(self, today, assets, out, data):  
            ttm_py = np.array([-4 * quarter_length, -5 * quarter_length, -6 * quarter_length, -7 * quarter_length])  
            out[:] = np.nansum(data[ttm_py], axis=0)  


    return Pipeline(  
        columns = {  
            'ps_ratio_ttm': Fundamentals.market_cap.latest / TTM(inputs=[Fundamentals.total_revenue]),  
            'ps_ratio_ttm_py': Fundamentals.market_cap.latest / TTM_PY(inputs=[Fundamentals.total_revenue])  
        },  
        screen = Q1500US()  
    )  

I'd recommend we follow an exhaustive testing procedure of pulling each fundamental at various intervals using old and new format and make sure they are same. If we correct data and change the data retrieval in the same release, it will be difficult to debug why performance is different and cause large scale disruptions to algos across the board. Hence I recommend releasing it in phases. Phase 1: change method, Phase 2: correct data.

If performance of algos change then it will be a difficult proposition as we will have to change the algo and lose the X months of out of sample with the old format, hence I recommend we don't switch morningstar.* to Fundamentals.*.

Instead if users want to make use of the new format in Fundamentals.* it will be better to start from scratch and write a new algo (thus preserving the performance and out of sample record of the old algo with the existing data (and format morningstar.*).

Redirecting morningstar.* to Fundamentals.* when there will be performance change to existing algos will be difficult to deal with, as opposed to a change that only increases speed but gives back the same data and preserves algo performance.

I think there is a larger issue with the evaluation process : if you allocated capital using performance and tearsheets with morningstar* assuming performance X and then redirect morningstar* to Fundamentals* where the performance is actually Y (and maybe not worthy of allocation) or might produce unexpected results after allocation than what was predicted during evaluation (X from morningstar*).

@Costantino

I have added an example of calculations with EBIT.

Am I overlooking or missed something here?

Loading notebook preview...
Notebook previews are currently unavailable.

Hi Donny,

I think, the problem is the date '2003-01-01'. There are no data before and instead of returning NAs the missing data are filled with the only available.
Try with '2016-01-01' and there are no duplicates.

Anyway there is a problem when the window_length is long.
I added the quarters in the previous year:

EBIT_Q1_PY = Previous(inputs = [Fundamentals.total_revenue], window_length = 260)  
EBIT_Q2_PY = Previous(inputs = [Fundamentals.total_revenue], window_length = 325)  
EBIT_Q3_PY = Previous(inputs = [Fundamentals.total_revenue], window_length = 390)  
EBIT_Q4_PY = Previous(inputs = [Fundamentals.total_revenue], window_length = 455)  

and confronted the values of AAPL with Gurufocus (also using Morningstar as source).
All values are the same, excepted for window_length = 455 (EBIT_Q4_PY):

Total Revenue

Gurufocus:
Sep15 Dec15 Mar16 Jun16 Sep16 Dec16 Mar17 Jun17
51501 75872 50557 42358 46852 78351 52896 45408

Quantopian:
6789 75872 50557 42358 46852 78351 52896 45408

6789 vs 51501 must be an error. The same problem occurs for the operating revenue.

Notebook with the computation described above attached.

Loading notebook preview...
Notebook previews are currently unavailable.

Added window_length =520, and the result is surprising: window_length = 455 is now correctly 51501 but window_length =520 is 6789!
There is something weird with the new data retrieval mechanism... at least in research, I don't in the algo backtesting.

@Jamie, Joe could you take please a look at this, it seems to be a major issue.

Hi @Costantino, thanks for posting an example highlighting the discrepancy. Hoping the backtest difference in performance can be explained by the offset issue that you and Donny have encountered and fixing that will bring the backtest performance to what it was before. Then we can isolate the effect of data corrections on our algorithms.

-Leo

I investigate the issue further and the problem is once again the date interval.!
If you run the pipeline with result = run_pipeline(pipe,'2014-08-29','2017-08-29') insted of result = run_pipeline(pipe,'2017-08-29','2017-08-29') is everything fine! See the new attached notebook:
I think the algorithm backtesting is not affected by this behaviour.

Conclusion: There are no discrepancies in the data and the indexes [ -1, -65, -130, -195] or [-260, -325, -390, -455] are okay for TTM or TTM previous year data.

Anyway this confusion demonstrates again that we need a better method to access past quarterly data as pointed out by Kevin 2 days ago (https://www.quantopian.com/posts/faster-fundamental-data#59a429c2675043000da06ac3)

Loading notebook preview...
Notebook previews are currently unavailable.

@James O'Brien: Thanks for finding this! There is currently a problem with a set of categorical fields, including profitability grade. The full list of affected fields is:

  • financial_health_grade
  • financial_growth_grade
  • profitability_grade
  • company_status
  • industry_template_code
  • share_class_status

For now, please continue fetching these fields through the old API. I will post an update when the fix has been deployed.

@Costantino: I agree that the current API makes it cumbersome to work with quarterly or lower frequency data. While it may seem like the current API is inefficient, due to the way pipeline is implemented this is not the case.

In algorithms, pipelines are computed in 6 month chunks. This means that every 6 months we fetch all data and perform all of the computations needed to produce the next 6 months of output values. One of the biggest costs in Pipeline is retrieving the input data so we try to read data in large batches. Querying for data in large batches reduces the number of times we need to go to the disk or database, which has a high constant cost regardless of the amount of data being read.

Imagine that we had an API that presented the user with the current value and the trailing quarter's value for some field. On the first day of the computation, when today = 2016-03-01, we would need to have read exactly two rows: the current day's value (2016-03-01) and one quarter ago's value (2016-01-01). The table on the left shows the raw source data, and the table on the right shows the slice of data that is presented in a custom factor.

On the second day of computation, when today = 2016-03-02, we would need to read two more rows. This means we have read four rows in total: 2016-01-01 and 2016-03-01 from the previous computation and 2016-01-02 and 2016-03-02 from the current day's computation. Again, the table on the left shows the raw source data and the table on the right shows the slice of data that is presented in a custom factor.

If you repeat this process for at least one quarter, we will have read every row from 2016-01-01 to 2016-03-01. Because we know that all of these values will be used eventually, it is more efficient to query for them as one dense block. In theory, we could hold only the two rows in memory at a time; however, the time cost of reading the data a few rows at a time would make this infeasible.

In research, users may choose to run smaller windows which would not require every value from 2016-01-01 to 2016-03-01. The optimization of querying in a dense block still holds because it is more efficient to read a contiguous block of data than to do random access. We would spend more time determining which rows to filter than we would just reading all of the rows. Even if we could build an indexing scheme to more efficiently read these non-contiguous regions, the absolute time saved when querying for less than one quarter of data would be fractions of a second, and the ram cost of a few rows is negligible.

Hopefully this helps explain why using a long lookback window may be as efficient as querying for trailing quarters.

@Ian Worthington: I apologize for the inconvenience, a few of the fields have been renamed slightly:

manual_renames = {  
    'dividend_yield': 'trailing_dividend_yield',  
    'dividend_rate': 'forward_dividend',  
    'diluted_eps': 'diluted_eps_earnings_reports',  
    'basic_eps': 'basic_eps_earnings_reports',  
    'basic_eps_other_gains_losses': (  
        'basic_eps_other_gains_losses_earnings_reports'  
    ),  
    'diluted_eps_other_gains_losses': (  
        'diluted_eps_other_gains_losses_earnings_reports'  
    ),  
    'basic_average_shares': 'basic_average_shares_earnings_reports',  
    'diluted_average_shares': 'diluted_average_shares_earnings_reports',  
    'average_dilution_earn': 'average_dilution_earnings',  
    'basic_extraordinary': 'basic_extraordinary_earnings_reports',  
    'normalized_basic_eps': 'normalized_basic_eps_earnings_reports',  
    'diluted_discontinuous_operations': (  
        'diluted_discontinuous_operations_earnings_reports'  
    ),  
    'diluted_continuous_operations': (  
        'diluted_continuous_operations_earnings_reports'  
    ),  
    'basic_accounting_change': (  
        'basic_accounting_change_earnings_reports'  
    ),  
    'continuing_and_discontinued_basic_eps': (  
        'continuing_and_discontinued_basic_eps_earnings_reports'  
    ),  
    'diluted_extraordinary': 'diluted_extraordinary_earnings_reports',  
    'tax_loss_carryforward_diluted_eps': (  
        'tax_loss_carryforward_diluted_eps_earnings_reports'  
    ),  
    'tax_loss_carryforward_basic_eps': (  
        'tax_loss_carryforward_basic_eps_earnings_reports'  
    ),  
    'basic_discontinuous_operations': (  
        'basic_discontinuous_operations_earnings_reports'  
    ),  
    'continuing_and_discontinued_diluted_eps': (  
        'continuing_and_discontinued_diluted_eps_earnings_reports'  
    ),  
    'basic_continuous_operations': (  
        'basic_continuous_operations_earnings_reports'  
    ),  
    'normalized_diluted_eps': 'normalized_diluted_eps_earnings_reports',  
    'dividend_per_share': 'dividend_per_share_earnings_reports',  
    'diluted_accounting_change': 'diluted_accounting_change_earnings_reports',  
}

The _earnings_report suffix denotes this attribute is about an entire company, and we may have an attribute of the same name for each share class.

The other renames are to clarify which direction a field is looking, for example: 'dividend_yield': 'trailing_dividend_yield'. We also have forward_dividend_yield so we wanted to clarify what these fields meant.

I am currently looking into the other issues posted in this thread. Thank you all for the great feedback and testing!

@Costantino

2016-01-01 (the actual date is 2016-01-04, 2016-01-01 is a friday) gives me duplicates...

I tried with dates 2003-01-01, 2004-01-01, 2005-01-01 and each time I get duplicate values.
2006, 2007 are fine. 2008-2011 again duplicates.
2012-2013 is fine. 2014-2016 again duplicates (all tested on 01-01)
2017-01-01 is fine again.
What is going on here??

EDIT: changing windows to 1, 70, 130 and 195 seems to solve it for now.

@Joe, there could be more morningstar data that has offset or data changes. I will do some debugging on my end to narrow down why one of my algos performance changed when switching to the new format. The returns have actually gone up, but my concern is with a drawdown in one period that spiked in the new format. I will spend some cycles this weekend and email you some datapoints. Was hoping it was a generic offset issue, but apparently not.

To everyone on this thread, thank you for all of the bug reports. These are extremely helpful to us.

@Costantino: The issue that you reported with the pipeline results changing based on the lookback window length is indeed a bug. Joe has identified the problem and is working on a fix. Thanks for digging it up! We can post more detail on the issue once the fix is up.

@Leo: As you suggested, the changes to the data might not be systematic. It's more likely that there are changes that don't span an entire field or date. If you find a pattern, or simply report a collection of changes that you noticed between algos that you think might be incorrect, we can certainly take a look.

@Jamie, I have been able to narrow down the problem (why simple text substitution of morningstar.* to Fundamentals.) was giving me different performance and also able to figure out how to change my algo using Fundamentals. to get back my original performance (from morningstar*), all I had to do to fix the problem was expand the windowlength and negate the effects of expanded window with a matching start offset to access the original offsets that I was using in the algo. I don't want to give out my algo here, but I will produce a different algo that will illustrate the issue quite convincingly and I will send that algo over the weekend.

@Jamie, I sent you two algos whose performances are different after switching from morningstar.* to Fundamentals.* in the custom factor. Hope that helps.

@Joe, I sent you two algorithms as well. Morningstar-operation_margin and Fundamentals-operation_margin. They are identical algorithms except for the morningstar* changed to Fundamentals*
I ran each algo with the following settings. The performances are different as listed below. I have tried different morningstar variables in this algo (about 3 or 4 and everytime the performance is different even though the only change I have done is morningstar* to corresponding Fundamentals.*). Please let me know if the algos I sent (along with the ones I sent to Jamie) can be used for debugging or if you need something in research that is easier to debug please let me know and I can try to get that as well.
Settings:
From 2003-10-01 to 2017-07-01 with $10,000,000 initial capital

Morningstar-operations_margin::

Total Returns
-16.61%

Benchmark Returns
217.1%

Alpha
-0.01

Beta
-0.01

Sharpe
-0.25

Sortino
-0.35

Volatility
0.05

Max Drawdown
-26.12%

Fundamentals-operation_margin::

Total Returns
-7.75%

Benchmark Returns
217.1%

Alpha
-0.00

Beta
-0.02

Sharpe
-0.09

Sortino
-0.13

Volatility
0.05

Max Drawdown
-22.19%

@Jamie and Joe,

The attached notebook shows the difference in quarterly data between the old and new API with window lengths of 1, 65, 130 and 195.

You should also try it with dates 2014-01-05 and 2014-01-15!
3 different differences between the API's...

Loading notebook preview...
Notebook previews are currently unavailable.

@Joe, Thanks for your full explanation, I'm quite impressed!
I understand, there will be no performance gain, but the thing I don't like is using offsets like 1, 65, 130 and 195. What if the company filed later? the risk of duplicates is high... a way to avoid this problem, could be to pass a timeframe for window_lenght. For example window_lenght=4 and timeframe=quarterly to get the last four quarters, instead of window_lenght=196 and then the indexes above.

I've another question, that could maybe further speed up the performances, at least in backtesting.
The Pipeline data are always updated every day, isn't it? At least it was so with the previous mechanism.
What if an algorithm trades less frequently, for example quarterly. Wouldn't be better for the performance, if the data are update only when pipeline_output(.) is invoked?

Thanks
Costantino

Leo,

Thanks for sending over those strategies. Once we have finished fixing the bug related to the lookback window length, we can run them again and see if the differences persist.

Just so I'm clear about this: (modulo the outstanding bugs)

General conversion rule from morningstar fundamentals to Q Fundamentals:
-Replace morningstar by Fundamentals.
-Remove all intermediate hierarchy in the name
e.g.

operating_income = morningstar.income_statement.operating_income.latest

transforms to

operating_income = Fundamentals.operating_income.latest  

I used this conversion rule with my algos and it works, yet I can't find this stated anywhere except by example.

@Jamie says above "At the end of September, the old namespace will be deprecated and redirected to point at the new data. If you don’t manually update your notebooks or algorithms to use the new namespace, they will automatically start to use the new data."

How does this impact algos currently running in a contest? E.g. if they were entered months ago and are mid-way through the six month contest ... what happens? I am not able to update them since they are locked for the contest. Or will existing contest entries continue to function in the contextual namespace they were started within?

@Alan: I'm glad that you got the new version to work - you got the general conversion rule correct. The purpose of this post was to announce the upcoming change and demonstrate how to convert to the faster namespace. We're in the process of updating the documentation to the new namespace, but we're also working out a few bugs that have been reported further up in this thread.

@Marc: You shouldn't need to update your contest algos. At the end of September, the quantopian.pipeline.data.morningstar module will start pointing to the quantopian.pipeline.data.Fundamentals dataset. This will be treated like other data corrections in the contest. Your paper trading track record up to the end of the September will not be affected, but going forward, the algorithm will warm up and act on the new version of the dataset. No code action should be required, but we advise you to backtest a version of your contest algorithm with the new namespace. If you uncover any issues with the new version, let us know and we will work to fix them before the cutover date. Does this help to clarify things?

Hi Joe and Jamie,

I'm having trouble migrating my algorithms over to use the new fundamental data. Some of the new fields values don't look correct:

  • new trailing_dividend_yield is always NaN (does not match the old valuation_ratios.dividend_yield field)
  • new file_date does not always have up to date earnings filing date e.g. Autodesk on 2017-09-01 new filing date is 2017-05-31 but old field (financial_statement_filing.file_date) is (correctly) 2017-08-24
  • BusinessDaysSincePreviousEvent() factor returns a large negative number if passed the new file_date, but works ok if passed the old field

The attached notebook shows the problems. I'm a Python and Quantopian newbie so it's possible that my code is incorrect.

Forgot to attach the notebook, so here its is!

Loading notebook preview...
Notebook previews are currently unavailable.

@Jamie/@Joe, I plan to test this feature/enhancements (with various fundamentals and window lengths quite a bit) after you announce the fixes.

I realized that the time spent now in QA/testing of this feature will potentially save 10-100x time that could be wasted in dealing with issues involving bugs in fundamental retrieval (later on if those issues are not found and addressed now).

I have a long/short algo that I run live (until the end of the month that is). I updated the algo with the new Fundamental code and as of live trading yesterday the portfolio symbol set that was traded is significantly different from the one produced by the backtest code. Is there a reason for this?

Can someone at Q help.

Thanks

@Christopher: Thanks for reporting that issue. We're investigating the cause of the NaNs.

@Leo: We'll post back here as soon as the fixes are made. We're hoping to have an update on Monday with the majority of the bugs fixed.

@Kamran: Did you compare a backtest of your strategy with the two versions? I'm curious if you only saw a problem in live trading or if you saw the same change in backtesting. There were several data corrections so it's possible that if you're using dynamic universe selection that the names coming out of the pipeline have changed. If you'd prefer to respond privately, feel free to email in to [email protected].

Jamie, I did compared with the backtest. I printed out the symbols and the difference was for some reason significant (60 delta for a 100 position port). The live protfolio traded yesterday. So I was running the backtest today to compare the symbols and found the delta. This makes no sense as the code running live is identical. I am confused !

BTW, one more test. If I start the the backtest from only two weeks ago then the backtest symbols match the live in the updated Fundamentals version. If I start the algo from earlier times the portfolio symbols no longer match. I usually run the backtest from beginning of the year. This should not mater as the list of the symbols is generated from the fundamentals factors at the beginning of the day. So the start of the backtest should not impact the ending portfolio symbol set.

This is strange ...

Hi Kamran,

It sounds like you're running into a bug that was reported earlier in this thread. We're working on the fix and hope to have it updated some time next week. Sorry for the confusion.

Thanks Jamie, Yeah this was a bit confusing and created some unnecessary trades.

Can we assume then that we really don't need to change anything as far as the code is concerned since these name spaces will be merged. That way I can leave the old code as is and wait for the back-ends to switch over.

Yes, you can leave the old version and it will switch over at the end of the month. However, I recommend that you try out the new version again next week after we've posted the update to make sure that the odd behavior that you experienced goes away.

Ran into the same issue when using Q3000US() from quantopian.pipeline.filters.fundamentals instead of from quantopian.pipeline.filters.
Both in live trading and in backtests the returned universe is way too small after mid July 2017 (showing only some 100 equities instead of 3000).

Question: For affected live trading contest entries, do we have a hope of rerunning those for the affected dates after the bugfixes are applied? That change created quite some weird effects in the live trading.

Thank you for the work on this optimization. Could you also post a list which includes all the available fundamental data in the new implementation? A few metrics I would like to use do not seem to appear in the new implementation.
Something like https://www.quantopian.com/help/fundamentals would be great.

When I am trying the new fundamentals API out today (Sept 15), the Sector values are -1 for most companies.

Loading notebook preview...
Notebook previews are currently unavailable.

Furthermore, the country codes don't always match between morningstar data and new Fundamental data.

Loading notebook preview...
Notebook previews are currently unavailable.

Furthermore, the shares outstanding number are different and often lacking altogether in the new Fundamental data.

Loading notebook preview...
Notebook previews are currently unavailable.

Whoa! The free cash flow figures are even more off, sometimes on the opposite side of the positive - negative scale. Which numbers can we trust!?

Loading notebook preview...
Notebook previews are currently unavailable.

Hi Otto,

Thanks for your help! We strongly suspect these problems you've identified and those previously identified in this thread have a common root cause. We're in the process of testing out our solution and hope to have it pushed out next week for the community.

Josh

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks for the great investigative work Otto! Great stuff.

Josh - any thoughts on if/how you're planning to handle affected contest entries yet?

Andy: Did you submit a new contest entry with the new Fundamentals API? Contest entries that were submitted with the old API won't start reading from the new data for a couple more weeks. We're expecting to have the various bugs and issues, including the one impacting the Q3000/Q1500 resolved before then.

Yes, that's exactly what i did, Jamie. Backtests looked fine and so I converted them directly, but then more recent data seems to have issue.

Thanks for clarifying, Andy. Your best bet is probably to withdraw your submission (stop the algorithm) and resubmit with the old morningstar Pipeline API before the 9:30AM ET deadline on Oct. 2. Alternatively, you can wait until we publish the bug fixes and resubmit with the new API, but the old API will be redirected to the new system at the end of the month anyway, so there shouldn't be much difference between the two options. Unfortunately, we won't be able to re-run the affected dates of entries that have already been submitted with the new namespace. Sorry for the inconvenience.

Bummer. Thanks for clarifying Jamie. Appreciate your help.

Hey everyone, we released fixes in both research and algorithms for several of the problems that were reported in this thread:
- Data should no longer be changing with different window lengths. This issue was actually the root cause of many of the problems described in this thread. We believe the problem is fixed with all fields except for the file_date field, which we are still working to solve.
- The Q Universe should now return the appropriate number of securities for any window.
- Classifiers such as the profitability grade are now working.

We are still working on a couple of issues:
- As mentioned above, the file_date still changes in a couple of cases with the window length, so we are working to fix this.
- The trailing_dividend_yield still returns NaN. This is also on our list to fix.

If you are still having issues with fields other than file_date or trailing_dividend_yield, please let us know. An thank you for all your continued hard work uncovering these problems!

Here is a notebook demonstrating the fixes.

Loading notebook preview...
Notebook previews are currently unavailable.

@Jamie, still seeing a difference in total returns and max drawdown in one QA algorithm that I created for this. I have emailed it to you. Will QA some more with different window_lengths and fundamentals.

Jamie, I rand my old test and the symbol list issue that I had reported seems to have been fixed in this version.
Thanks

BTW on a separate note, SPLS symbol gets selected for trading eventhough data.can_trade is checking for it. This symbol is not tradable as of last month. Is there a reason for it still being selected as tradable ?

Kamran, I'm glad that the issue with the symbol list has been resolved. Regarding the data.can_trade question, it's tough to say what the issue is without seeing the code. Would you be able to open a ticket with support and grant the permission to look?

Thanks,
Jamie