I have a bunch of fundamental data (i.e: data from Morningstar such as total_assets.latest, common_stock_equity) in a learning data set that incorporates various factors which seems to look promising in terms of train/test accuracy in the research environment. This learning data was created using pipeline which returns historical data in a multi-indexed pandas dataFrame in the Research environment.

Unfortunately, by design the backtesting IDE only returns the current days data when making a pipeline. I can create custom factors that call windows of data and compare historical data to create an output, but I don't think that I can pass an entire 500 day window back into the pipeline output as custom factors return 2D numpy arrays.

What would be the cleanest way to create a 500 day history of total_assets.latest, common_stock_equity for each company in my universe to create a training dataset? If Q wants their users to build fundamental factors and use machine learning to combine them, it would really make sense to have an easier way to create large historical training datasets of fundamental data.

IT WOULD BE GREAT TO HAVE A SPECIAL LEARNING DATA PIPELINE IN THE IDE THAT COULD RETURN HISTORICAL DATA IN A SIMILAR FORMAT AS IN THE RESEARCH ENVIRONMENT! It would certainly speed up machine learning algo development...

12 responses

Hi Matthias,

you can use even longer time periods in your fundamentals custom factors. I remember using 780 days window lengths on this type of custom factor without any issue.

Hi Mathieu,

You are saying that you can assign the entire historical window in the custom factor to a column in your context.trainingDF? If so, my problem is solved.

Thanks for the suggestion. I was clearly not thinking out of the box on this one!

Hi Mathieu,

As I suspected, it is not possible to populate a training dataframe with Morningstar Fundamental Data in Custom Factor classes.

While CustomFactors can be populated with big windows of historical data (including Fundamental Data), I have not found a clean way to extract them out of CustomFactors as context variables can't be assigned or used within CustomFactors. The only output out of a CustomFactor is a 1D numpy array (one value per stock). For learning data to be extracted out of a CustomFactor, one would have to be able to extract a 2D array.

Does anyone else know how to extract historical Fundamental Data outside of a custom factor in the IDE? (In Research, this is not a problem)

Hi Mattias,

Either use a custom factor as input in another customfactor (the one you want to use for further computations) with sufficient window_length, either do the further computations outside pipeline (in before_trading_start for example), either try to stock the values in a context variable directly inside the customfactor.

Please post an example notebook as I can't guess exactly what you want to achieve without it.

Hi Mathieu,

I don't believe the options you suggest are valid as a custom factor only returns one value per stock in the pipeline in the IDE, and context variables can't be used inside CustomFactors. It would be nice if someone from the Quantopian team could come up with suggestions as to how to access historical fundamental data in the IDE to create a learning DF. Any examples out there?

@Mattias,

Have a look at this example ML algo by Q and see if it helps you machine-learning-on-quantopian-part-3
Historical data is controlled by window_length in:

def make_ml_pipeline(universe, window_length=21, n_forward_days=5):
pipeline_columns = OrderedDict()

# ensure that returns is the first input
pipeline_columns['Returns'] = Returns(
inputs=(USEquityPricing.open,),
)

# rank all the factors and put them after returns
pipeline_columns.update({
k: v.rank(mask=universe) for k, v in features.items()
})

# Create our ML pipeline factor. The window_length will control how much
# lookback the passed in data will have.
pipeline_columns['ML'] = ML(
inputs=pipeline_columns.values(),
window_length=window_length + 1,
)

pipeline_columns['Sector'] = Sector()

return Pipeline(screen=universe, columns=pipeline_columns)



Let me just warn you of timeout exception errors if you use say, window_length = 756 and number of stocks = 500. It is very limiting, in this sense, to get a meaningful outcome.

Matthias,

context variables can't be used inside CustomFactors

Probably right, I've never used this method.
Howewer the 2 other options work

@ James, you must rapidly encounter timeout exception with this solution. I don't think this is suited for Mattias

@Mathieu,

@ James, you must rapidly encounter timeout exception with this solution. I don't think this is suited for Mattias

Well, basically the way Q's ML model design is not the conventional method ML algos are suppose to be structured. It is a good toy model example but it is structured as if it was a linear regression prediction. The lookback period (or the amount of data used for training) is 21 days and predicts 5 days forward for x number of stocks simultaneously and optimized with desired frequency (i.e. daily). With this configuration, it may run without a hitch but if you increase the lookback period to say three years (756 days) to get more meaningful generalization, you would easily run into timeout issues because of limited time alloted in pipeline.

The other issue, in terms of design, is under conventional ML methodology the entire training data is run through the ML/AI engine (i.e. Random Forests, Support Vector, Neural Nets) and the optimized weights of the engine's weighing scheme is saved, locked and to be used for validation of OOS data to see if there is good generalization achieved. Under Q design weights are constantly optimized depending on desired frequency which actually lends itself to overfitting.

@James Villa: The current example runs with 1 year of training data, not just 21 days (although originally that was the limitation). Not sure if we can increase it to 3 years already but I wouldn't be surprised.

Also not sure what you mean with "conventional", as most ML algorithms like RF or SVMs are conventionally not run on non-stationary time-series data. Because of the non-stationarity, you have to retrain your model every once in a while. Also, I think the risk of overfitting is even higher if you don't do that, as you just evaluate a single set of parameters on some fixed time-period. The walk-forward method functions in a similar manner like cross-validation. Of course, I agree that there needs to be enough training data in every window.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

So, first let's assume you are using RF or SVM which you say "...conventionally not run on non-stationary time-series data." So I'm assuming you are able to transform all your input data into stationary time series first to make your point valid. Secondly, the model is trying to find a universal set of factors or factor combinations for x number of stocks that will work ( perform well in the risk adjusted return sense) by trying to predict returns or some other variable in the future, done simultaneously. This is prediction stage through ML optimization and let's call it operation number 1.

You then put the results of operation number 1 through another optimization that handles weights of the stocks in the portfolio subjected to various risk and structural constraints. Both these operations retrains based on the desired frequency which is set by the author. While I do agree that ML algos need to be retrained from time to time, too frequent even with stationary time series may lend itself to overfitting, most specially if there are two optimization operations at play. If however, you only retrain operation number 1 when instability of transformed stationary time series occurs or recognized (i.e. spike in volatility) to then react to this transition, it gives less of the overfitting feel of just frequently optimizing and changing prediction weights every time. Does that make sense?

too frequent even with stationary time series may lend itself to overfitting, most specially if there are two optimization operations at play. If however, you only retrain operation number 1 when instability of transformed stationary time series occurs or recognized (i.e. spike in volatility) to then react to this transition, it gives less of the overfitting feel of just frequently optimizing and changing prediction weights every time. Does that make sense?

Yes, I agree with that.

To also put a bit more nuance on my previous point about cross-validation: It's certainly good practice during development of the ML pipeline to get some sense of how much it's overfitting by trying it on different time-periods. So retraining e.g. monthly makes sense there. But that doesn't necessarily mean that I actually want to retrain monthly when running this live, where other criteria like volatility could be used.

@James Villa, Thank you for pointing me in the right direction. I'll try to study this dedicated ML Pipeline to understand the inputs/outputs. If I can create a DF up to 9 months it should be sufficient.

In general, I have found that different factors go through regime changes at a different frequency (eg: Momentum factors with windows of 6+ month have done OK over time but do really poorly between Aug 2015 and June 2016, as markets are under stress for first 6 months of this period and bounce back sharply in Feb/March 2016). I've found that either you want to have an extremely long learning window to create an "all-weather" model that doesn't need to be refitted, or you need to have much shorter windows such 10-21 days so that you can benefit from the auto-correlated nature of the performance of these factors (when a factor stops working, it tends to stop working for 2+months therefore your model will give it a lower weighting after seeing it is not working for a month). Some factors are not very autocorrelated (they're flipflopping month on month) and monthly re-learning just kills the performance with these kind of factors...