Back to Community
How to Leverage the Pipeline to Conduct Machine Learning in the IDE

Hey Everybody,

We all know the Quantopian platform is incredibly powerful, but for those of us who have tried to coax it into backtesting machine learning algorithms on anything other than OHLCV data, we also know that the platform isn't as flexible with fundamental data, since Morningstar Fundamentals and other datasets can't be accessed through the data.history object. When working with fundamental factors, it's straightforward to use the research environment to tune a static set of parameters for a machine learning algorithm which you then implement in the Backtesting IDE, but it's not immediately clear how to implement a machine learning algorithm in the IDE that retrains and updates its parameters over time.

A few months back, I spent a good deal of time digging through the documentation, trawling the forum, hacking away at this, and I ultimately found a way to make it work. Despite likely giving a way a bit of a trading edge, I wanted to share the skeleton of my algo from back then to save a lot of you the headaches of trying to stand something like this up.

A lot of this will repeat from Thomas Wiecki's awesome post on this material, but, at just over 100 lines of code, I hope this notebook will be a little more digestible for folks than the very robust algorithm in that thread.

Note that the performance metrics in the below algorithm aren't too relevant-- if you pick up this algorithm for your own use, you'll likely have a more intentional and sensible portfolio allocation strategy. Plus, the choice of simple OLS for the machine learning algorithm, as well as the choice of these particular factors, are just placeholders-- you'll likely select both fundamental inputs and a machine learning algorithm which make sense.

Good luck!

EDIT: Bug fix-- thanks to Kelvin Ho for the catch.

Clone Algorithm
292
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
"""
This is a demo algorithm to show folks how to conduct some rudimentary machine learning on Morningstar Fundamental data in the Quantopian IDE.
"""

##################################################
# Imports
##################################################

from __future__ import division
from collections import OrderedDict
import time

# Pipeline, Morningstar, and Quantopian Trading Functions
from quantopian.algorithm import attach_pipeline, pipeline_output, order_optimal_portfolio
from quantopian.pipeline import Pipeline, CustomFactor
from quantopian.pipeline.data import Fundamentals
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.filters import QTradableStocksUS
from quantopian.optimize import TargetWeights

# The basics
import pandas as pd
import numpy as np

# SKLearn :)
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import Imputer, StandardScaler

##################################################
# Globals
##################################################

num_holding_days = 5 # holding our stocks for five trading days.
days_for_fundamentals_analysis = 30

##################################################
# Initialize
##################################################

def initialize(context):
    """ Called once at the start of the algorithm. """

    # Configure the setup
    set_commission(commission.PerShare(cost=0.001, min_trade_cost=0))
    set_asset_restrictions(security_lists.restrict_leveraged_etfs)

    # Schedule our function
    schedule_function(rebalance, date_rules.week_start(), time_rules.market_open(minutes=1))

    # Build the Pipeline
    attach_pipeline(make_pipeline(), 'my_pipeline')

##################################################
# Pipeline-Related Code
##################################################

class Predictor(CustomFactor):
    """ Defines our machine learning model. """

    # The factors that we want to pass to the compute function. We use an ordered dict for clear labeling of our inputs.
    factor_dict = OrderedDict([
              ('Open Price' , USEquityPricing.open),
              ('Volume' , USEquityPricing.volume),
              ('cf_yield' , Fundamentals.cf_yield),
              ('earning_yield' , Fundamentals.earning_yield),
              ('pb_ratio' , Fundamentals.pb_ratio),
              ('pe_ratio' , Fundamentals.pe_ratio), 
              ('roa' , Fundamentals.roa)
              ])

    columns = factor_dict.keys()
    inputs = factor_dict.values()

    # Run it.
    def compute(self, today, assets, out, *inputs):
        """ Through trial and error, I determined that each item in the input array comes in with rows as days and securities as columns. Most recent data is at the "-1" index. Oldest is at 0.

        !!Note!! In the below code, I'm making the somewhat peculiar choice  of "stacking" the data... you don't have to do that... it's just a design choice... in most cases you'll probably implement this without stacking the data.
        """

        ## Import Data and define y.
        inputs = OrderedDict([(self.columns[i] , pd.DataFrame(inputs[i]).fillna(method='ffill',axis=1).fillna(method='bfill',axis=1)) for i in range(len(inputs))]) # bring in data with some null handling.
        num_secs = len(inputs['Open Price'].columns)
        y = (np.log(inputs['Open Price']) - np.log(inputs['Open Price'].shift(num_holding_days))).shift(-num_holding_days-1).dropna(axis=0,how='all').stack(dropna=False)
        
        ## Get rid of our y value as an input into our machine learning algorithm.
        del inputs['Open Price']

        ## Munge X and y
        x = pd.concat([df.stack(dropna=False) for df in inputs.values()], axis=1)
        x = Imputer(strategy='median',axis=1).fit_transform(x) # fill nulls.
        y = np.ravel(Imputer(strategy='median',axis=1).fit_transform(y)) # fill nulls.
        scaler = StandardScaler()
        x = scaler.fit_transform(x) # demean and normalize

        ## Run Model
        model = LinearRegression() # you'll probably use something more involved here... this is just quick, for demo purposes.
        model_x = x[:-num_secs*(num_holding_days+1),:]
        model.fit(model_x, y)

        out[:] = model.predict(x[-num_secs:,:])


def make_pipeline():

    universe = QTradableStocksUS()

    pipe = Pipeline(columns={'Model': Predictor(window_length=days_for_fundamentals_analysis, mask=universe)},screen = universe)

    return pipe

##################################################
# Execution Functions
##################################################

def rebalance(context,data):
    """ Execute orders according to our schedule_function() timing."""

    # Timeit!
    start_time = time.time()

    ## Run pipeline
    pipeline_output_df = pipeline_output('my_pipeline').dropna(how='any')
    
    todays_predictions = pipeline_output_df.Model

    # Demean pipeline scores
    target_weight_series = todays_predictions.sub(todays_predictions.mean())

    # Reweight scores to prepare for portfolio ordering.
    target_weight_series = target_weight_series/target_weight_series.abs().sum()

    order_optimal_portfolio(objective=TargetWeights(target_weight_series),constraints=[])

    # Print useful things. You could also track these with the "record" function.
    print 'Full Rebalance Computed Seconds: '+'{0:.2f}'.format(time.time() - start_time)
    print "Number of total securities trading: "+ str(len(target_weight_series[target_weight_series > 0]))
    print "Leverage: " + str(context.account.leverage)
There was a runtime error.
4 responses

Hi Jim,

This "predict" CustomFactor is very useful, thank you very much.

However, I think there is a minor bug caused by a line below because np.mean(x) and np.std(x) will return a scalar value instead of a vector.

x = (x - np.mean(x))/np.std(x) # demean and normalize

We possibly better replace the line by two lines below to do standardization by individual fundamental factor one by one

scaler = preprocessing.StandardScaler().fit(x)

x = scaler.transform(x)

All the best,
Kelvin

Hi Kelvin-- Good catch! I've edited the initial post to fix the scalar reduction.

Thanks!

Hi Jim

I used you great algorithm to create for my ML students a "Naive Bayes High Low Return Prediction Algorithm" that a sort of ML Simple Mean Reversion algorithm. Do you know how can we include:

  • Calculated factors of the type "asset_to_equity_ratio = ( Fundamentals.total_assets.latest / Fundamentals.common_stock_equity.latest) "; and
  • Custom factors
    in your algo. We tried unsuccessfully.

Hi German,

Great to see someone picking up the code and using it for ML education. Unfortunately, it's been a while since I've touched this and don't know how to implement custom factors here. I think I'm gonna have to pass you off to the Quantopian Staff.

  • Jim