side comment to "Machine Learning on Quantopian Part 3: Building an Algorithm"

This is the algo posted by Thomas W. on https://www.quantopian.com/posts/machine-learning-on-quantopian-part-3-building-an-algorithm (Backtest ID: 58517784ee8d8363d0d9790d)

A couple issues:

The earliest start date for backtesting the algo is ~ 2003-08-01. Why? It'd be nice if it automatically adjusted the start date; instead, I get an error for earlier dates.

Also, it runs out of memory before the backtest can complete:

There was a runtime error. MemoryError Algorithm used too much memory.

Perhaps it is due to the number of transactions and unrelated to the new optimization/ML stuff. Or there is a memory leak?

13
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
from quantopian.algorithm import attach_pipeline, pipeline_output, order_optimal_portfolio
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.factors import Latest, CustomFactor, SimpleMovingAverage, AverageDollarVolume, Returns, RSI
from quantopian.pipeline.classifiers.morningstar import Sector
from quantopian.pipeline.filters import Q500US, Q1500US
from quantopian.pipeline.data.quandl import fred_usdontd156n as libor

# If you have eventvestor, it's a good idea to screen out aquisition targets
# Comment out & ~IsAnnouncedAcqTarget() as well. You can also run this over
# the free period.
#from quantopian.pipeline.filters.eventvestor import IsAnnouncedAcqTarget

import quantopian.experimental.optimize as opt

import talib
import pandas as pd
import numpy as np
from time import time
from collections import OrderedDict

from scipy import stats
from sklearn import linear_model, decomposition, ensemble, preprocessing, isotonic, metrics, svm

####################################
# Global configuration of strategy

N_STOCKS_TO_TRADE = 1000 # Will be split 50% long and 50% short
ML_TRAINING_WINDOW = 21 # Number of days to train the classifier on, easy to run out of memory here
PRED_N_FWD_DAYS = 1 # train on returns over N days into the future
TRADE_FREQ = date_rules.week_start() # How often to trade, for daily, set to date_rules.every_day()

#################################
# Definition of alphas

# Pipeline factors
bs = morningstar.balance_sheet
cfs = morningstar.cash_flow_statement
is_ = morningstar.income_statement
or_ = morningstar.operation_ratios
er = morningstar.earnings_report
v = morningstar.valuation
vr = morningstar.valuation_ratios

class Sector(Sector):
window_safe = True

def make_factors():
def Asset_Growth_3M():
return Returns(inputs=[bs.total_assets], window_length=63)

def Asset_To_Equity_Ratio():
return bs.total_assets.latest / bs.common_stock_equity.latest

def Capex_To_Cashflows():
return (cfs.capital_expenditure.latest * 4.) / \
(cfs.free_cash_flow.latest * 4.)

def EBITDA_Yield():
return (is_.ebitda.latest * 4.) / \
USEquityPricing.close.latest

def EBIT_To_Assets():
return (is_.ebit.latest * 4.) / \
bs.total_assets.latest

def Return_On_Total_Invest_Capital():
return or_.roic.latest

class Mean_Reversion_1M(CustomFactor):
inputs = [Returns(window_length=21)]
window_length = 252

def compute(self, today, assets, out, monthly_rets):
out[:] = (monthly_rets[-1] - np.nanmean(monthly_rets, axis=0)) / \
np.nanstd(monthly_rets, axis=0)

class MACD_Signal_10d(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 60

def compute(self, today, assets, out, close):

sig_lines = []

for col in close.T:
# get signal line only
try:
_, signal_line, _ = talib.MACD(col, fastperiod=12,
slowperiod=26, signalperiod=10)
sig_lines.append(signal_line[-1])
# if error calculating, return NaN
except:
sig_lines.append(np.nan)
out[:] = sig_lines

class Moneyflow_Volume_5d(CustomFactor):
inputs = [USEquityPricing.close, USEquityPricing.volume]
window_length = 5

def compute(self, today, assets, out, close, volume):
mfvs = []
for col_c, col_v in zip(close.T, volume.T):
# denominator
denominator = np.dot(col_c, col_v)
# numerator
numerator = 0.
for n, price in enumerate(col_c.tolist()):
if price > col_c[n - 1]:
numerator += price * col_v[n]
else:
numerator -= price * col_v[n]
mfvs.append(numerator / denominator)
out[:] = mfvs

def Net_Income_Margin():
return or_.net_margin.latest

def Operating_Cashflows_To_Assets():
return (cfs.operating_cash_flow.latest * 4.) / \
bs.total_assets.latest

def Price_Momentum_3M():
return Returns(window_length=63)

class Price_Oscillator(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252

def compute(self, today, assets, out, close):
four_week_period = close[-20:]
out[:] = (np.nanmean(four_week_period, axis=0) /
np.nanmean(close, axis=0)) - 1.

def Returns_39W():
return Returns(window_length=215)

class Trendline(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252

# using MLE for speed
def compute(self, today, assets, out, close):

# prepare X matrix (x_is - x_bar)
X = range(self.window_length)
X_bar = np.nanmean(X)
X_vector = X - X_bar
X_matrix = np.tile(X_vector, (len(close.T), 1)).T

# prepare Y matrix (y_is - y_bar)
Y_bar = np.nanmean(close, axis=0)
Y_bars = np.tile(Y_bar, (self.window_length, 1))
Y_matrix = close - Y_bars

# prepare variance of X
X_var = np.nanvar(X)

# multiply X matrix an Y matrix and sum (dot product)
# then divide by variance of X
# this gives the MLE of Beta
out[:] = (np.sum((X_matrix * Y_matrix), axis=0) / X_var) / \
(self.window_length)

class Vol_3M(CustomFactor):
inputs = [Returns(window_length=2)]
window_length = 63

def compute(self, today, assets, out, rets):
out[:] = np.nanstd(rets, axis=0)

def Working_Capital_To_Assets():
return bs.working_capital.latest / bs.total_assets.latest

""" Momentum factor """
inputs = [USEquityPricing.close,
Returns(window_length=126)]
window_length = 252

def compute(self, today, assets, out, prices, returns):
out[:] = ((prices[-21] - prices[-252])/prices[-252] -
(prices[-1] - prices[-21])/prices[-21]) / np.nanstd(returns, axis=0)

# Commenting out some factors to not run out-of-memory
all_factors = {
'Asset Growth 3M': Asset_Growth_3M,
#'Asset to Equity Ratio': Asset_To_Equity_Ratio,
#'Capex to Cashflows': Capex_To_Cashflows,
#'EBIT to Assets': EBIT_To_Assets,
#'EBITDA Yield': EBITDA_Yield,
'MACD Signal Line': MACD_Signal_10d,
'Mean Reversion 1M': Mean_Reversion_1M,
#'Moneyflow Volume 5D': Moneyflow_Volume_5d,
'Net Income Margin': Net_Income_Margin,
#'Operating Cashflows to Assets': Operating_Cashflows_To_Assets,
'Price Momentum 3M': Price_Momentum_3M,
'Price Oscillator': Price_Oscillator,
'Return on Invest Capital': Return_On_Total_Invest_Capital,
'39 Week Returns': Returns_39W,
'Trendline': Trendline,
'Vol 3M': Vol_3M,
}

return all_factors

def shift_mask_data(X, Y, upper_percentile=70, lower_percentile=30, n_fwd_days=1):
# Shift X to match factors at t to returns at t+n_fwd_days (we want to predict future returns after all)
shifted_X = np.roll(X, n_fwd_days, axis=0)

# Slice off rolled elements
X = shifted_X[n_fwd_days:]
Y = Y[n_fwd_days:]

n_time, n_stocks, n_factors = X.shape

# Look for biggest up and down movers
upper = np.nanpercentile(Y, upper_percentile, axis=1)[:, np.newaxis]
lower = np.nanpercentile(Y, lower_percentile, axis=1)[:, np.newaxis]

# Only try to predict whether a stock moved up/down relative to other stocks
Y_binary = np.zeros(n_time * n_stocks)

# Flatten X
X = X.reshape((n_time * n_stocks, n_factors))

# Drop stocks that did not move much (i.e. are in the 30th to 70th percentile)

return X, Y_binary

def get_last_values(input_data):
last_values = []
for dataset in input_data:
last_values.append(dataset[-1])
return np.vstack(last_values).T

# Definition of Machine Learning factor which trains a model and predicts forward returns
class ML(CustomFactor):
init = False

def compute(self, today, assets, out, returns, *inputs):
# inputs is a list of factors, for example, assume we have 2 alpha signals, 3 stocks,
# and a lookback of 2 days. Each element in the inputs list will be data of
# one signal, so len(inputs) == 2. Then each element will contain a 2-D array
# of shape [time x stocks]. For example:
# inputs[0]:
# [[1, 3, 2], # factor 1 rankings of day t-1 for 3 stocks
#  [3, 2, 1]] # factor 1 rankings of day t for 3 stocks
# inputs[1]:
# [[2, 3, 1], # factor 2 rankings of day t-1 for 3 stocks
#  [1, 2, 3]] # factor 2 rankings of day t for 3 stocks

if (not self.init) or (today.weekday() == 0): # Monday
# Instantiate sklearn objects
self.imputer = preprocessing.Imputer()
self.scaler = preprocessing.MinMaxScaler()
log.debug('Training classifier...')
#self.clf = ensemble.RandomForestClassifier()

# Stack factor rankings
X = np.dstack(inputs) # (time, stocks, factors)
Y = returns # (time, stocks)

# Shift data to match with future returns and binarize
# returns based on their
X, Y = shift_mask_data(X, Y, n_fwd_days=PRED_N_FWD_DAYS)

X = self.imputer.fit_transform(X)
X = self.scaler.fit_transform(X)

# Fit the classifier
self.clf.fit(X, Y)
#log.debug(self.clf.feature_importances_)
self.init = True

# Predict
# Get most recent factor values (inputs always has the full history)
last_factor_values = get_last_values(inputs)
last_factor_values = self.imputer.transform(last_factor_values)
last_factor_values = self.scaler.transform(last_factor_values)

# Predict the probability for each stock going up
# (column 2 of the output of .predict_proba()) and
# return it via assignment to out.
out[:] = self.clf.predict_proba(last_factor_values)[:, 1]

def make_ml_pipeline(factors, universe, window_length=21, n_fwd_days=5):
factors_pipe = OrderedDict()
# Create returns over last n days.
factors_pipe['Returns'] = Returns(inputs=[USEquityPricing.open],

# Instantiate ranked factors
for name, f in factors.iteritems():

# Create our ML pipeline factor. The window_length will control how much
# lookback the passed in data will have.
factors_pipe['ML'] = ML(inputs=factors_pipe.values(),
window_length=window_length + 1,

factors_pipe['Sector'] = Sector()

pipe = Pipeline(screen=universe, columns=factors_pipe)

return pipe

###########################################################
## Algo definition
def initialize(context):
"""
Called once at the start of the algorithm.
"""

time_rules.market_open(minutes=10))

# Record tracking variables at the end of each day.
schedule_function(my_record_vars, date_rules.every_day(),
time_rules.market_close())

# Set up universe, alphas and ML pipline
context.universe = Q1500US() # & ~IsAnnouncedAcqTarget()
ml_factors = make_factors()
ml_pipeline = make_ml_pipeline(ml_factors,
context.universe, n_fwd_days=PRED_N_FWD_DAYS,
window_length=ML_TRAINING_WINDOW)
# Create our dynamic stock selector.
attach_pipeline(ml_pipeline, 'alpha_model')

"""
Called every day before market open.
"""
context.predicted_probs = pipeline_output('alpha_model')['ML']
context.predicted_probs.index.rename(['date', 'equity'], inplace=True)

context.risk_factors = pipeline_output('alpha_model')[['Vol 3M', 'Sector']]
context.risk_factors.index.rename(['date', 'equity'], inplace=True)
context.risk_factors.Sector = context.risk_factors.Sector.map(Sector.SECTOR_NAMES)

# These are the securities that we are interested in trading each day.
context.security_list = context.predicted_probs.index

########################################################
# Portfolio construction

def my_rebalance(context, data):
"""
Execute orders according to our schedule_function() timing.
"""

risk_model_factors = context.risk_factors
risk_model_factors = risk_model_factors.join(context.predicted_probs, how='right').dropna()

predictions = risk_model_factors.ML

# Filter out stocks that can not be traded
# Select top and bottom N stocks

todays_universe = predictions.index

predictions -= 0.5 # predictions are probabilities ranging from 0 to 1
# Setup Optimization Objective
objective = opt.MaximizeAlpha(predictions)

# Setup Optimization Constraints
constrain_gross_leverage = opt.MaxGrossLeverage(1.0)
constrain_pos_size = opt.PositionConcentration.with_equal_bounds(-.02, .02)
market_neutral = opt.DollarNeutral()
# TypeError: cannot do label indexing on <class 'pandas.indexes.base.Index'> with these indexers [nan] of <type 'float'>
sector_neutral = opt.NetPartitionExposure.with_equal_bounds(
labels=context.risk_factors.Sector.dropna(),
min=-0.0001,
max=0.0001,
)

# Run the optimization. This will calculate new portfolio weights and
# manage moving our portfolio toward the target.
order_optimal_portfolio(
objective=objective,
constraints=[
constrain_gross_leverage,
constrain_pos_size,
market_neutral,
sector_neutral,
],
universe=todays_universe,
)

def my_record_vars(context, data):
"""
Plot variables at the end of each day.
"""
record(leverage=context.account.leverage,
num_positions=len(context.portfolio.positions))

def handle_data(context,data):
"""
Called every minute.
"""
pass

There was a runtime error.
22 responses

Regarding the error of the start date it might be that some pipeline factors has a very long window_length, so long that it requires data before Q database. it happened the same to me sometimes.

The memory error could be due to the universe filter 1500US() . I have the same error everytime I choose a large universe. Try with 500US() .
I know it would be amazing to have no limits, it's frustrating not being able to test what we like, but we have to cope with remote execution of our code. The more Q will succeed the more resources we will have, I guess.

Yeah, Q1500() has to be started on 11/1/2002 or later. There must be some longer windows in there that push the date out. My suggestion to Q support is to follow their paradigm of automatically finding the earliest start date, and jumping to it, since the algo will crash otherwise. Or report the earliest start date possible in the error message. For the code I posted above, it just crashes, and one has to fiddle around to find the earliest start date.

The memory error could be due to the way transactions are stored. It is relatively easy to overload the memory with a long backtest, I've found. It is a mystery to me, since I'd think they could be buffered and written to disk as the backtest runs. Once the backtest ends, they must get stored there, so why not just stick them there as they become available. I guess the entire history is needed in working memory to compute some stats? But then, I think the backtest will also crash if it is run in the background. Perhaps when run in the background, the rolling stats could be suppressed, and then the backtest could be loaded into the research platform (where there is also a limitation that tends to top out for long backtests with portfolios with hundreds of stocks...but if adding memory to the backtester is too expensive or whatever, maybe it could be done more easily in the research platform?).

Thanks for the feedback.

1. I agree, it would be nice if it automatically detected the start date in this case. I created a feature request. You've already discovered the workaround, though, which is to start the backtest at a later date.
2. The ML algorithm that Thomas shared does indeed push the limits of what the platform can support right now. His post noted that: "Ideally, we could train on a longer window, like 6 months. But it's very easy to run out-of-memory here. We are acutely aware of this problem and are working hard to resolve it. Until then, we have to make due with being limited in this regard." I'm not sure exactly what triggers this memory problem. Transactions are often a culprit, as are orders.

We are regularly making improvements in memory, speed, etc. to support algorithms. For now, as Thomas noted, this one is pushing the boundaries of what we can do. In the future it will be faster and more robust.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Regarding the memory limit Thomas' algo hits, it might be worthwhile just to make sure it is, in fact, the number of transactions/orders (transactions can be gaged by loading the backtest into the research platform), or if there is a new problem, such as a memory leak somewhere in Thomas' code. Maybe running his code without transactions would help sort this out. --Grant

FWIW, I'm pretty sure it's pipeline running out of memory when too big of a window is requested. Although transactions and orders contribute too if you increase the number of traded instruments too much, or use a short rebalance window.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hi Thomas -

I didn't change the window in your code, and when I load the aborted backtest into the research platform, only 9% of the memory is consumed. So, I don't think it is the overall scale of the algo (transactions, orders, etc.). Not sure what's going on, but as you guys post examples, I'd expect users to try running over the longest time periods possible. Not sure how you did the testing on your end, but if you run all the way back when backtesting, then if some fixes are still required, at least you can add a caveat (which you did, but I'm not clear if it applies here, because apparently there are multiple factors that can cause large scale algos to crash due to memory limitations).

EDIT: Well, I actually had another error:

TimeoutException: Call to before_trading_start timed out
There was a runtime error on line 352.

So, it may be that the memory error I reported was due to transactions. I'll see if I can push past the time-out error...maybe some electrons/holes in your transistors got confused.

Thomas -

Attached, you'll find the code you have posted on https://www.quantopian.com/posts/machine-learning-on-quantopian-part-3-building-an-algorithm (backtest ID: 58517784ee8d8363d0d9790d). It no longer runs to completion, but gives the error:

There was a runtime error.
MemoryError

When I load the backtest into the research platform, it only consumes 9% of the memory.

2
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
from quantopian.algorithm import attach_pipeline, pipeline_output, order_optimal_portfolio
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.factors import Latest, CustomFactor, SimpleMovingAverage, AverageDollarVolume, Returns, RSI
from quantopian.pipeline.classifiers.morningstar import Sector
from quantopian.pipeline.filters import Q500US, Q1500US
from quantopian.pipeline.data.quandl import fred_usdontd156n as libor

# If you have eventvestor, it's a good idea to screen out aquisition targets
# Comment out & ~IsAnnouncedAcqTarget() as well. You can also run this over
# the free period.
#from quantopian.pipeline.filters.eventvestor import IsAnnouncedAcqTarget

import quantopian.experimental.optimize as opt

import talib
import pandas as pd
import numpy as np
from time import time
from collections import OrderedDict

from scipy import stats
from sklearn import linear_model, decomposition, ensemble, preprocessing, isotonic, metrics, svm

####################################
# Global configuration of strategy

N_STOCKS_TO_TRADE = 1000 # Will be split 50% long and 50% short
ML_TRAINING_WINDOW = 21 # Number of days to train the classifier on, easy to run out of memory here
PRED_N_FWD_DAYS = 1 # train on returns over N days into the future
TRADE_FREQ = date_rules.week_start() # How often to trade, for daily, set to date_rules.every_day()

#################################
# Definition of alphas

# Pipeline factors
bs = morningstar.balance_sheet
cfs = morningstar.cash_flow_statement
is_ = morningstar.income_statement
or_ = morningstar.operation_ratios
er = morningstar.earnings_report
v = morningstar.valuation
vr = morningstar.valuation_ratios

class Sector(Sector):
window_safe = True

def make_factors():
def Asset_Growth_3M():
return Returns(inputs=[bs.total_assets], window_length=63)

def Asset_To_Equity_Ratio():
return bs.total_assets.latest / bs.common_stock_equity.latest

def Capex_To_Cashflows():
return (cfs.capital_expenditure.latest * 4.) / \
(cfs.free_cash_flow.latest * 4.)

def EBITDA_Yield():
return (is_.ebitda.latest * 4.) / \
USEquityPricing.close.latest

def EBIT_To_Assets():
return (is_.ebit.latest * 4.) / \
bs.total_assets.latest

def Return_On_Total_Invest_Capital():
return or_.roic.latest

class Mean_Reversion_1M(CustomFactor):
inputs = [Returns(window_length=21)]
window_length = 252

def compute(self, today, assets, out, monthly_rets):
out[:] = (monthly_rets[-1] - np.nanmean(monthly_rets, axis=0)) / \
np.nanstd(monthly_rets, axis=0)

class MACD_Signal_10d(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 60

def compute(self, today, assets, out, close):

sig_lines = []

for col in close.T:
# get signal line only
try:
_, signal_line, _ = talib.MACD(col, fastperiod=12,
slowperiod=26, signalperiod=10)
sig_lines.append(signal_line[-1])
# if error calculating, return NaN
except:
sig_lines.append(np.nan)
out[:] = sig_lines

class Moneyflow_Volume_5d(CustomFactor):
inputs = [USEquityPricing.close, USEquityPricing.volume]
window_length = 5

def compute(self, today, assets, out, close, volume):
mfvs = []
for col_c, col_v in zip(close.T, volume.T):
# denominator
denominator = np.dot(col_c, col_v)
# numerator
numerator = 0.
for n, price in enumerate(col_c.tolist()):
if price > col_c[n - 1]:
numerator += price * col_v[n]
else:
numerator -= price * col_v[n]
mfvs.append(numerator / denominator)
out[:] = mfvs

def Net_Income_Margin():
return or_.net_margin.latest

def Operating_Cashflows_To_Assets():
return (cfs.operating_cash_flow.latest * 4.) / \
bs.total_assets.latest

def Price_Momentum_3M():
return Returns(window_length=63)

class Price_Oscillator(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252

def compute(self, today, assets, out, close):
four_week_period = close[-20:]
out[:] = (np.nanmean(four_week_period, axis=0) /
np.nanmean(close, axis=0)) - 1.

def Returns_39W():
return Returns(window_length=215)

class Trendline(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252

# using MLE for speed
def compute(self, today, assets, out, close):

# prepare X matrix (x_is - x_bar)
X = range(self.window_length)
X_bar = np.nanmean(X)
X_vector = X - X_bar
X_matrix = np.tile(X_vector, (len(close.T), 1)).T

# prepare Y matrix (y_is - y_bar)
Y_bar = np.nanmean(close, axis=0)
Y_bars = np.tile(Y_bar, (self.window_length, 1))
Y_matrix = close - Y_bars

# prepare variance of X
X_var = np.nanvar(X)

# multiply X matrix an Y matrix and sum (dot product)
# then divide by variance of X
# this gives the MLE of Beta
out[:] = (np.sum((X_matrix * Y_matrix), axis=0) / X_var) / \
(self.window_length)

class Vol_3M(CustomFactor):
inputs = [Returns(window_length=2)]
window_length = 63

def compute(self, today, assets, out, rets):
out[:] = np.nanstd(rets, axis=0)

def Working_Capital_To_Assets():
return bs.working_capital.latest / bs.total_assets.latest

""" Momentum factor """
inputs = [USEquityPricing.close,
Returns(window_length=126)]
window_length = 252

def compute(self, today, assets, out, prices, returns):
out[:] = ((prices[-21] - prices[-252])/prices[-252] -
(prices[-1] - prices[-21])/prices[-21]) / np.nanstd(returns, axis=0)

# Commenting out some factors to not run out-of-memory
all_factors = {
'Asset Growth 3M': Asset_Growth_3M,
#'Asset to Equity Ratio': Asset_To_Equity_Ratio,
#'Capex to Cashflows': Capex_To_Cashflows,
#'EBIT to Assets': EBIT_To_Assets,
#'EBITDA Yield': EBITDA_Yield,
'MACD Signal Line': MACD_Signal_10d,
'Mean Reversion 1M': Mean_Reversion_1M,
#'Moneyflow Volume 5D': Moneyflow_Volume_5d,
'Net Income Margin': Net_Income_Margin,
#'Operating Cashflows to Assets': Operating_Cashflows_To_Assets,
'Price Momentum 3M': Price_Momentum_3M,
'Price Oscillator': Price_Oscillator,
'Return on Invest Capital': Return_On_Total_Invest_Capital,
'39 Week Returns': Returns_39W,
'Trendline': Trendline,
'Vol 3M': Vol_3M,
}

return all_factors

def shift_mask_data(X, Y, upper_percentile=70, lower_percentile=30, n_fwd_days=1):
# Shift X to match factors at t to returns at t+n_fwd_days (we want to predict future returns after all)
shifted_X = np.roll(X, n_fwd_days, axis=0)

# Slice off rolled elements
X = shifted_X[n_fwd_days:]
Y = Y[n_fwd_days:]

n_time, n_stocks, n_factors = X.shape

# Look for biggest up and down movers
upper = np.nanpercentile(Y, upper_percentile, axis=1)[:, np.newaxis]
lower = np.nanpercentile(Y, lower_percentile, axis=1)[:, np.newaxis]

# Only try to predict whether a stock moved up/down relative to other stocks
Y_binary = np.zeros(n_time * n_stocks)

# Flatten X
X = X.reshape((n_time * n_stocks, n_factors))

# Drop stocks that did not move much (i.e. are in the 30th to 70th percentile)

return X, Y_binary

def get_last_values(input_data):
last_values = []
for dataset in input_data:
last_values.append(dataset[-1])
return np.vstack(last_values).T

# Definition of Machine Learning factor which trains a model and predicts forward returns
class ML(CustomFactor):
init = False

def compute(self, today, assets, out, returns, *inputs):
# inputs is a list of factors, for example, assume we have 2 alpha signals, 3 stocks,
# and a lookback of 2 days. Each element in the inputs list will be data of
# one signal, so len(inputs) == 2. Then each element will contain a 2-D array
# of shape [time x stocks]. For example:
# inputs[0]:
# [[1, 3, 2], # factor 1 rankings of day t-1 for 3 stocks
#  [3, 2, 1]] # factor 1 rankings of day t for 3 stocks
# inputs[1]:
# [[2, 3, 1], # factor 2 rankings of day t-1 for 3 stocks
#  [1, 2, 3]] # factor 2 rankings of day t for 3 stocks

if (not self.init) or (today.weekday() == 0): # Monday
# Instantiate sklearn objects
self.imputer = preprocessing.Imputer()
self.scaler = preprocessing.MinMaxScaler()
log.debug('Training classifier...')
#self.clf = ensemble.RandomForestClassifier()

# Stack factor rankings
X = np.dstack(inputs) # (time, stocks, factors)
Y = returns # (time, stocks)

# Shift data to match with future returns and binarize
# returns based on their
X, Y = shift_mask_data(X, Y, n_fwd_days=PRED_N_FWD_DAYS)

X = self.imputer.fit_transform(X)
X = self.scaler.fit_transform(X)

# Fit the classifier
self.clf.fit(X, Y)
#log.debug(self.clf.feature_importances_)
self.init = True

# Predict
# Get most recent factor values (inputs always has the full history)
last_factor_values = get_last_values(inputs)
last_factor_values = self.imputer.transform(last_factor_values)
last_factor_values = self.scaler.transform(last_factor_values)

# Predict the probability for each stock going up
# (column 2 of the output of .predict_proba()) and
# return it via assignment to out.
out[:] = self.clf.predict_proba(last_factor_values)[:, 1]

def make_ml_pipeline(factors, universe, window_length=21, n_fwd_days=5):
factors_pipe = OrderedDict()
# Create returns over last n days.
factors_pipe['Returns'] = Returns(inputs=[USEquityPricing.open],

# Instantiate ranked factors
for name, f in factors.iteritems():

# Create our ML pipeline factor. The window_length will control how much
# lookback the passed in data will have.
factors_pipe['ML'] = ML(inputs=factors_pipe.values(),
window_length=window_length + 1,

factors_pipe['Sector'] = Sector()

pipe = Pipeline(screen=universe, columns=factors_pipe)

return pipe

###########################################################
## Algo definition
def initialize(context):
"""
Called once at the start of the algorithm.
"""

time_rules.market_open(minutes=10))

# Record tracking variables at the end of each day.
schedule_function(my_record_vars, date_rules.every_day(),
time_rules.market_close())

# Set up universe, alphas and ML pipline
context.universe = Q1500US() # & ~IsAnnouncedAcqTarget()
ml_factors = make_factors()
ml_pipeline = make_ml_pipeline(ml_factors,
context.universe, n_fwd_days=PRED_N_FWD_DAYS,
window_length=ML_TRAINING_WINDOW)
# Create our dynamic stock selector.
attach_pipeline(ml_pipeline, 'alpha_model')

"""
Called every day before market open.
"""
context.predicted_probs = pipeline_output('alpha_model')['ML']
context.predicted_probs.index.rename(['date', 'equity'], inplace=True)

context.risk_factors = pipeline_output('alpha_model')[['Vol 3M', 'Sector']]
context.risk_factors.index.rename(['date', 'equity'], inplace=True)
context.risk_factors.Sector = context.risk_factors.Sector.map(Sector.SECTOR_NAMES)

# These are the securities that we are interested in trading each day.
context.security_list = context.predicted_probs.index

########################################################
# Portfolio construction

def my_rebalance(context, data):
"""
Execute orders according to our schedule_function() timing.
"""

risk_model_factors = context.risk_factors
risk_model_factors = risk_model_factors.join(context.predicted_probs, how='right').dropna()

predictions = risk_model_factors.ML

# Filter out stocks that can not be traded
# Select top and bottom N stocks

todays_universe = predictions.index

predictions -= 0.5 # predictions are probabilities ranging from 0 to 1
# Setup Optimization Objective
objective = opt.MaximizeAlpha(predictions)

# Setup Optimization Constraints
constrain_gross_leverage = opt.MaxGrossLeverage(1.0)
constrain_pos_size = opt.PositionConcentration.with_equal_bounds(-.02, .02)
market_neutral = opt.DollarNeutral()
# TypeError: cannot do label indexing on <class 'pandas.indexes.base.Index'> with these indexers [nan] of <type 'float'>
sector_neutral = opt.NetPartitionExposure.with_equal_bounds(
labels=context.risk_factors.Sector.dropna(),
min=-0.0001,
max=0.0001,
)

# Run the optimization. This will calculate new portfolio weights and
# manage moving our portfolio toward the target.
order_optimal_portfolio(
objective=objective,
constraints=[
constrain_gross_leverage,
constrain_pos_size,
market_neutral,
sector_neutral,
],
universe=todays_universe,
)

def my_record_vars(context, data):
"""
Plot variables at the end of each day.
"""
record(leverage=context.account.leverage,
num_positions=len(context.portfolio.positions))

def handle_data(context,data):
"""
Called every minute.
"""
pass

There was a runtime error.

Grant, can you try it again? I just cloned your algo, and it ran to completion.

I'm not sure why it would be intermittent like that, but I'm looking.

Hi Dan,

I cloned a fresh copy of Thomas' code from https://www.quantopian.com/posts/machine-learning-on-quantopian-part-3-building-an-algorithm (Backtest ID: 58517784ee8d8363d0d9790d) and ran it, starting at 8/1/2003. I got:

There was a runtime error.
MemoryError

When I load the backtest into the research platform, only 11% of the research platform memory is consumed. So, I've tried it again, and it will not run to completion. Given that so little of the research platform memory is consumed when loading the backtest, I'm skeptical that orders/transactions are the culprit.

4
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
from quantopian.algorithm import attach_pipeline, pipeline_output, order_optimal_portfolio
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.factors import Latest, CustomFactor, SimpleMovingAverage, AverageDollarVolume, Returns, RSI
from quantopian.pipeline.classifiers.morningstar import Sector
from quantopian.pipeline.filters import Q500US, Q1500US
from quantopian.pipeline.data.quandl import fred_usdontd156n as libor

# If you have eventvestor, it's a good idea to screen out aquisition targets
# Comment out & ~IsAnnouncedAcqTarget() as well. You can also run this over
# the free period.
#from quantopian.pipeline.filters.eventvestor import IsAnnouncedAcqTarget

import quantopian.experimental.optimize as opt

import talib
import pandas as pd
import numpy as np
from time import time
from collections import OrderedDict

from scipy import stats
from sklearn import linear_model, decomposition, ensemble, preprocessing, isotonic, metrics, svm

####################################
# Global configuration of strategy

N_STOCKS_TO_TRADE = 1000 # Will be split 50% long and 50% short
ML_TRAINING_WINDOW = 21 # Number of days to train the classifier on, easy to run out of memory here
PRED_N_FWD_DAYS = 1 # train on returns over N days into the future
TRADE_FREQ = date_rules.week_start() # How often to trade, for daily, set to date_rules.every_day()

#################################
# Definition of alphas

# Pipeline factors
bs = morningstar.balance_sheet
cfs = morningstar.cash_flow_statement
is_ = morningstar.income_statement
or_ = morningstar.operation_ratios
er = morningstar.earnings_report
v = morningstar.valuation
vr = morningstar.valuation_ratios

class Sector(Sector):
window_safe = True

def make_factors():
def Asset_Growth_3M():
return Returns(inputs=[bs.total_assets], window_length=63)

def Asset_To_Equity_Ratio():
return bs.total_assets.latest / bs.common_stock_equity.latest

def Capex_To_Cashflows():
return (cfs.capital_expenditure.latest * 4.) / \
(cfs.free_cash_flow.latest * 4.)

def EBITDA_Yield():
return (is_.ebitda.latest * 4.) / \
USEquityPricing.close.latest

def EBIT_To_Assets():
return (is_.ebit.latest * 4.) / \
bs.total_assets.latest

def Return_On_Total_Invest_Capital():
return or_.roic.latest

class Mean_Reversion_1M(CustomFactor):
inputs = [Returns(window_length=21)]
window_length = 252

def compute(self, today, assets, out, monthly_rets):
out[:] = (monthly_rets[-1] - np.nanmean(monthly_rets, axis=0)) / \
np.nanstd(monthly_rets, axis=0)

class MACD_Signal_10d(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 60

def compute(self, today, assets, out, close):

sig_lines = []

for col in close.T:
# get signal line only
try:
_, signal_line, _ = talib.MACD(col, fastperiod=12,
slowperiod=26, signalperiod=10)
sig_lines.append(signal_line[-1])
# if error calculating, return NaN
except:
sig_lines.append(np.nan)
out[:] = sig_lines

class Moneyflow_Volume_5d(CustomFactor):
inputs = [USEquityPricing.close, USEquityPricing.volume]
window_length = 5

def compute(self, today, assets, out, close, volume):
mfvs = []
for col_c, col_v in zip(close.T, volume.T):
# denominator
denominator = np.dot(col_c, col_v)
# numerator
numerator = 0.
for n, price in enumerate(col_c.tolist()):
if price > col_c[n - 1]:
numerator += price * col_v[n]
else:
numerator -= price * col_v[n]
mfvs.append(numerator / denominator)
out[:] = mfvs

def Net_Income_Margin():
return or_.net_margin.latest

def Operating_Cashflows_To_Assets():
return (cfs.operating_cash_flow.latest * 4.) / \
bs.total_assets.latest

def Price_Momentum_3M():
return Returns(window_length=63)

class Price_Oscillator(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252

def compute(self, today, assets, out, close):
four_week_period = close[-20:]
out[:] = (np.nanmean(four_week_period, axis=0) /
np.nanmean(close, axis=0)) - 1.

def Returns_39W():
return Returns(window_length=215)

class Trendline(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252

# using MLE for speed
def compute(self, today, assets, out, close):

# prepare X matrix (x_is - x_bar)
X = range(self.window_length)
X_bar = np.nanmean(X)
X_vector = X - X_bar
X_matrix = np.tile(X_vector, (len(close.T), 1)).T

# prepare Y matrix (y_is - y_bar)
Y_bar = np.nanmean(close, axis=0)
Y_bars = np.tile(Y_bar, (self.window_length, 1))
Y_matrix = close - Y_bars

# prepare variance of X
X_var = np.nanvar(X)

# multiply X matrix an Y matrix and sum (dot product)
# then divide by variance of X
# this gives the MLE of Beta
out[:] = (np.sum((X_matrix * Y_matrix), axis=0) / X_var) / \
(self.window_length)

class Vol_3M(CustomFactor):
inputs = [Returns(window_length=2)]
window_length = 63

def compute(self, today, assets, out, rets):
out[:] = np.nanstd(rets, axis=0)

def Working_Capital_To_Assets():
return bs.working_capital.latest / bs.total_assets.latest

""" Momentum factor """
inputs = [USEquityPricing.close,
Returns(window_length=126)]
window_length = 252

def compute(self, today, assets, out, prices, returns):
out[:] = ((prices[-21] - prices[-252])/prices[-252] -
(prices[-1] - prices[-21])/prices[-21]) / np.nanstd(returns, axis=0)

# Commenting out some factors to not run out-of-memory
all_factors = {
'Asset Growth 3M': Asset_Growth_3M,
#'Asset to Equity Ratio': Asset_To_Equity_Ratio,
#'Capex to Cashflows': Capex_To_Cashflows,
#'EBIT to Assets': EBIT_To_Assets,
#'EBITDA Yield': EBITDA_Yield,
'MACD Signal Line': MACD_Signal_10d,
'Mean Reversion 1M': Mean_Reversion_1M,
#'Moneyflow Volume 5D': Moneyflow_Volume_5d,
'Net Income Margin': Net_Income_Margin,
#'Operating Cashflows to Assets': Operating_Cashflows_To_Assets,
'Price Momentum 3M': Price_Momentum_3M,
'Price Oscillator': Price_Oscillator,
'Return on Invest Capital': Return_On_Total_Invest_Capital,
'39 Week Returns': Returns_39W,
'Trendline': Trendline,
'Vol 3M': Vol_3M,
}

return all_factors

def shift_mask_data(X, Y, upper_percentile=70, lower_percentile=30, n_fwd_days=1):
# Shift X to match factors at t to returns at t+n_fwd_days (we want to predict future returns after all)
shifted_X = np.roll(X, n_fwd_days, axis=0)

# Slice off rolled elements
X = shifted_X[n_fwd_days:]
Y = Y[n_fwd_days:]

n_time, n_stocks, n_factors = X.shape

# Look for biggest up and down movers
upper = np.nanpercentile(Y, upper_percentile, axis=1)[:, np.newaxis]
lower = np.nanpercentile(Y, lower_percentile, axis=1)[:, np.newaxis]

# Only try to predict whether a stock moved up/down relative to other stocks
Y_binary = np.zeros(n_time * n_stocks)

# Flatten X
X = X.reshape((n_time * n_stocks, n_factors))

# Drop stocks that did not move much (i.e. are in the 30th to 70th percentile)

return X, Y_binary

def get_last_values(input_data):
last_values = []
for dataset in input_data:
last_values.append(dataset[-1])
return np.vstack(last_values).T

# Definition of Machine Learning factor which trains a model and predicts forward returns
class ML(CustomFactor):
init = False

def compute(self, today, assets, out, returns, *inputs):
# inputs is a list of factors, for example, assume we have 2 alpha signals, 3 stocks,
# and a lookback of 2 days. Each element in the inputs list will be data of
# one signal, so len(inputs) == 2. Then each element will contain a 2-D array
# of shape [time x stocks]. For example:
# inputs[0]:
# [[1, 3, 2], # factor 1 rankings of day t-1 for 3 stocks
#  [3, 2, 1]] # factor 1 rankings of day t for 3 stocks
# inputs[1]:
# [[2, 3, 1], # factor 2 rankings of day t-1 for 3 stocks
#  [1, 2, 3]] # factor 2 rankings of day t for 3 stocks

if (not self.init) or (today.weekday() == 0): # Monday
# Instantiate sklearn objects
self.imputer = preprocessing.Imputer()
self.scaler = preprocessing.MinMaxScaler()
log.debug('Training classifier...')
#self.clf = ensemble.RandomForestClassifier()

# Stack factor rankings
X = np.dstack(inputs) # (time, stocks, factors)
Y = returns # (time, stocks)

# Shift data to match with future returns and binarize
# returns based on their
X, Y = shift_mask_data(X, Y, n_fwd_days=PRED_N_FWD_DAYS)

X = self.imputer.fit_transform(X)
X = self.scaler.fit_transform(X)

# Fit the classifier
self.clf.fit(X, Y)
#log.debug(self.clf.feature_importances_)
self.init = True

# Predict
# Get most recent factor values (inputs always has the full history)
last_factor_values = get_last_values(inputs)
last_factor_values = self.imputer.transform(last_factor_values)
last_factor_values = self.scaler.transform(last_factor_values)

# Predict the probability for each stock going up
# (column 2 of the output of .predict_proba()) and
# return it via assignment to out.
out[:] = self.clf.predict_proba(last_factor_values)[:, 1]

def make_ml_pipeline(factors, universe, window_length=21, n_fwd_days=5):
factors_pipe = OrderedDict()
# Create returns over last n days.
factors_pipe['Returns'] = Returns(inputs=[USEquityPricing.open],

# Instantiate ranked factors
for name, f in factors.iteritems():

# Create our ML pipeline factor. The window_length will control how much
# lookback the passed in data will have.
factors_pipe['ML'] = ML(inputs=factors_pipe.values(),
window_length=window_length + 1,

factors_pipe['Sector'] = Sector()

pipe = Pipeline(screen=universe, columns=factors_pipe)

return pipe

###########################################################
## Algo definition
def initialize(context):
"""
Called once at the start of the algorithm.
"""

time_rules.market_open(minutes=10))

# Record tracking variables at the end of each day.
schedule_function(my_record_vars, date_rules.every_day(),
time_rules.market_close())

# Set up universe, alphas and ML pipline
context.universe = Q1500US() # & ~IsAnnouncedAcqTarget()
ml_factors = make_factors()
ml_pipeline = make_ml_pipeline(ml_factors,
context.universe, n_fwd_days=PRED_N_FWD_DAYS,
window_length=ML_TRAINING_WINDOW)
# Create our dynamic stock selector.
attach_pipeline(ml_pipeline, 'alpha_model')

"""
Called every day before market open.
"""
context.predicted_probs = pipeline_output('alpha_model')['ML']
context.predicted_probs.index.rename(['date', 'equity'], inplace=True)

context.risk_factors = pipeline_output('alpha_model')[['Vol 3M', 'Sector']]
context.risk_factors.index.rename(['date', 'equity'], inplace=True)
context.risk_factors.Sector = context.risk_factors.Sector.map(Sector.SECTOR_NAMES)

# These are the securities that we are interested in trading each day.
context.security_list = context.predicted_probs.index

########################################################
# Portfolio construction

def my_rebalance(context, data):
"""
Execute orders according to our schedule_function() timing.
"""

risk_model_factors = context.risk_factors
risk_model_factors = risk_model_factors.join(context.predicted_probs, how='right').dropna()

predictions = risk_model_factors.ML

# Filter out stocks that can not be traded
# Select top and bottom N stocks

todays_universe = predictions.index

predictions -= 0.5 # predictions are probabilities ranging from 0 to 1
# Setup Optimization Objective
objective = opt.MaximizeAlpha(predictions)

# Setup Optimization Constraints
constrain_gross_leverage = opt.MaxGrossLeverage(1.0)
constrain_pos_size = opt.PositionConcentration.with_equal_bounds(-.02, .02)
market_neutral = opt.DollarNeutral()
# TypeError: cannot do label indexing on <class 'pandas.indexes.base.Index'> with these indexers [nan] of <type 'float'>
sector_neutral = opt.NetPartitionExposure.with_equal_bounds(
labels=context.risk_factors.Sector.dropna(),
min=-0.0001,
max=0.0001,
)

# Run the optimization. This will calculate new portfolio weights and
# manage moving our portfolio toward the target.
order_optimal_portfolio(
objective=objective,
constraints=[
constrain_gross_leverage,
constrain_pos_size,
market_neutral,
sector_neutral,
],
universe=todays_universe,
)

def my_record_vars(context, data):
"""
Plot variables at the end of each day.
"""
record(leverage=context.account.leverage,
num_positions=len(context.portfolio.positions))

def handle_data(context,data):
"""
Called every minute.
"""
pass

There was a runtime error.

Grant, your most recent attached backtest looks like it's running over a longer time frame than Thomas's original. Thomas's run is less than 3 years. It appears that your recent attempt was going for 13 years. It's definitely known and understood that Thomas's ML algo is pushing the limits of what the Quantopian platform will do. It's unfortunate, but I'm not surprised, that it would fail when the backtest parameters are expanded.

What I don't yet understand is why your backtest from yesterday failed. That was only 3 years, and I was able to get it to run. I'm still looking into the cause of that failure.

O.K. No big deal. It just seems odd that my example immediately above (Backtest ID: 585ba441c3d625638d593532) crashes due to a memory limit, but when the backtest is loaded into the research platform, there isn't much being stored. My understanding is that Thomas is doing learning/training on a rolling basis, and that the amount of memory required for that activity should be fixed, not grow with time; but as you suggest, maybe effectively there is a memory leak. If you can look at the memory usage, as or after the backtest runs, then maybe it could be sorted out (perhaps you have some admin-level debug tools?).

@Grant, I bet lots of memory is used by the pipeline and you won't see any trace of that memory usage when loading the backtest into research platform.

@ Luca -

Yes, but it would seem that if things are running on a rolling basis, with finite trailing windows, then the memory usage should be constant versus time (aside from the orders/transactions, which, by definition, will grow). Of course, if expanding windows are used, or ML/pipeline stores a bunch of stuff every time it runs and doesn't "recycle" the memory space, then the memory usage could grow (i.e. a memory leak, if the code could be re-written to provide the same functionality, but not cause the memory usage to grow).

Edit - It is not clear what actually needs to reside in RAM. It would seem that a lot of working memory could be freed up for computing if orders/transactions were written to disk as a backtest runs (unless explicitly stored in context or in some other fashion, as specified in the code).

@Grant

I agree with you, but it could also be that the memory usage is very high since the beginning and as soon as the algorithm stores orders and positions the limit is reached and an error is raised. This could be an explanation that doesn't involve a memory leak.

I remember a reply by Scott that explains the pipeline memory usage. He was talking about the research platform but most of the concepts are valid in backtesting too. Considering the algorithm has quite a lot of pipeline factors, not filtered, I guess the memory consumption is very high.

@Luca -

Yeah, the algo could be "on the hairy edge of disaster" and the orders/transactions push it over the edge. Either way, one is flying blind, without even being able to see the gross amount of memory consumed compared to the amount available (~ 4 GB?). I have confidence that the Q team will eventually fix this; I can't imagine an algo running \$10M in capital that could run out of memory unexpectedly. Maybe internally they have a way of flagging which live-trading algos might be at risk?

Another comment that was deemed inappropriate for the thread https://www.quantopian.com/posts/machine-learning-on-quantopian-part-3-building-an-algorithm . I am copying it here, in case anyone is interested:

Hi Thomas,

If I'm reading things correctly, you are applying all factors to the same universe, the Q1500US. There is a more general case, though, of having N factors and M universes, with each factor assigned a universe. That seems straightforward, but then there is a choice of whether the ML is run once, or M+1 times, where the alpha combination step would be run separately on factors grouped by the M universes, and then a final time to combine the results.

The other comment is that it would be nice to be able to be able to combine factors that use minute data (i.e. computed outside pipeline) with pipeline factors. I think you'd said that because the ML is integral to pipeline, this would not be possible, correct? Can the framework be formulated so that the ML runs outside of pipeline (e.g. within before_trading_start as a callable function)?

I'm not sure how important either use case above might end up being, but I figure now's the time to consider them, as you develop the tools.

Hi Grant,

Not sure why we would work with multiple universes?

Re running ML outside of pipline: I think that should be possible. Actually not sure why we moved it into pipeline then in the first place which makes me think that there is probably a reason.

Hello Thomas,

Regarding multiple universes, my thinking is that some factors may work better (or solely) on certain securities. For example, one might have some that work on broad universes of stocks (e.g. Q500US/Q1500US or some such thing), some that work on certain ETFs, some that are IPO-focused, some on stocks that aren't in the Q500US/Q1500US (the subversive/contrarian universe), etc. So, my thinking is that although the workflow flow chart shows a single universe, one could have multiple universes with their associated alpha factors. Then (I think) one would want to combine the alpha factors by universe first, and then do a final combination, prior to the portfolio construction step. This would mean that your architecture would need to support this approach (if it makes sense).

Running ML outside of pipeline would seem to be a lot more flexible. A primary reason would be that for factors that can use pipeline, then by all means, implement them using that API. But if one wants to use minute bars, for example, for some alpha factors, then if your ML module is in pipeline, there will be no way to feed in alpha factors other than from pipeline (I think...unless I'm missing something about pipeline). That is not to say that you would support running the ML more than daily (within before_trading_start, I guess), it would just have the flexibility of being able to take in alpha factors both from pipeline and from the regular functions/classes.

Hi Grant,

Re Multiple Universes: I see, yeah, I think that makes sense. I think one would probably just have an ML classifier piplinefactor for each universe then.

Re ML outside of pipline: I remembered why it's not possible. A pipeline factor can get historical pipeline values. In before_trading_start() you only get the current factor value. I think it would be nice to make this available though.

Regarding ML/alpha combination outside of pipeline, I'd think that the historical pipeline values could be stored in context, right?

The other approach would be to allow feeding of non-pipeline factors into pipeline, so you could still do the combination step within pipeline.

A simple use case would be to use minutely history to compute a daily VWAP of prices, which would then feed into a non-pipeline technical analysis factor that would return daily values (same as a pipeline factor). You'd combine pipeline factors with non-pipeline factors on a daily basis.

My sense is that you'll have a much more universal workflow if you can incorporate non-pipeline factors at the alpha combination step, but apparently there are some fundamental technical limitations.

Another note is that you might do a basic head-scratch on sampling/noise. My intuition is that by using daily OHLCV bars in pipeline, which amounts to very limited sampling of individual trades, with no smoothing, you end up with a lot more noise than necessary. If the idea is to trade daily/weekly (or even monthly?), you'll be in the under-sampled regime, trying to make relatively high-frequency decisions on low-frequency, noisy data. But maybe I'm wrong, and there is good evidence that the kind of equity long-short strategies you are looking to support will work just fine on daily OHLCV bars.

Hi Thomas -

Any progress in devising a way to determine if the ML is doing what it is supposed to? As I suggested, I think you have to be able to input synthetic data/factors that have a predictable/ideal output, to see if the ML combines them properly. Also, one could formulate synthetic factors that are purely noise, to verify that the ML isn't somehow over-fitting.