DNN and beyond

Starting here as a new thread, the following posts are specifically on the topic of DNN and related forms of ML, They are copied from another thread (Risk Model) in which this discussion developed but was a bit far off the original thread topic, so we moved here. Cheers, best wishes, from TonyM.

56 responses

COPIED from post in "Risk Model Example" thread.

Tony Morland
Yesterday Edit Delete

I just watched Delaney's webinar. It does provide an excellent overview of the workflow associated with quant strategy development. Many thanks to you for that @Delaney; a great webinar! Although it was not specifically focused on the risk model side of algo development, Delaney made several points that gave me some additional insights.

1) The first key point for me was Delaney's comment that, upon careful analysis, new algo models are often found to give very similar results to other pre-existing models, either of risk or of alpha, and that, as he said: "anything that is already in a risk model is probably NOT alpha".

Now in that context, I think that maybe I start to make some sense of Q's use of RSI with default value 14 in relation to mean reversion (MR) risk. RSI-14 is probably one of the most over-used and worn-out indicators employed by people looking for reversals. As Delaney says, in seeking alpha, what we need to do is to find something that is new & innovative, not something that is just a re-hash of old over-worked ideas. The implication being that the more we re-work old ideas, the more risk we are implicitly taking and the less benefit we are likely to find. In that sense, the greater the similarity between an algo's results and those from a pure RSI-14 strategy, then probably the lower the potential for reward and the greater the inherent risk. So then, if the statement: "Q uses RSI-14 for mean reversion risk" actually means that what the Q risk model does is to to calculate the correlation between an algo's returns and those from a pure RSI-14 strategy, and then equate some function of that correlation to risk, then I think it does make good sense. However my question then is: Is this actually what Q's risk model is doing with regard to MR risk? @Delaney, @Rene, please can you clarify?

2) Another nice idea that Delaney mentioned is the concept of a "totally risk aware" algo that could dynamically adjust its risk constraints as it goes, in accordance with current market conditions. I think it is a very neat idea, but before attempting to implement it based on Q's risk model, we still have a long way to go and need a much better understanding to Q's Risk Model itself. So we come back to: ..... Please Q, give us a lot more detail & clarity about the Risk Model!!

COPIED from post in "Risk Model Example" thread. Please note: the material below is by JAMES

Gravatar avatar
James Villa
Yesterday

@Grant, as I have not yet watched Delaney's webinar, I can only go by your description above regarding a multivariate regression of a linear model. Programactically, this is achieved by Optimize API where the objective function is to maximize alpha (factors or combination of factors that creates alpha or excess returns) subject to the 8 constraints they laid out which are different risk measures that are designed to migitate risks through diversification and risk dispersal methods. The optimization algo does this serially in a loop as it adjust weights of your trading universe to achieve optimal alpha of the model.

Personally, my weapon of choice is deep learning, a multi-layered neural network that does something like a nonlinear multivariate regression. Unfortunately, Q doesn't allow AI libraries like Keras, Tensorflow , Theano to name a few, that does this type of algo. I guess because it will take up a lot of their computing resources. They do cater to some basic machine learning algos such Random Forest, SVR which I tried and get constant computational timeout because of longer training periods and / or too many factors that renders them useless.

COPIED from post in "Risk Model Example" thread. Please note: the material below is by KARL

Karl
16 hours ago

Agree @Tony: Mechanisms that are "totally risk aware" to dynamically adjust its risk constraints that can be appended to Optimize API will be great, in particular to constrain exposures to negative returns attributable to specific sectors point-in-time.

As for RSI-measured "risk", I consider that to be a matter of definition by degree of magnitude in the reversal. Any technical trader will know if RSI is used on its own is a terrible risk and a complete folly. But that does not negate its usefulness when used in combination with other technical indicators.

There lies what Delaney was referring to as new & innovative "alpha" where if the strategy is able to leverage insights into existing regimes to extract more value in the construct, I would transcend the technical indicators, RSI notwithstanding, to get the new metrics to signalise. It's another way of saying.. yes the jet engine is ubiquitous but see how the Sabre can fly a Skylon in one single-stage-to-space!

COPIED from post in "Risk Model Example" thread. Please note: the material below is by JAMES

Gravatar avatar
Tony Morland
15 hours ago Edit Delete

@Karl,
Your statement: "... insights into existing regimes ..... would transcend the technical indicators..." is, I believe, essential for taking trading to a much higher level, and the key is that little word "regimes". Personally I believe that understanding the market regime, and in particular seeking the answer to the question: "What regime is the market in right now? -- i.e. trending, mean reverting, or something else?" is THE most important question in trading. If you have that answer, then you know the way to trade, and the choice of "indicators" or even "entry signals" becomes of minor importance. If the market is trending, then trade in the direction of the trend using any kind of trend following (TF) technique, if the market is mean-reverting, then trade that using a MR technique, and if you are not sure whether the market is TF or MR, then stand aside. What always amazes me is how little thought & effort most people seem to put into the question: "what regime is the market in now"?

@James,
You write: " my weapon of choice is deep learning, a multi-layered neural network" That may indeed be a great choice, and i wonder why you select it. I did a lot of work with NNs years ago, and results were not as good as I hoped, but I know the technology has come a long way since then. However most times when i see people using any sort of ML for trading, i end up thinking their efforts are mis-directed, using great tools but then trying to apply them in seeking answers to the wrong questions, such as generating trading signals, and then the results turn out disappointing. I'm not currently using any sort of ML, but i am very interested in knowing what specific questions intelligent traders would be seeking to answer with the new generation NNs?

COPIED from post in "Risk Model Example" thread. Please note: the material below is by KARL

Karl
14 hours ago

@Tony: I think "regime" is a metaphor for order and system that one can postulate as probable premises :) I actually meant the regime of technical indicators that are used to visualise and describe "risks".

Let's take short-term-reversal as a "risk" by RSI indicator, in mitigation I can think of several measures:

1) Optimize the constituents in the portfolio with respect to each individual exposure to the market, ie. not to curtail alpha generation/aggregation but to seek an optimised outcome in positioning size and execution that are market specific, allowing the opt.order_optimal_portfolio() to perform as and when the portfolio interacts with the live market. No loss in alpha, possible casualty in exhaust.

ps: Dan Whitnable posted an example of the measure here.

2) Impose risk exposure mask, a priori in Pipeline to filter out defined risks, ie. CustomFactor by risk parameters. Possible loss in alpha, no new constraint to optimise.

ps: Grant Kiehne posted an example of the measure here.

3) Modify process workflow to adapt, or better to improve performance metrics. No loss in alpha, no casualty in exhaust, compliant risk metrics.

I am sure our colleagues will have more to add here.. indeed as Q are working to provide more tools for 1 and 2, I am adapting the codes for 3 to study intraday and medium-term trades.

COPIED from post in "Risk Model Example" thread. Please note: the material below is by JAMES

Gravatar avatar
James Villa
14 hours ago

@Tony,
I was an early adopter of using plain vanilla neural networks in financial prediction way back in the early 1990s. I had limited success back then. My first portfolio using NN made a cool 55% in the first year, only to start decaying in the second year, when I made an executive decision to stop trading it before all profits are gone. The culprit, as you've touched upon with Karl, is regime change. And you hit it in nail, being able to identify in what regime the market is would be the game changer. So at that time my solution was to retrain the NN after confirmation of the regime change. I was not totally satisfied with this stop and go approach.

Fast forward to present, many key developments in the field of AI have come about together with increase computing power. One of these key developments was the discovery of Prof. Hinton that stacking more hidden layers in the NN architecture does a better job dechipering nonlinear relationship among the input variables and thus better prediction accuracy. This is now termed 'deep learning" and very effective for stationary data like image recognition, computer vision and such spatial domain. Non stationary data such as financial time series remains elusive. However, a new architecture in neuron design called Long Short Term Memory (LSTM) looks very promising for time series analysis and non stationary data because it incorporates some kind of memory, thus making it possible for determining long term dependencies in the temporal space.

I attack financial prediction as a spatio-temporal problem, space and time domains, that is why my only inputs are variations of prices. I also believe in the principles of Chaos Theory as a possible solution. Deep Learning architecture with LSTM neurons seems to me like the right tool /fit for my hypothesis on the dynamics of price evolution. I am hoping that it also captures regime changes and adapt accordingly. The same questions still holds, how accurate are your predictions in out of sample data, is the model able to adapt to regime changes and are your results consistent in training, validation and out of sample tests? Hope this helps!

COPIED from post in "Risk Model Example" thread. Please note: the material below is by JAMES

Gravatar avatar
James Villa
11 hours ago

@Karl, I'm curious as to what inputs and target variables you use in your DNN for feature engineering? Q's current infrastructure does not allow for any meaningful application of DNN as they eat up a lot of memory and computational resources. I requested Q before, to no avail, to allow offline implementation of DNNs or any other computational intensive methods then be able upload results into Q platform for further processing. The problem,I think, is you cannot download their data for offline implementation.

COPIED from post in "Risk Model Example" thread

Tony Morland
52 minutes ago Edit Delete
@James, as we share some similarity of background in terms of NN application & interest, and as you & I and @Karl obviously share some interest in where modern DNN / ML might usefully go from here, i think this deserves a separate thread. Do I have your permission to copy your/our last few posts on this topic and kick off a new thread with it?

COPIED from post in "Risk Model Example" thread. Please note: the material below is by JAMES
Gravatar avatar
James Villa
47 minutes ago

@Tony, yes by all means. Really needs a new thread, as this might be off topic here. Thanks.

COPIED from post in "Risk Model Example" thread. Please note: the material below is by JAMES
Gravatar avatar
James Villa
11 hours ago

@Karl, very interesting, thanks!

Hi @James, @Karl: James writes: "...believe in the principles of Chaos Theory as a possible solution. Deep Learning architecture with LSTM neurons seems to me like the right tool". I'm certainly aligned with you in terms of "believing in the principles" of Chaos Theory. I'm sure you know that Benoit Mandelbrot was also a great follower of markets as well as being the "father of fractals", and I like his book "The (Mis-)Behavior of Markets" which you have no doubt also read. However the problem that I kept finding with Chaos Theory & fractals in financial markets was that, although they seem to offer a lot of potential, i always found it difficult to work out how to actually convert that potential into reality. Then the same thing happened with NNs & other forms of ML and also fractals in Geology in the Oil industry (where i worked in a "previous life") where all these went from being "interesting" to becoming "mainstream" tools. Perhaps the same has happened in the trading world since I last looked carefully at this topic. Any comments?

In case you guys missed it, Q has a kind of play into ML with:

It is important to note:

Quantopian started collecting this dataset live on March 6, 2017. Why this matters: https://www.quantopian.com/posts/quantopian-partner-data-how-is-it-collected-processed-and-surfaced

So, as of 12/10/2017, there are 279 calendar days of out-of-sample data.

Hi Tony, first thank you for starting this thread.

I'm glad you mentioned Mandelbrot because he was one of the first more prominent mathematician that concluded based on his extensive research on price movements of cotton, that financial time series does not follow a normal (Gaussian) distribution but rather more of the Pareto-Levy stable distribution. You may have noticed that I having been trying to harp on this in my other posts. However, even today most investment professionals are still stuck with the notion that financial prices are normally distributed and compute performance metrics and risk measures with this premise. Case in point, the calculation of volatility as a measure of risk. This is calculated as the standard deviation of daily percent change, annualized. As empirical evidence already proves that prices are not normally distributed, this is like fitting a square into a circle. To come close to reality, it should be computed using daily log differences, not daily percent change. This will give what is called a log normal distribution with fat tails which is closer to the Pareto-Levy distribution.

That being said, I stopped using technical indicators and other data transformations that are derived from the normal distribution assumption, the mean / variance kind. Ooops, Karl may disagree! The beauty of DNN is that it does not assume what the distribution is, it inherently extracts it! DNN is quite a powerful tool but one has to understand the underpinnings of it as it has a lot of moving parts to tweak to get meaningful results. As always, the choice of right inputs, target variables and pre-processing them the right way are very important.

@James,
Yes indeed, Patero-Levy is reality, but its a lot harder to work with than Normal, so academics use Normal and then try to convince reality to be like their theory!!

Somehow to me this seems reminiscent of Buffett's comments about academics & the EMH: They observed that markets are OFTEN efficient, then concluded that they are ALWAYS efficient, then tried to force everyone to believe in a theory that assumes markets are always efficient. "And if that was the case then I would be a bum on the street with a tin cup" says Buffett, with his usual style of humor.

Your final comment: " the choice of right inputs, target variables and pre-processing them the right way are very important" is absolutely true and misunderstanding this was the cause of almost all the problems that I saw with the (mis-)application of NNs in the industry where I formerly worked, where people often just threw data in and said "no need for pre-processing, the NN will take care of it", which of course it didn't.

The other part of this, in addition to right inputs, right pre-processing etc, is, as you say, right target variables or, as i would put it, "asking the right questions and in the right way". Asking a NN to answer the question of "what will be tomorrow's (or next week's) stock price" is not a good question, as I'm sure we both know, although that's what a lot of people seem to do and why a lot of people get disappointed with NNs or ML in general.

You can possibly already guess what my own idea of "the one very best question" would be, and I'm interested to know what would be yours.

This whole discussion around ML would be much more interesting if the Q team would layout a vision for how it might (eventually) be implemented properly on the Q platform. I gather that Thomas Wiecki is the lead. @Thomas, any thoughts? I realize that Q tends to operate in a kind of start-up stealth mode in many ways, and also needs to hyper-focus on the tasks at-hand, so I'm not expecting you to bear your soul. But some rough plans would help put things in context. I also would be curious how you might be able to de-couple from the paradigm on breadth, with on-demand computing, to depth, with old-school submit-a-compute-job and wait batch processing (think punch cards and print-outs, to be picked up from the computing center). It seems that if Q continues on the current path, with the goal of a million users being able to do what 160,000 are doing today, it will be all about breadth, with little depth. Doing both would be feasible, I think. Any thoughts?

@Grant, does Q have out of sample performance data on precog_top_100 & 500? Curious to find out.

@Tony, actually my target variable is a pre-processed next day log difference as I do regression as oppose to Karl that does classification. So basically I'm asking the NN to predict whether tomorrow will be an up day or a down day. If > 0 or < 0 which is just direction. Directional accuracy is an important aspect of profitable trading specially the crossover part (i.e. from up to down and vice versa).

@Grant, I do wonder if Q has any vision for ML at all? They do have provisions for some ML algos via sklearn like Random Forest, AdaBoost, SVR but even these which are less computational intensive than Neural Networks can not be implemented properly because these will often result in compute timeout error. I read the three part ML demo and I giggle when people ask why the results are not profitable. The basic mistake here is they are training with a year or less of data. I don't know if this is by design or this is what their computing resources would allow. Implementing ML the right way would require a lot of computing power being allocated to the users and this could be very expensive.

@ James -

As I understand, Q stored all of the PreCog data prior to March 6, 2017 and thus it cannot be revised by Alpha Vertex. Q then appends the data point-in-time to preclude any look-ahead bias. Thus, starting on March 6, 2017, we have out-of-sample raw data (of course, if it is used in an un-raw form, then bias can be injected after March 6, 2017).

I am not aware of Q doing any out-of-sample analysis, but users have (e.g. https://www.quantopian.com/posts/alpha-vertex-precog-test). You can straightforwardly run your own analysis using Alphalens, etc.

The Alpha Vertex team has be incommunicado on the Q forum, which I find revealing. They were all chatty when the data set was released, but now that there is decent out-of-sample data, they have had nothing to say. It would be nice if they would comment on the in-sample versus out-of-sample performance, vis-a-vis their model and training process.

@Grant, man, i thought i would be the only one around here old enough to actually remember those punched card days at uni! I wish i had kept a few of them as antiques ;-)

@James, I understand clearly the difference both conceptually & computationally between treating the problem as regression type or as classification type, although it appears that eventually you are using a regression to drive an up/down (i.e. directional classification) decision for trading. I have a friend who works with ML (mostly SVM-type stuff) and occasionally we help each other a bit with ideas. One thing we found to be useful, both computationally and also from a practical trading perspective, was to re-frame the question just slightly. Instead of asking is tomorrow up or down, it can work better to ask two separate questions: 1) Is tomorrow UP AND > some small threshold below which taking the trade would probably not be profitable anyway?, and 2) Is tomorrow DOWN AND of magnitude > threshold? Doing it this way helps the NN / ML computationally by not having it wasting effort agonizing (so to speak) over decisions about uncertain but very small magnitude price moves that are not practically tradable anyway. Cheers, best wishes.

@ James -

For research, I don't see the hardware expense as the issue (although to do it for production algos, running real money, it could get messy, since everything might need to run in parallel, quickly, just before market open). On a research basis, my understanding is that even deep learning can be done on a desktop PC, with a GPU. We aren't talking about exotic, supercomputer-type hardware; it is commodity stuff.

The thing is, I get the impression that there are a lot of folks who would pay a flat monthly fee and/or a per-minute compute fee to be able to do ML for real on Quantopian. It is a mystery why the Q response has been so lukewarm. Maybe they just need to focus on other things, for the time being...

@ Tony -

I actually just missed the punch-card era (but my father had to use them, when I was a kid).

@Grant, looks like the out of sample tests ran by Aqua Rooster shows that it is a big disappoinment. Could be an overfit or as I suspect, data snooping (lookahead bias) in preprocessing the datasets. This is a common rookie mistake in ML. One takes the whole dataset, then preprocess them before splitting into training, validation and test sets. This creates the lookahead bias.

I'm a fan of examples in case someone might have some really simple ones in mind to be able to describe each of ML and DNN, how they could look here in a backtest, and ways the two would differ. I'm interested in how you are picturing them in this framework specifically (maybe using SMA or something) rather than general discussions of the concepts by academics elsewhere.

users being able to do what 160,000 are doing today

So everyone who logs in on Quantopian and tries it out is still writing algorithms, impressive. Except I always though that particularly in relation to the stock market, success is directly proportional to our ability to be courageously frank with ourselves on what is really happening. Meh, probably just quant humor.

@Tony, yes, your target variable questions are more in tune with reality considering transaction costs, some small upward movement can be a loss.

@Grant, I ran my DNN on an i7 laptop without GPU, works fine but I have the luxury of time. For Q research though you will often get a timeout error, guess their maximum allocation is 5 minutes. Q's probably not focus on this yet but like you said if there is enough demand it could be a revenue stream for them too.

@Blue Seahawk, it is impossible to show an example of DNN in the Q platform because they do not allow the use libraries of like Keras or Tensorflow that does this kind of algo.

I was just hoping for a description in words, say, if Keras were available, by way of example.

Hi @Blue, welcome to the conversation.
Just to pick up on your: " ...maybe using SMA or something) rather than general discussions of the concepts by academics..." and to mix a bit of humor with speculation as to why Q might be lukewarm about ML in general (i suspect they remain unconvinced that it is actually worthwhile) here is a true story that might, i hope, cause a smile.

Thanks Tony! Yeah I'd love to be able to follow the discussion better, just that I draw a blank. I've been stuck with needing examples to learn all my life. I'm guessing it has to do with keeping track of the reasons for each stock's position from back in time and how those worked out and then automatically making adjustments, not sure.

How can I paste picture (png file) on this board? I want to show results of a DNN model.

@Grant, thanks! I'll try it.

Here's an example of what the PROFIT results of a DNN model could look like.
The model is designed to predict SP500 index returns the next day.
Model: DNN 12 layers deep

TRAINING:

VALIDATION:

OUT OF SAMPLE TEST:

As you can see, the model was able to train and validate very well in both bull and bear markets. The out of sample test is not bad either. My problem here is high drawdowns. I surmised that since my target variable was next day returns only, the model only learned to focus on taking profits without considering drawdowns or risks. It also may just learned that in order to get high profits you have to take some considerable amount of risks. There is still a lot of room for improvement.

Hi @James, you write: " I surmised that since my target variable was next day returns only, the model only learned to focus on taking profits without considering drawdowns or risks", and no doubt what you say there is exactly the case. One of the big learnings for me with (the old style) NNs, in addition to the importance of pre-processing data, was that i got exactly (or at least as close as the ML tool could manage to give me) what i asked for, and sometimes changing the "question" (i.e. the target) slightly made a lot of difference in terms of accuracy and therefore usefulness for practical trading. Maybe a better target might be profit / DD risk? Maybe the best way to do that is with 2 separate NNs, one to predict profit and the other to predict DD, and then combine results?

Honestly, I don't know how much effort to invest in looking at ML for trading signal generation. Maybe the new DNNs are way better than anything i even imagined, but actually i'm a lot more hopeful about the possibility of new-generation ML (of whatever types) for other applications to "trading assistance" rather than generating trading signals themselves.

Hi Tony,

You read my mind, making the target profit/DD is now my focus . Will update you on the progress.

@Karl, the prediction signal is executed as buy when greater than 0 and a sell when less than 0 and a hold if subsequent sign of signal is equal to the sign of current signal. Confidence level is not used. The model is designed for directional accuracy of predictions.

@Karl, basically the execution process does not account for any of these, thus, it is naive. Probably, a more sophisticted execution algo that takes statistics of training and validation signals to determine strength and confidence levels of signals would churn out better results in out of sample data.

Another thing I want to test is to run these out of sample signals in Q research platform using pipeline of USTradableStocks and Rolling Correlations of returns with SP500. And say, take the top 10% quantile of positively correlated and bottom 10% of negatively correlated, as long and short bets when signal is a buy and conversely when signal is a sell and run these through Optimize API subject to their constraints. I just found out about fetch_csv yesterday and I hope it will make this test possible.

Just updating this thread. After experimenting with how to structure trading framework with SPY DNN prediction (out of sample results of above examples) into the Q framework here, this approach would show these results below. The out of sample predictions of this example has a directional accuracy of 56% only. The settings of this model:
alpha = 21 day stock beta
maximum exposure = 1
min/max position size = 1%
Optimize = alpha * sign(DNN prediction)
Constraints: leverage = 1, position exposure =1%
Commissions = 0.001 per share
New Slippage Model

6
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
import pandas as pd
import numpy as np
import quantopian.algorithm as algo
import quantopian.optimize as opt

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import builtin, Fundamentals, psychsignal
from quantopian.pipeline.factors import AverageDollarVolume, SimpleBeta
from quantopian.pipeline.factors.fundamentals import MarketCap
from quantopian.pipeline.classifiers.fundamentals import Sector

from quantopian.pipeline.factors import CustomFactor, RollingPearsonOfReturns, RollingSpearmanOfReturns
# Algorithm Parameters
# --------------------
UNIVERSE_SIZE = 1000
LIQUIDITY_LOOKBACK_LENGTH = 100

MAX_GROSS_LEVERAGE = 1.0
MAX_SHORT_POSITION_SIZE = 0.01  # 1%
MAX_LONG_POSITION_SIZE = 0.01   # 1%

def initialize(context):
set_slippage(slippage.FixedBasisPointsSlippage())
#Fetch SP500 DNN predictions
date_column = 'Date',
date_format = '%m/%d/%y')
context.stock = symbol('SPY')

# Universe Selection
# ------------------

# From what remains, each month, take the top UNIVERSE_SIZE stocks by average dollar
monthly_top_volume = (
AverageDollarVolume(window_length=LIQUIDITY_LOOKBACK_LENGTH)
.downsample('week_start')
)
# The final universe is the monthly top volume &-ed with the original base universe.
# &-ing these is necessary because the top volume universe is calculated at the start
# of each month, and an asset might fall out of the base universe during that month.
universe = monthly_top_volume & base_universe

# Market Beta Factor
# ------------------
stock_beta = SimpleBeta(
target=sid(8554),
regression_length=21,
)
# Alpha Generation
# ----------------
# Compute Z-scores of free cash flow yield and earnings yield.
# Both of these are fundamental value measures.
# Alpha Combination
# -----------------
# Assign every asset a combined rank and center the values at 0.
# For UNIVERSE_SIZE=500, the range of values should be roughly -250 to 250.
#combined_alpha = (fcf_zscore + yield_zscore + sentiment_zscore).rank().demean()
alpha = b.top(100)

# --------------
# Create and register a pipeline computing our combined alpha and a sector
# code for every stock in our universe. We'll use these values in our
# optimization below.
pipe = Pipeline(
columns={
'alpha': alpha,
},
# combined_alpha will be NaN for all stocks not in our universe,
# but we also want to make sure that we have a sector code for everything
screen=alpha,
)
algo.attach_pipeline(pipe, 'pipe')

# Schedule a function, 'do_portfolio_construction', to run twice a week
# ten minutes after market open.
algo.schedule_function(
do_portfolio_construction,
date_rule=algo.date_rules.every_day(),
half_days=False,
)

schedule_function(func=record_vars,
date_rule=date_rules.every_day(),
time_rule=time_rules.market_close(),
half_days=True)

# Call pipeline_output in before_trading_start so that pipeline
# computations happen in the 5 minute timeout of BTS instead of the 1
# minute timeout of handle_data/scheduled functions.
context.pipeline_data = algo.pipeline_output('pipe')
context.Predict = data.current(context.stock, 'Predicted')

# Portfolio Construction
# ----------------------
def do_portfolio_construction(context, data):
pipeline_data = context.pipeline_data
perf = context.Predict
# Objective
# ---------
# For our objective, we simply use our naive ranks as an alpha coefficient
# and try to maximize that alpha.
#
# This is a **very** naive model. Since our alphas are so widely spread out,
# we should expect to always allocate the maximum amount of long/short
# capital to assets with high/low ranks.
#
# A more sophisticated model would apply some re-scaling here to try to generate
# more meaningful predictions of future returns.
objective = opt.MaximizeAlpha(pipeline_data.alpha * perf)

# Constraints
# -----------
# Constrain our gross leverage to 1.0 or less. This means that the absolute
# value of our long and short positions should not exceed the value of our
# portfolio.
constrain_gross_leverage = opt.MaxGrossExposure(MAX_GROSS_LEVERAGE)

# Constrain individual position size to no more than a fixed percentage
# of our portfolio. Because our alphas are so widely distributed, we
# should expect to end up hitting this max for every stock in our universe.
constrain_pos_size = opt.PositionConcentration.with_equal_bounds(
-MAX_SHORT_POSITION_SIZE,
MAX_LONG_POSITION_SIZE,
)

# Run the optimization. This will calculate new portfolio weights and
# manage moving our portfolio toward the target.
algo.order_optimal_portfolio(
objective=objective,
constraints=[
constrain_gross_leverage,
constrain_pos_size,
],
)

def record_vars(context, data):
"""
This function is called at the end of each day and plots certain variables.
"""

# Check how many long and short positions we have.
longs = shorts = 0
for position in context.portfolio.positions.itervalues():
if position.amount > 0:
longs += 1
if position.amount < 0:
shorts += 1

# Record and plot the leverage of our portfolio over time as well as the
# number of long and short positions. Even in minute mode, only the end-of-day
# leverage is plotted.
record(leverage = context.account.leverage, long_count=longs, short_count=shorts)

def handle_data(context, data):
record(Predict = data.current(context.stock, 'Predicted'))
There was a runtime error.

Improved the above on alpha = 63 day stock beta:

6
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
import pandas as pd
import numpy as np
import quantopian.algorithm as algo
import quantopian.optimize as opt

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import builtin, Fundamentals, psychsignal
from quantopian.pipeline.factors import AverageDollarVolume, SimpleBeta, Returns, Factor
from quantopian.pipeline.factors.fundamentals import MarketCap
from quantopian.pipeline.classifiers.fundamentals import Sector

from quantopian.pipeline.factors import CustomFactor, RollingPearsonOfReturns, RollingSpearmanOfReturns
# Algorithm Parameters
# --------------------
UNIVERSE_SIZE = 1000
LIQUIDITY_LOOKBACK_LENGTH = 100

MAX_GROSS_LEVERAGE = 1.0
MAX_SHORT_POSITION_SIZE = 0.01  # 1%
MAX_LONG_POSITION_SIZE = 0.01   # 1%

def vectorized_beta(spy, assets):
"""Calculate beta between every column of assets and spy.

Parameters
----------
spy : np.array
An (n x 1) array of returns for SPY.
assets : np.array
An (n x m) array of returns for m assets.
"""
assert len(spy.shape) == 2 and spy.shape[1] == 1, "Expected a column vector for spy."

asset_residuals = assets - assets.mean(axis=0)
spy_residuals = spy - spy.mean()

covariances = (asset_residuals * spy_residuals).sum(axis=0)
spy_variance = (spy_residuals ** 2).sum()
return covariances / spy_variance

daily_returns = Returns(window_length=2)
daily_log_returns = daily_returns.log1p()
SPY_asset = sid(8554) #symbols('SPY')
class MyBeta(CustomFactor):
# Get daily returns for every asset in existence, plus the daily returns for just SPY
# as a column vector.
inputs = [daily_log_returns, daily_log_returns[SPY_asset]]
# Set a default window length of 2 years.
window_length = 63

def compute(self, today, assets, out, all_returns, spy_returns):
out[:] = vectorized_beta(spy_returns, all_returns)

def initialize(context):
set_slippage(slippage.FixedBasisPointsSlippage())
#Fetch SP500 DNN predictions
date_column = 'Date',
date_format = '%m/%d/%y')
context.stock = symbol('SPY')

# Universe Selection
# ------------------

# From what remains, each month, take the top UNIVERSE_SIZE stocks by average dollar
monthly_top_volume = (
AverageDollarVolume(window_length=LIQUIDITY_LOOKBACK_LENGTH)
.downsample('week_start')
)
# The final universe is the monthly top volume &-ed with the original base universe.
# &-ing these is necessary because the top volume universe is calculated at the start
# of each month, and an asset might fall out of the base universe during that month.
universe = monthly_top_volume & base_universe

# Market Beta Factor
# ------------------
stock_beta = MyBeta() #SimpleBeta(
#target=sid(8554),
#regression_length=21,
#)
# Alpha Generation
# ----------------
# Compute Z-scores of free cash flow yield and earnings yield.
# Both of these are fundamental value measures.
# Alpha Combination
# -----------------
# Assign every asset a combined rank and center the values at 0.
# For UNIVERSE_SIZE=500, the range of values should be roughly -250 to 250.
#combined_alpha = (fcf_zscore + yield_zscore + sentiment_zscore).rank().demean()
alpha = b.top(100)

# --------------
# Create and register a pipeline computing our combined alpha and a sector
# code for every stock in our universe. We'll use these values in our
# optimization below.
pipe = Pipeline(
columns={
'alpha': alpha,
},
# combined_alpha will be NaN for all stocks not in our universe,
# but we also want to make sure that we have a sector code for everything
screen=alpha,
)
algo.attach_pipeline(pipe, 'pipe')

# Schedule a function, 'do_portfolio_construction', to run twice a week
# ten minutes after market open.
algo.schedule_function(
do_portfolio_construction,
date_rule=algo.date_rules.every_day(),
half_days=False,
)

schedule_function(func=record_vars,
date_rule=date_rules.every_day(),
time_rule=time_rules.market_close(),
half_days=True)

# Call pipeline_output in before_trading_start so that pipeline
# computations happen in the 5 minute timeout of BTS instead of the 1
# minute timeout of handle_data/scheduled functions.
context.pipeline_data = algo.pipeline_output('pipe')
context.Predict = data.current(context.stock, 'Predicted')

# Portfolio Construction
# ----------------------
def do_portfolio_construction(context, data):
pipeline_data = context.pipeline_data
perf = context.Predict
# Objective
# ---------
# For our objective, we simply use our naive ranks as an alpha coefficient
# and try to maximize that alpha.
#
# This is a **very** naive model. Since our alphas are so widely spread out,
# we should expect to always allocate the maximum amount of long/short
# capital to assets with high/low ranks.
#
# A more sophisticated model would apply some re-scaling here to try to generate
# more meaningful predictions of future returns.
objective = opt.MaximizeAlpha(pipeline_data.alpha * perf)

# Constraints
# -----------
# Constrain our gross leverage to 1.0 or less. This means that the absolute
# value of our long and short positions should not exceed the value of our
# portfolio.
constrain_gross_leverage = opt.MaxGrossExposure(MAX_GROSS_LEVERAGE)

# Constrain individual position size to no more than a fixed percentage
# of our portfolio. Because our alphas are so widely distributed, we
# should expect to end up hitting this max for every stock in our universe.
constrain_pos_size = opt.PositionConcentration.with_equal_bounds(
-MAX_SHORT_POSITION_SIZE,
MAX_LONG_POSITION_SIZE,
)

# Run the optimization. This will calculate new portfolio weights and
# manage moving our portfolio toward the target.
algo.order_optimal_portfolio(
objective=objective,
constraints=[
constrain_gross_leverage,
constrain_pos_size,
],
)

def record_vars(context, data):
"""
This function is called at the end of each day and plots certain variables.
"""

# Check how many long and short positions we have.
longs = shorts = 0
for position in context.portfolio.positions.itervalues():
if position.amount > 0:
longs += 1
if position.amount < 0:
shorts += 1

# Record and plot the leverage of our portfolio over time as well as the
# number of long and short positions. Even in minute mode, only the end-of-day
# leverage is plotted.
record(leverage = context.account.leverage, long_count=longs, short_count=shorts)

def handle_data(context, data):
record(Predict = data.current(context.stock, 'Predicted'))
There was a runtime error.

It gets better and beats the market at alpha = 126 days stock beta:

6
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
import pandas as pd
import numpy as np
import quantopian.algorithm as algo
import quantopian.optimize as opt

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import builtin, Fundamentals, psychsignal
from quantopian.pipeline.factors import AverageDollarVolume, SimpleBeta, Returns, Factor
from quantopian.pipeline.factors.fundamentals import MarketCap
from quantopian.pipeline.classifiers.fundamentals import Sector

from quantopian.pipeline.factors import CustomFactor, RollingPearsonOfReturns, RollingSpearmanOfReturns
# Algorithm Parameters
# --------------------
UNIVERSE_SIZE = 1000
LIQUIDITY_LOOKBACK_LENGTH = 100

MAX_GROSS_LEVERAGE = 1.0
MAX_SHORT_POSITION_SIZE = 0.01  # 1%
MAX_LONG_POSITION_SIZE = 0.01   # 1%

def vectorized_beta(spy, assets):
"""Calculate beta between every column of assets and spy.

Parameters
----------
spy : np.array
An (n x 1) array of returns for SPY.
assets : np.array
An (n x m) array of returns for m assets.
"""
assert len(spy.shape) == 2 and spy.shape[1] == 1, "Expected a column vector for spy."

asset_residuals = assets - assets.mean(axis=0)
spy_residuals = spy - spy.mean()

covariances = (asset_residuals * spy_residuals).sum(axis=0)
spy_variance = (spy_residuals ** 2).sum()
return covariances / spy_variance

daily_returns = Returns(window_length=2)
daily_log_returns = daily_returns.log1p()
SPY_asset = sid(8554) #symbols('SPY')
class MyBeta(CustomFactor):
# Get daily returns for every asset in existence, plus the daily returns for just SPY
# as a column vector.
inputs = [daily_log_returns, daily_log_returns[SPY_asset]]
# Set a default window length of 2 years.
window_length = 126

def compute(self, today, assets, out, all_returns, spy_returns):
out[:] = vectorized_beta(spy_returns, all_returns)

def initialize(context):
set_slippage(slippage.FixedBasisPointsSlippage())
#Fetch SP500 DNN predictions
date_column = 'Date',
date_format = '%m/%d/%y')
context.stock = symbol('SPY')

# Universe Selection
# ------------------

# From what remains, each month, take the top UNIVERSE_SIZE stocks by average dollar
monthly_top_volume = (
AverageDollarVolume(window_length=LIQUIDITY_LOOKBACK_LENGTH)
.downsample('week_start')
)
# The final universe is the monthly top volume &-ed with the original base universe.
# &-ing these is necessary because the top volume universe is calculated at the start
# of each month, and an asset might fall out of the base universe during that month.
universe = monthly_top_volume & base_universe

# Market Beta Factor
# ------------------
stock_beta = MyBeta() #SimpleBeta(
#target=sid(8554),
#regression_length=21,
#)
# Alpha Generation
# ----------------
# Compute Z-scores of free cash flow yield and earnings yield.
# Both of these are fundamental value measures.
# Alpha Combination
# -----------------
# Assign every asset a combined rank and center the values at 0.
# For UNIVERSE_SIZE=500, the range of values should be roughly -250 to 250.
#combined_alpha = (fcf_zscore + yield_zscore + sentiment_zscore).rank().demean()
alpha = b.top(100)

# --------------
# Create and register a pipeline computing our combined alpha and a sector
# code for every stock in our universe. We'll use these values in our
# optimization below.
pipe = Pipeline(
columns={
'alpha': alpha,
},
# combined_alpha will be NaN for all stocks not in our universe,
# but we also want to make sure that we have a sector code for everything
screen=alpha,
)
algo.attach_pipeline(pipe, 'pipe')

# Schedule a function, 'do_portfolio_construction', to run twice a week
# ten minutes after market open.
algo.schedule_function(
do_portfolio_construction,
date_rule=algo.date_rules.every_day(),
half_days=False,
)

schedule_function(func=record_vars,
date_rule=date_rules.every_day(),
time_rule=time_rules.market_close(),
half_days=True)

# Call pipeline_output in before_trading_start so that pipeline
# computations happen in the 5 minute timeout of BTS instead of the 1
# minute timeout of handle_data/scheduled functions.
context.pipeline_data = algo.pipeline_output('pipe')
context.Predict = data.current(context.stock, 'Predicted')

# Portfolio Construction
# ----------------------
def do_portfolio_construction(context, data):
pipeline_data = context.pipeline_data
perf = context.Predict
# Objective
# ---------
# For our objective, we simply use our naive ranks as an alpha coefficient
# and try to maximize that alpha.
#
# This is a **very** naive model. Since our alphas are so widely spread out,
# we should expect to always allocate the maximum amount of long/short
# capital to assets with high/low ranks.
#
# A more sophisticated model would apply some re-scaling here to try to generate
# more meaningful predictions of future returns.
objective = opt.MaximizeAlpha(pipeline_data.alpha * perf)

# Constraints
# -----------
# Constrain our gross leverage to 1.0 or less. This means that the absolute
# value of our long and short positions should not exceed the value of our
# portfolio.
constrain_gross_leverage = opt.MaxGrossExposure(MAX_GROSS_LEVERAGE)

# Constrain individual position size to no more than a fixed percentage
# of our portfolio. Because our alphas are so widely distributed, we
# should expect to end up hitting this max for every stock in our universe.
constrain_pos_size = opt.PositionConcentration.with_equal_bounds(
-MAX_SHORT_POSITION_SIZE,
MAX_LONG_POSITION_SIZE,
)

# Run the optimization. This will calculate new portfolio weights and
# manage moving our portfolio toward the target.
algo.order_optimal_portfolio(
objective=objective,
constraints=[
constrain_gross_leverage,
constrain_pos_size,
],
)

def record_vars(context, data):
"""
This function is called at the end of each day and plots certain variables.
"""

# Check how many long and short positions we have.
longs = shorts = 0
for position in context.portfolio.positions.itervalues():
if position.amount > 0:
longs += 1
if position.amount < 0:
shorts += 1

# Record and plot the leverage of our portfolio over time as well as the
# number of long and short positions. Even in minute mode, only the end-of-day
# leverage is plotted.
record(leverage = context.account.leverage, long_count=longs, short_count=shorts)

def handle_data(context, data):
record(Predict = data.current(context.stock, 'Predicted'))
There was a runtime error.

@Blue, thanks for the detailed trade info. I didn't quite understand your comments except for the part that it shows many partial fills. So what do you think is the bottom line here? My takeaway here is that with only 56% directional accuracy produced by the DNN model on SPY and traded on the top 100 stocks ranked according to beta (126 days) to SPY can be profitable and mirror market returns but with big drawdowns therefore highly risky. The hope is with improved prediction accuracy of the DNN to say, 66% , would translate to higher returns and lower drawdowns.

@Karl, tried your pipeline setting but made it worse, see below. I would like to see your DNN / ML model with this fetcher template. From the way you described it, it looks more sophisticated than my naive approach. Can you please post it here?

3
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
import pandas as pd
import numpy as np
import quantopian.algorithm as algo
import quantopian.optimize as opt

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import builtin, Fundamentals, psychsignal
from quantopian.pipeline.factors import AverageDollarVolume, SimpleBeta, Returns, Factor
from quantopian.pipeline.factors.fundamentals import MarketCap
from quantopian.pipeline.classifiers.fundamentals import Sector
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import CustomFactor, RollingPearsonOfReturns, RollingSpearmanOfReturns
# Algorithm Parameters
# --------------------
UNIVERSE_SIZE = 1000
LIQUIDITY_LOOKBACK_LENGTH = 100

MAX_GROSS_LEVERAGE = 1.0
MAX_SHORT_POSITION_SIZE = 0.01  # 1%
MAX_LONG_POSITION_SIZE = 0.01   # 1%

inputs = [USEquityPricing.close, USEquityPricing.volume]
def compute(self, today, assets, out, close_price, volume):
out[:] = np.nanmean(close_price * volume, axis=0)

def vectorized_beta(spy, assets):
"""Calculate beta between every column of assets and spy.

Parameters
----------
spy : np.array
An (n x 1) array of returns for SPY.
assets : np.array
An (n x m) array of returns for m assets.
"""
assert len(spy.shape) == 2 and spy.shape[1] == 1, "Expected a column vector for spy."

asset_residuals = assets - assets.mean(axis=0)
spy_residuals = spy - spy.mean()

covariances = (asset_residuals * spy_residuals).sum(axis=0)
spy_variance = (spy_residuals ** 2).sum()
return covariances / spy_variance

daily_returns = Returns(window_length=2)
daily_log_returns = daily_returns.log1p()
SPY_asset = sid(8554) #symbols('SPY')
class MyBeta(CustomFactor):
# Get daily returns for every asset in existence, plus the daily returns for just SPY
# as a column vector.
inputs = [daily_log_returns, daily_log_returns[SPY_asset]]
# Set a default window length of 2 years.
window_length = 126

def compute(self, today, assets, out, all_returns, spy_returns):
out[:] = vectorized_beta(spy_returns, all_returns)

def initialize(context):
set_slippage(slippage.FixedBasisPointsSlippage())
#Fetch SP500 DNN predictions
date_column = 'Date',
date_format = '%m/%d/%y')
context.stock = symbol('SPY')
'''
# Universe Selection
# ------------------

# From what remains, each month, take the top UNIVERSE_SIZE stocks by average dollar
monthly_top_volume = (
AverageDollarVolume(window_length=LIQUIDITY_LOOKBACK_LENGTH)
.downsample('week_start')
)
# The final universe is the monthly top volume &-ed with the original base universe.
# &-ing these is necessary because the top volume universe is calculated at the start
# of each month, and an asset might fall out of the base universe during that month.
universe = monthly_top_volume & base_universe
'''
# Market capitalisation: 5:95 ~2Sigma
marketCap = Fundamentals.market_cap.latest.percentile_between(5, 95)  # Next try (8, 98)
isLiquid = AvgDailyDollarVolumeTraded(window_length=21, mask=marketCap) > 5e6  # Next try > 10e6

universe = QTradableStocksUS() & marketCap & isLiquid

# Market Beta Factor
# ------------------
stock_beta = MyBeta() #SimpleBeta(
#target=sid(8554),
#regression_length=21,
#)
# Alpha Generation
# ----------------
# Compute Z-scores of free cash flow yield and earnings yield.
# Both of these are fundamental value measures.
# Alpha Combination
# -----------------
# Assign every asset a combined rank and center the values at 0.
# For UNIVERSE_SIZE=500, the range of values should be roughly -250 to 250.
#combined_alpha = (fcf_zscore + yield_zscore + sentiment_zscore).rank().demean()
alpha = b.top(100)

# --------------
# Create and register a pipeline computing our combined alpha and a sector
# code for every stock in our universe. We'll use these values in our
# optimization below.
pipe = Pipeline(
columns={
'alpha': alpha,
},
# combined_alpha will be NaN for all stocks not in our universe,
# but we also want to make sure that we have a sector code for everything
screen=alpha,
)
algo.attach_pipeline(pipe, 'pipe')

# Schedule a function, 'do_portfolio_construction', to run twice a week
# ten minutes after market open.
algo.schedule_function(
do_portfolio_construction,
date_rule=algo.date_rules.every_day(),
half_days=False,
)

schedule_function(func=record_vars,
date_rule=date_rules.every_day(),
time_rule=time_rules.market_close(),
half_days=True)

# Call pipeline_output in before_trading_start so that pipeline
# computations happen in the 5 minute timeout of BTS instead of the 1
# minute timeout of handle_data/scheduled functions.
context.pipeline_data = algo.pipeline_output('pipe')
context.Predict = data.current(context.stock, 'Predicted')

# Portfolio Construction
# ----------------------
def do_portfolio_construction(context, data):
pipeline_data = context.pipeline_data
perf = context.Predict
# Objective
# ---------
# For our objective, we simply use our naive ranks as an alpha coefficient
# and try to maximize that alpha.
#
# This is a **very** naive model. Since our alphas are so widely spread out,
# we should expect to always allocate the maximum amount of long/short
# capital to assets with high/low ranks.
#
# A more sophisticated model would apply some re-scaling here to try to generate
# more meaningful predictions of future returns.
objective = opt.MaximizeAlpha(pipeline_data.alpha * perf)

# Constraints
# -----------
# Constrain our gross leverage to 1.0 or less. This means that the absolute
# value of our long and short positions should not exceed the value of our
# portfolio.
constrain_gross_leverage = opt.MaxGrossExposure(MAX_GROSS_LEVERAGE)

# Constrain individual position size to no more than a fixed percentage
# of our portfolio. Because our alphas are so widely distributed, we
# should expect to end up hitting this max for every stock in our universe.
constrain_pos_size = opt.PositionConcentration.with_equal_bounds(
-MAX_SHORT_POSITION_SIZE,
MAX_LONG_POSITION_SIZE,
)

# Run the optimization. This will calculate new portfolio weights and
# manage moving our portfolio toward the target.
algo.order_optimal_portfolio(
objective=objective,
constraints=[
constrain_gross_leverage,
constrain_pos_size,
],
)

def record_vars(context, data):
"""
This function is called at the end of each day and plots certain variables.
"""

# Check how many long and short positions we have.
longs = shorts = 0
for position in context.portfolio.positions.itervalues():
if position.amount > 0:
longs += 1
if position.amount < 0:
shorts += 1

# Record and plot the leverage of our portfolio over time as well as the
# number of long and short positions. Even in minute mode, only the end-of-day
# leverage is plotted.
record(leverage = context.account.leverage, long_count=longs, short_count=shorts)

def handle_data(context, data):
record(Predict = data.current(context.stock, 'Predicted'))
There was a runtime error.

@Karl, I noticed you have been deleted some of your posts, have you found the holy grail? I remember you describing to me your, OK not DNN but ML classifier with confidence level tracking (maybe your golden goose). That I'd like to see under this fetcher template if you don't mind sharing without giving away your secret sauce :)

Anyway, the problem with partial fills is not with insufficient/ illiquid stocks but more with position concentration on 100 stocks and this was solved by upping the number of tradeable stocks to 500.

@James I apologize if I'm being obvious with these comments. Tweaking parameters to maximize total return is a form of overfitting, and as such you're not likely to see the same boost out-of-sample. In addition, when an algorithm performs flat for the vast majority of 5 years and all its gains come from a two-or-three-month spurt, then the stats are pretty meaningless as far as expected future performance is concerned. That brief period of positive returns is a statistical outlier. What the backtest tells me is that you could expect the algo to perform more-or-less flat for years on end. There's no reason to believe that a returns-generating event will repeat, since it only happens once in the backtest. Ideally you want the backtest to show consistent, even returns.

@Viridian, yes, what you are saying is generally correct. But if you carefully read the full context of this thread, the point of this exercise is about a deep neural network (DNN) that predicts SPY and how to apply it within the Q framework. The DNN was trained for 15 years, validated for 4 years and has an out of sample of 6 years. I was trying to see how these out of sample predicted results would translate if traded within the context of the Q framework where hundreds of stocks are traded. I chose to correlate the SPY prediction to the individual stock's beta to SPY and its window length as the parameter to maximize or tweak to see which brings the most returns. Is that overfitting? It's like trying to determine what lookback period to use in a moving average that brings you the best returns as a form of generality.

Thanks Karl for the preview. Looks fantastic! Good luck with that.

Ok, well my main point is that that the out-of-sample returns aren't consistent enough to confirm that the model is indeed predictive. This could perhaps be explained by the difference in volatility between the training period and the out-of-sample period, and perhaps the Brexit announcement freakout gave it market conditions it better understood.

I'm curious, what's the hypothesis behind a strategy like this? From my own observations, short-term stock movements are quite often reactions to news events (however, which events move the market and which do not continues to baffle me) or waves from large funds moving their money around (e.g. rebalancing). While there are likely some underlying momentum and mean reversion forces an AI can help predict, I would think there are too many wrenches being thrown into the gears (e.g. randomness of news events) throughout the day, every day for them to be significant.

Also worthwhile is to consider how much tail risk a strategy takes on.

@Viridian, I think you're reading in too much on the example above. I already know that the OOS isn't consistent enough to confirm that it has a significant predictive power. That was not the purpose of the exercise of which I already said above.

While your observations about these factors that drive short term price movements are valid, there are many, many others. Now how do you quantify all these contributing factors to a predictive model? One way is to make the assumption that all these factors are already reflected in the price based on all available information the market participants have point in time including i.e. news events. So under this assumption, the inputs to the model are basically prices or variations of it, nothing more. DNN/AI is just another modeling tool among many others out there. The lure to using DNN/AI is its ability to generalize nonlinear interrelationships of its input variables into something that has some predictive capabilities. Put simply, learning from data through training, validating and outsampling. The hyphothesis is also simple, if the model is tuned to one day prediction and achieves predictive accuracy that is statistically significant and consistent across the board ( training, validating and outsampling), then you trade according to the prediction and if everything holds you should be profitable.

Just updating this old thread with the new Self-Serve Data capability. On the examples above, I used fetch_csv to upload my OOS DNN predictive signals. This update uses the new Self-Serve Data to upload these signals. The main difference and improvement is under fetch_csv, you cannot use output in a pipeline, with Self-Serve Data you now can! In this notebook, the uploaded OOS DNN predictive signals is combined with other alpha factors in a pipeline and processed by another ML algo within the Q framework with stock selection, Optimize API including contest constraints which it passed given compute timeout limitations. I was only able to evaluate around 100-120 possible stock selection candidates, anything beyond that I get the dreaded timeout error. Here's the notebook:

1
Notebook previews are currently unavailable.

@james would you consider posting the updated version of your algo that uses the self serve data? Working to understand better how to transition an algo from using fetch and another example would be extremely helpful. If not, no problem at all!

@kay: I just posted an example in another thread where I uploaded a Fetcher .csv file as a Self-Serve dataset. I'm not sure if it's the exact same as what James is doing, but you might find it helpful.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hi Folks -

Wondering if anyone has thought about the mechanics of using the new Self-Serve Data API for computing a factor offline (e.g. using machine learning, ML)?

Some considerations:

--Point-in-time list of the QTradableStocksUS.
--Point-in-time symbol matching with Quantopian database (i.e. ability to check that the offline XYZ is the same company as the Quantopian XYZ).

All of the data would need to be available N minutes/hours prior to the daily cut-off for upload of the factor, via the Self-Serve Data API (presumably, the cut-off time is the same for both the contest and for an algo that receives an allocation).

I saw an example by James Villa of an offline ML computation on SPY, which is an easy use case; I'm wondering how to generalize the approach.

In my mind, offline factor computation may be the leading use case for the new Self-Serve Data API, but maybe I'm missing something. If the idea is for users to upload data (the example provided is campaign contributions), the data is likely to be public domain and free/inexpensive, and so why wouldn't Quantopian just add it to their online data store. Maybe the idea is that some users are already paying for data sets for other work, and the licensing would allow upload to Quantopian, as well? I just don't see users developing alternative data sources from the ground up, just for use on Quantopian (e.g. create a personal version of 13-D filings, from public records, so as not to have to pay for it on Quantopian).

Any thoughts on ML and the new Quantopian Self-Serve Data API?

Hi Grant,

Here's some of my thoughts on your queries.

--Point-in-time list of the QTradableStocksUS.
--Point-in-time symbol matching with Quantopian database (i.e. ability to check that the offline XYZ is the same company as the Quantopian XYZ).

The Self Serve Data mechanism will automatically do this for you except I guess the ability to check that the offline XYZ is the same company as the Quantopian XYZ. You will only be doing yourself a disservice if your uploaded datafile where you specify the stock symbol does not correspond to Q symbol protocol.

There are several sites that offer free historical OHLCV data like Yahoo Finance, Quandl, to name a few. For company fundamentals, there are probably free sources but would be a tedious task to automatically scrape data on a daily basis, can be done though. There are also probably low cost subscriptions for these.

an offline ML computation on SPY, which is an easy use case; I'm wondering how to generalize the approach.

I take it that what you mean by generalized approach is being able to do an offline ML prediction on hundreds of stocks simultaneously which can then be uploaded to Self Serve Data as inputs to Q pipeline for portfolio construction. Again, this could be done but one has to have the data and compute infrastructure capable of processing huge amounts of data in a limited timeframe. The benefits we get from Q is free access to clean OHLCV and fundamental data from reputable data vendors, the downside is you can not work on these data offline.

My approach on SPY prediction using purely data transformations of OHLC price data processed by a deep neural network is a short cut or workaround due to lack of abovementioned capabilities/resources.

A good example of a generalized approach with a specific factor ("news sentiment") is Accern's here I guess as a company and data provider they have both resources and infrastructure. But that said, a news sentiment data can be made scraping data from say Google Trends and processing it through NLP and/or DNN/ML but it takes a lot of work.

Hope this helps.

I’d be interested in hearing from the ML folks on the need for the point-in-time QTU. Generally my impression is that one is not operating on each symbol independently and assigning a score. In this case, uploading and selecting only QTU symbols would be o.k. But to do a proper job, one really needs to limit the universe to the QTU prior to applying the offline ML, right?

Grant -- I don't think there's currently any way to get the current day's QTU, use it for offline processing, and then upload your data before market starts, though I'm sure it'd be useful for your purposes if Quantopian published a QTU feed for the next day after close the day before. However, the churn on QTU is quite low, so if you have a QTU that's from the day before, or even a few days old for that matter, that's probably good enough to narrow down your universe for your ML processing.

@ Viridian Hawk -

So where did the QTU file come from (http://www.mediafire.com/file/9xh0r5m8ab0y0rd/qtu2018-06-23.csv)? It goes all the way back to 1/3/2003? Has it been checked against the online QTU?

One thought is that I gather that aside from the Eventvestor data, I gather that perhaps if one had access to fundamentals data, the QTU to be re-constructed offline. Then, for the Eventvestor data, it could be checked daily (although I suppose the issue is that the terms of use would not allow any kind of automated "scraping" so the check would need to be manual...hmmf!).

Oops, I just read the TOS, and it's pretty explicit that creating a derivative work based on somebody else's idea is against the rules. (Talk about broad!) So I'm going to delete the CSV file unless I hear otherwise that it's ok. I figured it would be because the purpose is to use the list for an offline dataset to use on Quantopian, but who knows. But basically it's the log output from this algorithm.

2
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
 Returns 1 Month 3 Month 6 Month 12 Month
 Alpha 1 Month 3 Month 6 Month 12 Month
 Beta 1 Month 3 Month 6 Month 12 Month
 Sharpe 1 Month 3 Month 6 Month 12 Month
 Sortino 1 Month 3 Month 6 Month 12 Month
 Volatility 1 Month 3 Month 6 Month 12 Month
 Max Drawdown 1 Month 3 Month 6 Month 12 Month
import quantopian.algorithm as algo
from quantopian.pipeline import Pipeline
from quantopian.pipeline.filters import QTradableStocksUS
from quantopian.pipeline.factors import SimpleBeta

def initialize(context):
algo.attach_pipeline(make_pipeline(), 'pipeline')

context.last_QTU = []

def make_pipeline():
beta = SimpleBeta(target=symbol('SPY'), regression_length=252)
pipe = Pipeline(
columns={
'beta'    : beta,
},
)
return pipe

context.output = algo.pipeline_output('pipeline')

record( QTradableAndBetaNotNull = len(context.output.beta) )

subtractions = ''

for stock in context.output.beta.keys():
if stock not in context.last_QTU:
context.last_QTU.append(stock)
additions += stock.symbol + ','
if len(additions) > 500:
print "+: " + additions[:-1]

if len(additions) > 0:
print "+:" + additions[:-1]

for stock in context.last_QTU:
if stock not in context.output.beta.keys():
context.last_QTU.remove(stock)
subtractions += stock.symbol + ','

if len(subtractions) > 0:
print "-:" + subtractions[:-1]


There was a runtime error.

@ Viridian Hawk -

Probably a good move. My understanding is that Q needs to stick with the licensing agreements with data vendors, which of course forbid download of direct or derivative data. They'd eventually catch up with you, and put the kibosh on your efforts to democratize finance.