Back to Community
The 101 Alphas Project

101 Formulaic Alphas, a newly published paper from Zura Kakushadze, Geoffrey Lauprete and Igor Tulchinsky, spells out pseudo-code for 101 "alphas," volume and price based factors that can be computed accross a large universe of stocks. The stock-by-stock values of these alphas can be used as a trading signal (positive for buy, negative for sell). The authors propose that while the alphas may not be individually profitable, their aggregate "mega-alpha" signal is capable of generating significant market outperformance.

Here at Quantopian HQ, we're very curious to see if the aggregate signal discussed in 101 Formulaic Alphas is truly capable of generatating returns in excess of the market. That's where you come in. We need the Quantopian community's help turning each of the 101 alphas into a Pipeline algorithm. When we're done, we'll pull together data from all of the algorithms you wrote and share our findings here. We'll also release the aggregated signal data, so you can craft your own mega-alpha algorithm.

For an example, check out my implementation of Alpha #101 below. Alpha #101, ((close - open) / ((high - low) + .001), spells out a factor (in this case a momentum factor) that is computed cross-sectionally by a custom Pipeline factor and used to rank the top 2,500 market cap stocks. Before trading starts each day, these rankings are used to generate baskets of 200 names to buy and 200 to short.

Here are some guidelines to help you get started:

  • Your algorithm must implemented using the Pipeline API.
  • Your algorithm should hold a minimum of 100 stocks in its portfolio at the end of each trading day.
  • Your algorithm should use the alpha factor to rebalance daily.
  • Each alpha gets its own thread. Please check for an existing thread for your alpha before starting a new one.
  • The title of your thread should match the following format: "101 Alphas Project: Alpha #10"
  • Attach the "101 Alphas Project" tag to your post.
  • Share a backtest from 1/1/10 to 1/1/15 with your post.

Some questions:

Q: Why Pipeline?
A: The Pipeline API is designed to support exactly the type of cross-sectional strategies described by the alphas in this paper. We also hope that this project will help us identify areas where Pipeline can be improved.

Q: How should I implement my algo's ordering logic?
A: Feel free to test out different trading logic implementations. Just make sure your algorithm holds at least 100 names and trades daily.

Q: What if I don't understand what some of these factors mean?
A: Let's discuss! We'll work together to determine how exactly each of the provided pseudo-code expressions can be translated into an algorithm.

Q: Someone else has already shared the alpha I want to work on. What should I do?
A: Share your alternate interpretation in the existing thread for that alpha. We expect there to be some debate over the proper implementation of some of the alphas.

Q: I'm new to Pipeline. Where should I start?
A: Take a look at the example implementation of Alpha #101 shared below. Also, check out the Pipeline help docs and example algorithms. To get a better sense of an alpha's functionality, try implementing your Pipeline factor in the research environment before moving it into an algorithm. If you do use research to build your alpha algorithm, be sure to share your notebook along with your backtest! Still stuck? Always feel free to send your questions to [email protected].

Got it? Pick an alpha and clone the algo below to get started with your implementation!

Clone Algorithm
555
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 568afc2f61b19f11664e632c
There was a runtime error.
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

59 responses

Andrew,

I have backtested yours algo in daily mode on full market cycle with enabled default slippage and commissions models and have some questions:

Why in sample algo default slippage and commissions models are disabled?

Why algorithm should hold a minimum of 100 stocks in its portfolio at the end of each trading day?

Why algorithm should  to rebalance daily?

Why backtest should be from 1/1/10 to 1/1/15 (pure bull market)?  
and not on full market cycle from 1/1/07 to 1/1/16

Is the results of my backtest in line with The goal of 101 Alphas Project?  
Clone Algorithm
68
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 56bd2ae67a398f12b7700eac
There was a runtime error.

Vladamir,

These are all great questions. Thank you for asking.

// Why in sample algo default slippage and commissions models are disabled?

In the interest of learning more about each alpha individually, we'd like to observe their signals in isolation. As you showed in your backtest, transaction costs can mask a weak signal. It'd be more interesting to look at the impact of transaction costs on our aggregated alpha algorithm. As the authors mention, combining multiple alphas can dramatically reduce transaction costs by crossing trades (combining multiple buy and sell signals into a signal order).

//Why algorithm should hold a minimum of 100 stocks in its portfolio at the end of each trading day?

Since we are using the orders placed by these individual algorithms to build an aggregate "mega-alpha," we'd like to have the ability to observe a substantial number of portfolio weights from each alpha.

//Why algorithm should to rebalance daily?

Just as we want to observe weights for an adequately large number of stocks, we also want to capture the weights prescribed by each alpha with high (daily) frequency.

//Why backtest should be from 1/1/10 to 1/1/15 (pure bull market)? and not on full market cycle from 1/1/07 to 1/1/16

The authors of 101 Alphas use backtest dates of 1/1/10 to 1/1/14 for their analysis. In the spirit of scientific reproducibility, we've suggested also using those dates with the addition of one year of true out-of-sample data. Though, you do make a good point. A more robust analysis would attempt to capture a more diverse set of market regimes.

//Is the results of my backtest in line with The goal of 101 Alphas Project?

Yes, while your backtest doesn't stick to the slippage/commissions guidelines, your post is helpful. The main goal of this project is to facilitate a discussion about the 101 Alphas. You've brought up some great points and revealed the individual weakness of Alpha 101's signal.

Want to implement #12? (sign(delta(volume, 1)) * (-1 * delta(close, 1)))

Andrew

Andrew,

Since Alpha #101 ((close - open) / ((high - low) + .001), uses the open and there is a known issue with USEquityPricing.open, then close[-2] in the snippet out[:] = (close[-1] - close[-2]) / ((high[-1] - low[-1]) + .001) is your workaround for the known issue that access to USEquityPricing.open is broken. Just want to verify. I'm new.

Assuming it is the workaround, any ETA on the patch.

((close - open) / ((high - low) + .001)

Create custom factor subclass to calculate a market cap based on yesterday's

close

class RandomFactor(CustomFactor):
# Pre-declare inputs and window_length

inputs = [USEquityPricing.close, USEquityPricing.high, USEquityPricing.low]  
window_length = 2

# Compute market cap value  
def compute(self, today, assets, out, close, high, low):  
    out[:] = (close[-1] - close[-2]) / ((high[-1] - low[-1]) + .001)  

Georg,
That is correct. I am using the previous day's close as a proxy for the open price.

Andrew,

//Is the results of my backtest in line with The goal of 101 Alphas
Project?

Yes, while your backtest doesn't stick to the slippage/commissions
guidelines...

First of all Quantopian slippage model is not mine it is Quantopian's.
Quantopian apply it to any contest participant.
If you tell me that it is not realistic I may agree with you.
But disable it completely is unrealistic ether.
There is only one way - fix it.

There are some more things to be fixed before you start
scientific reproducibility of 101 Alphas.
Quantopian backtest engine do not support MOC and MOO orders required by most "Alphas" in daily mode which you chose in sample algo.

I start backtest of the same algo in minute mode with disabled slippage model and stop it when in 10 month of still bull market Alpha #101 lost more then 50%

Is the results of my backtest below with disabled slippage model in line with The goal of 101 Alphas Project?

P.S.
To my mind the problem may be not in Alpha #101 but in yours guidelines:

-Generate baskets of 200 names to buy and 200 to short.
- Your algorithm should hold a minimum of 100 stocks in its portfolio
at the end of each trading day.
-Your algorithm should use the alpha factor to rebalance daily.

Simple arithmetic.
In order to follow yours guidelines the algo every day should:
sell 200 stocks
buy 200 stocks
cover 200 stocks
short 200 stocks
Total 800 orders per day or 200000 a year.
If we take the lowest possible commission $1 per trade then we should pay
$200000 commission a year. If Alpha #101 make more then $200000 a year then algo will be in profit if it will be flat which is more likely
for long-short system then in 3 years you will lose more then half of initial capital and so on.
My backtest of pure Alpha #101 without any commissions and slippage charges in minute mode during first 4 years ended almost flat all the time under water with this:

There was a runtime error.  
MemoryError  
Algorithm used too much memory. Need to optimize your code for better performance.  
Backtest id: 56beca845e48a812ad5ffe44  
Clone Algorithm
68
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 56bea19371785b11d65ed71a
There was a runtime error.

Hi Andrew,

I skimmed over the paper, and found:

In this section we describe empirical properties of our formulaic alphas based on data
proprietary to WorldQuant LLC, which is used here with its express permission. We provide as
many details as possible within the constraints of the proprietary nature of this dataset.

Perhaps I've missed it, but it seems that the authors don't actually describe which securities are being used to produce their results. I guess I'm confused about your objective here? Are you trying to reproduce results from the paper? Wouldn't you need to know what universe they are using?

Also, in the "alphas" I see what appear to be lots of "magic numbers" taken out to many significant digits. For example, we have:

Alpha#69: ((rank(ts_max(delta(IndNeutralize(vwap, IndClass.industry), 2.72412),
4.79344))^Ts_Rank(correlation(((close * 0.490655) + (vwap * (1 - 0.490655))), adv20, 4.92416),
9.0615)) * -1)

It seems like one would need to contact the authors to understand exactly what this means and how it was derived, before launching into an exercise of attempting to replicate it.

Frankly, I guess I can't see through to the end how this is gonna play out.

By the way, there are a bunch of relevant questions/comments on https://www.quantopian.com/posts/quantopian-lecture-series-long-short-equity-algorithm that you may want to consider.

One thing you might consider, if you haven't done it already, is calling the authors and describing Quantopian's data sets, platform, etc. to see if they have any advice (or hop on the train and meet them in person). Is the Quantopian platform at all applicable to their approach? They may be willing to share this sort of infrastructure advice. How much platform "horsepower" is required to pull off their strategy?

Grant

We also hope that this project will help us identify areas where Pipeline can be improved.

One challenge in using Q is that there is no user-facing issue/bug/suggestion tracker (although perhaps sometimes stuff gets flagged in the bowels of zipline). Taking pipeline as an example, information is scattered around the forums with no way to track it (e.g. when the USEquityPricing.open problem gets fixed, it'll be important to notify everyone who cares). Presumably, for pipeline, someone at Q has a list. Could it be published on an interactive forum?

One architectural problem you might consider is that if you are dealing with 101 factors (or whatever big number), not being able to do computations in parallel may not be practical. Has there been any discussion at Quantopian HQ along these lines? Assuming you get a bunch of nice little 'alphas' how does the computing problem scale to cobble them together into a big honkin' optimized portfolio? Is the idea to run lots of backtests every day with an optimizer to determine the weights? If so, presently (at least for users), there is no way to do it that I'm aware of. In fact, there is no way to call even a serial backtest function from within the trading platform.

Hi Grant,

I think Andrew's intent here is to open a collaborative project for people who are interested in working on this type of signal research. I'm pretty excited about this work and I think it is pushing Quantopian signal development closer to the state of the art techniques used in the industry.

In light of that goal I would personally find it helpful to keep posts in this thread focused on the 101 Alphas project and have asked Jamie to curate this thread accordingly to keep things on topic.

A few thoughts on your question about the paper's universe selection methodology:

This is a great point - defining the tradable universe, especially knowing what securities to filter OUT, becomes more and more critical to building a successful quant strategy as you use dynamic selection criteria across increasingly large universes. This is something we've come to appreciate more and more as we vet strategies for potential allocations on the research team.

Here Andrew is making a good first cut that fits within current Q platform constraints whereby he's targeted the top 500 stocks by market cap. I don't know if it's precisely laid out in Zura's paper, but I'd expect they used something like the top 3000 stocks based on some liquidity filter while also excluding certain problematic symbols that tend to introduce noise.

A nice first improvement to this sample algorithm would be to redefine the investible universe as the top 500 most liquid symbols excluding ADRs, OTCs and non-primary shares. I think Andrew will also take a crack at that update, unless someone else here beats him to it :)

-Jess

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Jessica,

I see that yours guidelines for this state of the art techniques differ from Andrew's (3000,200,200,every day).
Just two questions very critical to outputs of this project:
In which mode it should be run, daily or minute?
Should be commissions and slippage charges included in equity calculation?

I see that yours guidelines for this state of the art techniques differ from Andrew's (3000,200,200,every day).

Not sure I follow? I wasn't intending to set guidelines, just trying to answer Grant's question about what I thought the universe the researchers were studying might be. Given the current platform constraints on Q it makes sense to use the largest universe size you can to most closely approximate sorting 'all' tradable equities.

Just two questions very critical to outputs of this project:
In which mode it should be run, daily or minute?

Personally I think minute. More accurate, flexible and powerful in the long run. Worth the tradeoff in compute time (which we're continuing to work hard on improving).

Should be commissions and slippage charges included in equity calculation?

For determining whether there is any predictive power in these 'alphas' I think it's instructive to do research with slippage and commissions disabled. Certainly there's the follow on issue of how you'd actually trade these alphas. But it can be powerful to decouple the alpha generation step from the portfolio construction and trade execution step. I see this project as focusing more on the first part of the process initially.

Just a small thought for those interested, for me those magic numbers mentioned by Grant etc sound that these alphas are generated via genetic algorithms. These kind of magic algos are extremely easy to generate with GAs for fixed period but for out-sample they usually fail (note that they are using only 80 of these in production, it's quite possible that they generate a lot of these and let the market decide which ones survive after a period, a very reasonable tactic) I have done something quite similar in the past but the real problem is not generating these algos, it's finding which ones really do work and which ones are just curve fitted for the period.

Results posted here suggest that these might be fitted to given period.

Note: I'm not sure if this is true but to be sure I would test a few randomly at out-sample data (dates outside their given dates) and check if the results are in line on out-sample with in-sample. That would give an indication of how useful these algos actually are.

If Quantopian see this project as pure academic exercise decoupled from execution reality then this should be mentioned on the top of algo.

Thanks Jess,

Regarding your comment:

In light of that goal I would personally find it helpful to keep posts in this thread focused on the 101 Alphas project and have asked Jamie to curate this thread accordingly to keep things on topic.

If I've posted something irrelevant, I can move it to a separate thread, for feedback or to be ignored. Just let me know what material is objectionable.

Grant

There's a general problem here that a canonical clean universe needs to be established, if you are looking for users to work on establishing the "alphas". Instead of leaving it up to users, could you set it up like this:

security_lists.101_alphas  

Then, at each point in time (every day, I guess), users could refer to the list of valid securities prepared by Q.

Also, there needs to be a standard approach to handling securities that are de-listed over the timespan of the backtest. Maybe Q could bake that into security_lists.101_alphas?

It might be prudent to have code to cancel all orders prior to the close, to sorta replicate live trading behavior on Quantopian (or better yet, update the backtesting engine so that orders are canceled after the close, as they are in live trading).

If, indeed, there was some "secret sauce" such as genetic algorithms as mentioned above to establish the "alphas" and the authors are just giving us a snapshot in time (i.e. the parameters within the "alphas" are tweaked versus time), then it may be hopeless to try and reproduce their result (in fact, the authors make it clear that they are not giving away the farm with their white paper). It's not exactly an academic paper, but more of a "look how smart we are, come invest in our fund" type publication, so I'm just wondering if by definition, the reverse engineering exercise will be frustrating. Hence, if you contact the authors and they tell you to go pound sand, it might be an indication that reproducing their results will be tricky.

Take a look at the WorldQuant business model. These are the guys behind the Formulaic Alphas paper. Have a look at their terms of business. In particular look at Clause 3 of their WebSim Agreement:

**3. License to Input. In consideration of Our License, you hereby grant to us a royalty-free, worldwide, fully transferable, sublicenseable, irrevocable, non-exclusive, perpetual right and license to freely use, copy, distribute, make available to the public by any means, modify, adapt, translate, exploit, perform, display, make, sell, offer for sale, import, export and prepare derivative works based upon any and all code, algorithms, ideas, comments, recommendations, methods, designs, plans, techniques, processes, calculations, analyses, or other feedback, data or information (other than personal data or information) that you input into or provide at Websim or otherwise provide to us or that is incorporated into or reflected or effectuated by any of the foregoing (in whole or part) (any of the foregoing, individually or collectively, "Input") and/or to incorporate any and all of Input into other works in any form, media, or technology now known or later developed (such right and license, collectively hereinafter referred to as "Your License"). Moreover, unless otherwise restricted by applicable law, you hereby waive and, to the extent not waiveable, agree not to exercise any "moral rights" or corresponding rights in other jurisdictions in having Input edited, removed, modified, published, transmitted or displayed in a manner not agreeable to you or having Input attributed to you.**  

I don't think these are people I would want to share with. They reputably have 450 staff. I wonder where they are based and how much they are paid.

I think I prefer the Quantopian model. At least these guys "claim" your IP remains your own.

At the risk of having my comments "curated" it would be interesting to understand what Q has in mind here, in a more global sense. Say I write some "alphas" (kinda the idea behind the crowd-sourced fund in the first place, I thought). How do I get paid if they end up being used in the Q hedge fund? Or is the idea that I would freely contribute "alphas" that could then be combined with those from other users into my own "mega-alpha" that could be in the running for Q hedge fund money? I see above:

We'll also release the aggregated signal data, so you can craft your own mega-alpha algorithm.

What does this mean? Would it simply be that each "alpha" would be coded up and released under an API (sorta like TA-LIB), and then users would pick-and-chose to formulate their own algos (that hopefully would be uncorrelated from everyone else's)? Or is there some other vision here? If I find an "alpha" that actually works, why would I want to share it? I guess I don't get the whole concept yet.

Guys,

Imagine you run a 100 billion dollar hedge fund and you have a team of PhD's working round the clock at your disposal. To this point they have found 50 reliable signals which work on US stocks, ignoring transaction costs. What do you do?

All signals are summed and you have an overall target quantity to be long/short at any given moment. Now the aim is simply to match that target as best as possible throughout the day. What do you do now?

Use your team of researchers to identify the most cost-effective way of meeting the target within some given time, i.e. minimizing transaction costs. This is a problem all on its own and real hedge funds have a team dedicated to this research task alone.

Obviously any signal that fails with the introduction of a 0.00001 commission fee is a dead duck. However, taking it to the limit, there is no point in considering transaction costs per signal -- this is really a separate job for another department to handle. As laws/taxes/costs come and go over time, it will affect the buying/selling strategy to minimize the transaction costs to meet the given target.

Throw in access to dark pools (where you buy/sell away from the exchange) and suddenly the transaction-cost problem has sky-rocketed in complexity. Again, its not the job of the signal to consider transaction fees.

It might be prudent to have code to cancel all orders prior to the close, to sorta replicate live trading behavior on Quantopian (or better yet, update the backtesting engine so that orders are canceled after the close, as they are in live trading).

My comment above. In light of Jack's comments, probably not relevant in the long run. Obviously, one would not have such an arbitrary cancellation at the close for a real hedge fund. It would be part of the same optimization process. And perhaps there would be overnight signals, as well. That said, if the idea is to go live on the present Q platform in the near-term, then it might be important to have the cancel-all-orders-after-close rule in place for development. Even with the slippage model turned off, for thinly traded stocks, one could end up with open orders at the end of the day, which would be automatically cancelled by the current real-money live-trading system with IB.

Grant,

This project is purely a thought experiment. If we are able to reproduce some of the authors' results, great. If not, we've still learned something. Weak signal aggregation is a common challenge in the quant world.

I'm working on a Pipeline template that will filter out illiquid names, ADRs and OTCs. I'll also post a some ideas on how we might go about aggregating these alphas.

Here's an example notebook showing how you can use Research to build out your alpha factor. In this example, I've limited my pipeline to stocks in the consumer cyclical sector. I'm generally curious to see the effect of sector neutrality when it comes to evaluating the alphas.

I imagine it'd be interesting to try out a Spearman rank correlation for each alpha (or combination of alphas) and 1 - 5 day price movement.

Loading notebook preview...
Notebook previews are currently unavailable.

Here are the results for Alpha #101 simulation on the site of one of the paper authors WOLDQUANT WebSim

Alpha#101: ((close - open) / ((high - low) + .001))

Sim Settings
Asset Region Universe Date Language Simulation Decay Delay MaxStockWeight
EQUITY USA TOP2000 2016-02-15 EXPRESSION 5 Years 4 1 0.1

IS Summary
Year Size Long Short Pnl Sharpe Fitness Returns Drawdown Turnover Margin
2010 20.0M 1000 990 -954K -1.90 0.00 -10.06% 5.16% 81.10% -2.48bpm
2011 20.0M 998 993 -727K -0.95 0.00 -7.21% 5.05% 80.25% -1.80bpm
2012 20.0M 998 993 -1.07M -2.00 0.00 -10.66% 5.18% 80.83% -2.64bpm
2013 20.0M 1003 986 -863K -1.82 0.00 -8.57% 5.17% 80.93% -2.12bpm
2014 20.0M 991 993 285K 0.40 0.06 2.83% 6.29% 82.15% 0.69bpm

FYI the simulation time 15 sec

Vladimir
It is not at all clear to me what this simulation represents. Is this just one person's work? Is this the combined result of applying all 101 algos? Or is this some sort of combined effort?

Either way disappointing doesn't begin to describe it.

Anthony,
That was single alpha simulation for Alpha#101: ((close - open) / ((high - low) + .001)) on Universe of top 2000 USA stocks from 2010 to 2014 and initial capital $20,000,000

Thanks Andrew,

I'm working on a Pipeline template that will filter out illiquid names, ADRs and OTCs

Is that all? For example, Simon filters out a bunch of other stuff in the code he posted here:

https://www.quantopian.com/posts/equity-long-short

There are also stocks that are de-listed...how to manage those? And would it make sense to avoid ETF's, etc.?

If the idea is for all "alphas" written under this project to use the same universe, then perhaps you should simply establish a list. Does the object security_lists get updated each day of a backtest? Or is it static? If it is actually point-in-time, then it would simply be a matter of your pulling together a list that would allow an apples-to-apples comparison across all of the "alphas".

If the fix to the USEquityPricing.open bug is just around the corner, wouldn't it make sense to postpone the project until it is implemented? Not being able to calculate open-to-close (or close-to-open) changes seems like a pretty severe limitation?

This project is purely a thought experiment.

I guess I'm confused. It doesn't fit with the hedge fund effort? If you aren't interested in eventually making money (and sharing some of it with users), then what's the point?

I thought this was all more of a fun crowd sourcing game to try to code up all the (clearly algorithmically generated) signals in this paper and see what the replication aggregate looks like. A lot of these objections seem to be taking the fun out of it! Sorry I can't help write a few up, I am very busy at work. It might quicker to write an expression parser which converts zura-formatted alphas into CustomFactors though! :)

Sorry to be such a downer. I'll tone down the cranky old man routine. Code away!

Regarding execution in minute versus daily mode, you might consider daily, with the slippage model published here:

https://www.quantopian.com/posts/trade-at-the-open-slippage-model

Since this is a thought experiment/back-of-the-envelope effort, a first cut would run a lot faster (~390X) in daily mode and probably yield the same information (some checks could be performed to compare the results to minute mode). The pipeline data are daily, so daily trading would be consistent with the timescale of the information available to make decisions. And if the weak signal aggregation involves an optimization routine that requires calling backtests, then you might want to have things set up in daily mode initially, since there will be a lot of computations, it would seem.

On another point, it is interesting that the signals ("alphas") were algorithmically generated, as Simon points out. It suggests that there may be code to basically spit out the alpha formulas automatically (not just their parameters), by churning over trade data (versus a bunch of human monkeys on typewriters generating the code). And the more signals the merrier. So I'm wondering if a more fundamental approach would be to figure out how the authors came up with the alpha signals in the first place. It might be better to start from scratch and write the code to generate the alphas such that they would be automatically compatible with Q/pipeline.

As I gather, the magic happens by figuring out which signals to glom onto at any given time and in what combination (the "emphemeral" part mentioned in the paper), drawing from a large pool of alphas. Anybody know how this is done?

Any feedback Q team? Despite my seemingly skeptical comments, generally this seems like an interesting path forward for Q, if it can be made to work--particularly if you can figure out how to pay contributors of ephemeral signals, should the signals end up getting used profitably in the hedge fund. This would be at the opposite end of the spectrum to the institutional-grade, scalable long-short algos you seem to be requiring to be considered for the managers program--an approach that may not scale easily to your 60,000 users (especially if the strategies need to be uncorrelated). So, this project seems like a cool idea.

Having thought about the 101 Alphas project a little more, I think that a good first step would be to implement and test a couple of the alphas in Research. By separating the alpha discovery/analysis step from the ordering logic of an algo, we'll be able to more easily and clearly evaluate the raw signal from each alpha factor. Once we have a handle on a number of the alphas in research, we can think about ways to aggregate signals and write optimal execution logic.

I took a stab at a notebook to compute the Information Coefficient of a Pipeline factor. I started by computing the daily alpha values and tacking on columns for the 1, 5, and 10 day forward price movements for each equity/date combindation. I then grouped by sector and computed the Spearman Rank correlation between each forward price movement column and the alpha factor. I also tinkered with displaying monthly and daily IC for each sector over time and computing mean returns by factor decile over different time horizons.

I'm eager to hear your feedback on this factor analysis. We're interested in eventually building out a factor tear sheet in pyfolio. Hopefully, this project can help us determine exactly which features that should go into that toolset.

Loading notebook preview...
Notebook previews are currently unavailable.

Hi Andrew,

Does your notebook filter out undesirable "junk" that you would not want to trade? For example, Simon filters a bunch of stuff out of the universe in before_trading_start as illustrated on https://www.quantopian.com/posts/equity-long-short. Or does the fact that you are analyzing only stocks assigned sector codes take care of that?

Also, I'm still a little murky on the distinction between factors/alphas and signals. Is the end goal here to have a bunch of user-contributed black boxes that emit signals on a minute-by-minute basis (with the universe encapsulated in the black box)? And would each security signal from a given black box have a floating point weight, -1.0 to +1.0, corresponding to the strength of the short-long signal? Then a Q (or user-written?) aggregator would magically combine all of the signals every minute, manage the orders, slippage, transaction costs, etc. forming a glob of hedge fund perfection? Or would the factors/alphas be more universal (kinda like TA-LIB), with the universe up for grabs as part of the optimization? I'm just trying to picture the end game here.

I realized that you may already be working on the aggregation problem, since you have a stream of virtual signals from all of the contest submissions, no? Or maybe they aren't so good, or you don't have enough? So without even coding any new alphas, you could just use all of the live-running contest algos as signal sources (and I assume you store copies, so that even if an algo is stopped, you could still run it), to show how aggregation might be done. Just another thought for the pile...

@Grant
That notebook limits our universe to the 1000 most liquid names on each day. You are right in noticing that the screen criteria sector.eq(sector) filters out rows with NaN sector classifications. I think this should remove things like ETFs from our universe. Let me know if you see otherwise. As soon as support for boolean fields gets merged into Pipeline, we can get more sophisticated with our universe filtering.

I've been using the terms "alpha", "factor", and "signal" interchangeably. Apologies for the confusion. Yes, when we aggregate the alphas, we'd want to normalize their values in a way that gives each alpha an equal (or perhaps intentionally unequal) "vote" in the aggregate algo's portfolio construction logic. This signal aggregation and rebalancing and would occur, at most, each day.

Again, there isn't a direct for-profit motivation behind this project. The goal is simply to prompt greater discussion about the construction and evaluation Pipeline factors. If there is enthusiasm for the alphas, we'd definitely be open to the idea of adding them to an importable factor library.

I'm still confused by the alpha/factor/signal distinction and how the aggregation would work. In my mind, every day, each minion robot generates a long/short list (say a Python dictionary of floating point numbers, keyed by security). Then, a master robot grabs the lists and sorts out what to do with them. Would the master robot only be allowed to weight each list? Or would it put together a global list of longs and shorts, with their weights as defined by the minions, and then optimize? And would the master robot be allowed to influence his minion robots? Or would they operate independently?

Don't you already have minion robots, in the form of the contest entries? It seems you could put the master robot to work.

Don't tell your management that there is no "direct for-profit motivation behind this project." This sort of talk usually doesn't go over well. Just make up a story about how this will generate oodles of profits. Or you could continue to say that there is no clear path to profit from the project, and see how that plays out...

A few more random thoughts for the pile:

As I see it, the minions are synthetic instruments, right? Like itsy bitsy ETFs, I guess. It is not obvious that they should conform to anything like what you are looking for in the contest. But what characteristics should a good minion have? I guess if the master finds a given minion useful, then that would be the measure. So, any idea how to do the aggregation? Or is the idea that a given minion could be judged in isolation?

Your intuition is correct and the questions you raise are the right ones too. There is not a single answer on how to do the aggregation, though. In practice, equal weighting seems to work surprisingly well. But you can get arbitrarily smarter/more complex than that. For example, you could do something like mean-variance optimization (selecting factors that have high mean information coefficient and low sd). Then, some factors might work well only during certain times so you want some factor rotation strategy. Then some factors might work better for some sectors than others. Some factors might also have non-linear relationships with the returns so you could build a model to allow for that as well. You could also use machine learning techniques like random forest to learn the optimal weightings. Then once you have the weightings you'd want to do some optimization to reduce trading costs and do risk control.

That is really the purpose of this research project, playing around with how to do all that. So once we have a couple of individual factors to play around with, that'll be the next part -- how to do the aggregation. If we knew how to do each step exactly, it wouldn't be called research.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hello Andrew,

I looked over your example notebook above, Information Coefficient of a Pipeline factor. It's hard for me to follow due to several of my shortcomings: I'm not up the learning curve on Pipeline, I'm not familiar with the statistics you use, and generally, I'm not good enough at Python to be able to "read" through code and follow the flow. So, could you provide an executive summary and discussion? In a nutshell, what approach are you taking and why? Did you learn anything, or just end up with a bunch of nice plots?

I'll try to learn something about the statistics you selected.

By the way, when you have access to minute bars, I don't think it makes sense to use daily bars for Pipeline. My rationale is that you are introducing noise unnecessarily (out of convenience and maybe infrastructure cost, I think). Daily OHLC data are derived from single trades, and are inherently noisy. If I were thinking about this project, I'd go to my colleagues and say "Hey, this is crazy. We have all of these wonderful data, and you are forcing me to use daily bar data to make daily trading decisions. How about I synthesize my own smoothed OHLCV daily bars from the minutely data, and use those for Pipeline?" I'm not saying that Pipeline should run on minutely bars (Scott S. gives acceptable rationale on https://www.quantopian.com/posts/introducing-the-pipeline-api), but that users should be able to synthesize their own daily bars, so that smoothing and other forms of pre-processing can be applied. In the research platform, is it feasible to supply a custom data source to Pipeline?

equal weighting seems to work surprisingly well

That's interesting. But then wouldn't I just be pasting together a bunch of high-performing strategies? To get a 10% return, wouldn't I need returns from the strategies centered around 10%. I couldn't start with a bunch of sucky strategies and just cobble them together, right?

The paper seems to suggest that you want a large pool of transient (ephemeral) "alphas" and that the secret sauce might be in deciding, at any given point in time, which ones should be applied to the portfolio. So, knowing how the aggregation will be done could influence how the alphas are set up and evaluated for goodness. Looking for the kind of uber-algos that you require for the contest/Q fund might be the wrong approach.

do some optimization to reduce trading costs and do risk control

I gather that there are third parties who could do this for Quantopian. Overnight, you would just queue up the portfolio du jour and their execution engine would do the rest, subject to objectives and constraints. There's no point in using the Quantopian platform, since you'd have to code up your own order execution engine (although maybe you are already doing this for the Q fund?).

If I try to run the notebook above, I get:

In [10]:

len(equities)


NameError Traceback (most recent call last)
in ()
----> 1 len(equities)

NameError: name 'equities' is not defined

Presumably y'all are still working on this?

In any case, I've heard that backtesting can be vectorized. It might help to have a vectorized backtester for this project. An event-driven backtester would seem like a non-starter, and might be overkill anyway for comparing alphas in a relative sense. Also, I gather that there is a lot of pulling data from (remote?) databases/disk as Q simulations run. Could a bunch of RAM be allocated and all of the necessary data then loaded? Or is the hardware platform out of scope for this research project, and you'll just have to use what's available?

Also, you aren't really set up to collaborate with users. This work should be on github, and then integrated into the research platform, or something along these lines. Feasible? Trying to do this sort of thing on the forum seems doomed; it's the wrong tool.

Hi Andrew,

I attempted to read through you code again (the notebook you posted above). You would benefit from a write-up describing your plan of attack and rationale, discussion, conclusions, etc. in the form of a little report, including formulas and maybe a few references/examples. I think that the text could be incorporated into the notebook, right?

Also, rather than being constrained by the pipeline's use of only daily data (and no access to opening prices, as well), you could think in terms of sampling random stocks and then pulling their minutely data, basically bypassing pipeline altogether.

Hi Grant,

I'm working on a cleaner version of that factor analysis notebook. Stay tuned for more.

Also, in both research and live trading, Pipeline only works with daily data.

I just came across this brand new paper from the 101 Alphas authors on alpha aggregation: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2739219

Definitely worth a look.

Thanks for the update and paper reference.

Also, in both research and live trading, Pipeline only works with daily data.

Yes, but as a researcher, is that what you want? I suspect that the daily data used by pipeline results in noisy signals, since the OHLC values are individual trades of arbitrary volume. Intuitively, using such data for making sound trading decisions on a daily to few day timescale doesn't make sense--especially when you have access to minute bars (and as a Q employee, you could probably get at the tick stream used to generate the live bars, too).

Andrew,

Your IC notebook is so good. Would it make sense to see the same analysis for stock return volatility?

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Of the 101 alphas in the paper, 43 of them use VWAP ("daily volume-weighted average price"). Daily VWAP is usually calculated from tick data. It can be approximated as (sum(V*(H+L+C)/3) / sum(V)) using the 390 minutely bars of each day. With Quantopian's current time/memory constraints, it is not practical to do this calculation for a broad universe of stocks, especially when testing factors that do not otherwise require intraday data (e.g. these 101). Ideally, VWAP would be calculated during database updates and added as a field in the daily price history series, where it could easily be accessed by pipeline custom factors. (The built-in "VWAP" factor does not calculate actual VWAP. Using window_length = 1, it simply returns the daily close. For longer windows, it returns the volume-weighted average of daily closes.)

mhp -

Yes, I've been kinda confused about Q's approach with regard to pipeline factors, and my comments, similar to yours, have fallen on deaf ears, seemingly. I think the game plan is to get the workflow in place for daily bars, and then go from there. However, Q does have a minutely database, from which daily factors could be derived en masse. It would require an overnight/weekend crunching of numbers, but it would be a single crunch, not 90,000+ crunches. This would work for common factors that are part of the API (a different story for user-defined custom factors). For backtesting, it could be a massive undertaking, though, since the factors would need to be pre-computed back to 2002, across all securities point-in-time. And as data errors are found, re-computed (at least for affected securities). Hence, the approach is to fit everything into the 5-minute before_trading_start window (at least that's my understanding of the pipeline mechanics). When backtesting, there is some periodic grabbing of data from databases, but all of the computations are done within before_trading_start (I think).

Currently, Quantopian is calculating the standard daily price fields (OHLCV) using their 1-min database, so that their daily bars reflect actual regular-hours trading activity.* It would take only a few lines of code and add virtually no extra processing time to also calculate VWAP during this process and add it as a field in the daily bars.

  • This is as it should be. Most daily bar data services provide a strange and distorted view of the world by using the "official close" and especially by including pre-open and post-close trades in their daily volume calculations, but not in the daily high and low. This is a bizarre practice that I've never understood.

mhp -

Yes, but it would seem better to sort out the more general problem, and then daily VWAP (computed from minute bars) would be just another factor. In other words, within pipeline, if one could define factors that take in minute bars and output daily values, then it would be a more universal solution.

I've attached a crude example of how to do VWAP in before_trading_start across all of the Q1500US securities. In theory, once could try to take advantage of the before_trading_start 5-minute compute window to implement a bunch of trailing minutely factors, and then plug them into the workflow starting at the alpha combination step, but it would seem to be a total hack versus modifying pipeline.

Clone Algorithm
13
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 580b67e65c74201056270af0
There was a runtime error.

Grant, thanks. That seems to work reasonably well. I changed it slightly to improve speed by only making one history call and to make the VWAP approximation a bit better:

hist = data.history(context.security_list, ('high','low','price','volume'), 390, '1m')  
dv = hist.volume * (hist.high + hist.low + hist.price) / 3  
vwap = dv.sum(axis=0)/hist.volume.sum(axis=0)

I still think it could be argued that VWAP is a special case that could be added as a field to the daily bars, not just an example of a factor.

Note that the 390 minute window will capture data from more than one trading day, if the market closes early. There are some other potential problems, as well. I just did it for the quick example; there are better approaches (e.g. using the datetime stamps to isolate the data to the prior trading day, supposing that is what is needed).

Note also that I'm not sure if splits are adjusted for the current (not yet opened) trading day, when history is used in before_trading_start. Presumably the answer is 'yes' per the help page:

Prices are split- and dividend-adjusted as of the current date in the simulation.

Hi there guys,

Time ago I tried to add up some of the alphas signals. The idea behind my work was that many almost useless signals can add up to a god one using machine learning. As you may know by now, some of the alphas have very weird behavior. The work ends up being a mess but I made the attached module with some of the alphas as methods.

Once the Alpha class is instantitated e.g.

alpha_generator = Alphas(pn_data)  

where pn_data is a panel with Stocks as items, time as major axis and OHLCV as minor axis.
You can retrieve the available alphas as

alphas = [alpha for alpha in dir(alpha_generator) if alpha.startswith('alpha')]  

Then retrieve a particular alphas as

alpha003 = getattr(alpha_generator, alphas[3])()  

The code wasn’t fully tested, and I know there are some pretty lousy implementations, like the decay_linear method. Also I didn't use in the Quantopian framework, so I don’t know if will work here.

Anyway, I hope someone will find this useful.

Cheers, JJ

Loading notebook preview...
Notebook previews are currently unavailable.

Incredible.

You are a beast of burden. And i mean that in a good way...

This is what i should have done all the way but was too stupid/lazy to attempt.

Many thanks

Bravo, JJ!

Linking to this thread where someone wrote a compile to add all 101 Alphas: https://www.quantopian.com/posts/alpha-compiler

Hi Q Support -

Just curious if this is still a project? It is still a tag one could apply to a forum post, and you must have lots of factors that could be combined by some magic. I'll share a few factors, if you want to still give it a go.

As a reminder, the plan was:

"When we're done, we'll pull together data from all of the algorithms you wrote and share our findings here. We'll also release the aggregated signal data, so you can craft your own mega-alpha algorithm."

Hi Grant,

Definitely, if you have a few factors we could add try and combine this with the ML 3 workflow!

I'll see what I can come up with. I'll be impressed if you can get to 101 factors, but maybe Pipeline is architected cleverly so that it scales gently with the number of factors. I got the impression that you were already bumping up against time/memory limitations.

The other thing is that your ML 3 workflow is going to get clunky with 101 factors. Is there any way they could be set up on GitHub and imported? For example:

from quantopian.pipeline.experimental import alphas101_factors

My interpretation of this project was that Q would be doing an offline combination of the alpha factors, and then publishing the result as a free data set that users could then use for their algos. Just adding a few more factors to your ML 3 workflow doesn't really get us there.

On a more general note, as you suggested, there needs to be modularity to the code so that various Pipeline alpha factor combination/aggregation techniques can be switched out and then compared (one wouldn't have to limit the scope to Pipeline alpha factors...minutely ones could be considered, as well...but that probably opens up a can of worms). I'm not so much of a Python and Pipeline whiz, so input would be welcome.

How many from the 101 still work? :-)