Back to Community
Alpha Compiler

I was quite excited about the 101 Alphas Project and the paper "101 Formulaic Alphas". So I wrote a compiler that takes an alpha equation and generates a Quantopian Pipeline factor. Attached is a notebook with 77 of the 101 alphas, see below as to why there are only 77. I have tested the following alphas on Quantopian: 1-20, 23, 33, 41, 57, and 101, I could use some help testing the others, spotting bugs, and or any useful feedback. (If you think the notion of alphas, or long-short equity is a waste of time please - save it for another thread.)

The biggest problem is correlation, 51/101 alphas contain the correlation operator and I did properly implement it, however it is slow. One month of data takes 14min to process. The real problem is even though I set the screen to 500, or 1500 equities the custom factor is still processing over 8000 equities. Perhaps I am not using the screen correctly? From what I have read this is the intended behavior. There seems to be a work-around that I may try to speed things up.

Things I didn't implement, (why there are 77 not 101 alphas):

-Run-Time Time Series: Most alphas involve some time-series operator, and for most of these time-series operators the number of days that the operator is applied is known at compile time. However there are nine alphas (71, 73, 76, 77, 82, 87, 88, 92, 96) that have a time-series operator that is not known until run-time. This could be done, however it would involve a nasty for-loop that would need to iterate over every equity. If you looked at every one of these you would notice, the dynamic operator is actually on a max() or min(), the notes say that max() = ts_max(), but I suspect in these examples max() actually behaves by returning the largest of the arguments. I actually contacted the author of the paper, but he couldn't say anything that was not already in the paper.
-IndNeutralize, I simply didn't get around to this. See Alpha#97. I think it is very valuable, any ideas?
-Fractional time-series days. There are a handful of time series operators that use a number of days that are not whole numbers like: 9.991009 See Alpha#62 as an example. For these I simply round the number to the nearest integer.
-Logical to floating point. There are a few places where a logical operation is used like a floating point vector. See Alpha#95. Here I was not sure what value to assign to False (assuming 1.0 for True) 0.0 or -1.0?

A note on the distasteful for-loops: I originally wrote the for-loops as a way to reason about how the data needed to be arranged for the proper time-series operations with the intention of later rewriting them as matrix operations. for-loops are slower than matrix operations, as the looping is done in c or Fortran in a matrix operation. However what I noticed was that these for loops are not that bad. There is never a for loop which loops over the equities, it is always days which are usually in the single digits. Even an alpha with five embeded loops like Alpha29 does not appear to take much longer than an alpha with no loops. For the time being I have no plans to remove the for loops.

Loading notebook preview...
Notebook previews are currently unavailable.
30 responses

Hi Peter,

This looks like very interesting work! I'd be love to see your compiler if you're willing to share. A few comments/responses to your particular questions:

Using screen: You're correct that passing a screen to your Pipeline doesn't change the data that's supplied to your CustomFactor compute functions. What passing a screen does is define a post-processing step to be applied on the final output of your pipeline: all your terms get computed without any knowledge of the screen, and as a final step we throw out the rows for any asset that doesn't pass the screen. In general, it would be very hard to automatically apply a screen to all the terms in your pipeline. Consider, for example, a Pipeline that wants to get the returns for all assets whose daily returns are greater than zero:

rets = Returns(window_length=2)  
pipe = Pipeline({'rets': rets}, screen=(rets > 0))  

there's no way for us to pre-emptively apply the screen to the rets term of the pipeline, because we need the rets term to calculate the screen in the first place!

If you want to tell a Factor (or a Filter or Classifier) to compute on only a subset of all assets, you want to pass a Filter as the mask parameter to that factor. For example, here's a pipeline that computes SomeFactor on just the assets that have positive daily returns for the previous day:

class SomeFactor(CustomFactor):  
    window_length = ...   # Omitted for brevity  
    inputs = (...)

rets = Returns(window_length=2)

pipe = Pipeline({'my_factor': SomeFactor(mask=(rets > 0))}, screen=rets)  

Computing Ranks: One thing I notice is that you're manually calling rankdata() on each output row for many of your factors. Most of the time, if you want a rank on some numerical value, a more efficient and composable implementation is to just calculate the raw value in your CustomFactor, and to then call rank() on your factor instance. rank() produces a new factor that computes ranks over the original factor. For example, here's a pipeline that computes ranks of daily returns (I'm using returns in all the examples here for simplicity, but Returns is just a vanilla CustomFactor defined in Zipline

pipe = Pipeline({'rets_rank': Returns(window_length=2).rank()})  

There are three major benefits to this approach:

  1. It's faster, because the rank calculation is applied in one vectorized loop over the timeperiod for which your factor is computed.
  2. rank supports mask and groupby parameters, which allow you to filter out and/or group data in interesting ways.
  3. It's easy to switch to another normalization method like zscore or demean, both of which provide the same interface as rank.

Computing Correlations: You mention in your example that computing correlation coefficients is slow. You might want to take a look at the pearsonr and spearmanr methods. One complication is that those terms are not currently marked as window_safe, which means they're not allowed to be used as inputs to other Factors. Both correlations are safe to use in lookback calculations, however, so I've opened a PR in Zipline to make them window-safe.

All in all, this is awesome work. Thanks for sharing!

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

+1 to take a look at your compiler!!

I found two bugs, one with delay another that swapped corr and cov. Will push a new notebook later.

You guys don't want to see the code, it is really ugly.

Updated Notebook with bug fixes applied.

@Scott thanks for your feedback, I will apply some of these suggestions.

Loading notebook preview...
Notebook previews are currently unavailable.

Great !

This will definitely be useful for everyone. Perhaps Q can offer a bounty to the author to contribute this to zipline?

Here is a notebook with 83/101 alphas. I have working prototype code for the remaining alphas that I will try to push by the end of the weekend.

Loading notebook preview...
Notebook previews are currently unavailable.

This is really some great work!!!

Cool stuff. Unfortunately, none of the ones that use VWAP are correct, because the built-in "VWAP" factor doesn't actually calculate VWAP. See https://www.quantopian.com/posts/the-101-alphas-project#580b485b714ff6bdb800088b for details.

@mhp THANKS a lot for bringing this to my attention. I will look into that.

101/101 alphas,
I finished the remaining items and fixed a few bugs.

One note of caution: the Quantopian built in factor for VWAP is not calculated the way that the author of the original paper (and the rest of the world) calculate VWAP. Thanks to @mhp for pointing that out, I feel this topic deserves much more discussion but would be suited for another thread.

Loading notebook preview...
Notebook previews are currently unavailable.

I have made a primitive web site where you can use the complier to generate code for any alpha factor.
http://alphacompiler.com/

Let me know if you use it and encounter any problems.

Great stuff. What does -1 mean. E.g. I replaced it with -5

(-1 * correlation(open, volume, 10))

Then I get

    class AlphaX(CustomFactor):  
        inputs = [USEquityPricing.volume, USEquityPricing.open]  
        window_length = 10

        def compute(self, today, assets, out, volume, open):  
            v0 = np.full(out.shape[0], -5.0)  
            v10 = np.empty((10, out.shape[0]))  
            for i0 in range(1, 11):  
                v10[-i0] = open[-i0]  
            v11 = np.empty((10, out.shape[0]))  
            for i0 in range(1, 11):  
                v11[-i0] = volume[-i0]  
            v1 = pd.DataFrame(v10).rolling(window=10).corr(pd.DataFrame(v11)).tail(1).as_matrix()[-1]  
            out[:] = v0 * v1  

Only line

            v0 = np.full(out.shape[0], -5.0)  

Changes.

In this case what does -5 mean?

Also can you list more example equation and what they mean.

@Suminda, that is a great question. The -1.0 only serves to change the sign of the outputs. In this case what (-1 * correlation(open, volume, 10)) means is that we want to short the securities that have a high correlation between open and volume over the last 10 days, and we want to go long the ones with low correlation. We also feel more strongly about shorting the equities with a higher correlation. Each alpha factor is a prediction of the relative prices of these equities.

Depending on what you do with this, changing from -1 to -5 probably won't make any difference as you are scaling all of the equities together, and you will probably do some sort of normalization step after this which will rescale the values.

So in plain English what does (-1 * correlation(open, volume, 10)) mean? It looks like an overbought indicator. If the price and the volume have been increasing together that may mean that the security is too highly priced and will come down. That makes sense, and sounds like it should work.

This given equation was Alpha#6 from the original paper. To be honest I have not tried to explain all 101 alphas from that paper. Some of the alpha equations appear to be written by a machine, for example Alpha51: (0 - (1 * ((close - vwap) / decay_linear(rank(ts_argmax(close, 30)), 2)))). This is not the wildest alpha in the paper, and you could come up with an explanation, but the fact that a simple algebraic manipulation is left off makes me think it was generated by a machine.

Is it reckless to use an equation that you can't explain but you are very confident that it works?

Hi Peter,

Excellent work on compiling the 101 alphas. I've been testing a few out and found a few that look promising. Question for you (if you could help a NOOB out here). I'm trying to rank 3 of these factors that I like and add them to the pipeline as their own column. I want to see how the combined factor looks in Alphalens. Any ideas on how to do that using your latest notebook? Any help you (or anyone else listening to this thread) could provide would be appreciated.

Attached is the notebook. Thanks

Loading notebook preview...
Notebook previews are currently unavailable.

Is it reckless to use an equation that you can't explain but you are very confident that it works?

I don't think so, I recently listened to a podcast where Bert Mouler was the guest (I don't remember if it was Chat with Traders or Better System Trader), he talked about how machine learning will find things that the human mind can't or won't. For example, he talked about a structure that was designed to absorb vibrations and engineers repeatedly failed to design one strong enough, they let a machine learning algorithm loose on the idea and what resulted was a very chaotic looking structure that no one would have ever conceived but it worked perfectly. The point was just because we can't initially explain it doesn't mean it isn't valid, some people are here to be right or explain market dynamics, others just want to make money.

I love this community.

@Goldmember I'm glad you have found some useful signals. I have attached a modified version of your notebook with a new column called "sum589" which is the sum of the ranks of Alpha5, 8, 9 as an example of how to combine factors into another factor. I hope that helps. Please ask if you have more questions.

The last notebook I posted with all 101 alphas is not the most recent version. I have made about five improvements, but I thought no one cared so I stopped posting here. I have pushed the latest code to http://alphacompiler.com/, you can cut and paste the alpha equation into the text box there and compile to get the most recent code. I couldn't test all 101 Alphas on Quantopian so I built a Zipline version on my local machine. I plan to make most of this code (how to run pipeline on your local machine) public soon.

Loading notebook preview...
Notebook previews are currently unavailable.

@Peter - you are the man. Thanks so much for your help combining and ranking these factors!

So I went through and tested almost all of the 101 factors from the notebook individually the other night with alphalens (except the ones with correlations, couldn't get them to run in a decent amount of time). Did the five improvements you mentioned help the correlations code runtime at all?Also, were the changes that you made significant enough that you would recommend re-testing the factors using the code from http://alphacompiler.com/ ?

Thanks again!

I would rerun anything with ADV, and anything with two or more for-loops as those were producing incorrect code, and anytime there was a delay inside of a ts element like ts_rank(). Oh yeah, and anything with rank() or ts_rank(), and anything with scale(). Do you see why I automated this now?

I also normalized rank() and ts_rank(), so the maximum value is 1.0. For example rank(open) produces the following code.

    class AlphaX(CustomFactor):  
        inputs = [USEquityPricing.open]  
        window_length = 1

        def compute(self, today, assets, out, open):  
            v0 = open[-1]  
            out[:] = stats.rankdata(v0)/np.float(out.shape[0])  

It became obvious that the original authors have a system made up of components that vary from -1.0 to 1.0. If you don't scale rank() then it will "wash out" other components, for example the equation: rank(open) + correlation(open, volume,10) would be dominated by rank() if it was not scaled.

Thanks for the great work here, Peter!

I added a quick check with Alphalens to your last notebook ranking Alphas 5, 8, and 9. It should be pretty straight forward to add up, rank, and check the performance of any other alphas people are interested in.

Loading notebook preview...
Notebook previews are currently unavailable.

Thank you for your post, I am getting compilation error about not being able to find demean_by_group method.
Is that an out of date Quantopian built in method?

@yira, demean_by_group() not a Quantopian built in function. It is a function I have written to demean by sector, industry or sub-industry. You could use any grouping in theory but these are the groupings used the the original paper and also supported by the Morningstar fundamental data. I'm not sure which notebook you are running. If you look at the notebook Ian just posted demean_by_group() is the first function, make sure it is in your notebook and make sure the cell has been run. I hope that helps.

Is there still an aspiration that the Alphas 101 Project would be executed on Quantopian? Or has it been abandoned? Does anyone know the status? It would seem to fit the workflow, with a bunch of alphas, followed by a combination step. Is it executable on Quantopian? What is the motivation behind this compiler effort?

Hi Grant,
In my mind, the Alphas 101 Project is a completed project. If you take any of the 101 alphas and compile them in my compiler (available here: http://alphacompiler.com/) you will get a CustomFactor that you can run on Quantopian. I personally have run over half of them on Quantopian and have run all of them on Zipline on my own machine. I know other members have run them as well.

The motivation behind writing the alpha complier was:
1. to automate the translation of the 101 alphas into Quantopian executable code
2. to be able to generate new alphas quickly.

Some of the alphas involve correlation and this takes a long time to compute. (Alpha 92 took 54557 seconds to process one year's worth of data on the US 500 on my laptop). My laptop was about 2x as fast as the Quantopian notebooks. I would have run all of them in Quantopian research, but notebooks running longer than one hour usually error out especially during US trading hours. An alpha that takes 30 hours to compute wouldn't work in a Quantopian trading algorithm. Perhaps the alpha can be rewritten so that the correlation does not take so long, I would welcome any input on that.

The correlation alphas actually performed well. Given this and some other reasons I choose to spend my time developing for the Zipline platform.

Peter Thank you for the great work, we certainly care and learning much from sharing.

Kind regards,
Warsame

(If I recall, VWAP had not been fully addressed, at least not in a way that works cleanly with Pipeline API.)

I've attached a modified VWAP function that leverages minute data, and can be used within the Pipeline API. Feel free to try it out. FYI, in order to minimize the use of get_pricing within Pipeline factors, the function is a bit convoluted. It's more PoC, so a lot of opportunity for clean-up and further optimization.

Sam

Loading notebook preview...
Notebook previews are currently unavailable.

Hi,

I just learned that the notebook I attached above, which includes VWAP with minute data, would not work as-is in a backtest. So I've attached a version of the above code that will work as part of a backtest.

It's also a bit hacky since it's just to prove you can use minute data within a Pipeline for purposes of calculating VWAP. But the code can always be cleaned up (and further vetted).

Let me know if you have any questions.

Sam

Clone Algorithm
15
Loading...
Backtest from to with initial capital
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Max Drawdown
--
Benchmark Returns
--
Volatility
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
# Backtest ID: 59136802ed6a34659629274b
There was a runtime error.

@Sam I don't think your minutely-vwap is doing quite what you think it does. While I commend the effort in your workaround, I can promise that there's no way right now to reliably get minutely pricing data into a Pipeline calculation.

When the backtester runs a pipeline, it pre-computes all of your Factors/Filters/Classifiers in large chunks (we do this for a variety of reasons, but primarily for speed and memory savings). When you make a call to pipeline_output, the backtester simply loads the appropriate pre-computed chunk out of its cache. When that cache gets exhausted, we run another batch. You can see the implementation for this in Zipline here.

What this means for your workaround is that every time you load minutely pricing data in your compute function, you're mostly getting the same 390 minutes of pricing data.

Sorry for the confusion here.
- Scott

@Scott,

Yeah, I figured that Quantopian's architectural design intent is to prevent access to the minute data within the Pipeline API due to the server load implications, so any work around would be nothing more than a temporary loophole.

Regarding your stated reasoning, I got deceived by the declarative nature of the Pipeline API :). It's easy to forget where the actual trigger/starting point is.

Is there any intent in providing an alternative option, e.g., pre-calculated VWAP data as part of the data source?

Thanks,
Sam