Thanks to everybody who submitted questions. I've gone through an initial batch below in the order in which they were asked. I will work to get another batch answered in the next week or two. I'm going to try to run some webinars and record some videos on the topics below, so watch out for those.

Does a positive IC skew mean that there's more predictive information in the top quantile (for longs) and vice versa for negative skew? If not, what does the IC skew tell me?

We have found that this statistic is not super useful in practice and will likely remove it from Alphalens in the future Alphalens. IC Skew refers to the distribution of information coefficients, not anything about the percentiles. The IC is computed each day, which means we have a whole set of measurements of the predictiveness of the model. Each IC alone is likely not that informative, as rarely are strategies highly predictive on one day. Often they become more and more predictive as you average together more and more predictions. Given that, we look at the distribution of IC estimates and try to decide if the IC is drawn from a distribution with a mean of zero (not predictive), or not zero (predictive). Skew is one moment of that distribution, and refers to how the mass is distributed in the histogram. The explanation of skew is a bit unintuitive, but I recommend reading up on it if you’re curious. It informs you about the distribution of times your model is and isn’t predictive.

If I set the 'null hypothesis' threshold to 5%, and if the p-value is below 5% for the 5 day holding period but above 5% for 1 day (or vice versa), do I accept or reject the 'null hypothesis?' What 'bias' am I vulnerable to if I 'cherrypick' holding period based on what the IC p-value tell me?

Generally, models will have a predictive ‘sweet spot’ at which they work best. You can think about the days-forward as a parameter to the model, and you’re trying to decide the best value of that parameter without overfitting. You are definitely vulnerable to p-hacking (multiple comparisons bias) if you look at a ton of horizons and pick one that happens to pass your null hypothesis test. As such, you wanna think about it no differently from any other parameter choice. I would start by picking wide parameters, like 1, 5, 20, then zooming in on the one that works best to see if it’s indeed a smooth local peak, or that number of days just happened to randomly work well. For instance, if 5 works best, then try 3, 5, 7 and get a better sense of the total parameter space. Then, once you’re chosen what you believe to be the best number of days, run an out of sample test on new data and check the ranges again to make sure that the same structure exists in that parameter space.

Does choosing a longer test period reduce the likelihood of getting a false-negative p-value? (maybe I should be more concerned with false-positives?) What's a good balance? Is one year + one month of future returns sufficient?

Yes, more data while running the same number of tests reduces the chance of a false negative and false positive by upping your certainty in general. However if you also run more tests, you’re back to the same level of risk (p-hacking). The amount of time you need to validate depends on how often your strategy trades. If it trades many securities each day, you build up confidence in the predictive power much quicker and a month might be okay. A slower strategy will need more time. In general you wanna pick a confidence level that makes sense in your situation. If this is the only strategy you are personally investing in, then you wanna be super confident. If this is one of 100 strategies being invested in by an institutional investor, you probably actually are okay with more false positives, because the flip-side is you’ll discard fewer good strategies. There’s some recent research on this here. Sorry I don’t have an easy answer, it’s very context dependent and there’s no general rule.

If I still 'believe' in a factor, but p-value > 0.05, is it wrong / bad practice to test a different time series to see if it returns a p-value < 0.05? Essentially p-value hunting to 'fit' my hypothesis (did I just answer my own question there?). It's possible that a factor has predictive power during some time periods, and not during others though, correct? (non-stationarity (?))

Generally yes, tweaking something until it works is textbook p-hacking. It’s okay to test and idea, notice it doesn’t quite work yet, but then use what you learned to improve the idea. But you definitely have to out-of-sample test it afterwards to make sure you didn’t just overfit to the in-sample data. Just make sure you aren’t repeatedly testing (a few times is probably okay) on out-of-sample data, as that just puts you back to square one.

Similar to above, for 'false-positive' p-values (below 0.05) and IC, in general, is it a good idea to re-test a 'good' factor (p-value of <0.05 and high IC) during a different time period (cross-validation(?)), to ensure it wasn't just a 'false-positive' during the first test? If so, by how much does a second test reduce the likelihood of a 'false-positive' if both tests have p-value of <0.05 and high-ish IC?

Yes, cross-validation is a specific technique, I think you’re thinking of generic out-of-sample testing. A model that holds up with similar accuracy statistics in out-of-sample data is a very strong indication that you have found a good model. Just make sure you aren’t repeatedly testing (a few times is probably okay) on out-of-sample data, as that just puts you back to square one.

If a 'combined_factor' has some individual factors with p_values above 0.05 and/or negative IC during the test period, but when removing them from the 'combined_factor' results in significantly lower returns in backtests (i.e. fitted to market noise), what should I do with the 'bad' factors? Should I try to see if any combination of non-predictive factors, when combined have a p-value of < 0.05? Or just remove them altogether?

This is a very interesting case. It’s very possible that models, not predictive alone, are predictive in combination. This is known as non-linearity. What that means is that there’s some combination in the models that’s acting as predictive, or that you just happen to be overfit. In order to investigate this more, I would look at all the possible combinations of your individual models. Watch what happens as you add and subtract different ones and notice if there’s a specific combination which is causing the predictiveness. Also look at the IC and returns as you add and subtract, not just the p-value. Some research has been done looking at alpha factor interactions. Think about the ranking of two factors, when you combine both each stock gets assigned a point in 2D space, so now instead of quintiles you have a 5x5 grid of buckets into which stocks can fit. Whereas it’s definitely true that adding complexity increases your risk of overfitting, if you start from a hypothesis that two models should interact in an interesting way, there are definitely some cool effects you can explore here. For instance, you might hypothesize that whereas price/equity ratio is not related with returns, stocks with a high price/equity ratio and high amounts of debt will have negative excess returns in the future.

Does a higher IC Kurtosis mean that there are more extreme outliers and therefore a stronger case for 'winzorizing' the factor? My factors tend to have fairly low IC Kurtosis values (well below 3, which I believe is the peak for a normal distribution (?)). Is this good or bad?

As before in the skewness question, the kurtosis here refers to the distribution of the IC values, and not to the specific values of the factor itself. If it’s particularly high, then it’s worth investigating why that’s the case. It means that the model is not uniformly predictive day over day, and that there may be structures governing the days on which it predicts better.

IC Std. - the lower the better in relation to the IC Mean?? What does IC Std. actually tell me?

The lower the better. A low IC standard deviation means that you can have higher confidence in the mean value of the IC. A large amount of variance means you can’t be as certain. This is really no different from standard concepts around confidence intervals.

Risk-adjusted IC - is this 'volatility' adjusted IC? If so, is it essentially: IC / IC Std.?

The nice part of Alphalens being open-source is one can just check, which is precisely what I did. It is the IC mean / IC standard deviation. Again coming back to notions of confidence intervals, it just describes the IC mean as a number of standard deviations away from 0. No different from a z-score. However, you need to be careful not to assume anything about the distribution of IC values, as you don’t know if they’re normal. Basically it’s a way of comparing two different models. If one has a mean IC of 0.1, but a IC std. of 0.1, then it will get a volatility adjusted IC score of 1. If another one has a mean IC of 0.05, but a IC std. of 0.01, then it will get a IC score of 5. This doesn’t mean that the second is definitely better, it just means that you have relatively more confidence that the second IC is meaningfully different from 0, and less in the first case.

T-stat(IC) - what does this tell me, in relation to alpha research? Higher the better?

This is just the raw t-stat from the t-test that checks to see whether the IC values were likely drawn from a distribution with a mean of 0 (not predictive) or not (predictive). Remember that a t-test assumes a t-distribution, so this test is certainly not perfect. More of a rule of thumb estimate. The t-stat is not particularly useful, I would mostly just look at the p-value.

In this Q Short video on IC and P-values, Delaney mentions at around 3:15 in the video that there's currently a lot of debate whether p-value analysis is meaningful or not. Could you expand a bit on this please? Is there a school of thought that argues that p-values should be seen as 'relative' (i.e. the lower the better) rather than binary?

Here is a series of blog posts discussing why the author thinks that p-values are not a great way to test hypotheses. One of the most compelling ones in my opinion is just how little they are fully understood, and how much they are consequently misused. P-values are delicate and complex things, and they are only as useful as your interpretation of them. Be careful, read up if you want, but for now it’s probably enough to just think of them as binary and greater than or less than 0.05. Make sure you check the effect size, in this case the volatility adjusted IC and mean IC.

In the graph, IC Observed Quantile (y-axis) to Normal Distribution Quantile (x-axis), does one wand to see the bottom left tail of the S-shaped plot to be above the divider line, and the upper right S-tail to be below the line? Or it doesn't matter as long as it's a clear S-shape?

Quantile-Quantile plots just tell you how closely your data follow a baseline distribution. In our case we use the normal distribution as the baseline. If you notice a deviation, it’s because the distribution of IC values is not behaving in a normal fashion. That’s generally to be expected in real data I think, but the plots can give you clues as to how it might be deviating. A normal distribution will just be a straight line exactly, any deviation from the straight line indicates a dearth or surplus of observations in that quantile.

Does an IC value of 0.1 mean that the factor if predictive 10% of the time, and the other 90% of the time it's just random noise / coin-flipping? And an IC value of 1.0 means the factor is predictive 100% of the time?

Basically yes, but be careful about taking this too far without checking the actual math. A mean IC of 0.1 means that on average the correlation between your model’s predictions and real returns is 0.1. A perfect model which have a mean of 1.0 and an std. of 0.0. A coinflip will have a mean IC of 0.0 and I’m not actually sure what std. you’d get.

How can I check the correlation of two different alpha factors?

I’m attaching a notebook that does this in a separate comment. Basically you construct a portfolio based on each factor, then you check the correlation of the returns. This methodology relies on your choice of portfolio, but generally choosing a portfolio that longs the top quintile and shorts the bottom is common in industry. If you’re worried you can try it for a variety of portfolio methods, or a method that fits how you actually trade better.

If I combine two uncorrelated alpha factors that I've found, is there any need to run the combined factor through Alphalens (over a different training period?)?

Yes absolutely. You don’t know what kind of non-linear effects are introduced by combining your models, you want to check that the stats are at least as good as each model independently, ideally the mean IC will be the average of the two independent mean ICs, and the IC std. Will be strictly lower than the average of the independent IC stds.

By combining two uncorrelated (and seemingly conflicting) alpha factors, for example one Momentum factor and one ST Reversal factor, wouldn't you lose alpha by combining them?

Certainly possible. To the extent that the factors are based on contradictory models you will definitely lose alpha. To be clear, when we talk about combining factors, it’s factors that are independent and have no reason not to be combined. Another thing to think about is predictive time frame of factors, you probably want to combine factors with similar time frames, or at least make sure that they help each other. Momentum factors tend to be slower than short term reversal in my experience, so combining the two might not ever be a problem. However, the frequency at which you’re trading would decide which factor’s signal you’re actually using and complicate things. It’s certainly possible to combine some slightly slower (say weekly) factors with some faster ones (say daily) and get benefits. This can be especially helpful if you have a weekly model and need to increase turnover to get to within our contest criteria bounds.

Taking the above example again, one Momentum Factor and one ST Reversal factor combined into one. The 'Strategic Intent' for the Momentum factor is the Newtonian idea that 'stocks in motion tend to stay in motion' whereas the 'Strategic Intent' for the ST Reversal factor is essentially the opposite (Mr. Market overreacts in the short-term). Is it ok to have separate Economic Rational for each factor, or is it not a good idea to combine two factors that have conflicting rationale (as they might be negatively correlated (?))?

Conflicting rationale is definitely a worrying issue, but again let’s think about the time frames. Momentum will generally target weekly/monthly timeframes from what I’ve seen. Momentum says that an upwards swing is indicative of a long term upwards trend. You can still trade on the up/down noise that happens on that upwards trend. So you could even combine momentum and mean reversion models in an intelligent way by using momentum to effectively detrend the series.

It would seem that for Alphalens to be helpful, the risk model would need to be incorporated. For example, say I decide to cook up a factor, and it just happens to be similar to one of these:

from quantopian.pipeline.experimental import Momentum, ShortTermReversal, Size, Value, Volatility

I could spend a lot of time analyzing and perfecting my factor, only to have a very unpleasant surprise when I try to use it in an algo for the contest/fund.

Any thoughts on how to address this problem?

We've built an integration between Alphalens and Pyfolio that allows you to construct a basic portfolio based on your alpha factor and then run that portfolio's returns through Pyfolio's risk exposure. We'd love to build a more integrated risk breakdown into Alphalens, this is what we currently have. As mentioned above I'm attaching a notebook in a separate comment that shows how to do this.