One thing that i believe probably contributes to the problem is missing or NaN entries in some data fields of some stocks at some times. Often i have observed that inputting a factor as one of the terms in the "combined factor", and then inactivating that term later by multiplying by zero does NOT produce the same results as omitting the factor.

For example, i naievely assumed that:

combined_factor = (weight1 * factor1.zscore() + weight2 * factor2.zscore() + 0.0 * factor3.zscore())

"should" be the same as:
combined_factor = (weight1 * factor1.zscore() + weight2 * factor2.zscore())

because "mathematically" the equation: Result = weight1*factor1 + weight2*factor2

is completely equivalent to: Result = weight1*factor1 + weight2*factor3 + 0.0*factor3

But in fact in they turn out NOT to be the same because each factor involves taking elements of a data set and if a data item is NaN then calculating 0*NaN is not the same as omitting the relevant item.

So, before coming to the issue that you @Joakim raise of factors being predictable, i sometimes get caught up with the problem of whether or not the factor even EXISTS at some times for some stocks. Every time i have tried cleaning up this problem by explicitly setting all NaNs to zero, i have also run into the problem that my code doesn't run properly. This is probably just because of limitations in my python skills, so if you can help me by showing me the required modification to your code snippet above to "zero the NaNs" correctly, then i would be most appreciative. Please ...

Now, coming on the the issue of factor PREDICTABILITY, let's consider using our general "combined_factor" and calling the result of it the "solution landscape". Just as with technical analysis inputs, i observe that with fundamental inputs the solution landscape is sometimes very smooth (good) and we can find high plateaux & gently sloping hills in multiple dimensions where the tops are meaningful extrema, whereas sometimes i infer that parts of the solution landscape surface are very jagged and ill-behaved indeed. That's not only a problem mathematically for optimizing, but also a problem in terms of robustness of the (financial) solution. Of course we would like to have smooth solution landscape surfaces.

One way of helping with this is to be very careful with ratios. A common but actually very nasty fundamental ratio is PE. Of course the stock price P is always positive > 0, but EPS can be positive, negative, zero, or missing. And so, if earnings decline, go negative, then rise again, we will have 2 singularities in PE, as well as whatever might happen with the NaNs for missing EPS data. Even if we avoid those singularities, we still get wild extreme values of PE at low values of EPS, so any solution landscape obtained from using PE as an input is potentially unstable, ill-behaved, or at least somewhat unpredictable.

The solution in this case is to use Earnings Yield rather than PE as input. The information content is essentially the same, but the stability or predictability characteristics are very different. Of course this is well-known in the case of PE ratio, but something similar happens in many other fundamental ratios as well. So, in answer to your question about factors & predictability, the first step is to take a lot of care of which ratios we use. Sometimes the solution is just to invert the ratio, or sometimes to choose a near-equivalent that is better-behaved.

Personally i always use zscore() rather than rank(), mainly just because i started that way, but also for two other reasons:

1) zscore caries more info about relative size differences than rank, so i like zscore better.

2) A adding zscores in our combined factor seems at least reasonably meaningful to me, but i'm not so sure about the meaning of adding rank values

In terms of the end results, does it matter which method (zscore or rank) one chooses? Honestly i don't know as i haven't investigated it, but maybe is a useful thing to check.