Has anyone examined the technique proposed by Lee et al.? I'm having a go (using free Quandl data) but I'm finding it difficult to follow. I can handle the ML aspects. But I'm not quite sure how they are packaging the data.

I think it's something like this:

For a given moment in time for a particular stock we can construct a (labelled) training item by using the previous 13 months worth (and the subsequent 1 months worth) of daily data for that stock.

We use this data to construct 12 monthly cumulative returns ending a month short of our moment. So I'm guessing just add up daily Adj_Close prices & spit out the value every 30 or so passes. Now it gets interesting. They do the same thing for every other stock at this moment, and get a z-value for our stock over this set (i.e. # of standard deviations from the mean). So the movement of this z-value is showing the growth of this particular stock relative to the whole market. Since the algorithm is going to be invested a certain amount of money in the market, and just shifting it between stocks, this is what you want!

Looks like they do this for each of the 12 monthly cumrets.

And then they do the same process for the previous 30 days.

That actually makes a lot of sense because you want to be feeding in Data with mean 0 round about the (-1, +1) range into your NN.

So that covers input data. (there is one extra input that is a 'start of year' flag. But a complete supervisor training item also requires an associated output value. It looks as though they are just using whether that particular stock went up in the ensuing month. Although I don't understand their language, they talk about 'above the median'. And median of what??? It seems a really weird way of doing it. Why not just look at whether the price one month later is higher or lower than the price at this particular moment & output 1 or 0 accordingly? I think that's what I will do as I don't understand what they're saying.

Then I can only assume that everything is shunted forwards by a single day at the algorithm is repeated to generate another sample.

It seems strange to me that they don't make use of daily volume.

π