Back to Community
Training and Testing Sets

I watched the "live risk model tearsheet review" recently (wish I could have made the actual webcast when I saw my tearsheet was picked!), which got me thinking about the training/testing set.

Personally, I'd like to be able to test using the whole time frame if possible and build my training/testing sets by holding out some other form of data. I've heard of others using US data and then testing on foreign markets, but we don't have foreign data so we can't do that.

Instead of that, I'm trying to use the CIK number from pipeline to build a custom filter. I thought a simple and unbiased rule would be even CIKs are my training set and odds are my testing set. Alternatively, you could use any number with modulo so that you can have multiple revisions of your hypothesis without multiple comparison bias.

Two things. First, does this make sense? I couldn't find anything that says how the CIKs are assigned but the 1's digit seems sufficiently random. Second, I'm having trouble constructing a filter from this. The CIK is provided as a string (I think), which is giving me trouble. Right now I have:

class trainingset(CustomFactor):  
    inputs = [Fundamentals.cik]  
    window_length=1  
    def compute(self, today, assets, out, cik):  
        cik = cik.astype(int)  
        out[:] = cik % 2 # can be any number  

and in my pipeline construction I have:
cik_filter = (trainingset() == 0) # training is 0, test is 1

but I keep getting this error:
ValueError: Bin edges must be unique: array([ nan, nan, nan, nan, nan, nan])

3 responses

Holding out individual assets is definitely an approach to OOS testing. Once thing to consider is whether the covariance across assets will cause it to be a true OOS test, certainly it will reduce the power of the test as it's possible that the effect you're seeing in one set of data will emerge in another set of pricing which is highly varying.

That said, I generally encourage more types of OOS testing and this is one of a few ways you can start to get a good sense. Cross validation is also a pretty decent way to check for overfitting in-sample. I'm going to make sure you get an answer to your pipeline question.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Not sure if this helps James. Here's an exploratory notebook on cik.

Yes. It is a string.

It does not like it if you try to convert it into an integer within the Pipeline (might have something to do with my limited understanding of Pipeline). The error message states

TypeError: int() argument must be a string or a number, not 'Latest'

TypeErrorTraceback (most recent call last)  
<ipython-input-14-1e9b4345bf3b> in <module>()  
----> 1 my_pipe_2 = make_pipeline_2()  
      2 result_2 = run_pipeline(my_pipe_2, '2015-05-05', '2015-05-05')  
      3 result_2['cik'] = result_2['cik'].dropna().astype('int')  
      4 result_2.head()

<ipython-input-13-a31e0eaca6e4> in make_pipeline_2()  
      3     return Pipeline(  
      4         columns={  
----> 5             'cik': np.int(df),  
      6         },  
      7     )

TypeError: int() argument must be a string or a number, not 'Latest'  

I had to work outside of Pipeline to obtain the result you wanted - modulo.

Loading notebook preview...
Notebook previews are currently unavailable.

@Delaney - I'll try to keep that in mind while testing this thing

@Anthony - Thanks! I had to use a similar workaround when I posted. I got this to work after posting:

class trainingset(CustomFactor):  
    inputs = [Fundamentals.cik]  
    window_length=1  
    def compute(self, today, assets, out, cik):  
        m = 2 #can be any number  
        cik = [int(string)%m if string is not None else None for string in cik[0]]  
        cik = [cik]  
        out[:] = np.array(cik)  

I'm sure its not the most efficient method but it gets the job done.

Loading notebook preview...
Notebook previews are currently unavailable.