Back to Community
Can CustomFactor return a multi dimensional array?

The API docs refer to the out return of CustomFactor as:

out will be an empty array of length N.

But can that array contain only Scalar values or is there a way for it to return arrays of scalars?

I have a time intensive Custom Factor where I would like to return more than one value. It would be prohibitive to create variations of that Custom Factor in order to return each value I'm computing.

10 responses

Hi @Pumplerod,

A Factor can have multiple outputs, but you can only pass a single value per security to each output. For an example, check out the implementation of the built-in RollingLinearRegression factor from the Zipline repository.

I hope this helps.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hi Ernesto -

I gathered that Thomas W. somehow got around this limitation here:

https://www.quantopian.com/posts/machine-learning-on-quantopian-part-3-building-an-algorithm

To train, doesn't one need a trailing window of values for each factor, across the universe of securiites? Or am I missing something? Is there a trick?

Thanks,

Grant

Here you can see how to return multiple output in New Feature: Multiple Output Pipeline Custom Factors and in New Pipeline Features: Slicing and Correlation/Regression Methods how to use a factor as input to other factors (you can use this feature to create a factor with partial calculation and pass a trailing window of that as input to your other(s) factors which, in turn, compute the final values)

Thanks Luca -

On https://www.quantopian.com/posts/machine-learning-on-quantopian-part-3-building-an-algorithm#58cfaf06c4df8a3e3b1ba9f5, Thomas says:

"The problem is still that pipeline only returns the most recent row, rather than the full history of factors required to train the classifier."

So, has this changed? Or is the idea that one would not actually output the results of a custom factors to the algo (which is a single scalar per security), but do the alpha combination step within pipeline, where presumably the customer factors support vector (or higher dimensional?) values, point-in-time, for each security?

Sorry, this is probably a gap in my basic understanding of how pipeline is architected, and how it provides output to the algo.

Also, I'd note that Thomas W. hinted that using globals may be a way of getting multi-valued outputs out of pipeline. Is this correct?

 Or is the idea that one would not actually output the results of a custom factors to the algo (which is a single scalar per security), but do the alpha combination step within pipeline, where presumably the customer factors support vector (or higher dimensional?) values, point-in-time, for each security?  

Exactly! Your CustomFactor would take as input a vector of other factors, combine them together and then returns one value per each security.

class MyFactor(CustomFactor):  
    def compute(self, today, assets, out, returns, *inputs):  
        # 'inputs' is a list of factors each of size 'window_length'  
        # combine them and return the final value in 'out'  
        [...]  


myfactor = MyFactor( inputs=[someFactor1, someFactor2, someFactor3],  window_length=some_historical_days_of_data)  
pipe.add(myfactor, 'myfactor')  

MyFactor can also return multiple output though.

class MultipleOutputs(CustomFactor):  
    # Define inputs and outputs.  
    inputs = [USEquityPricing.close]

    # Specify and name the different outputs.  
    outputs = ['highs', 'lows']  
    window_length = 10

    def compute(self, today, assets, out, close):  
        highs = np.nanmax(close, axis=0)  
        lows = np.nanmin(close, axis=0)  
        # Write the desired return values into `out.<output_name>` for each output name in `self.outputs`.  
        out.highs[:] = highs  
        out.lows[:] = lows

# Each output is returned as its own Factor upon instantiation.  
highs, lows = MultipleOutputs(mask=high_dollar_volume)

EDIT:

I believe that Thomas W. suggested the global variable approach to store minutely data (calling data.history inside handle_data), then you can use that global variable inside your CustomFactor. Otherwise you cannot use minutely data as the data frequency in input to a Pipeline Factor is daily.

So, to make the final example for @Pumplerod, you can create custom factors containing the partial computation you want to share between multiple factors:

class PartialComputationVeryCPUintensive1(CustomFactor):  
    window_length=30 # some_historical_days_of_data  
    window_safe = True  # see below the explanation  
    [...]

class PartialComputationVeryCPUintensive2(CustomFactor):  
    window_length=30 # some_historical_days_of_data  
    window_safe = True  # see below the explanation  
    [...]

Then, you have the option to create multiple factors that receive in input the same computation:

factor1 = MyFactor1( inputs=[PartialComputationVeryCPUintensive1, PartialComputationVeryCPUintensive2],  window_length=1)  
factor2 = MyFactor2( inputs=[PartialComputationVeryCPUintensive1, PartialComputationVeryCPUintensive2],  window_length=1)  
factor3 = MyFactor3( inputs=[PartialComputationVeryCPUintensive1, PartialComputationVeryCPUintensive2],  window_length=1)  

pipe.add(factor1, 'factor1')  
pipe.add(factor2, 'factor2')  
pipe.add(factor3, 'factor3')  

Or you can create a factor that returns multiple outputs:

factor1, factor2, factor3 = MyFactor123( inputs=[PartialComputationVeryCPUintensive1, PartialComputationVeryCPUintensive2],  window_length=1)  

pipe.add(factor1, 'factor1')  
pipe.add(factor2, 'factor2')  
pipe.add(factor3, 'factor3')  

Also note that if you can make your cpu expensive computation on a window_length=1, then you can create MyFactor with different window_lengths:

class PartialComputationVeryCPUintensive1(CustomFactor):  
    window_length=1 # some_historical_days_of_data  
    window_safe = True  
    [...]

class PartialComputationVeryCPUintensive2(CustomFactor):  
   window_length=1 # some_historical_days_of_data  
   window_safe = True  
    [...]


factor1 = MyFactor( inputs=[PartialComputationVeryCPUintensive1, PartialComputationVeryCPUintensive2],  window_length=10)  
factor2 = MyFactor( inputs=[PartialComputationVeryCPUintensive1, PartialComputationVeryCPUintensive2],  window_length=4)  
factor3 = MyFactor( inputs=[PartialComputationVeryCPUintensive1, PartialComputationVeryCPUintensive2],  window_length=7)  

pipe.add(factor1, 'factor1')  
pipe.add(factor2, 'factor2')  
pipe.add(factor3, 'factor3')  

EDIT:

added window_safe=True to PartialComputationVeryCPUintensive

Here is what Quantopian said about using factors as input of other factor:

"[...] is now allowed for a select few factors deemed safe for use as inputs. This includes Returns and any factors created from rank or zscore. The main reason that these factors can be used as inputs is that they are comparable across splits. Returns, rank and zscore produce normalized values, meaning that they can be meaningfully compared in any context."

So, if you are sure your factor is window_safe but you are not going to call zscore or rank on it, then set window_safe=True in the class

Thanks Luca -

I'll have to sort through this as time allows. The pipeline API just seems overwrought with complexity, but like anything, there's a learning curve, I suppose.

@ Luca, Ernesto -

By the way, I was under the impression that minute data (or anything else for that matter) couldn't be fed into pipeline. But maybe a hack using globals works?

Yes Grant, you cannot use minute date with pipeline and I believe the only workaround is to save minute data in a global variable and access it from CustomFactor(s). But then why using the data inside CustomFactors? If we need the minute data we can make the computation inside 'before_trading_start' as the pipeline output is available there and history function too.

Thanks Luca -

I'll have to give it a try. At this point I'm just curious about the mechanics of getting minute bar data into pipeline. The use case I have in mind is to perform the alpha combination step (which could include ML) all within pipeline. Then pipeline simply provides an input to the optimize API (and its embedded order management system).