Cutting down on computation time: How to get multiple variables from a single custom factor

I have a created custom factor to detect the presence of a long term trend. In the pipeline the custom factor produces a dataframe with historic data per asset, and passes it to a function that then aims to find a fitting trend and then returns the startdate for that trend. By doing so, I avoid having to do tons of databasecalls.

However, I now get only one variable per asset as output, where potentially I could easily have many more, all as output from the same calculation.

class weeklyConfirmations(CustomFactor):
inputs = [USEquityPricing.low, USEquityPricing.high, USEquityPricing.volume]
window_safe = True
#    resamples daily data to weekly
def compute(self, today, assets, out, low, high, volume):
n=5
low_df = pd.DataFrame(low, columns=assets).stack()
high_df = pd.DataFrame(high, columns=assets).stack()
volume_df = pd.DataFrame(volume, columns=assets).stack()
df = pd.concat([low_df, high_df, volume_df], axis=1)
df.columns=['low', 'high', 'volume']
df.index = df.index.rename(['day', 'sid'])

#'Since this historic data does not have a timeseries index, we cannot use the resample
# function to produce weekly data. We use the following instead:
n=5
df2 = df.reset_index()
df2.sort_values(by=['sid', 'day'], inplace=True)
df3 = df2.groupby([np.arange(len(df2))//n, 'sid']).agg({'day': max, 'low': min, 'high': max, 'volume': sum})
df4 = df3.reset_index()
df4.sort_values(by=['day', 'sid'], inplace=True)
df4.set_index(['day', 'sid'], inplace = True)
df4.drop('level_0', axis =1, inplace = True)
#        print('with the new index it looks like \n', df4.tail())

def my_df_function(sid_df):
my_result = longTermConfirmed(sid_df)
return my_result

# Rather than looping over each security it's much faster to group by security and apply a function



The actual logic resides in the CustomFactor that is referenced as longTermConfirmed(sid_df) in the code above:

def longTermConfirmed(stock_hist):
trendStartDate = np.nan
my_df = pd.DataFrame(columns=['Start date', 'Confirmations','Squared distance', 'Porosity', 'Porosity at latest conf date'])
length = len(stock_hist)
n=2
launchpoints = list(argrelextrema(stock_hist['low'].values, np.less, axis =0, order = n))
newlist = launchpoints[0].tolist()

newlist2 = [i for i in newlist if i < length - 8]
# We consider only trends that are at least 2 months old, hence subtracting 8 weeks
for i in newlist2:
dft = stock_hist.copy().iloc[i:]
trend = calcTrend(dft)
result = confirmTrend(trend, 'S', .002, .005, 1)

if result.empty == False:
datelist = result.index
porosityAtLatestConfDate = result.iloc[-1]['porosity']
result['dist_squared'] = result['distance']**2
confCount = len(result.index)
latestConfirmationDate = datelist[-1][0]
latestRelOBV = result.iloc[-1]['OBV_rel']
squared = result['dist_squared'].sum()
new_data = {'Start date': i,
'Confirmations': confCount,
'Squared distance': squared,
#                        'Porosity': porosity,
'Porosity at latest conf date': porosityAtLatestConfDate,
'Latest Confirmation Date': latestConfirmationDate,
'Rel OBV': latestRelOBV}
my_df = my_df.append(new_data, ignore_index = True)

if my_df.empty == False:
my_df = my_df.sort_values(["Confirmations", "Squared distance"], ascending = (False, True))
mask = (my_df['Confirmations'] >= 2) #require at least two confirmations to be considered
emptydf2 = df2.empty
if emptydf2 == False:
df2.sort_values(['Latest Confirmation Date','Rel OBV'], ascending = [0,0])
if ((df2.iloc[0]['Latest Confirmation Date'] >= length * 5 -1 )) :

trendStartDate = length * 5  - df2.iloc[0]['Start date'] *5

return trendStartDate


This code is now set up to return the date  trendStartDate  at which a particular uptrend started.

That same longTermConfirmed() function could output additional valuable information (e.g. number of times the trend has been confirmed, etc. etc.) that I intent to use in my algo. As there is some computationally heavy work going on, I obviously want to avoid having to run close variants of that same function, if i could do it all in one pass.
However, I am going completely blank as to how to pass back multiple variables per assets as output from the function longTermConfirmed() - to the customFactor??

Does anyone know how I could solve this?

4 responses

Good callout to note that running a function multiple times to get multiple outputs often isn't the most efficient approach. Typically much faster to execute a function just once. But how does this work? There are maybe two questions here. First, how to return multiple values from the pandas apply method, and second, how to return multiple outputs from a factor.

The apply method can be used to iterate through the assets and pass a column of input data to a user defined function. That function should typically return a single value. However, python (and pandas) is pretty flexible in what a 'single value' is. While normally this would be a single scaler value, it could also be a tuple or a list of scaler values. In the later case the value is a list so a 'single value' is a single list.

So, for the user defined function, simply return multiple values. Something like this

        def my_function(column):
"""
Return three separate values for each asset column
Here we return the the mean, min, and max price for each asset
"""
mean_price = column.mean()
min_price = column.min()
max_price = column.max()
return mean_price, min_price, max_price



That's pretty straight forward. However, what to do with the results, and how to reference them, requires a bit of Python magic. There are two Python operators called * unpack and zip which can be used to first 'unpack' the results and then 'zip' them back into separate lists. I must admit I don't use these a lot and find myself googling 'zip unpack python' to refresh my memory how these work. I won't go into those details here but here is how to get the three lists of values when applying the method above.

        mean_prices, min_prices, max_prices = zip(*close_prices_df.apply(my_function))



As I said, Python magic. The result is three lists with the mean, min, and max values for each asset. The apply function could return any values and any number of values. I just used these three as an example.

The final step is to assign these values to separate factor outputs. First, the outputs must be named, and then simply assign values to each just as one would do for a basic custom factor with a single output. Something like this

    # define the factor's outputs
outputs = ['mean_value', 'min_value', 'max_value']

# then set the values
out.mean_value[:] = mean_prices
out.min_value[:] =  min_prices
out.max_value[:] =  max_prices



Hope that's a start on adding multiple outputs. See the attached notebook for an example. Good luck.

3