@Dan. Good call noticing that `(a * b).sum()`

is just `a.dot(b)`

. I ended up simplifying the `vectorized_beta`

implementation in the post a bit in the interest of clarity, though I actually didn't think about just using `.dot`

! In the real implementation in zipline, we're not using `dot`

because there's no nan-aware version of it in numpy (at least not one that I'm aware of).

For what it's worth, I think this is the fastest pure vectorized beta implementation I can muster without resorting to Cython or C:

```
def fastest_vectorized_beta_I_can_muster(spy, assets):
# Allocate one len(assets) array and fill it initially with the
# column-means of assets. We'll re-use this buffer several times in the
# course of this function.
buf = assets.mean(axis=0)
# Subtract the means from each asset in-place.
# Note: This mutates the input in place, so don't do this if the caller
# expects to use `assets` again!
np.subtract(assets, buf, out=assets)
# Overwrite the output of the "covariance" dot product into `buf`.
spy.dot(assets, out=buf)
# Overwrite the output of the division into `buf` again.
np.divide(buf, spy.dot(spy), out=buf)
# buf now holds our expected output.
return buf
```

On my laptop for a 504-day lookback with 1000 assets, this is about 3 times faster than the `vectorized_beta`

function in the notebook:

```
In [82]: spy.shape
Out[82]: (504,)
In [83]: assets.shape
Out[83]: (504, 1000)
In [84]: spy2d = spy[:, np.newaxis]; spy2d.shape
Out[84]: (504, 1)
In [85]: def vectorized_beta(spy, assets): # This is the version from the notebook.
...: asset_residuals = assets - assets.mean(axis=0)
...: spy_residuals = spy - spy.mean()
...:
...: covariances = (asset_residuals * spy_residuals).sum(axis=0)
...: spy_variance = (spy_residuals ** 2).sum()
...: return covariances / spy_variance
...:
In [86]: %timeit -n500 vectorized_beta(spy2d, assets)
500 loops, best of 3: 2.81 ms per loop
In [87]: def fastest_vectorized_beta_I_can_muster(spy, assets):
...: buf = assets.mean(axis=0)
...: np.subtract(assets, buf, out=assets)
...: spy.dot(assets, out=buf)
...: np.divide(buf, spy.dot(spy), out=buf)
...: return buf
...:
In [88]: %timeit -n500 fastest_vectorized_beta_I_can_muster(spy, assets)
500 loops, best of 3: 936 µs per loop
```

I'm not sure whether `fastest_vectorized_beta_I_can_muster`

gets most of its speedup just from the cost of extra allocations, or if it's because we're getting better cache locality. I suspect it's a mix of both.

One thing I'd like to do in a future update would be to add a fast path to `SimpleBeta`

that uses something like the above if you pass `allowed_missing_percentage=0`

. The current implementation pays a steeper performance cost than I'd like in exchange for handling missing data robustly. We can probably claw some of that back my moving the implementation to Cython (which would let us remove at least one large allocation), but ultimately handling nans correctly requires a branch per array element, which is a real cost at this level of abstraction.