Back to Posts
Listen to Thread

The stock price is the result of the trading decisions of many individuals. But what if you could peek into the information gathering process that precedes these decisions?

A recent paper uses Wikipedia page views to predict market changes.

You can try out how different Wikipedia pages affect the market using by clicking “Clone Algorithm” and editing the specified Wikipedia page.

Clone Algorithm
223
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

This algorithm looks at page view counts of specific Wikipedia pages. The theory is that spikes in the number of page views can be used to predict a price change.

The hard part of this is collecting the data, but we did that already for you. We extracted page viewing history of certain Wikipedia pages from http://stats.grok.se/. You can use the data by calling the function fetch_wikipedia(). As arguments it either takes the name of a single Wikipedia page or a list of Wikpedia pages. The average daily viewing history is then made available in handle_data().

For this algorithm I used the wikipedia page 'Opportunity cost'. Once the weekly average of page views is smaller than the moving average of the delta_t (in this case delta_t == 5 weeks), we buy and hold the S&P500 for one week. If the weekly average is higher than the moving average, we sell and re-buy the S&P500 after one week.

Suggestions for improvement (please share improvements in this thread):

  • The authors used many different Wikipedia pages, listed here
  • Can you find a wikipedia page that outperforms the one I found?
  • delta_t == 5 is what the authors of the paper used. It would be interesting to see how the algorithm performs when this is changed.
  • The underlying algorithm is a very basic moving average cross-over. A more clever strategy might be able to do a much better job.

If this idea is interesting to you, please also see the example using Google page views.

'Fear' performs pretty well

Clone Algorithm
31
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

Neat, but I keep getting timeouts for most search terms. Even "fear" timed out on me. I'm not sure what the solution here is as the reliability of results seems low and this is likely dependent on Wikipedia's servers, right?

The data comes from http://stats.grok.se/, I also got quite a lot of timeouts when I played with it

You should be able to use a deque instead of a list to hold the history, which would make your state tracking a lot simpler:

from collections import deque

c.history = deque(maxlen=delta_t)

c.history.append(weekly_views)  
if len(c.history) < delta_t:  
    return  

@John: Neat, I was aware of deque but not the maxlen kwarg. Thanks! Do you want to post an updated, simplified version?

Btw. we are also increasing the time-out for the fetching so hopefully it'll work more reliably soon!

@Thomas If I did this properly, the backtest below should contain the simpler approach using deques.

Clone Algorithm
11
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

@John: Yep, looks correct. Thanks! You might want to use delta_t instead of 5 though to make it parameterizable (or remove the delta_t even).

Agreed, although I do use delta_t as a parameter for context.history, but context.weekly_history is hard-coded to 5 (weekdays in a week).

Edited

Sorry about the performance problems. Thank you all for bringing them up. Some of this is limitations we knew about, some of it is a bug. I'll start with a bit more info about what's happening under the covers. For hundreds of wikipedia pages, particularly the ones in the paper that Thomas refers to, we maintain a local cache of the data. If we have a local cache, there shouldn't be a data problem.

For the ones that aren't in the local cache, we query http://stats.grok.se/. The data is organized by month. So we issue serial requests to get each month. Each request takes 2-5 seconds, and we give one minute for the data to load. So there is a limit on the number of "page-months" that can be loaded if the pages aren't in our cache.

We found a couple problems:

  1. We have a bug where we're being case sensitive where we shouldn't be. For instance, you'll see very different behavior if you look at 'debt' v. 'Debt'. We'll fix that.
  2. We're finding sometimes the individual page calls are taking longer than 5 seconds, so we're bumping that to 10 seconds. The overall limit of 1 minute is unchanged, though.

Other things to make this better:

  1. If you have other pages you'd like us to cache, let me know at [email protected]. We're happy to cache more, but we aren't in a place where we can cache them all!
  2. We find that if we query the website a second time it gets faster. So you may be able to succeed on a 2nd attempt that failed on the first when you're looking at non-cached pages.

Quick edit: overall limit is 1 minute, not 2.

I tried to live trade Wiecki's code and received this error

FunctionCalledOutsideOfInitialize:
File test_algorithm_sycheck.py:10, in initialize
File algoproxy.py:1332, in fetch_wikipedia

Anyways, I found impressive results from using these words for context.article:

  • positive
  • confident
  • defeated
  • cautious

Source: BlackRock

Clone Algorithm
19
Loading...
Backtest from to with initial capital ( data)
Cumulative performance:
Algorithm Benchmark
Custom data:
Week
Month
All
Total Returns
--
Alpha
--
Beta
--
Sharpe
--
Sortino
--
Information Ratio
--
Benchmark Returns
--
Volatility
--
Max Drawdown
--
Returns 1 Month 3 Month 6 Month 12 Month
Alpha 1 Month 3 Month 6 Month 12 Month
Beta 1 Month 3 Month 6 Month 12 Month
Sharpe 1 Month 3 Month 6 Month 12 Month
Sortino 1 Month 3 Month 6 Month 12 Month
Information Ratio 1 Month 3 Month 6 Month 12 Month
Volatility 1 Month 3 Month 6 Month 12 Month
Max Drawdown 1 Month 3 Month 6 Month 12 Month
This backtest was created using an older version of the backtester. Please re-run this backtest to see results using the latest backtester. Learn more about the recent changes.
There was a runtime error.

@Ethan: I was able to reproduce the problem. We'll fix it, thanks for reporting!

Log in to reply to this thread.
Not a member? Sign up!