Back to Posts
Listen to Thread

I was reading an article talking about David Einhorn's holdings via Greenlight Capital. This quote was pretty interesting:

Our calculations show that imitating David Einhorn’s 13F filings would
have yielded a monthly alpha of 47 basis points between 1999 and 2011.
The monthly alpha declines to 25 basis points between 2008 and 2011
...

They continue to talk about how if you mirror just his small cap holdings, you'd do even better.

All the 13F reports are on the SEC's site (here is Greenlight's latest). Does anyone know of a good data source that converts these into a structured format? If not, I'll try to write one.

Would be interesting to play around with this and see how well it works to draft off big investors. (with the caveat, of course, that most of this alpha might evaporate because 13Fs can come out significantly after the actual purchasing was done).

I had been wondering similar (about being able to use data from websites).

There's some discussion on extracting text from html in this stack overflow question - http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python - meaning that it could be possible.

What I would be interested in using it for is monitoring news websites front pages, and performing sentiment analysis on the language they where using - on each given day - and seeing if the occurrence of certain words in greater numbers could be used to predict changes in market direction (perhaps similar to the twitter sentiment analysis done by the guys over at Derwent Capital Markets who started a twitter based hedge fund - at a profit).

With current developments in the use of Big Data, I expect some of the better hedge funds in the future will have algorithms that use data from all over the web (perhaps even algorithms with their own web spiders that crawl the web performing worldwide sentiment analysis from every blog, newspaper or social media feed they can find).

Or even, imagine the scenario of an algorithm that watched the social media profiles and blogs of the worlds top executives, waiting for them to accidentally let something slip ;)

Adam, totally agreed. A very important part of our product map this year is to let users plug in custom data sources in backtets, and then obviously in live trading too. Any time series data should be usable as a data source.

I imagine you'd define the data source in the algorithm's initialize method, and then our backtesting engine would pull events, sort them, and put them right place into the incoming event stream in Zipline. It would be nice if we either let you create a generator for a data source, or even better, did the heavy lifting of creating generators so that just had to define how events were emitted. How does that sound to you?

Anyway, idle musings for now :). Looking forward to building this functionality soon.

I'm going to set out how I imagine this could work below :P

You have a website (Let's call it www.example.com).

You want to extract some data from that website, so you initialise it in your algorithm using:

context.example_website = url(http://www.example.com)  

What that does is fetch the source code of that url, which might look like below (a very basic illustrative example):

<html>

<head>  
    <title>Example Website</title>  
    <meta name="description" content="A description of the website."/>  
    <link href="css/style.css" rel="stylesheet" type="text/css" />  
</head>

<body>  
    <div id="header"><h1 class="logo">Example.com</h1></div>

    <div id="horizontal_navigation"><ul><li><a href="#">Home</li></a><a href="contact.html"><li>Contact</li></a></ul></div>

    <div id="content">  
        <h2>Content Title.</h2>  
        <p>Some random article text saying something possibly of interest. Oh and here comes a table!</p>  
        <table>  
            <tr>  
                <th>Header 1. | Stock 1.</th>  
                <th>Header 2. | Stock 2.</th>  
                <th>Header 3. | Stock 3.</th>  
                <th>Header 4. | Stock 4.</th>  
            </tr>  
            <tr>  
                <td>row 1, cell 1. | Sell.</td>  
                <td>row 1, cell 2. | Sell.</td>  
                <td>row 1, cell 3. | Sell.</td>  
                <td>row 1, cell 4. | Sell.</td>  
            </tr>  
            <tr>  
                <td>row 2, cell 1. | Sell.</td>  
                <td>row 2, cell 2. | Sell.</td>  
                <td>row 2, cell 3. | Buy.</td>  
                <td>row 2, cell 4. | Sell.</td>  
            </tr>  
        </table>  
    </div>

    <div id="footer">Copyright © 2013 - Example.com.</div>  
</body>

</html>  

Now, you will want to define only the data from this page that you are interested in.

The code will do this by analyzing the above html code, and creating lists of the data in the of the website code, seperated by both their tag type (such as a list of all strings contained between

tags) and their containing div (so only

tags contained within the div of id = "content").

** Note Made While Editing: Some things might need to be stored a bit more creatively (such as tables and such)... I'm too tired atm to work it out ;) **

So you are now going to call data from the table in the content div.

    def handle_data(context, data):  
        table_1 = data[context.example_website].body.div_id(content).table[0]  

You will want to now also specify variables for reaching the data in that table.

        table_1_row = data[context.example_website].body.div_id(content).table[0].table_row  
        table_1_cell = data[context.example_website].body.div_id(content).table[0].table_cell  

So now let's make an algorithm that checks each row of a column (Stock 3's Column) for the word "buy" or "Buy" in the column of Stock 3.

    def initialize(context):  
        context.example_website = url(http://www.example.com)  
        context.cell_to_check = 3  
        context.row_to_check = [1, 2]  
    def handle_data(context, data):  
        table_1 = data[context.example_website].body.div_id(content).table[0]  
        table_1_row = data[context.example_website].body.div_id(content).table[0].table_row # Row Number  
        table_1_cell = data[context.example_website].body.div_id(content).table[0].table_cell # Column Number  
        currently_selected_cell_content = data[context.example_website].body.div_id(content).table[0].table_cell_content #Content of currently selected cell  
        row_cell_cache = None

    For table_1_row in context.row_to_check and table_1_cell in context.cell_to_check:  
        row_cell_cache = currently_selected_cell_content

    if row_cell_cache == "Buy" or row_cell_cache == "buy":  
        order(sid[24],+1)  

Ok, so my codes getting sloppier near the end - long post and I'm new to python lol - but I think the gist what I'm suggesting is just about clear lol (and if not I'm sorry for subjecting you to a long post)... N.B. I think tables will be one of the harder parts of this idea to find a workable solution for ;)

Hmm, it appears some words that I had used in the above to illustrate possible tags accidentally worked as html in the post... :P

Hello Adam,

How did you manage to embed the cats with lightsabers image into your post?

The below html code:

<img src="http://mememachine.viralvideochart.com/storage/jedi%20kittens.jpg" width="275px" height="151px">  

However - while it could be potentially useful to use images in posts - it could also be a security risk with people being able to embed any html they want into a post... Quantopian may want to take a look at that.

Hello Jean and Adam,

This might be of interest:

http://www.kdnuggets.com/datasets/

I poked around a bit on the 13F report question, but did not come up with any useful google hits. As long as the SEC maintains a fixed format in the text files, it seems you should be able to wrangle it (and you can include some error checking in case the format changes).

Adam, thanks for the guidance on sticking images into posts.

Have any of you come across Whale Wisdom before? They process 13F filings specifically for this purpose and have a backtesting tool and API. Unfortunately it looks like they've monetized since I last looked and now offer only the last 9-months data for free.

I should also mention this related post at Abnormal Returns:

"Hedge funds can’t generate enough alpha in the large-cap space to justify their high fees. They invest in the large caps because they have too much money to manage, and they don’t want to give up juicy management fees that enable them buy a condo on New York City’s Park Avenue. How do hedge fund managers get away with murder (i.e. murder of their clients’ assets)?

The answer is simple.

They generate significantly higher alpha in their small-cap stock investments. Generally speaking, there are fewer analysts covering the little guys, and these stocks are less efficiently priced. Hedge funds spend enormous resources to analyze and uncover data about these stocks because this is one of the places where they can generate significant outperformance. Our analysis also shows that this is also a fertile ground for piggyback investors.
Between 1999 and 2009, the 15 most popular small-cap stocks among hedge funds managed to beat the market by 1.4 percentage points per month."

Log in to reply to this thread.
Not a member? Sign up!