NetSquared enables social benefit organizations to leverage the tools of the social web.

Meetup with social changemakers and web innovators near you. Join a local Net Tuesday in 21 cities around the world, or start your own!

Ian Elwood's blog

Screen Scraping Tools for SEC's EDGAR Database

Update:   Continuing my theme of working backwards and getting things done faster than if I had done them the "right way" here is a tool developed by Joshua Tauberer that exposes the entire SEC EDGAR database in the way we need it exposed, and outputs RDF.  Now it is a matter of linking to this output and creating a visualization.  One more step!

The Hunt For Contributors

The hunt continues for finding contributors for our project, but being one of the winners sure does help. Our corporate malfeasance wiki, Crocodyl.org has attracted a large number of qualified and interested people, so reaching out to those folks has been really useful.

We have one contributor who works for Mozilla, some people from Wikileaks and a couple of avid Wikipedians with expert technical know-how, so we are keeping our fingers crossed that these folks will join up officially soon.

Anyone interested in helping out, please contact me at ian [at] corpwatch.org. Thanks!

FOAFCorp RDF Lanugage

Theyrule.net was an early inspiration for this project, and we recently found a data format that might be useful for this project, depending on how we take it on. FOAFCorp is based on FOAF, which is a data format to describe relationships between people in RDF. There is some really good information, as well as a list of projects that are developing along these same lines on RDFWeb.org.

Tax havens and the SEC Database

The issue of tax havens recently came up in discussion with some coworkers. The idea is that many companies have multiple subsidiaries that are shell companies, and wouldn't really be interesting for the scope of this project. The Stop Tax Haven Abuse Act bill that is in congress now has a good list of tax havens that we could use to filter out the "shell" companies, so that the mashup we create can distinguish between valid business entities and companies that are just being used to hide money from the IRS.

Technical Specifications for the CorpWatch Mashup

The job here is to screen scrape the subsidiary pages on the EDGAR database and translate them into an open format that will be published to the web and can be inputted into the Prefuse visualization software or another API.

The SEC's EDGAR website theoretically has lists of subsidiaries for all publicly traded companies in the USA. They are listed as "Exhibit 21" on forms S-1, S-4, S-11, F-1, F-4, 10, and the annual report filed on Form 10-K. Here are two examples:

http://www.sec.gov/Archives/edgar/data/831001/000119312507038505/dex2101.htm

Google Hackathon and Project Scope

The Google Hackathon went really well, I got a lot of really great ideas from the developers, as well as the other NetSquared Contest entrants. I really like Metavid, they are doing similar work by cataloging video footage from the house and senate floor. Awesome! MAPLight is also really cool.

One concrete suggestion I got from the Google developers was to narrow the scope of our proposal. I am going to edit the proposal to focus specifically on the task of parsing and translating the EDGAR database into a format that CorpWatch and Crocodyl can use for our visualization of parent company/subsidiary relationships, and make the format using open standards so that others can build additional API's on top of it.

Crocodyl blog

I do a lot of blogging on the Crocodyl blog but will also be posting regular updates on the state of the mashup here.

Subscribe to Net2News

Sign up for NetSquared's e-newsletter

Latest Comments

User login



Sitemap

About

Share

Projects

Conferences

Partner