Data and the Web

Archive for the ‘data repositories’ Category

Further Sunlight on Government Data

Monday, July 20th, 2009

sunbeams1.pngIn a previous post, we discussed some of the interesting things the US government is doing to make its data more widely available, culminating in the website.  This website is now up and running and has definitely made some progress since we’ve last discussed it. is broken down into three main catalogs:

  1. Raw Data Catalog (with data files available in XML, CSV, KML, etc.)
  2. Tools Catalog (list of tools built to work with various open data sets)
  3. Geodata Catalog (links to Federal geospatial data)

They’ve also tried to make it easier to search for data sets, which like video, is quite reliant on being tagged with good, meaningful descriptions and related meta data.  It’s a hard nut to crack.  For example, government agencies tend to release data sets on an annual basis, so you’ll have, say, 5 different data sets (and counting) for the “Public Libraries Survey” from 2004 through 2008.  If your search terms aren’t specific enough, these repetitious items tend to clutter up the search results.  As continues to add more data sets, hopefully they can refine this area further.

But, then again, maybe they won’t have to.  The folks at Sunlight Labs, whose mission is to build technology that makes government more transparent and accountable, has recently announced a project called The National Data Catalog.  It will be a tool that aims to take the concept and improve upon it.  From the announcement:

“We think we can add value on top of things like and the municipal data catalogs by autonomously bringing them into one system, manually curating and adding other data sources and providing features that, well, Government just can’t do. There’ll be community participation so that people can submit their own data sources, and we’ll also catalog non-commercial data that is derivative of government data like OpenSecrets. We’ll make it so that people can create their own documentation for much of the undocumented data that government puts out and link to external projects that work with the data being provided.”

This should be interesting to watch.  As the Sunlight folks say in a later post, they are not out to replicate, but to stand on its shoulders (similar to how, say, relies on and improves upon the National Weather Service).  Given the nature of the beast, data sets need to be described really well in order to be both searchable and useful.  Hopefully the community aspect, in particular, can help give this data more utility.  If any are tech savvy folks interested in either following the project or contributing with code, here’s the project page.

Thursday, March 5th, 2009

OMB SealWe recently posted an article about Vivek Kundra, who was named United States CIO this morning by the Obama administration. He’s got $71 billion in IT spending under his care. Hmm, that’s a lot of data browsers.

One interesting tidbit appeared in this Saul Hansell NY Times article:

Another initiative will be to create a new site,, that will become a repository for all the information the government collects. He pointed to the benefits that have already come from publishing the data from the Human Genome Project by the National Institutes of Health, as well as the information from military satellites that is now used in GPS navigation devices.

“There is a lot of data the federal government has and we need to make sure that all the data that is not private, or restricted for national security reasons, can be made public,” [Kundra] said.

In another bit of interesting news, the Jonathan Stein at Mother Jones notes that Mike Honda (D-Calif) added a provision into the recent appropriations bill that requires government entities to make their public available in raw form:

If the Senate passes the bill with the provision intact, citizens seeking information about Congress’ activities—such as bill names and numbers, amendments, votes, and committee reports—won’t have to rely on government websites, which often filter information, are incomplete, or are difficult to use. Instead, the underlying data will be available to anyone who wants to build a superior site or tool to sift through it. “The language is groundbreaking in that it supports providing unfiltered legislative information to the public,” says Honda’s online communications director, Rob Pierson. “Instead of silo-ing the information, and only allowing access through a limited web form, access to the raw data will make it easier for people to learn what their government is doing.”

Kim Zetter from Wired has more on the story here.

Maybe once the data is made more accessible, some clever folks can put an interface on things that improve the complex aftermath of the “laws and sausages” routine. I did my best to search for Honda’s three-sentence provision in the latest omnibus bill with no luck. Anyone know what the actual provision stated? [UPDATE: Rob Pierson, Online Communications Director of Congressman Honda’s office, provided a link to an O’Reilly post with the full text of the provision. Give the full article a read — it’s quite worthwhile.]

And, for posterity, here are some of the data repositories mentioned in the articles above:

AWS Public Data Sets Continues to Expand

Wednesday, February 25th, 2009

AWS Public Data Sets ScreenshotPreviously, we posted some information on Amazon’s foray into making huge public data sets available to users of their web services.  Yesterday they announced the addition of some very sizable additions:

If you use AWS, the announcement provides more info on these datasets as well as how to access them.  If you don’t use AWS, you can still access much of this data directly from the websites linked above.

More Government Data Coming to a Browser Near You…

Friday, February 6th, 2009

File CatalogIt was intriguing to see how all this newfangled web 2.0 technology was applied during the US presidential campaign this past year (organization, multimedia, etc.).  It’s also quite interesting to hear about some of the big ideas for how the new administration wants to change how government works.  And, not to be outdone, the opposition party is also getting into the Web 2.0 game.

According to Nextgov, it appears that Vivek Kundra, current CTO of the District of Columbia, is going to be given the nod as the next e-government liaison.  From the article:

Kundra also is a strong proponent of giving the public access to government data. “Why does the government keep information secret?” he rhetorically asked during an interview with Nextgov. “Why not put it all out in the government domain?” [Since arriving in Washington], I’ve made all the government databases public. Every 311 call, every abandoned automobile, who has responded, etc. It provides high-level oversight of the daily tasks of government.”

A more in-depth bio of Kundra can be found at this recent Washington Post article.  A couple of the more intriguing things that he promoted in the District of Columbia were the DC Data Catalog and “Apps for Democracy.”

The data catalog covers all kinds of DC data from  crime statistics to — ahem — most recent roadkill pickups.  It’s also available in a wide variety of formats. The “Apps for Democracy” was a kind of mashup contest to see what kind of apps could be developed to improve DC resident’s access to data.  It was highly successful, providing 47 different applications for a fraction of the cost of formally contracting out these projects.

Of course, changing such a huge, bureaucratic system as the Federal government will not happen overnight, but it is encouraging to see more of a focus on making data available in a timely manner (and in usable formats).

For those interested in this sort of thing, I’d also recommend checking out the Sunlight Foundation, which is focused on government transparency.  Also, TechPresident and Nextgov are both news sources focused on following all things e-gov.

Got any other interesting links on this topic?  Please feel free to post ‘em in the comments below.

Amazon Gets into the Public Data Sets Game

Thursday, December 4th, 2008

Amazon AWS LogoAmazon announced the launch of its Public Data Sets service this evening.  Bottom line, they asked people for different public or non-proprietary data sets and they got ‘em.  Here’s a sample of the (pretty hefty) stuff they are hosting for free:

  • Annotated Human Genome Data provided by ENSEMBL
  • A 3D Version of the PubChem Library provided by Rajarshi Guha at Indiana University
  • Various US Census Databases provided by The US Census Bureau
  • Various Labor Statistics Databases provided by The Bureau of Labor Statistics

Though the individual size of the sets are huge, there aren’t many of them at this point, but it appears that Amazon will be filling this out over time.

How do you access them?  Well, there’s a slight hitch.  You need to fire up an EC2 instance, hook into the set and then perform your analysis.  You just pay for the cost of the EC2 service.  Given how massive these tables are, it seems like a pretty good way to go.  A step closer to the supercomputer in the cloud.

We’re devoted users of Amazon S3 here and have also done some work with EC2, which is quite impressive.  Overall, this is another example of a nice trend where large data sets are becoming more easily accessible.

Use ZT software tool to convert addresses from ipv4 to ipv6/

If anyone has the chance to play with this service, let us know how it goes.

Infochimps and Numbrary: More Data Than You Can Shake a Stick At

Thursday, July 10th, 2008

Infochimps and Numbrary LogosThese are some very good times for those of you out there who like publicdata.  I ran across Kevin Chai’s research website today that has a nice listing of various data sets, blog articles and  other data-related goodies.

This reminded me of a couple other really interesting websites that are trying to solve the problem of data accessibility.  Check ‘em out:

Infochimps wins the award for compiling massive data sets.  If this is your thing, you may want to have a look.  For instance, in a recent blog post, they provided a peek into some of the hidden gems of their collection, including:

  • Full game state for every play of every baseball game in 2007, majors and minors.  Additionally, for about half of the major league games, pitch by pitch trajectory and game state information.  (MLB Gameday)
  • Word frequencies in written text for ~800,000 word tokens (British National Corpus)
  • All the Wikipedia infoboxes, turned on their side and put into a table for each infobox type.
  • 250,000+ Materials Safety Data sheets - the chemical and safety information required by OHSA
  • 100 years of Hourly weather data; from 1973 on there’s about 10,000 stations all taking hourly readings … put another way, it’s 475,000+ station-years of hourly readings and weighs in at ~15 GB compressed.

Break out that baseball data and you’ll be sure to impress your friends during the upcoming All-Star game.

As an aside, if any of you do end up taking this data for a spin with Kirix Strata™, let us know how it goes.  Strata’s got a theoretical limit of about 60 billion records per table.  Internally, we’ve tested on about 1 billion records, but have only pushed it past 100 million records or so in the corporate setting.  Strata tends to eat data for lunch, so if you push it past the 100 million record mark, we’d love to hear about it.


I recently ran across Numbrary and, for the little time I’ve played with it, I’m pretty impressed.  It has a lot of public data available with a heavy emphasis on economic indicators but with a load of other stuff too.  Best of all, it offers the data to the user in CSV format, which Strata happily opens up directly.

Here’s their mission statement, summarized:

Finding data is a pain.
Working with data is a drag.
Talking usefully about data is nearly impossible.
Numbrary® aims to change this.

Search engines don’t help much. Numbers are not words, which can be scanned and indexed for rapid search and retrieval.

Collections of numbers need as much attention online as do collections of words. With Numbrary®, they will receive that attention.

So, if you need a data set and a Google search set to filetype:csv doesn’t help, give these two websites a spin.  Got any other good data repositories to share?  Let us know.

Show Me the Data!

Wednesday, April 9th, 2008

data_table_htmlGreat post today by Bret Taylor, seeking a utopian wikipedia for structured data. There are currently various attempts at this type of thing; Freebase, in particular, comes to mind.

But what Bret is talking about is less about the semantic web — where all data everywhere is linked together by certain tags and/or infrastructure — and more just about getting access to really useful chunks of information, like mapping data and white pages. Heck, it’d be great to just get all US area codes with related information in a nice, accessible CSV file.

Hard to know if this sort of thing could ever come to pass, given that there are likely far fewer people who would edit a large financial data table than, say, the latest goings-on in American Idol — however, I’d love to see it happen.

In the meantime, ReadWriteWeb has compiled a bunch of data sources that I thought might prove useful to you as well. Some of this stuff was new to me and I look forward to exploring them. The one thing that the article didn’t mention was the great publicdata tag archive still happening on Delicious that we talked about many moons ago. Make sure to check that out too if you are looking for some good public data sets.

For Large Data Sets and the People Who Love Them…

Friday, January 18th, 2008 - Logo ImageYou may remember one of our earlier quests on this blog related to tagging the world’s public data. We’re still eagerly adding and taggin publicly available data sets when we find them.

Today, a friend of mine alerted me to another lode of mine-able data offered by Peter Skomoroch of Data Wrangling. There are a bunch of great sets here; hope you find something to your liking as well.

Besides enjoying his great blog name, I was also happy to be directed to a site called, which simply states it is “for people with large data sets.” It’s a site built to gather data-enjoyers across the web and collaborate in three areas:

  • Get: scrapers, crawlers, phone calls, buyouts
  • Process: conversions, queries, regressions, collaborative filtering
  • View: tables, graphs, maps, websites

Here’s a quick peek into their mission:

Some of us have spent years scraping news sites. Others have spent them downloading government data. Others have spent them grabbing catalog records for books.Referencement Google And each time, in each community, we reinvent the same things over and over again: scripts for doing crawls and notifying us when things are wrong, parsers for converting the data to RDF and XML, visualizers for plotting it on graphs and charts.

It’s time to start sharing our knowledge and our tools. But more than that, it’s time for us to start building a bigger picture together. To write robust crawl harnesses that deal gracefully with errors and notify us when a regexp breaks. To start converting things into common formats and making links between data sets. To build visualizers that will plot numbers on graphs or points on maps, no matter what the source of the input.

We’ve all been helping to build a Web of data for years now. It’s time we acknowledge that and start doing it together.

If you love data, this appears to be well worth checking out.

Have a great weekend!

Finding Data Tables on the Web

Saturday, November 3rd, 2007

GraphWise LogoI’m slightly (fashionably?) late to this party, but I just came across a new website called GraphWise that sets out to be the search engine for tabular data. In a recent press release, they state, “…if you want to search for videos, you go to YouTube, and if you want music, you go to iTunes. If you’re looking for tables of data we aim for users to go to GraphWise.” The comparison may not be entirely accurate since YouTube and iTunes search only their own catalogs, but the vision has some potential if they can pull it off.

Currently, when I look for a data set on the web, I start with these standard tactics:

  1. Google Search, by keyword only
  2. Google Search, by keyword with file type qualifier (e.g., filetype:csv)
  3. Delicious Search, by keyword
  4. Delicious Search, by tag (e.g., publicdata)
  5. Data “Repository” Search, such as Swivel, Data360 or ManyEyes

GraphWise provides an additional option to find data. It apparently spiders data (from HTML tables, CSV files, licensed sources and user uploads), then imports and normalizes the data and, ultimately, develops graphs based on the data (similar to Swivel or Data360). I rarely have need for auto-generated visualizations, but I really like the fact that they provide the URL to the original source table. With Kirix Strata™, it’s obviously a piece of cake to just import the raw table and start using it.

I did have some trouble finding useful data sets based on my search queries (forgivable, as the service is still in beta). For instance, in my previous blog post, we needed to find area code data in tabular format. So, I searched for US Area Codes in GraphWise, but got nothing even close to what I was looking for. For a simpler example, I search for Apple’s stock price. It looks like GraphWise licenses historic stock information from a company called CSI, but only displayed the data in bite-sized chunks. I know I can easily download the full set of Apple’s historical stock data via CSV at Yahoo Finance, but that wasn’t listed as a resource.

It appears GraphWise has done well with the spidering technology to identify and capture table information across the web. The next big step will be to make the search queries more relevant. Because HTML and CSV files aren’t often linked to directly, it would be really difficult to apply the kind of PageRank algorithm that makes Google so valuable. I can imagine some other issues as well, like trying to separate a table name (if available) and the actual text within a given table. Hopefully they’ll be able to overcome these hurdles; it would be great to have a Google-like place to identify tabular data on the web.

(via Swivel)

Tagging the World’s publicdata

Tuesday, July 17th, 2007

SignpostThere’s a surprising amount of publicly available data on the web — government statistics, economic information, sports data, etc. And lots of it is in good ol’ fashioned CSV files ripe for analysis.

Jon Udell has recently begun tracking this kind of data using and has asked anyone who is so inclined to follow along. All you have to do to join in the fun is tag your bookmark publicdata.

With Kirix Strata™, we’ve been interested in identifying public data sources as well and have been jotting bookmarks down as we’ve come across them. We’re quite pleased to finally have a useful, publicly available place to put them:


We’ve only added a few to start with, but you’ll see more added in the coming weeks.

Got any good publicdata to share?


Data and the Web is a blog by Kirix about accessing and working with data, wherever it is located. We have a particular fondness for data usability, ad hoc analysis, mashups, web APIs and, of course, playing around with our data browser.