Data and the Web

Archive for the ‘data tags/search’ Category

Infochimps and Numbrary: More Data Than You Can Shake a Stick At

Thursday, July 10th, 2008

Infochimps and Numbrary LogosThese are some very good times for those of you out there who like publicdata.  I ran across Kevin Chai’s research website today that has a nice listing of various data sets, blog articles and  other data-related goodies.

This reminded me of a couple other really interesting websites that are trying to solve the problem of data accessibility.  Check ‘em out:

Infochimps wins the award for compiling massive data sets.  If this is your thing, you may want to have a look.  For instance, in a recent blog post, they provided a peek into some of the hidden gems of their collection, including:

  • Full game state for every play of every baseball game in 2007, majors and minors.  Additionally, for about half of the major league games, pitch by pitch trajectory and game state information.  (MLB Gameday)
  • Word frequencies in written text for ~800,000 word tokens (British National Corpus)
  • All the Wikipedia infoboxes, turned on their side and put into a table for each infobox type.
  • 250,000+ Materials Safety Data sheets - the chemical and safety information required by OHSA
  • 100 years of Hourly weather data; from 1973 on there’s about 10,000 stations all taking hourly readings … put another way, it’s 475,000+ station-years of hourly readings and weighs in at ~15 GB compressed.

Break out that baseball data and you’ll be sure to impress your friends during the upcoming All-Star game.

As an aside, if any of you do end up taking this data for a spin with Kirix Strata™, let us know how it goes.  Strata’s got a theoretical limit of about 60 billion records per table.  Internally, we’ve tested on about 1 billion records, but have only pushed it past 100 million records or so in the corporate setting.  Strata tends to eat data for lunch, so if you push it past the 100 million record mark, we’d love to hear about it.


I recently ran across Numbrary and, for the little time I’ve played with it, I’m pretty impressed.  It has a lot of public data available with a heavy emphasis on economic indicators but with a load of other stuff too.  Best of all, it offers the data to the user in CSV format, which Strata happily opens up directly.

Here’s their mission statement, summarized:

Finding data is a pain.
Working with data is a drag.
Talking usefully about data is nearly impossible.
Numbrary® aims to change this.

Search engines don’t help much. Numbers are not words, which can be scanned and indexed for rapid search and retrieval.

Collections of numbers need as much attention online as do collections of words. With Numbrary®, they will receive that attention.

So, if you need a data set and a Google search set to filetype:csv doesn’t help, give these two websites a spin.  Got any other good data repositories to share?  Let us know. Another Reason We Need the Semantic Web

Tuesday, May 13th, 2008

Whenever I take the train in and out of Chicago, I’m reminded about how much better things would be if there was greater adoption of the Semantic Web. In order to find the train times, I have to navigate through the esoteric organization of the Chicago Metra train website– and every time, I’m struck by how much useful information is just sitting there, waiting to be set free with semantic markup.

The Metra site itself is easy enough to use, if you’re already familiar with the train system in Chicago. However, it’s got to be quite a challenge for anyone who’s new to it.

The problem is that the train schedules are organized according to train lines, rather than by what station you’re traveling to or from. For instance, when you click the “Quick Schedule” link, you just get a list of all the train lines in the system, with options like the “Metra Heritage Corridor Line” and “Metra BNSF Railway Line.” This works great if you know where these train lines run. Unfortunately, if all you know is that you want to get from Chicago to Elmhurst, well, you’ll need to dig around quite a bit to figure out the correct train line to take.

Metra Schedule Navigation

This is where the Semantic Web could really help.

When the data on the Metra site gets marked up semantically, the information it offers will no longer be tied to the way it is presented on the page or limited to being organized and consumed in only one way. So, if the train schedules are given a universal resource identifiers (URI) and other semantic markup, they would be available directly to the rest of the web and could be accessed and used independently from the way they’re organized in the Metra site. The data itself would be its own web-based resource.

As a result, Metra could continue to list their schedules according to each train line, if they think this is best methodology, but other users and applications would have the ability to re-use this information and present it differently. For instance, a person might be able to type in “Chicago” and “Elmhurst” into a trip planner on an iPhone and have it look up the train schedule automatically.

And this is obviously just one drop in an ocean of possibilities. As Tim Berners-Lee notes in his “Giant Global Graph” article:

“Now, people are making another mental move. There is realization now, ‘It’s not the documents, it is the things they are about which are important’. Obvious, really.”

The web is mainly a set of connected documents right now. But, as the Semantic Web grows, an increasing number of data resources will have the ability to be connected to each other, with the potential for being re-mixed and re-purposed.

That will definitely be a good day. But until then, I suppose I’ll just have to remember to take the Union Pacific West Line…

Update (01/05/2009):  Looks like Google is trying to make this process easier with their Google Transit Feed Specification, although it appears that there is a bit of resistance out there from the transport agencies…

Show Me the Data!

Wednesday, April 9th, 2008

data_table_htmlGreat post today by Bret Taylor, seeking a utopian wikipedia for structured data. There are currently various attempts at this type of thing; Freebase, in particular, comes to mind.

But what Bret is talking about is less about the semantic web — where all data everywhere is linked together by certain tags and/or infrastructure — and more just about getting access to really useful chunks of information, like mapping data and white pages. Heck, it’d be great to just get all US area codes with related information in a nice, accessible CSV file.

Hard to know if this sort of thing could ever come to pass, given that there are likely far fewer people who would edit a large financial data table than, say, the latest goings-on in American Idol — however, I’d love to see it happen.

In the meantime, ReadWriteWeb has compiled a bunch of data sources that I thought might prove useful to you as well. Some of this stuff was new to me and I look forward to exploring them. The one thing that the article didn’t mention was the great publicdata tag archive still happening on Delicious that we talked about many moons ago. Make sure to check that out too if you are looking for some good public data sets.

For Large Data Sets and the People Who Love Them…

Friday, January 18th, 2008 - Logo ImageYou may remember one of our earlier quests on this blog related to tagging the world’s public data. We’re still eagerly adding and taggin publicly available data sets when we find them.

Today, a friend of mine alerted me to another lode of mine-able data offered by Peter Skomoroch of Data Wrangling. There are a bunch of great sets here; hope you find something to your liking as well.

Besides enjoying his great blog name, I was also happy to be directed to a site called, which simply states it is “for people with large data sets.” It’s a site built to gather data-enjoyers across the web and collaborate in three areas:

  • Get: scrapers, crawlers, phone calls, buyouts
  • Process: conversions, queries, regressions, collaborative filtering
  • View: tables, graphs, maps, websites

Here’s a quick peek into their mission:

Some of us have spent years scraping news sites. Others have spent them downloading government data. Others have spent them grabbing catalog records for books.Referencement Google And each time, in each community, we reinvent the same things over and over again: scripts for doing crawls and notifying us when things are wrong, parsers for converting the data to RDF and XML, visualizers for plotting it on graphs and charts.

It’s time to start sharing our knowledge and our tools. But more than that, it’s time for us to start building a bigger picture together. To write robust crawl harnesses that deal gracefully with errors and notify us when a regexp breaks. To start converting things into common formats and making links between data sets. To build visualizers that will plot numbers on graphs or points on maps, no matter what the source of the input.

We’ve all been helping to build a Web of data for years now. It’s time we acknowledge that and start doing it together.

If you love data, this appears to be well worth checking out.

Have a great weekend!

Finding Data Tables on the Web

Saturday, November 3rd, 2007

GraphWise LogoI’m slightly (fashionably?) late to this party, but I just came across a new website called GraphWise that sets out to be the search engine for tabular data. In a recent press release, they state, “…if you want to search for videos, you go to YouTube, and if you want music, you go to iTunes. If you’re looking for tables of data we aim for users to go to GraphWise.” The comparison may not be entirely accurate since YouTube and iTunes search only their own catalogs, but the vision has some potential if they can pull it off.

Currently, when I look for a data set on the web, I start with these standard tactics:

  1. Google Search, by keyword only
  2. Google Search, by keyword with file type qualifier (e.g., filetype:csv)
  3. Delicious Search, by keyword
  4. Delicious Search, by tag (e.g., publicdata)
  5. Data “Repository” Search, such as Swivel, Data360 or ManyEyes

GraphWise provides an additional option to find data. It apparently spiders data (from HTML tables, CSV files, licensed sources and user uploads), then imports and normalizes the data and, ultimately, develops graphs based on the data (similar to Swivel or Data360). I rarely have need for auto-generated visualizations, but I really like the fact that they provide the URL to the original source table. With Kirix Strata™, it’s obviously a piece of cake to just import the raw table and start using it.

I did have some trouble finding useful data sets based on my search queries (forgivable, as the service is still in beta). For instance, in my previous blog post, we needed to find area code data in tabular format. So, I searched for US Area Codes in GraphWise, but got nothing even close to what I was looking for. For a simpler example, I search for Apple’s stock price. It looks like GraphWise licenses historic stock information from a company called CSI, but only displayed the data in bite-sized chunks. I know I can easily download the full set of Apple’s historical stock data via CSV at Yahoo Finance, but that wasn’t listed as a resource.

It appears GraphWise has done well with the spidering technology to identify and capture table information across the web. The next big step will be to make the search queries more relevant. Because HTML and CSV files aren’t often linked to directly, it would be really difficult to apply the kind of PageRank algorithm that makes Google so valuable. I can imagine some other issues as well, like trying to separate a table name (if available) and the actual text within a given table. Hopefully they’ll be able to overcome these hurdles; it would be great to have a Google-like place to identify tabular data on the web.

(via Swivel)

Tagging the World’s publicdata

Tuesday, July 17th, 2007

SignpostThere’s a surprising amount of publicly available data on the web — government statistics, economic information, sports data, etc. And lots of it is in good ol’ fashioned CSV files ripe for analysis.

Jon Udell has recently begun tracking this kind of data using and has asked anyone who is so inclined to follow along. All you have to do to join in the fun is tag your bookmark publicdata.

With Kirix Strata™, we’ve been interested in identifying public data sources as well and have been jotting bookmarks down as we’ve come across them. We’re quite pleased to finally have a useful, publicly available place to put them:


We’ve only added a few to start with, but you’ll see more added in the coming weeks.

Got any good publicdata to share?


Data and the Web is a blog by Kirix about accessing and working with data, wherever it is located. We have a particular fondness for data usability, ad hoc analysis, mashups, web APIs and, of course, playing around with our data browser.