Data Mining | Data and the Web

Data and the Web

Archive for the ‘data mining' Category

Thursday, March 5th, 2009

OMB SealWe recently posted an article about Vivek Kundra, who was named United States CIO this morning by the Obama administration. He's got $71 billion in IT spending under his care. Hmm, that's a lot of data browsers.

One interesting tidbit appeared in this Saul Hansell NY Times article:

Another initiative will be to create a new site,, that will become a repository for all the information the government collects. He pointed to the benefits that have already come from publishing the data from the Human Genome Project by the National Institutes of Health, as well as the information from military satellites that is now used in GPS navigation devices.

"There is a lot of data the federal government has and we need to make sure that all the data that is not private, or restricted for national security reasons, can be made public," [Kundra] said.

In another bit of interesting news, the Jonathan Stein at Mother Jones notes that Mike Honda (D-Calif) added a provision into the recent appropriations bill that requires government entities to make their public available in raw form:

If the Senate passes the bill with the provision intact, citizens seeking information about Congress' activities—such as bill names and numbers, amendments, votes, and committee reports—won't have to rely on government websites, which often filter information, are incomplete, or are difficult to use. Instead, the underlying data will be available to anyone who wants to build a superior site or tool to sift through it. “The language is groundbreaking in that it supports providing unfiltered legislative information to the public,” says Honda's online communications director, Rob Pierson. “Instead of silo-ing the information, and only allowing access through a limited web form, access to the raw data will make it easier for people to learn what their government is doing.”

Kim Zetter from Wired has more on the story here.

Maybe once the data is made more accessible, some clever folks can put an interface on things that improve the complex aftermath of the “laws and sausages” routine. I did my best to search for Honda's three-sentence provision in the latest omnibus bill with no luck. Anyone know what the actual provision stated? [UPDATE: Rob Pierson, Online Communications Director of Congressman Honda's office, provided a link to an O'Reilly post with the full text of the provision. Give the full article a read — it's quite worthwhile.]

And, for posterity, here are some of the data repositories mentioned in the articles above:

AWS Public Data Sets Continues to Expand

Wednesday, February 25th, 2009

AWS Public Data Sets ScreenshotPreviously, we posted some information on Amazon's foray into making huge public data sets available to users of their web services. Yesterday they announced the addition of some very sizable additions:

If you use AWS, the announcement provides more info on these datasets as well as how to access them. If you don't use AWS, you can still access much of this data directly from the websites linked above.

Amazon Gets into the Public Data Sets Game

Thursday, December 4th, 2008

Amazon AWS LogoAmazon announced the launch of its Public Data Sets service this evening. Bottom line, they asked people for different public or non-proprietary data sets and they got ‘em. Here's a sample of the (pretty hefty) stuff they are hosting for free:

  • Annotated Human Genome Data provided by ENSEMBL
  • A 3D Version of the PubChem Library provided by Rajarshi Guha at Indiana University
  • Various US Census Databases provided by The US Census Bureau
  • Various Labor Statistics Databases provided by The Bureau of Labor Statistics

Though the individual size of the sets are huge, there aren't many of them at this point, but it appears that Amazon will be filling this out over time.

How do you access them? Well, there's a slight hitch. You need to fire up an EC2 instance, hook into the set and then perform your analysis. You just pay for the cost of the EC2 service. Given how massive these tables are, it seems like a pretty good way to go. A step closer to the supercomputer in the cloud.

We're devoted users of Amazon S3 here and have also done some work with EC2, which is quite impressive. Overall, this is another example of a nice trend where large data sets are becoming more easily accessible.

Use ZT software tool to convert addresses from ipv4 to ipv6/

If anyone has the chance to play with this service, let us know how it goes.

A CSV File You Can Believe In

Monday, December 1st, 2008 logoThis is not a blog that delves into political issues, but I happened to notice that the Obama transition team released the names of all their donors today. However, inexplicably, they don't have them in a CSV format for easy slicing and dicing in your favorite data analysis software.

A couple clicks in Kirix Strata™ took care of that pretty quickly. (*.csv, 120 KB)

Some interesting bits of information:

  • Google is the employer with the most total donations at $14,200 (from “Google” and “Google, Inc.”, 8 employees).
  • Microsoft employees only gave $500 (2 employees)
  • 74 different colleges and universities were represented for $25,900 (81 employees)
  • 4 people who defined themselves as “Not Employed” gave a total of $11,250.
  • There are 1,776 donors in the list. Mere coincidence… or more evidence that Obama is truly “that one” (alternatively, the list could have been hacked because he is “the one“)?

The data is a little bit dirty (particularly the “Employer” field), but you might have some fun poking around. Shoot us a message in the comments if you find anything interesting.

P.S. Also, I saw this article about data overload during the campaign… looks like the Federal Election Commission could have used the Kirix Strata government discount. ;)

Update: Also, looks like George Lucas jumped in and we see an employee of the notorious Dewey, Cheetham & Howe

The Dirty Data of Data Mining

Tuesday, October 28th, 2008

Bulldozers in LandfillToday I came across a survey on data mining by a consulting firm called Rexer Analytics. Their survey took into account 348 responses from data mining professionals around the world. A few interesting tidbits:

* Dirty data, data access issues, and explaining data mining to others remain the top challenges faced by data miners.

* Data miners spend only 20% of their time on actual modeling. More than a third of their time is spent accessing and preparing data.

* In selecting their analytic software, data miners place a high value on dependability, the ability to handle very large datasets, and quality output.

We've found these issues to hold true with our clients as well, particularly in various auditing industries. Auditors will get a hold of their client's data, maybe in some delimited text file. The data set is inevitably too large for Excel to handle easily, so they may try Access (of course, once they are eternally frustrated, they give Kirix Strata™ a shot).

Once they can actually see that data set, they start exploring it to learn about what they're looking at and then inevitably find out how dirty it is. Multiple fields are mashed together or individual ones are stripped apart. Company names appear multiple times in various forms (”I.B.M” vs. “IBM”). An important bit of numeric information is embedded in a text field. There is no end of time spent “purifying” the data set to make sure to avoid the “garbage in, garbage out” syndrome.

Often overlooked, data cleansing is really as important as the analysis itself. Only once this step is complete can you move on to your data mining or other data analysis.

Check out the survey summary yourself and let us know if it matches your experience.


Data and the Web is a blog by Kirix about accessing and working with data, wherever it is located. We have a particular fondness for data usability, ad hoc analysis, mashups, web APIs and, of course, playing around with our data browser.