Dirty Data | Data and the Web

Data and the Web

Archive for the ‘dirty data' Category

Data Clean Up, Brought to You by Google

Thursday, November 18th, 2010

Google Refine LogoI recently saw this announcement for an open source tool and thought it might be interesting to some folks that deal with messy data sets.

Google Refine provides an interesting take on grouping and filtering data and then getting it cleaned up. It also does some pretty interesting stuff using web APIs to transform data (see video 3, in particular).

The tool focuses on the data clean-up side of things, rather than analysis and reporting. You may end up running into some trouble with larger data sets, as, I believe, the processing needs to be performed entirely in memory.

However, for data geeks out there, it's definitely worth a look and might even be a nice complement for Kirix Strata at times.

If you have a chance to play with it, feel free to let us know what you think in the comments below.

Cooking the (Quick)Books

Wednesday, January 14th, 2009

Illinois ST-1 ImageAh, tax season… could there be a more thrilling time of the year?

So, today I was reviewing a sales & use tax form for the State of Illinois. Since our governor really isn't helping matters in our state these days, we felt the least we could do to help was to make sure to pay our taxes on time.

So, I was looking at our sales tax report in Quickbooks and, like a good accountant, just quickly checked to make sure it matched up against the total revenues in the income statement. They didn't match.

Hmm… funny thing about accounting, things really ought to balance.

It was a small discrepancy, but after searching unsuccessfully for the difference, it was clear that the issue involved more than one transaction. And, unfortunately, there were just far too many transactions to try and come up with a solution manually.

So, since I happened to have this data browser laying around, I exported both reports as CSV files and opened them up in Kirix Strata™.

The Quickbooks CSVs were obviously meant for spreadsheet export (as it included subtotals and odd record breaks), so I quickly did some clean up and then did a few gymnastics to compare the tables. Turns out there were a few manual journal entries that weren't mapped to the sales tax codes required by Quickbooks. And here I was hoping to blame Quickbooks… oh well.

Running through this process was a 5 minute affair, but it made me wonder about all these other small data manipulation tasks that are out there. There have got to be millions, nay, billions, of these things — 5 minute one-off, ad hoc data tasks that just can't be solved with the help of a spreadsheet (in this case, grouping or relationships were needed to do this quickly).

What do people normally do in these situations? I fear that they probably spend hours working the problem manually. Got a similar story and/or solution? Feel free to share in the comments section below.

A CSV File You Can Believe In

Monday, December 1st, 2008

Change.gov logoThis is not a blog that delves into political issues, but I happened to notice that the Obama transition team released the names of all their donors today. However, inexplicably, they don't have them in a CSV format for easy slicing and dicing in your favorite data analysis software.

A couple clicks in Kirix Strata™ took care of that pretty quickly. (*.csv, 120 KB)

Some interesting bits of information:

  • Google is the employer with the most total donations at $14,200 (from “Google” and “Google, Inc.”, 8 employees).
  • Microsoft employees only gave $500 (2 employees)
  • 74 different colleges and universities were represented for $25,900 (81 employees)
  • 4 people who defined themselves as “Not Employed” gave a total of $11,250.
  • There are 1,776 donors in the list. Mere coincidence… or more evidence that Obama is truly “that one” (alternatively, the list could have been hacked because he is “the one“)?

The data is a little bit dirty (particularly the “Employer” field), but you might have some fun poking around. Shoot us a message in the comments if you find anything interesting.

P.S. Also, I saw this article about data overload during the campaign… looks like the Federal Election Commission could have used the Kirix Strata government discount. ;)

Update: Also, looks like George Lucas jumped in and we see an employee of the notorious Dewey, Cheetham & Howe

The Dirty Data of Data Mining

Tuesday, October 28th, 2008

Bulldozers in LandfillToday I came across a survey on data mining by a consulting firm called Rexer Analytics. Their survey took into account 348 responses from data mining professionals around the world. A few interesting tidbits:

* Dirty data, data access issues, and explaining data mining to others remain the top challenges faced by data miners.

* Data miners spend only 20% of their time on actual modeling. More than a third of their time is spent accessing and preparing data.

* In selecting their analytic software, data miners place a high value on dependability, the ability to handle very large datasets, and quality output.

We've found these issues to hold true with our clients as well, particularly in various auditing industries. Auditors will get a hold of their client's data, maybe in some delimited text file. The data set is inevitably too large for Excel to handle easily, so they may try Access (of course, once they are eternally frustrated, they give Kirix Strata™ a shot).

Once they can actually see that data set, they start exploring it to learn about what they're looking at and then inevitably find out how dirty it is. Multiple fields are mashed together or individual ones are stripped apart. Company names appear multiple times in various forms (”I.B.M” vs. “IBM”). An important bit of numeric information is embedded in a text field. There is no end of time spent “purifying” the data set to make sure to avoid the “garbage in, garbage out” syndrome.

Often overlooked, data cleansing is really as important as the analysis itself. Only once this step is complete can you move on to your data mining or other data analysis.

Check out the survey summary yourself and let us know if it matches your experience.


Data and the Web is a blog by Kirix about accessing and working with data, wherever it is located. We have a particular fondness for data usability, ad hoc analysis, mashups, web APIs and, of course, playing around with our data browser.