Data and the Web

Archive for October, 2008

The Dirty Data of Data Mining

Tuesday, October 28th, 2008

Bulldozers in LandfillToday I came across a survey on data mining by a consulting firm called Rexer Analytics.  Their survey took into account 348 responses from data mining professionals around the world.  A few interesting tidbits:

* Dirty data, data access issues, and explaining data mining to others remain the top challenges faced by data miners.

* Data miners spend only 20% of their time on actual modeling.  More than a third of their time is spent accessing and preparing data.

* In selecting their analytic software, data miners place a high value on dependability, the ability to handle very large datasets, and quality output.

We’ve found these issues to hold true with our clients as well, particularly in various auditing industries.  Auditors will get a hold of their client’s data, maybe in some delimited text file.  The data set is inevitably too large for Excel to handle easily, so they may try Access (of course, once they are eternally frustrated, they give Kirix Strata™ a shot).

Once they can actually see that data set, they start exploring it to learn about what they’re looking at and then inevitably find out how dirty it is.  Multiple fields are mashed together or individual ones are stripped apart.  Company names appear multiple times in various forms (”I.B.M” vs. “IBM”).  An important bit of numeric information is embedded in a text field.  There is no end of time spent “purifying” the data set to make sure to avoid the “garbage in, garbage out” syndrome.

Often overlooked, data cleansing is really as important as the analysis itself.  Only once this step is complete can you move on to your data mining or other data analysis.

Check out the survey summary yourself and let us know if it matches your experience.

About

Data and the Web is a blog by Kirix about accessing and working with data, wherever it is located. We have a particular fondness for data usability, ad hoc analysis, mashups, web APIs and, of course, playing around with our data browser.