Data and the Web

Archive for the ‘data analysis' Category

Data Clean Up, Brought to You by Google

Thursday, November 18th, 2010

I recently saw this announcement for an open source tool and thought it might be interesting to some folks that deal with messy data sets.

Google Refine provides an interesting take on grouping and filtering data and then getting it cleaned up. It also does some pretty interesting stuff using web APIs to transform data (see video 3, in particular).

The tool focuses on the data clean-up side of things, rather than analysis and reporting. You may end up running into some trouble with larger data sets, as, I believe, the processing needs to be performed entirely in memory.

However, for data geeks out there, it's definitely worth a look and might even be a nice complement for Kirix Strata at times.

If you have a chance to play with it, feel free to let us know what you think in the comments below.

Posted in data analysis, dirty data | Comments Off

AWS Public Data Sets Continues to Expand

Wednesday, February 25th, 2009

Previously, we posted some information on Amazon's foray into making huge public data sets available to users of their web services. Yesterday they announced the addition of some very sizable additions:

US Bureau of Transportation Statistics
DBpedia Knowledge Base (67 GB)
Freebase Data Dump (66 GB)
Genbank Genetic Sequence Database(250 GB)

If you use AWS, the announcement provides more info on these datasets as well as how to access them. If you don't use AWS, you can still access much of this data directly from the websites linked above.

Posted in data analysis, data mining, data repositories | Comments Off

Cooking the (Quick)Books

Wednesday, January 14th, 2009

Illinois ST-1 Image Ah, tax season… could there be a more thrilling time of the year?

So, today I was reviewing a sales & use tax form for the State of Illinois. Since our governor really isn't helping matters in our state these days, we felt the least we could do to help was to make sure to pay our taxes on time.

So, I was looking at our sales tax report in Quickbooks and, like a good accountant, just quickly checked to make sure it matched up against the total revenues in the income statement. They didn't match.

Hmm… funny thing about accounting, things really ought to balance.

It was a small discrepancy, but after searching unsuccessfully for the difference, it was clear that the issue involved more than one transaction. And, unfortunately, there were just far too many transactions to try and come up with a solution manually.

So, since I happened to have this data browser laying around, I exported both reports as CSV files and opened them up in Kirix Strata™.

The Quickbooks CSVs were obviously meant for spreadsheet export (as it included subtotals and odd record breaks), so I quickly did some clean up and then did a few gymnastics to compare the tables. Turns out there were a few manual journal entries that weren't mapped to the sales tax codes required by Quickbooks. And here I was hoping to blame Quickbooks… oh well.

Running through this process was a 5 minute affair, but it made me wonder about all these other small data manipulation tasks that are out there. There have got to be millions, nay, billions, of these things — 5 minute one-off, ad hoc data tasks that just can't be solved with the help of a spreadsheet (in this case, grouping or relationships were needed to do this quickly).

What do people normally do in these situations? I fear that they probably spend hours working the problem manually. Got a similar story and/or solution? Feel free to share in the comments section below.

Posted in data analysis, dirty data, spreadsheets | 2 Comments »

Amazon Gets into the Public Data Sets Game

Thursday, December 4th, 2008

Amazon announced the launch of its Public Data Sets service this evening. Bottom line, they asked people for different public or non-proprietary data sets and they got ‘em. Here's a sample of the (pretty hefty) stuff they are hosting for free:

Annotated Human Genome Data provided by ENSEMBL
A 3D Version of the PubChem Library provided by Rajarshi Guha at Indiana University
Various US Census Databases provided by The US Census Bureau
Various Labor Statistics Databases provided by The Bureau of Labor Statistics

Though the individual size of the sets are huge, there aren't many of them at this point, but it appears that Amazon will be filling this out over time.

How do you access them? Well, there's a slight hitch. You need to fire up an EC2 instance, hook into the set and then perform your analysis. You just pay for the cost of the EC2 service. Given how massive these tables are, it seems like a pretty good way to go. A step closer to the supercomputer in the cloud.

We're devoted users of Amazon S3 here and have also done some work with EC2, which is quite impressive. Overall, this is another example of a nice trend where large data sets are becoming more easily accessible.

Use ZT software tool to convert addresses from ipv4 to ipv6/

If anyone has the chance to play with this service, let us know how it goes.

Posted in data analysis, data mining, data repositories | 3 Comments »

A CSV File You Can Believe In

Monday, December 1st, 2008

This is not a blog that delves into political issues, but I happened to notice that the Obama transition team released the names of all their donors today. However, inexplicably, they don't have them in a CSV format for easy slicing and dicing in your favorite data analysis software.

A couple clicks in Kirix Strata™ took care of that pretty quickly. (*.csv, 120 KB)

Some interesting bits of information:

Google is the employer with the most total donations at $14,200 (from “Google” and “Google, Inc.”, 8 employees).
Microsoft employees only gave $500 (2 employees)
74 different colleges and universities were represented for $25,900 (81 employees)
4 people who defined themselves as “Not Employed” gave a total of $11,250.
There are 1,776 donors in the list. Mere coincidence… or more evidence that Obama is truly “that one” (alternatively, the list could have been hacked because he is “the one“)?

The data is a little bit dirty (particularly the “Employer” field), but you might have some fun poking around. Shoot us a message in the comments if you find anything interesting.

P.S. Also, I saw this article about data overload during the campaign… looks like the Federal Election Commission could have used the Kirix Strata government discount.

Update: Also, looks like George Lucas jumped in and we see an employee of the notorious Dewey, Cheetham & Howe…

Posted in data analysis, data mining, dirty data, government | 1 Comment »

The Dirty Data of Data Mining

Tuesday, October 28th, 2008

Bulldozers in Landfill Today I came across a survey on data mining by a consulting firm called Rexer Analytics. Their survey took into account 348 responses from data mining professionals around the world. A few interesting tidbits:

* Dirty data, data access issues, and explaining data mining to others remain the top challenges faced by data miners.

* Data miners spend only 20% of their time on actual modeling. More than a third of their time is spent accessing and preparing data.

* In selecting their analytic software, data miners place a high value on dependability, the ability to handle very large datasets, and quality output.

We've found these issues to hold true with our clients as well, particularly in various auditing industries. Auditors will get a hold of their client's data, maybe in some delimited text file. The data set is inevitably too large for Excel to handle easily, so they may try Access (of course, once they are eternally frustrated, they give Kirix Strata™ a shot).

Once they can actually see that data set, they start exploring it to learn about what they're looking at and then inevitably find out how dirty it is. Multiple fields are mashed together or individual ones are stripped apart. Company names appear multiple times in various forms (”I.B.M” vs. “IBM”). An important bit of numeric information is embedded in a text field. There is no end of time spent “purifying” the data set to make sure to avoid the “garbage in, garbage out” syndrome.

Often overlooked, data cleansing is really as important as the analysis itself. Only once this step is complete can you move on to your data mining or other data analysis.

Check out the survey summary yourself and let us know if it matches your experience.

Posted in data analysis, data mining, dirty data | Comments Off

Moving Toward Business Intelligence 2.0

Tuesday, June 3rd, 2008

Elephant Crossing Sign I just read a pretty interesting article by Neil Raden called “Business Intelligence 2.0: Simpler, More Accessible, Inevitable” (HT: Snaplogic) and would recommend giving it a read.

Historically, business intelligence hasn't been all that its cracked up to be. Very expensive data warehousing systems are put in place. Existing reports are re-created and all kinds of new objects/reports are added. Everyone is thoroughly trained on the system. Pretty 3D graphics are added to the dashboard. The project goes over budget. Users revert to using Excel.

Some would say that BI is just a fancy way to do organizational reporting. There's a lot of truth to this; why else do people continue to rely on their spreadsheets when they need to do some quick and dirty analysis? I think the answer is that there is a substantial ad hoc component to the “intelligence” part of business intelligence that will never be captured by a large, centralized system.

Having a few BI gurus setting up reports for everyone just isn't an efficient use of resources. Nor does it capture the collective brain power of the organization. And there is quite a bit of this power ready to be tapped, even in the deepest corners of a company.

For example, we've done a lot of work with folks in the accounts payable industry. AP is not what you'd call a very sexy part of the organization — however, billions of dollars flow through it each year, as the keepers of the company checkbook. There are efficiencies to be gained, analyses to be done and, in our experience, a whole slew of people eager to do a bang-up job. However, when an AP manager needs to get something from the legacy system or just wants to create a new type of report they have one of two options — either go to IT and hope they can get a report created within the next couple weeks or go to mattresses with Excel/Access and do what they need to do themselves.

Neil echoes this when comparing BI 1.0 to BI 2.0:

BI 1.0 Fallacy: Most users want to be spoon-fed information and will never take the initiative to create their own environment or investigate the best way to get the answers they need.

BI 2.0 Reality: The Consumer Web invalidates this idea. When given simple tools to do something that is important and/or useful to them, people find a way to “mash up” what they need.

We've seen people's initiative on display time and again and are really happy that Kirix Strata is playing a part in making this type of ad hoc analysis, integration and reporting easier than ever.

So, give those articles a read and see what you think. Also, please consider joining us on Wednesday at 1 PM EDT for our free hour-long web seminar with Snaplogic called “BI 2.0: Data Visualization and Spreadsheets on Steroids.” All the pertinent details can be found here. Hope to see you then!

Posted in business intelligence, data analysis, spreadsheets | 2 Comments »

A Business Intelligence Browser

Wednesday, April 23rd, 2008

Image - Light Bulb During our journey from the MySQL Conference last week to the Web 2.0 Expo this week, we've continued to see a common pattern. There are a tremendous number of back-end business systems and almost all of them seem to rely on the humble Web browser for the front-end.

Now there is a very good reason for this — the Web is a tremendous distribution platform and companies are taking advantage of the same protocols and tools on their intranet that have already propelled the growth of the Internet. And, bottom line, browsers are really, really easy to use.

Data Integration 2.0

As an example of this situation, we met some folks last week from Snaplogic. Their product makes it really simple to integrate data from back-end systems and enables you to access data from anywhere, mash it up and publish it. However, when they were demoing the software, there wasn't particularly much to show off, since a normal browser can't do much with the data except let you “Save As” to your desktop.

Thankfully, this type of front-end work is perfectly suited for a data browser. So, we decided to demo each other's products at the Web 2.0 Expo this week. Snaplogic is able to easily show the value they bring to the table by actually displaying the data integration they can do. We put together a quick dashboard with Snaplogic data and were able to show off all the cool stuff you can do once you've got a data feed to work with. Indeed, this was like the serendipitous combination of chocolate and peanut butter.

The Last Mile

This reminded me of a great metaphor used in a post by Juice Analytics called “The Last Mile of Business Intelligence.” The idea is that businesses have built lots of large back end systems — from data integration and warehousing to BI and reporting systems. But, in the end, it still seems that actual users are left out in the cold. The “last mile” between the heavy machinery and the actual knowledge worker has not been adequately bridged.

So unfortunately, the typical pattern involves working with a back end system via a Web browser and then, ultimately, exporting a CSV file to work with in Excel.

This is not terribly efficient. A data browser enables you to open up those same intranet pages and perform your normal browsing tasks. However, because it's a data analysis tool at heart, you can actually open up data sets and start working with them directly. Then you can do fun things like bookmark tables, build calculations, create reports and write queries and have it refresh the next day with the latest data. Because of this, it plays very nicely in the worlds of of exploration and reporting per Juice's “data value framework.”

Why Incorporate the Web?

In the past we've shown off some of the fun things Kirix Strata™ does — like open up HTML tables natively or showing RSS feeds in a table view — but often we'll get a response like, “that's cool, but what do I do with it?” There is no question that tools like these can be very useful for some tasks (e.g., integrating data from the web with local tables), but they're really just icing on the cake.

The important thing is how tightly the web is integrated with the data analysis tool. This opens up all kinds of possibilities for direct interaction with both the Web infrastructure and the database systems that power business intelligence in the enterprise. Add some Web APIs into the mix, and now you could have access to the contact information in your ACT database, the corporate newsletter contacts in a MySQL system and all the recent sales information from your Salesforce.com account. Explore and integrate at will.

Given that the meme “Enterprise 2.0″ resounds louder each day, we should only expect to see Web infrastructure reaching deeper and deeper into organizations. In the end, maybe part of the BI “last mile” solution is just giving businesses a browser with a bit more intelligence.

Posted in business intelligence, data analysis, spreadsheets | 1 Comment »

Pill Bugs, Potato Bugs or Doodlebugs?

Wednesday, November 14th, 2007

Image - Pill Bug If you haven't checked out what the fine folks at Many Eyes are doing with “community” data visualization, it is well worth a peek. I took a look at one of their recent blog posts today regarding the new map visualizations they are offering. Very nice stuff indeed.

However, the thing that really caught my [many] eye[s] was the mention of a data set denoting “the regional slang for those odd little bugs that curl into balls.” This is definitely not something that keeps me up at night, but I've always wondered about the monikers of these benign little creatures. I grew up calling them “pill bugs.” However, probably due to some deep psychological desire to be accepted in elementary school, I eventually started referring to them as roly-polies, since that is what my friends called them.

I dug up the visualization in question and was pleased to see that I probably wasn't the only kid in Chicagoland that may have struggled with these entomological naming conventions — in fact, there are a bunch of other names given to these li'l guys across the US. For posterity, Illinoisans call this thing a Pill Bug, Roly-Poly, Potato Bug, Sow Bug and Doodlebug 15%, 41% 7%, 6%, and 3% percent of the time, respectively.

The one downside that I've encountered with Many Eyes is they only seem to provide the underlying data set in a .txt file (albeit in tab delimited format). Kirix Strata™ doesn't yet recognize the tsv-in-txt's-clothing just yet, so you need to save the file to the desktop and then open it up in Strata (selecting a text-delimited file type with a tab delimiter). But, once you get it in there, fire up your analytical skills and manipulate away.

Also, please report any bugs while you're at it.

Edit: Per Wikipedia, there are a bunch of other common names associated with these insects, including my favorite, “cheeselog.”

Posted in data analysis, data visualization | Comments Off

Embedded phpBB Search Terms within Apache Web Logs

Friday, August 24th, 2007

This afternoon I was doing some analysis on our web logs and thought it may make for a good screencast and blog post. We currently use a combination of AWstats and Google Analytics for our web stats but are increasingly using Kirix Strata™ to dig deeper into the raw web logs for the more customized things that aren't readily available otherwise.

Also, honestly, it is kind of fun to plow through almost a million records on your own. Hmmm, maybe I should get out more.

The topic of the screencast below are the search terms people enter to find things in our phpBB3 support forums. These terms are embedded in the “request” field of the apache logs and I couldn't find a way to get them without digging into the logs themselves (NOTE: I wouldn't doubt that there is some way to do this via a mod to phpBB or a filter in Google Analytics… but since I couldn't find anything via a quick Google search, using Strata just ended up being a lot faster).

An example of a search string we're dealing with is:

GET /forums/search.php?keywords=proxy HTTP/1.1

So the trick was to parse the search keywords out of the field and then group them together to see what people were searching for… and in turn give us the chance to improve our support area by targeting some of these search terms and expanding our documentation accordingly.

Hope this video proves helpful:

(And here's an embeddable YouTube version…)

TECHNICAL NOTE:

I downloaded the Apache logs from the server and, due to the file size, decided to import them into Strata rather than open the file and work with it directly. To import your logs, go to Import, select text-delimited files, and then import as space delimited with quotation marks as the text qualifier. Update: You can now use a handy little log parsing extension to pull in your web log files without having to mess around with a straight text import.

TECHNICAL NOTE 2:

For posterity, here are the functions that were used in this screencast:

STRPART(string, section [, delimiter])
SUBSTR(string, start [, length])
CONTAINS(string, search string)
IIF(boolean test, true value, false value)

Posted in data analysis, examples, videos, web log analysis | 1 Comment »