Dirty Data | Kirix Strata Blog

Kirix Strata Blog

Archive for the ‘dirty data' Category

Tutorial: Creating Calculations and Cleaning up Dirty Data

Tuesday, December 16th, 2008

Continuing with our video tutorial series, here's one on creating calculations and using formulas. This one briefly looks at how to create calculated fields, but quickly delves into more complex formulas and functions. This one would be worth watching, even if you're an expert — it may give you a few new ideas.

Play Calculations Video Tutorial

Updating and Replacing Values in Cells

Friday, July 18th, 2008

Strata Tips and TricksWe had a really good question come up in our forums regarding Strata's ability to replace values within cells.

Bottom line, if you have a value in one field and want to auto-replace it with a different value, you can easily do this using the Update Records tool (Data > Update Records…). In a nutshell, it works like this:

  1. Choose the field you want to update.
  2. Choose the new value you want to update with.
  3. Create the logic to let Strata know when to update a cell and when not to update a cell.

Both #2 and #3 enable you to get pretty complex, since you can use formulas in both of these areas. The “update with” (#2) area can be particularly tricky, since it gives you the ability to transform values on the fly.

However, it had been a while since I'd personally gone through the various options one can use to replace records, but as I did, I quickly realized our documentation wasn't comprehensive enough.

So, I added the “replace” rules to the post here.

Hope this helps. Now, if I only could figure out a way to replace Friday with Saturday… :) Hope everyone has a good weekend!

P.S. Got a question about Strata that you've been curious about? Please post a note on the support forums and we'll be happy to help!

Checking Date Ranges Prior to Analyzing New Data Sets

Thursday, June 26th, 2008

Date Range Extension ImageSeasoned data analysts know that one of the first things you need to do with a new, unfamiliar data set, is to run some basic tests to determine what kind of animal you're working with. This is particularly important when working with larger data sets that may be amalgamated from multiple systems or appended together from archived files.

One of these tests is a date range check. So, for example, if a client has shipped you all the data from the first 6 months of 2007, you want to make sure you actually have a full, complete 6 months of data to work with. In fact, you'd like to see something like this:

12/2006 - 43 records
01/2007 - 255 records
02/2007 - 249 records
03/2007 - 265 records
04/2007 - 287 records
05/2007 - 259 records
06/2007 - 263 records
07/2007 - 53 records

The outlying dates on the end (12/2006 and 7/2007) do provide some comfort that the data set is truly complete. However, it is surprising how often you'll actually see something like this:

01/1999 - 196 records
12/2006 - 43 records
01/2007 - 255 records
02/2007 - 249 records
03/2007 - 96 records
04/2007 - 287 records
05/2007 - 259 records
06/2007 - 263 records
07/2007 - 53 records

This second example is a dirtier data set; there is a strange, high-count outlier from 1999 and we also see that there was a significant drop in the record count during March 2007.

Before you actually start performing your analysis, you'd want to investigate the items from 1999, which could just be empty records that can be ignored or, worse, could be something wrong with the formatting of these records. The precipitous drop in March 2007 is a little more worrisome. Was it because sales dipped drastically that month or was it because there was an error when the IT department appended this data set together?

Whatever the cause, it's better to get your data in order and make sure you have a complete set before jumping into your analysis and providing that client with incorrect or skewed results. In order to help you to do this, we've created a simple date range analysis extension. Running this utility on a new data set from the get-go can save you a lot of time and hassle later on.

You can install the date range analysis extension and learn how to use it here. Got some other data utilities you'd like in your toolkit? Let us know.