Removing or Consolidating Duplicate Email Addresses from Website Form Data

If you have text entry forms on your website, you probably have data with duplicate email addresses. These duplicates can either be from data entered twice into the same form or are duplicates from merging data from multiple applications/forms.

So, two questions arise:

How do you identify these duplicates?
How do you either remove them or group them together to track the related information?

Using Kirix Strata's grouping functionality, it's actually pretty easy. You can quickly identify duplicates from your website data and then either remove the duplicates or consolidate the different records into groups of related records. Let's look at the problem more closely.

Suppose you have a web page that asks your visitors for the following feedback information:

Dupformdata1

Visitors will enter their information, including their email address, which allows you to respond to them. However, if the visitor stops by again in the future, you'll have multiple records from the same person and therefore duplicated email addresses:

Screenshot - Dupform2

Normally, these duplicates aren't a problem since the database is just storing two different comments by the same person. The problem with these multiple records comes when we want to analyze the data further, either to create a unique list of email addresses, or to find users that have submitted multiple messages and see the multiple messages they submitted.

Before we can analyze the data, we have to access it; for web data, we might have the data in a MySQL or Oracle database, or we might have it in raw text files, such as CSV files. In either case, we can readily open it in Strata by either importing it, connecting to it or opening it directly.

Grouping Data Together

Once we've got the data in, we just need to create a list of unique email addresses. If we were performing this task using SQL, we would write:

select email, count(*)
  from feedback
 group by email;

Of course, consolidating email lists with SQL in this manner is pretty straightforward; the only moderate challenges, other than knowing SQL syntax, are accessing the data, knowing the tables and fields to query and then actually doing something with the results. Each of these steps can be a little tricky if you don't regularly interact with databases or you're using tools designed more for database administration than analysis.

In Strata, consolidating email lists is just as straightforward as doing this with SQL, but it includes the following advantages:

you can easily see and interact with the data
you can readily access the fields and structure so it is easy to design the test
you can quickly manipulate your results, similar to how you would with a spreadsheet

To create a unique list of emails in Strata, simply open the table, then group on the email field:

Screenshot - Dupform3

Hit the “Run” button and we get our grouped results of unique email addresses and the number of times it appears in the table:

Screenshot - Dupform4

Grouping Data Together and Filtering the Results

Now that we've solved the first issue, we want to solve the second problem, which is to find all the messages from visitors that have submitted multiple messages. In SQL, we would have to create a "select of a select":

select *
  from feedback
 where email in
           (select email
              from feedback
             group by email
            having count(*) > 1);

For most casual users of SQL, this statement is a bit more complicated. It tells the database to return only the records in the feedback table when the email address is used more than once.

Fortunately, in Strata, finding and grouping all these messages is very simple. We just open the table and group the email field like we did before, and then just add a couple extra parameters:

Screenshot - Dupform5

As shown above, we are doing the following things:

Grouping the table based upon unique email address.
Counting the number of records with a given email address.
Assigning a unique group ID to each group.
Only outputting the groups that appear more than once (count() > 1) and include all the detail records in the output table.

The resulting table shows us only the records with duplicated email addresses:

Screenshot - Dupform6

Summary

This is just a simple example that shows off some fairly useful functionality. Grouping is one of those data analysis tools that is used all the time for summarizing data in a table. The group filter provides even more power by letting you look at the individual records that make up a certain statistic or metric, based on some aggregate function (e.g., sum() > 2000, avg() <= 50, count() > 1). In the original group, you see that two records are duplicated by the grouping operation. Once you have that statistic, you want to see the detail related to the group. This is trivial with only a few records but is pure gold when you can't just eyeball a data set.

This little tool comes in handy quite a bit. For example, you may want to see:

duplicate records by name, address, city, zip, or some other field/combination of fields.
individual telephone calls to a particular telephone number that when combined exceed a certain total call time.
individual IP addresses in a web log that at some time in their history went to a particular page and downloaded a file.
all the pages visited from a given IP address, when at some point that IP address went to a particular page.
all the pages visited from a given IP address, when that IP address has visited at least 10 times over a one month span
all the purchases of a given customer when that customer purchased at least two or more particular types of items.

All of these tasks can be accomplished using the grouping tool. For a bit more information on using the grouping panel, check out the Strata grouping help page.

Do you have any website data analysis that you need help with? Just let us know and we'll be happy to explore the issues with you.

This entry was posted on Thursday, June 19th, 2008 at 2:57 pm and is filed under duplicates, email, groups, web data. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Benni says:
July 18th, 2008 at 10:43 am

Hey, this sounds really helpful. I need to do this every once in a while and it's nice to have a tutorial like that at hand.

Using Relationships to Compare Lists of Emails | Strata Blog says:
July 31st, 2008 at 3:41 pm

[…] a previous article, we were concerned with identifying and removing duplicate emails within a single table. But let's say that email addresses are located in multiple tables — how do we find […]

Kirix Strata Blog