I am using Strata to identify duplicated entries in a table of contacts.
I have a table called contacts. Every contact has a contactID, firstName, lastName, title etc.
I have another table called addresses. Every entry here has a contactID, addressID and the the standard address fields.
So far things have been working great... In under 30 minutes I was able to create a list of potential duplicate contacts while working with a dataset of 75,000 entries.
I couldn't even view the entire contacts table in excel... let alone merge it with the address table to drag the related fields into my view. Strata has been a *huge* help.
The process I have gone through so far looks like this:
Set a one to many relationship from contacts to addresses, joining on contactID.
Drag the pertinent address fields into my contact table view -- Street1, City, State, Zip.
Create a calculated field called "dupeChecker" which concatonates firstName, lastName, title, street1, city, state, zip.
Group the contacts on dupeChecker. Add a count field to this report.
With the results of the grouping operation, filter on count > 1. This produces a list of all dupeCheckers that appear multiple times throughout the contact dataset.
Save this dataset as "dupeList". Set a one to many relationship from contacts to dupeList, joining on dupeChecker.
Drag the field dupeList.dupeChecker into my contacts table. If the contact appears on the dupeList, this field should be populated. If it doesn't, this field should be blank.
Filter contacts on dupeList.dupeChecker != "". Sort by dupeChecker to clump together potential dupes. Insert group breaks on dupeChecker to view clusters of potential dupes.
Save this dataset as "potentialDupes"
My original contacts dataset is 75,000 lines. potentialDupes is 15,000 lines.
Word on the street is that you guys have some experience in identifying duplicate entries in various datasets. I have some questions for your expertise...
Improving the quality of the search:
-What can I do to improve my dupeChecker field? What fields have you found to be best indicators of dupes? Is street1 really necessary? And if so, what about the difference between "Driftwood Dr" and "Driftwood Drive"?
-Is there a function I can use to strip all non-alphanumeric characters? Quotes and various punctuation were giving me issues.
The next steps...
Identifying potential dupes is one thing. Actually removing them from the dataset is another. What processes have you guys seen as the best way to work through the list of potential dupes? Out of three potential dupes, its not always easy to figure which is the "correct" entry. And sometimes the "correct" entry may be a merging of the fields. I would assume that this phase is going to require some degree of manual labor. Any processes you've found for speeding up this process?
I'm sure many more questions to come.