Merge/Purge and Duplicate Detection

Techniques for removing redundant records in an address database.

Duplicate detection is accomplished by finding all records that contain exactly or approximately the same data in one or more fields. For example, consider this file fragment of five records containing six fields in each record:

   Name    Address           City         St ZIP®       Phone         
   ------  ----------------- ------------ -- ---------- --------------
1  SMITH   2 E 13TH ST       CHICAGO      IL 60601-2407 (312) 458 9992
2          2 13 ST EAST      CHICAGO      IL 60605      SMITH
3  SMTH    2 EAST 13TH       CHICAGO LAWN IL            312-458-9992
4  SMITH   2 E THIRTEENTH ST CHICAGO      IL 60605      458-9992
5  SMITH   TWO EAST 13TH ST  CHICAGO IL      60602      312-458-9991

Although all five of the above records intend to refer to the same person at the same address, no two records are exactly alike. If the above records were part of a larger file, consider the following attempts at finding duplicates in the file:

Attempt 1: Select records with the same address field. Finds none of the above records.

Attempt 2: Select records with the same name and same five-digit ZIP. Misses records 1, 3, and 5.

Attempt 3: Select records with the name "SMITH". Misses records 2 and 3 (while probably matching lots of other SMITHs at other addresses).

Obviously, a good strategy for duplicate detection requires a good strategy for dealing with inconsistent data. An excellent first step toward removing inconsistent data is to standardize addresses with CASS Certified™ address correction software. Another important technique is to use data entry programs that validate field formats, to prevent errors such as allowing names in the phone number field. For example, after address correction and field validations, the above examples become:

   Name   Address     City    St ZIP        Phone       
   -----  ----------- ------- -- ---------- ------------
1  SMITH  2 E 13TH ST CHICAGO IL 60601-2407 312-458-9992
2  SMITH  2 E 13TH ST CHICAGO IL 60601-2407
3  SMTH   2 E 13TH ST CHICAGO IL 60601-2407 312-458-9992
4  SMITH  2 E 13TH ST CHICAGO IL 60601-2407 XXX-458-9992
5  SMITH  2 E 13TH ST CHICAGO IL 60601-2407 312-458-9991

After standardizing as shown above, any attempt at duplicate detection will have a much better chance of finding the correct group of duplicates. In fact, selecting "records with the same address, ZIP, and soundex name" is an attempt that works perfectly on the above example. However, that attempt will still fail to find duplicates like the following (one person at one phone number, with two different versions of his name, and a shipping address different from his mailing address):
Name         Address     City  St ZIP   Phone       
------------ ----------- ----- -- ----- ------------
BOB JAMES    100 MAIN ST APTOS CA 95003 408-662-2717
ROBERT JAMES PO BOX 200  APTOS CA 95001 408-662-2717

The basic struggle in developing a good duplicate detection procedure is finding the right combination of comparisons to test for. If the duplicate test is too "tight" (for example, selecting only records with the exact same unedited contents in a large number of fields), then too many actual duplicates will be overlooked. For example, here's a file of names, Social Security numbers, and phone numbers:
Name     SSN          Phone       
-------  -----------  ------------
Johnson  555-99-2223  408-662-2717
Johnsen  555-99-2223  408-662-2717
Johnson  555-99-2223  408-662-2717

For the above data, the test "records with the same name, Social Security number, and phone number" is too tight, and misses the "Johnsen" record. Testing for only the same Social Security number and phone number would have been better.

But if the duplicate test is too "loose" (for example, testing only one field, or multiple fields without careful regard to field content), then too many actual non-duplicates will incorrectly be considered as duplicates. Obviously, "records with the same last name" (and no other matching criteria) is far too loose for most mailing lists. Another example is "records with the same phone number", which will match two different people who happen to use the same phone number, a common occurrence at many business addresses.

A good approach toward determining the proper "tightness" for a duplicate test is to start with loose tests, then apply more restrictive selections if too many "false" duplicates are found. For example, if selecting "records with the same phone number" in a file of 100,000 records reports 30,000 duplicates (which might happen when the list contains lots of business addresses with more than one employee at each business), you might tighten up the test by changing the selection to "records with the same personal name and phone number". Continue adding restrictions until the number of duplicates drops to a small enough number of records that you are willing to inspect and review.

Most mailing lists contain at least six fields: personal name, company name, address, city, state, and ZIP Code™. (If the city-state-ZIP are in a single field, break them out as separate fields using address correction software.) If addresses have been standardized, the state field is redundant, leaving five fields that can be used in 31 different combinations to test for duplicate records. Of course, if your file contains additional fields, many more combinations are possible. The most frequently used duplicate test is probably "records with the same name and address", but many other tests may be appropriate, depending on the list being processed. See the Appendix for a description of each of the possible field combinations in a typical list, and what that combination finds when it is used to search for duplicates.

In general, comparing more fields makes a test tighter, and comparing fewer fields loosens a test. However, a test also can vary its tightness, without changing the number of fields being tested, by "filtering" the contents of the fields it is examining. For example, instead of using all characters in a field, the test might keep only the non-blank characters, or only the numeric characters. Consider this example:

Name     Phone       
-------  --------------
Johnson  408-662-2717
Johnson  (408) 662-2717
Johnson  408 662 2717

With the above data, the test "records with the same phone" produces no matches, if all characters in the phone field are considered significant. But the test "records with the same digits in the phone number" successfully matches all three records. Other common filters are "take the leftmost n characters" and "the equivalent soundex code for a name". In general, applying a filter loosens up a test, because more records will match the filtered field compared to the unfiltered field.

Before trying to detect duplicates, it's important that each field in the file is consistently used for only one type of data: names should be in the name field and in no other field, addresses in the address field, ZIP codes in the ZIP field, and so on. Using one field for more than one purpose (sometimes using an address field for company names, sometimes for addresses, sometimes for mailstops, and so on), is bad database maintenance and makes it impossible to do thorough duplicate detection.

Also note that individual fields sometimes make better duplicate detection candidates if the field is split into two or more fields. A common candidate for field splitting is personal names. For example, it's very difficult to automatically determine the duplicates in the following file:


However, when the above data is split into multiple fields, duplicate detection becomes easier:

Prefix  First  Last  Suffix
------  ------ ----- ------
        BOB    JAMES
MR      BOB    JAMES
DR      R      JAMES PHD

Be sure to carefully decide exactly when you consider two records to be duplicates. Do you have a duplicate when you find two different employees at the same business address? No, if you're mailing to individuals. Yes, if you're mailing to companies. Do you have a duplicate when you find a male and female with the same last name at the same residential address? No, if you're mailing to individuals. Yes, if you're mailing to families or households. What about two different last names at the same residential address? Depending on your application, what might be considered a duplicate for one mailing might not be a duplicate for the next mailing.

Personal name and company name aren't the only fields to consider when determining which records qualify as duplicates. A good example is finding the same name at a street address and at a post office box address. Do you keep both or choose only one? Street addresses can provide better demographic information, and are easier to correct and less prone to undeliverability because of typos, but are sometimes used only for shipping (not mailing), and occupants usually change street addresses more frequently than their box addresses. (Note that because street ZIP codes and post office box ZIP codes are often different within the same city, searching for name-address-ZIP duplicates will often fail to find one person having both a street address and box address. Use a name-address-city duplicate test instead.)

After a pair of duplicate records has been found, there still has to be a decision about which data to prefer if the two records don't match exactly. For example, consider these two "duplicates":

Version #1         Version #2         
-----------------  -------------------
100 MAIN ST        100 MAIN ST
DENVER CO 80211    DENVER CO 80211
303-333-3000 X12   303-333-3112

Three of the five fields in the above two records don't match. How do you reconcile the two records? Use the newer version? The older, original version? Somehow merge the two? Or arbitrarily delete one?

Note that regardless of the method you decide on to choose between or resolve ambiguous duplicates, it's often best NOT to simply delete the unwanted records. Instead, mark them as duplicates, but keep them on file. That way, if the duplicate records are ever again reacquired, they won't have to be re-examined to determine if they're really duplicates. For example, imagine a file containing:

ID # Name        Company   City    
---- ----------  --------  --------

A new prospect list is acquired, the two files are merged, and duplicate detection reports the file now contains this possible match:
ID # Name        Company                 City    
---- ----------  ----------------------  --------
236  EVAN SMITH  IBM CORP                LAKESIDE

If, after review, one of the two records is simply deleted, the review process will have to be repeated if the deleted form ever shows up again. But, if both records are retained and one is marked as a duplicate, the file might instead become something like:
ID # Name        Company                 City     Useable?   
---- ----------  ----------------------  -------- -----------

In the above file, undesirable record 543 has been marked as a duplicate for record 236. Now, if the prospect "EVAN SMITH, INTL BUSINESS MACHINES, LAKESIDE" is reacquired from some subsequent source, it will exactly and completely match record 543 and therefore can be automatically deleted without any review. If record 543 had been deleted after the first acquisition and review, the review process will have to be done over again if the unwanted version is reintroduced to the file.

The technique of marking duplicates instead of deleting them also pays off when doing change-of-address processing. For example, let's say you acquire a prospect list and mail to the following record:

ID # Name      Address    City       
---- --------  ---------  -----------

But the piece comes back undeliverable and with a forwarding address because the recipient has moved, so you delete the old address and change it to the new address:
ID # Name      Address    City       
---- --------  ---------  -----------
563  JANE DOE  BOX 5000   NEW YORK

Unfortunately, if you reacquire the original address (say, from another prospect list), there won't be any way to tell you've reintroduced an obsolete record:
ID # Name      Address    City       
---- --------  ---------  -----------
563  JANE DOE  BOX 5000   NEW YORK

Instead of wasting money remailing to undeliverable addresses you've already tried mailing to, a better scheme is "mark, don't delete". When a forwarding address is received, mark the obsolete address so the file becomes something like this:
ID # Name      Address    City        Status Ref
---- --------  ---------  ----------- ------ ---
688  JANE DOE  BOX 5000   NEW YORK    OK

With the above scheme, any attempt to reacquire an obsolete prospect can immediately be detected and ignored. Even if an undeliverable address doesn't have a new forwarding address, simply mark the bad address "obsolete" and retain it to serve as a sentinel that can trigger duplicate detection if reacquisition of the obsolete data ever occurs.

To summarize the strategy and tactics described above: standardize addresses with CASS Certified address correction software, validate data entry, use fields consistently, start with loose tests then tighten, test for duplicates with appropriate field combinations, filter fields as necessary, split fields for consistency, mark instead of delete.

Copyright © 1996-2007 by Semaphore Corporation
Semaphore Corporation is a non-exclusive licensee of the United States Postal Service®. The prices of Semaphore Corporation products are not established, controlled or approved by the Postal Service™. The following trademarks are owned by the United States Postal Service: CASS Certified, ZIP, ZIP Code, Postal Service, and United States Postal Service. [DA#4.07]