This is DUPDTECT HELP from Semaphore Corporation

This document describes how to detect duplicate records in database and text files using the DUPDTECT program. While viewing this document, use Find or Search menu commands to quickly locate any word or phrase of interest.

Contents of this document
Section 0. If You Don't Have Time To Read Documentation...
Section 1. What is DUPDTECT?
Section 2. Starting DUPDTECT and Selecting the Input File
Section 3. Specifying the Format of a Fixed-Field (Undelimited) Text File
Section 4. Examining the Input File
Section 5. Specifying Filters
Section 6. Marking Duplicates
Section 7. Examing and Editing Duplicates


Section 0. If You Don't Have Time To Read Documentation...

Step 1. Start DUPDTECT.EXE running and click OK in the initial dialog box. Find and open a .DBF file you want to examine for duplicates.

Step 2. In the data grid, click on the row corresponding to the field containing data that might be duplicated between records (such as the name field, or the address field, or the ZIP field).

Step 3. Click the "All" radio button in the panel titled "Keep what field data?".

Step 4. Click the "Find duplicates" button.

Step 5. If duplicates are found, click the "Edit duplicates" button.


Section 1. What is DUPDTECT?

DUPDTECT is a "duplicate detector", a program that finds duplicate records in mailing list files. DUPDTECT allows you to find duplicates using tests that can range from very simple (such as "find records with the same phone number") to very complex (such as "find records containing the same unpunctuated alphabetics in the address field plus the same first three digits in the ZIP field plus the same sounding name in the recipient's name field").

DUPDTECT is a Windows application that processes dBase (.DBF) databases or fixed-field (undelimited) text files, using a convenient drag-and-drop interface. DUPDTECT does not use any proprietary drivers, so DUPDTECT can't process proprietary formats such as Access (.MDB) or Excel (.XLS) databases. If you wish to process a database in a proprietary format, first save the records in a .DBF or undelimited text file.

DUPDTECT includes the ability to edit and delete records in the file being processed. Or, you can instruct DUPDTECT to simply mark any records found to be duplicates. Deletion of duplicate records can then be done within a database or mailing list program, after the duplicate records have been reviewed.

DUPDTECT does not compare two lists, but instead compares all records inside one list. To find the duplicates between two lists, first combine the two lists, then use DUPDTECT to find duplicates in the combined list.

The rest of this document describes how to operate DUPDTECT. If you are not familiar with the general subject of duplicate detection, read this discussion of strategy and tactics before trying to understand or use DUPDTECT.


Section 2. Starting DUPDTECT and Selecting the Input File

Although DUPDTECT includes the ability to edit dBase (.DBF) and fixed-field (undelimited) text files "in place" by directly updating the original input file, DUPDTECT also needs to create temporary files on the disk volume it is running from, so DUPDTECT must be started from a writeable disk. (The amount of temporary space required varies with the DUPDTECT options that are selected, but a good rule of thumb is to be sure the disk has twice as much free space as the size of the file being examined for duplicates.)

Put DUPDTECT.EXE on a writeable volume, then double-click the DUPDTECT icon to start the program running. DUPDTECT will display a small dialog box showing the program's current version number along with OK and Cancel buttons. Click the OK button and use the standard system dialog displayed by DUPDTECT to find and open any file you wish to process.

IMPORTANT: DUPDTECT can be used to modify files. Never process a file with DUPDTECT unless a backup of the original data is available!

If you open a file with a ".DBF" extension, DUPDTECT assumes the file is a dBase database, and you can skip Section 3, because DUPDTECT will automatically know the size of each database field.

If you open a file with any extension other than ".DBF", DUPDTECT will assume the file you opened contains fixed-length text fields, and DUPDTECT will display a dialog for defining the file's field format, as described in the next section.


Section 3. Specifying the Format of a Fixed-Field (Undelimited) Text File

When you open a fixed-field (undelimited) text file as described in the previous section, DUPDTECT assumes the file contains records with fixed-length fields. That is, every record has the same number of fields, and each field has a certain fixed size. For example, here's a file of three fixed-length records, where each record contains a company, address, city, ZIP, and state field:

    SEMAPHORE CORP207 GRANADA   APTOS      95003     CA
    SEMAFOUR      207 GRENADA DRRIO DEL MAR95003-5007CA
    SEMAFOR CO    207 GRAND DR             95003     CA

Because each field has the same fixed length in every record (creating perfectly aligned columns of data through the entire file), fields can run together, without the need for any special "delimiter" characters like tabs or commas to separate fields from one another. Records in a fixed-field text file may or may not be separated by record delimiters, such as carriage return or line feed characters.

Before DUPDTECT can properly understand a fixed-field text file, you need to tell DUPDTECT how many fields are in each record, and how big each field is. This is accomplished with a dialog that DUPDTECT displays after you open a fixed-field text file. (To properly use the dialog, you must already know the number of fields and the sizes of fields in the text file you are processing.)

If each input record is delimited by at least a carriage return, isn't longer than 1,000 characters, and doesn't contain more than 100 fields, DUPDTECT will try to guess the file's field layout. Otherwise, DUPDTECT will use a default layout that assumes the file contains records with ten fields of fifteen characters each. In either case, you'll probably have to adjust the layout as described below.

The DUPDTECT dialog for indicating the layout of a fixed-field text file always displays the file's first five records in a grid, with one record in each grid row. If the correct number of fields and the correct size for each field is properly set, you will see the file's data line up correctly in each grid column, similar to the aligned sample data shown above. If the number of fields or any field size is set incorrectly, or if the file does not contain fixed-length records, the file's data will appear jumbled and not line up properly in each column. If the file contains variable-length records instead of fixed-length records, you will not be able to correctly use the layout dialog, and DUPDTECT will not be able to make sense of the file. If the file contains fixed-length fields and records, adjust the number of fields and/or the sizes of fields until the data lines up properly in the dialog's grid as described below.

To increase the number of fields in a record, double-click on a field to split it into two fields at the spot you double-clicked. To change the number of characters in a field, first click on the field, then use the dialog's control to increase or decrease the field width. Decreasing a field's width to zero deletes the field. The maximum width for an individual field is 255 characters.

If a field isn't completely visible in the grid, clicking the partially hidden field will scroll the grid to display as much of the field as possible. Since the first click of a double-click will scroll a partially hidden field, it's best to click a partially hidden field once to bring it completely into view, then double-click the unobscured field to split the field as desired. (Otherwise, the second click of a double-click on a partially hidden field will probably not be in the desired place because of the scroll that occurs between clicks.)

The keyboard's tab, shift-tab, and arrow keys can be used to change the grid's currently selected field.

Note that control characters such as carriage returns and line feeds will display as black boxes in the layout grid. If the file contains such delimiters (typically at the end of each record), set up a field for the delimiters just as you would any other text in the file. For example, a final field of width 2 should be created for the carriage return and line feed at the end of each record, if those control character record delimiters are present in the file.

Because DUPDTECT is only concerned with fields you wish to compare for duplicate data, other contiguous runs of fields that you don't care about can be grouped as one large field for convenience. For example, if the file contains a 12-character phone number field alongside an 11-character Social Security number field alongside an 8-character date field, and you won't be testing any of those fields for duplicate data, then you can tell DUPDTECT those three fields are simply a single contiguous 31-character field, since DUPDTECT won't need to inspect the three fields individually. This can save time setting up the layout for a file with lots of fields.

Once you have the proper number of fields and field sizes set up in the dialog, click the OK button. DUPDTECT will automatically save the file's layout in a "schema" file with an .SCH extension. For example, if you define a layout for file TEST.TXT, DUPDTECT will create a schema file named TEST.SCH. The next time you open TEST.TXT, DUPDTECT will automatically use TEST.SCH to initialize the dialog's layout (if TEST.SCH is in the same directory as TEST.TXT), so you don't have to manually create the same layout every time you process the file.


Section 4. Examining the Input File

After a dBase file is opened as described in Section 2, or after the layout of a fixed-field text file is specified as described in the previous section, DUPDTECT will display the file's first record in a large, three-column grid. Each row of the grid corresponds to one field in the record. The leftmost column shows the record's field sizes and names, and the middle column shows the contents of each field in the record. The rightmost "Filters" column is used as described in the next section to indicate how DUPDTECT should process each field in the file. The width of the rightmost two grid columns can be adjusted by clicking and dragging the vertical grid lines between the column headings. Data in the middle column of the grid can be edited by clicking on a field in the column to select it, then typing in the changes you want. (Any edits you make are immediately saved, so always back up your file before processing it with DUPDTECT.)

The group of four arrowhead buttons below the grid allow jumping to the first, previous, next, and last database records. The Jump button can be used to jump to any particular record in the file. The Search button can be used to find any particular data in the file.

DUPDTECT displays and processes every record in a dBase file, whether or not records were previously flagged for deletion by database software. If the database contains records flagged for deletion but not yet physically removed from the database, use your database software to "pack" the file and remove the deleted records before processing the file with DUPDTECT.

DUPDTECT treats every field in a dBase file as a "character" field, regardless of the actual type (such as date, numeric, or logical) defined for the field by the database software. Also, DUPDTECT does not re-index dBase files. If you change the contents of an indexed dBase field with DUPDTECT, you should re-index the file with your database software after DUPDTECT processing.

DUPDTECT can't change the actual widths of the fields in the file being processed, so DUPDTECT will display a warning message if you attempt to save data in any field that isn't wide enough.


Section 5. Specifying Filters

In order to find duplicates, you must tell DUPDTECT which fields to examine when comparing records. This is accomplished by assigning "filters" to the fields that will be compared. Filters are chosen using the radio buttons in DUPDTECT's panel titled Keep what field data? For example, if you wish to find all records that have the same address, click on the file's address field (one of the rows in the grid), then click on the radio button for any filter, such as the All filter or the Alphanumeric filter.

By assigning a filter to one or more fields, you are telling DUPDTECT you want to find all instances of records that have the same data in those fields. For example, if you assign a filter to the file's phone number field, DUPDTECT will be able to find all records that have the same phone number. If you assign a filter to the name field and the address and the city field, DUPDTECT will find all records that have the same name and address and city. You can assign filters to as many fields as you wish.

When you select a filter, it must be one of the five listed in the Keep what field data? panel: All, Nonblank, Alphanumeric, Alpha, or Numeric. If you select a grid row that already has a filter assigned, the corresponding radio button in the panel will be selected. To change the filter for a field, just select some other filter in the panel. To remove a filter selected in the grid, click on None in the panel.

When a filter is assigned to a field, the grid is updated to show the filter attached to the field, and what the field's data looks like when the filter is applied. For example, here is a record with the All filter assigned to fields 1 and 4, the Nonblank filter on field 2, and the Numeric filter on field 5:

Field 1: John Smith        (All = JOHN SMITH)
Field 2: XYZ CO            (Nonblank = XYZCO)
Field 3: 100 MAIN ST
Field 4: NEW YORK NY 10012 (All = NEW YORK NY 10012)
Field 5: 212/443-2233      (Numeric = 2124432233)

The All filter tells DUPDTECT to use all characters found in the field when comparing records. The Nonblank filter indicates that only characters in the field that aren't a blank should be used when comparing that field between records. Similarly, the Alphanumeric filter means keep only characters that are letters or numbers (not blanks, punctuation, or other special characters), the Alpha filter means keep only letters, and Numeric means keep only numbers.

In the "Filters" column of the grid, the filter's name appears on the left of the equal sign (=), and the field's data after filtering appears on the right of the equal sign. In the above example, "(Numeric = 2124432233)" indicates the Numeric filter assigned to field 5 will cause DUPDTECT to keep only the digits found in the phone number field before comparing that field between records. In other words, two records with phone numbers "(212) 443-2233" and "212-443-2233" will look like duplicates if the Numeric filter is assigned, because DUPDTECT will treat each field as though it contains "2124432233". If the All filter is assigned to those two phone numbers, they won't be equal and won't be considered duplicates because of the non-numeric punctuation characters found in the field.

Note that DUPDTECT always converts a field's data to all upper case text before applying any filter. If a field's data is already upper case, the All filter changes nothing in the field. Here are some more examples of filtered data:

Data            All             Nonblank      Alphanumeric Alpha   Numeric
--------------- --------------- ------------- ------------ ------- ---------
J Smith         J SMITH         JSMITH        JSMITH       JSMITH
J SMITH         J SMITH         JSMITH        JSMITH       JSMITH
100 W Main ST   100 W MAIN ST   100WMAINST    100WMAINST   WMAINST 100
100 W. MAIN ST. 100 W. MAIN ST. 100W.MAINST.  100WMAINST   WMAINST 100
10012-5668      10012-5668      10012-5668    100125668            100125668
(212) 443-2233  (212) 443-2233  (212)443-2233 2124432233           2124432233

After a field is processed by one of the five available filters, the field's data may optionally receive one or two additional modifications: DUPDTECT may extract only a selected number of leftmost characters, and/or convert the data to a phonetic "soundex" code. This is accomplished using the Leftmost and Phonetic controls in the Keep what field data? panel to additionally modify a field's current filter. For example, here is a record with the Phonetic modifier added to the All filter on field 1, and the Leftmost modifier (set for the leftmost five characters) added to the Numeric filter on field 3:

Field 1: JOHN SMITH             (All,sound = JMCM)
Field 2: 100 MAIN ST
Field 3: NEW YORK NY 10012-5668 (Numeric[5] = 10012)

Any number entered in the Leftmost edit box appears in brackets after the current filter name, like "Numeric[5]" on field 3 above. You must select a filter before the Leftmost edit box is enabled. To remove the Leftmost modifier, remove any number in the edit box, or choose None to remove the current filter. Here are more examples of filters with Leftmost modifiers:

Data            All[5] Nonblank[5] Alphanumeric[5] Alpha[5] Numeric[5]
--------------- ------ ----------- --------------- -------- ----------
J Smith         J SMI  JSMIT       JSMIT           JSMIT
J SMITH         J SMI  JSMIT       JSMIT           JSMIT
100 W Main ST   100 W  100WM       100WM           WMAIN    100
100 W. MAIN ST. 100 W  100W.       100WM           WMAIN    100
10012-5668      10012  10012       10012                    10012
(212) 443-2233  (212)  (212)       21244                    21244

If the Phonetic checkbox is checked, DUPDTECT converts the field's data to a phonetic soundex code after the filter is applied, and the soundex code then appears after the filter name, like "All,sound=JMCM" in the earlier example above. You must select a filter before the Phonetic checkbox is enabled. To remove the Phonetic modifier, uncheck it or choose None to remove the current filter.

Soundex conversion, which is normally used to find names that sound alike but are spelled differently, essentially consists of ignoring vowels and converting different consonants with the same sound to one consonant. The final form of the resulting soundex code isn't important, but the fact that two different words that sound alike will have the same soundex code allows two different names to be matched simply by comparing their codes. For example, "Laskowski" and "Lyczkowski" have the same soundex code ("LCC"). Soundex processing automatically ignores nonalphanumeric characters, so the Phonetic modifier generates the same result when the filter is All, Nonblank, or Alphanumeric. Here are some examples of filters with Phonetic modifiers:

Data     All,sound  All[5],sound  Alpha,sound
-------- ---------- ------------- ------------
MICHAEL  MCL        MC            MCL
MIKE     MC         MC            MC            
FISCHER  FCR        FC            FCR     
FISHER   FCR        FC            FCR     
BOX 50   BC50       BC5           BC
BX 50    BC50       BC50          BC     

Use the Save filters button to save the current configuration of filters in a file. Then the Load filters button can be used to load the configuration from the same file, to avoid having to manually adjust filters each time a file with the same format is processed.

Check the box labeled Ignore first record if the first record in the file is a header record (such as a record containing field names instead of actual data). When this box is checked, DUPDTECT will skip and ignore the first record in the file during the search for duplicates.

When the Skip if any filter empty box is checked, DUPDTECT's search for duplicates will skip and ignore any record containing a field that generated no data after the field's filter was applied. This is useful for comparing only those records having something in all their filtered fields.

After at least one filter has been assigned to a field, the Find duplicates button will be enabled. Click Find duplicates to locate duplicate records as described in the next sections.


Section 6. Marking Duplicates

After at least one field filter has been assigned and the Find duplicates button is clicked as described in the previous section, DUPDTECT announces how many, if any, duplicates have been found.

DUPDTECT computes a Group number and Member number for any record that is found to be a duplicate. All records that match each other are assigned to numbered groups, and each record in each group is given a member number. For example, assume the input file contains:

Name     Company
-------- --------
Smith    IBM
Johnson  Apple
Johnson
Adams    Lotus
Jackson  IBM
Smith
Adams    Apple
Davis    Apple

If the only input filter was All on the "Company" field, DUPDTECT would search for all records with the same company and find three groups: an Apple group with three members, an IBM group with two members, and a two-member group of records with no company name. The Lotus record does not duplicate another record (at least not when the "Company" field is the only field filtered), so DUPDTECT does not assign a Group number or Member number to that record. Group numbers are assigned starting from 1, and Member numbers are assigned in each group beginning with 1:

Name     Company Group number Member number
-------- ------- ------------ -------------
Johnson               1             1
Smith                 1             2
Johnson  Apple        2             1
Adams    Apple        2             2
Davis    Apple        2             3
Smith    IBM          3             1
Jackson  IBM          3             2
Adams    Lotus

If the All input filter was on the "Name" field instead of the "Company" field for the above file, DUPDTECT would search for all records with the same name and find these three groups:

Name     Company Group number Member number
-------- ------- ------------ -------------
Adams    Apple        1             1
Adams    Lotus        1             2
Johnson               2             1
Johnson  Apple        2             2
Smith                 3             1
Smith    IBM          3             2
Davis    Apple
Jackson  IBM

By saving the Group number in the file being processed, you can examine the file with any database tool and determine which records are duplicates. Any record that is a duplicate will have a Group number and a Member number. All records in the same group are duplicates of each other and will have the same Group number and a different Member number. If a file contains duplicates, it will have at least one Group number. Every group has at least two members.

When DUPDTECT displays the number of duplicates found, the dialog box will also include checkbox and combobox controls for saving the Group number and/or Member number in any database field you wish. (Note that the saved numbers will overwrite any data already in the fields you select with the comboboxes.) If you don't want DUPDTECT to save the Group number and Member number in your mailing list, don't check the Save group number in field or Save member number in field checkboxes.

If you do save the Group number in your database, you can later use any database tool to sort by the Group number field, so all members of the group can be examined together to determine what records should be edited, merged, or deleted.

The Member number is only useful when manipulating groups in a database, such as when finding groups with the most members, or limiting each group to a certain number of members and deleting the rest. Note that the Member number without the Group number is insufficient for comparing duplicates, because two records with the same Member number will belong to two different groups.

As another example of DUPDTECT processing, here's a file where All filters were on the "Name" and "Address" fields to find duplicate records with the exact same name and address:

Name     Address     City      Group number  Member number
-------- ----------- --------- ------------  -------------
Smith    15 2nd Ave  Miami          3              1
Adams    Box 10      Atlanta        1              2
Adams    PO Box 10   Atlanta
Johnson  10 Main St  New York       2              2
Johnson  Ten Main    St Paul
Jackson  2 Maple Dr  Dallas
Johnson  10 Main St  Dallas         2              1
Jackson  2 Maple     Dallas
Smith    15 2nd Ave  New York       3              2
Smith    15 Second   New York
Adams    Box 10      New York       1              1
Davis    70 5th Ave  Dallas
Davis    70 Fifth    Dallas


Section 7. Examing and Editing Duplicates

After DUPDTECT announces duplicates have been found, click the Edit duplicates button to examine and edit the duplicate records (whether or not you chose to save the Group number and Member number in the file).

DUPDTECT displays two duplicate records at a time in a four-column grid. The first column shows the database field names and sizes. The second column shows the filters used to detect duplicates. The third and fourth columns display a pair of duplicate records, which can be edited by clicking and typing or by dragging fields from one grid location to another.

To see the next group of duplicates in the file, use the arrowhead buttons titled Group... to jump to the first, previous, next, or last group. Similarly, the arrowhead buttons titled Member... will jump to the first, previous, next, or last member of the currently displayed group.

The red X buttons can be used to "delete" the currently displayed member. If a dBase (.DBF) record is "deleted", the record is simply marked for deletion, and is not actually physically removed from the file. The red X button remains depressed to indicate the record has been marked for deletion. A depressed button can be pressed again to release the button and unmark the record. After using DUPDTECT, a database tool can be used to "pack" the dBase file and physically remove all records marked for deletion.

If the red X button is pressed while examining a fixed-field text record (not a dBase record), all fields in the record are blanked to "delete" the record (although the now-blank record still occupies the same position in the file), and the red X button is then disabled for that record. There is no way to automatically restore such a blanked record. However, any data may be typed or dragged into the blank fields to "undelete" that record.