FAST external sorting of variable-length records

7/17/95

TO:  Frank C. Lin
Dept. of Math & CS
Univ. of MD Eastern Shore
410-651-6426
fax 651-6259

Greetings,

We saw your sorting article in the Computer Journal v. 35 #2. It inspired us to try something similar for externally sorting files of variable length records, which has proved very successful. Assuming an input file contains contiguous, variable-length records (for example, carriage-return-delimited text lines), our approach is:

1. Divide available memory into two equal-size buffers.

2. Input a contiguous block of variable length records, filling buffer 1.

3. Create a block of pointers, with each pointer pointing to a record in buffer 1. Save the first pointer at the end of buffer 1, and work backwards as each new pointer in the block of pointers is created, overwriting the last few records in buffer 1. Create pointers until you have run out of records in buffer 1 (which only happens at end of file) or until the record you're pointing to will be overwritten by its own pointer. This means the maximum record size is the size of buffer 1 less the size of one pointer.

4. Seek backwards on the input file to account for the number of records overwritten by pointers in buffer 1, so the records will be re-read on the next iteration.

5. Sort block 1 by sorting the pointers, leaving the buffer's data records in place.

6. Step through the sorted pointers, and copy each referenced record to buffer 2, so buffer 2 ends up containing a contiguous block of buffer 1's variable length records, but in sorted order. Since buffer 1 contains records and their pointers, there will always be enough room in buffer 2 to store just the records.

7. Write the sorted records in buffer 2 to an output file as one contiguous block of data. Note that the block can overwrite the original input records, since the block of records as a group will be the same size. If the original input file is overwritten, step 4 can be omitted, but the input position at step 2 will have to be remembered and seeked-to for the overwriting operation in this step.

8. Return to step 2 and repeat the above for any remaining input.

9. After writing all sorted output blocks, merge them into the final sorted file using any convenient merging algorithm. When allocating buffers or making other preparations, the merging steps can take advantage of the fact that every variable-length block written by step 7 has the same maximum size (just less than the size of buffer 1 or 2).

In most of the computer systems we use, I/O and/or operating system overhead occurs with the number of operations, not the amount of data being read/written. In other words, the time penalty for writing a single variable-length record is about the same as writing one large contiguous block of variable-length records, so the subtle internal nuances of different sorting algorithms becomes insignificant compared to the number of times an I/O call is made. That's why all the trouble of the above steps pays off: instead of lots of single-record read/writes, we do a minimum number of read/writes of large blocks, and overall sorting of the entire file ends up very fast.

Semaphore Corporation