Smalltalk/X Webserver

Documentation of class 'PhoneticStringUtilities::SoundexStringComparator':

Class: SoundexStringComparator (private in PhoneticStringUtilities

This class is only visible from within PhoneticStringUtilities.

Inheritance
Description
Instance protocol
- api
- private

Inheritance:

   Object
   |
   +--PhoneticStringUtilities::PhoneticStringComparator
      |
      +--PhoneticStringUtilities::SingleResultPhoneticStringComparator
         |
         +--PhoneticStringUtilities::SoundexStringComparator
            |
            +--PhoneticStringUtilities::MiracodeStringComparator
            |
            +--PhoneticStringUtilities::MySQLSoundexStringComparator

Package:: stx:libbasic2

Category:: Collections-Text-Support

Owner:: PhoneticStringUtilities

Description:

WARNING: this is the so called 'simplified soundex' algorithm;
there are more variants like miracode (american soundex) or
mysqlSoundex around.

Be sure to use the correct algorithm, if the generated strings must be compatible
(otherwise, the differences are probably too small to be noticed as effect, but
your search will be different)

The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm

SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable
components of names, but by doing so reports more matches.

There are some variations around in the literature;
the following is called 'simplified soundex', and the rules for coding a name are:

1. The first letter of the name is used in its un-coded form to serve as the prefix
character of the code. (The rest of the code is numerical).

2. Thereafter, W and H are ignored entirely.

3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5).

4. Other letters of the name are converted to a numerical equivalent:
B, P, F, V 1
C, G, J, K, Q, S, X, Z 2
D, T 3
L 4
M, N 5
R 6

5. There are two exceptions:
1. Letters that follow prefix letters which would, if coded, have the same
numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them.

2. The second letter of any pair of consonants having the same code number is likewise ignored,
i.e. unless there is a ''separator'' between them in the name.

6. The final SOUNDEX code consists of the prefix letter plus three numerical characters.
Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros.

Notice, that in another variant, w and h are treated slightly differently.
This is only of relevance, if you need to reconstruct original soundex codes of other programs
or for the original 1880 us census data.
SoundexStringComparator new encode:'Ashcraft' -> 'A226'
vs.
MiracodeStringComparator new encode:'Ashcraft' -> 'A261'

Also notice, that soundex deals better with english.
For german and other languages, other algorithms may provide better results.

Instance protocol:

api

encode: word: self new encode:'washington' -> 'W252'
self new encode:'lee' -> 'L000'
self new encode:'Gutierrez' -> 'G362'
self new encode:'Pfister' -> 'P236'
self new encode:'Jackson' -> 'J250'
self new encode:'Tymczak' -> 'T522'

private

translate: aCharacter: use simple if's for more speed when compiled

ST/X 7.7.0.0; WebServer 1.702 at 20f6060372b9.unknown:8081; Tue, 05 Aug 2025 22:16:50 GMT