eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'PhoneticStringUtilities::SoundexStringComparator':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: SoundexStringComparator (private in PhoneticStringUtilities

This class is only visible from within PhoneticStringUtilities.

Inheritance:

   Object
   |
   +--PhoneticStringUtilities::PhoneticStringComparator
      |
      +--PhoneticStringUtilities::SingleResultPhoneticStringComparator
         |
         +--PhoneticStringUtilities::SoundexStringComparator
            |
            +--PhoneticStringUtilities::MiracodeStringComparator
            |
            +--PhoneticStringUtilities::MySQLSoundexStringComparator

Package:
stx:libbasic2
Category:
Collections-Text-Support
Owner:
PhoneticStringUtilities

Description:


WARNING: this is the so called 'simplified soundex' algorithm;
  there are more variants like miracode (american soundex) or
  mysqlSoundex around.
  
  Be sure to use the correct algorithm, if the generated strings must be compatible
  (otherwise, the differences are probably too small to be noticed as effect, but
  your search will be different)

The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm

SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable
components of names, but by doing so reports more matches. 

There are some variations around in the literature; 
the following is called 'simplified soundex', and the rules for coding a name are:

1. The first letter of the name is used in its un-coded form to serve as the prefix
   character of the code. (The rest of the code is numerical).

2. Thereafter, W and H are ignored entirely.

3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5).

4. Other letters of the name are converted to a numerical equivalent:
             B, P, F, V              1 
             C, G, J, K, Q, S, X, Z  2 
             D, T                    3 
             L                       4 
             M, N                    5 
             R                       6 

5. There are two exceptions: 
    1. Letters that follow prefix letters which would, if coded, have the same
       numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them.

    2. The second letter of any pair of consonants having the same code number is likewise ignored, 
       i.e. unless there is a ''separator'' between them in the name.

6. The final SOUNDEX code consists of the prefix letter plus three numerical characters.
   Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros.

Notice, that in another variant, w and h are treated slightly differently.
This is only of relevance, if you need to reconstruct original soundex codes of other programs
or for the original 1880 us census data.
 SoundexStringComparator  new encode:'Ashcraft' -> 'A226'
vs.
 MiracodeStringComparator new encode:'Ashcraft' -> 'A261'

Also notice, that soundex deals better with english. 
For german and other languages, other algorithms may provide better results.


Instance protocol:

api
o  encode: word
self new encode:'washington' -> 'W252'
self new encode:'lee' -> 'L000'
self new encode:'Gutierrez' -> 'G362'
self new encode:'Pfister' -> 'P236'
self new encode:'Jackson' -> 'J250'
self new encode:'Tymczak' -> 'T522'

private
o  translate: aCharacter
use simple if's for more speed when compiled



ST/X 7.2.0.0; WebServer 1.670 at bd0aa1f87cdd.unknown:8081; Fri, 26 Apr 2024 18:12:19 GMT