Smalltalk/X WebserverDocumentation of class 'PhoneticStringUtilities::SoundexStringComparator': | |
Class: SoundexStringComparator (private in PhoneticStringUtilitiesThis class is only visible from within PhoneticStringUtilities.Inheritance:Object | +--PhoneticStringUtilities::PhoneticStringComparator | +--PhoneticStringUtilities::SingleResultPhoneticStringComparator | +--PhoneticStringUtilities::SoundexStringComparator | +--PhoneticStringUtilities::MiracodeStringComparator | +--PhoneticStringUtilities::MySQLSoundexStringComparator
Description:WARNING: this is the so called 'simplified soundex' algorithm; there are more variants like miracode (american soundex) or mysqlSoundex around. Be sure to use the correct algorithm, if the generated strings must be compatible (otherwise, the differences are probably too small to be noticed as effect, but your search will be different) The following was copied from http://www.civilsolutions.com.au/publications/dedup.htm SOUNDEX is a phonetic coding algorithm that ignores many of the unreliable components of names, but by doing so reports more matches. There are some variations around in the literature; the following is called 'simplified soundex', and the rules for coding a name are: 1. The first letter of the name is used in its un-coded form to serve as the prefix character of the code. (The rest of the code is numerical). 2. Thereafter, W and H are ignored entirely. 3. A, E, I, 0, U, Y are not assigned a code number, but do serve as 'separators' (see Step 5). 4. Other letters of the name are converted to a numerical equivalent: B, P, F, V 1 C, G, J, K, Q, S, X, Z 2 D, T 3 L 4 M, N 5 R 6 5. There are two exceptions: 1. Letters that follow prefix letters which would, if coded, have the same numerical code, are ignored in all cases unless a ''separator'' (see Step 3) precedes them. 2. The second letter of any pair of consonants having the same code number is likewise ignored, i.e. unless there is a ''separator'' between them in the name. 6. The final SOUNDEX code consists of the prefix letter plus three numerical characters. Longer codes are truncated to this length, and shorter codes are extended to it by adding zeros. Notice, that in another variant, w and h are treated slightly differently. This is only of relevance, if you need to reconstruct original soundex codes of other programs or for the original 1880 us census data. SoundexStringComparator new encode:'Ashcraft' -> 'A226' vs. MiracodeStringComparator new encode:'Ashcraft' -> 'A261' Also notice, that soundex deals better with english. For german and other languages, other algorithms may provide better results. Instance protocol:api
|
|
ST/X 7.7.0.0; WebServer 1.702 at 20f6060372b9.unknown:8081; Wed, 04 Dec 2024 08:38:49 GMT |