|
Class: PhoneticStringUtilities
Object
|
+--PhoneticStringUtilities
- Package:
- stx:libbasic2
- Category:
- Collections-Text-Support
- Version:
- rev:
1.43
date: 2023/09/08 14:10:04
- user: cg
- file: PhoneticStringUtilities.st directory: libbasic2
- module: stx stc-classLibrary: libbasic2
Utilities which are helpful to perform phonetic string searches or comparisons.
These are all variations or improvements of the soundex algorithm, which usually fails
to provide good results for non-english languages.
soundexCode
this algorithm was originally contained in the CharacterArray class;
nysiis
a modified soundex algorithm
miracode
another modified soundex algorithm ('american soundex') used in the 1880 census.
mySQLSoundex
another modified soundex algorithm used in mySQL.
koelner phoneticCode
provides a functionality similar to soundex, but much more tuned towards the German language
Double metaphone
works with most european languages.
phonem
described in Georg Wilde and Carsten Meyer, 'Doppelgaenger gesucht - Ein Programm fuer kontextsensitive phonetische Textumwandlung'
from 'ct Magazin fuer Computer & Technik 25/1999'.
mra
Match Rating Approach Phonetic Algorithm Developed by Western Airlines in 1977.
caverphone2
better than soundex
spanish phonetic code
an algorithm slightly adjusted to spanish names
More info for german readers is found in:
http://www.uni-koeln.de/phil-fak/phonetik/Lehre/MA-Arbeiten/magister_wilz.pdf
copyrightCOPYRIGHT (c) 1994 by Claus Gittinger
COPYRIGHT (c) 2009 by eXept Software AG
All Rights Reserved
This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice. This software may not
be provided or otherwise made available to, or used by, any
other person. No title to or ownership of the software is
hereby transferred.
sampleDatafor the 50 most common german names, we get:
ext.
name soundex soundex metaphone phonet phonet2 phonix daitsch phonem koeln caverphone2 mra
müller M460 54600000 MLR MÜLA NILA M4000000 689000 MYLR 657 MLA1111111 MLR
schmidt S530 25300000 SKMTT SHMIT ZNIT S5300000 463000 CMYD 862 SKMT111111 SCHMDT
schneider S536 25360000 SKNTR SHNEIDA ZNEITA S5300000 463900 CNAYDR 8627 SKNTA11111 SCHNDR
fischer F260 12600000 FSKR FISHA FIZA F8000000 749000 VYCR 387 FSKA111111 FSCHR
weber W160 16000000 WBR WEBA FEBA $1000000 779000 VBR 317 WPA1111111 WBR
meyer M600 56000000 MYR MEIA NEIA M0000000 619000 MAYR 67 MA11111111 MYR
wagner W256 25600000 WKNR WAKNA FAKNA $2500000 756900 VACNR 3467 WKNA111111 WGNR
schulz S420 24200000 SKLS SHULS ZULZ S4800000 484000 CULC 858 SKS1111111 SCHLZ
becker B260 12600000 BKR BEKA BEKA B2000000 759000 BCR 147 PKA1111111 BCKR
hoffmann H155 15500000 HFMN HOFMAN UFNAN $7550000 576600 OVMAN 036 AFMN111111 HFMN
schäfer S16ß 21600000 SKFR SHEFA ZEFA S7000000 479000 CVR 837 SKFA111111 SCHFR
|cls|
cls := MRAStringComparator.
cls := SoundexStringComparator.
cls := KoelnerPhoneticCodeStringComparator.
cls := Caverphone2StringComparator.
#('müller' 'schmidt' 'schneider' 'fischer' 'weber' 'meyer'
'wagner' 'schulz' 'becker' 'hoffmann' 'schäfer')
do:[:name |
Transcript show:''''; show:name; show:''' -> '''; show:(cls encode:name); showCR:''''.
].
KoelnerPhoneticCodeStringComparator encode:'Müller-Lüdenscheidt' -> '65752682'
phonetic codes
-
koelnerPhoneticCodeOf: aString
-
return a koelner phonetic code.
The koelnerPhonetic code is for the german language what the soundex code is for english;
it returns simular strings for similar sounding words.
There are some differences to soundex, though:
its length is not limited to 4, but depends on the length of the original string;
it does not start with the first character of the input.
This algorithm is described by Postel 1969
Usage example(s):
#(
'Müller'
'Mulier'
'Moliere'
'Miller'
'Mueller'
'Mühler'
'Mühlherr'
'Mülherr'
'Myler'
'Millar'
'Myller'
'Müllar'
'Müler'
'Muehler'
'Mülller'
'Müllerr'
'Muehlherr'
'Muellar'
'Mueler'
'Mülleer'
'Mueller'
'Nüller'
'Nyller'
'Niler'
'Czerny'
'Tscherny'
'Czernie'
'Tschernie'
'Schernie'
'Scherny'
'Scherno'
'Czerne'
'Zerny'
'Tzernie'
'Breschnew'
) do:[:w |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities koelnerPhoneticCodeOf:w)
].
|
Usage example(s):
PhoneticStringUtilities koelnerPhoneticCodeOf:'Breschnew'. '17863'.
PhoneticStringUtilities koelnerPhoneticCodeOf:'Breschneff'. '17863'.
PhoneticStringUtilities koelnerPhoneticCodeOf:'Braeschneff'. '17863'.
PhoneticStringUtilities koelnerPhoneticCodeOf:'Braessneff'. '17863'.
PhoneticStringUtilities koelnerPhoneticCodeOf:'Pressneff'. '17863'.
PhoneticStringUtilities koelnerPhoneticCodeOf:'Presznäph'. '17863'.
PhoneticStringUtilities koelnerPhoneticCodeOf:'Preschnjiev'. '17863'.
|
-
miracodeCodeOf: aString
-
return a miracode soundex phonetic code or nil.
Miracode is a slightly modified soundex algorithm.
Notice that there are better algorithms around (doubleMetaphone)
Usage example(s):
PhoneticStringUtilities miracodeCodeOf:'claus'
PhoneticStringUtilities miracodeCodeOf:'clause'
PhoneticStringUtilities miracodeCodeOf:'close'
PhoneticStringUtilities miracodeCodeOf:'smalltalk'
PhoneticStringUtilities miracodeCodeOf:'smaltalk'
PhoneticStringUtilities miracodeCodeOf:'smaltak'
PhoneticStringUtilities miracodeCodeOf:'smaltok'
PhoneticStringUtilities miracodeCodeOf:'smoltok'
PhoneticStringUtilities miracodeCodeOf:'aa'
PhoneticStringUtilities miracodeCodeOf:'by'
PhoneticStringUtilities miracodeCodeOf:'bab'
PhoneticStringUtilities miracodeCodeOf:'bob'
PhoneticStringUtilities miracodeCodeOf:'bop'
PhoneticStringUtilities miracodeCodeOf:'pub'
|
-
mySQLSoundexCodeOf: aString
-
return the mySQL soundex code. The mysql soundex coed is different from the miracode 'american' soundex
(no 4char limitation; different order of duplicate vowel vs. duplicate code elimination).
Notice that there are better algorithms around (doubleMetaphone)
Usage example(s):
#(
'Müller'
'Miller'
'Mueller'
'Mühler'
'Mühlherr'
'Mülherr'
'Myler'
'Millar'
'Myller'
'Müllar'
'Müler'
'Muehler'
'Mülller'
'Müllerr'
'Muehlherr'
'Muellar'
'Mueler'
'Mülleer'
'Mueller'
'Nüller'
'Nyller'
'Niler'
'Czerny'
'Tscherny'
'Czernie'
'Tschernie'
'Schernie'
'Scherny'
'Scherno'
'Czerne'
'Zerny'
'Tzernie'
'Breschnew'
) do:[:w |
Transcript show:w; show:'->'; showCR:(PhoneticStringUtilities mySQLSoundexCodeOf:w)
].
|
Usage example(s):
PhoneticStringUtilities mySQLSoundexCodeOf:'Breschnew'.
PhoneticStringUtilities mySQLSoundexCodeOf:'Breschneff'.
PhoneticStringUtilities mySQLSoundexCodeOf:'Braeschneff'.
PhoneticStringUtilities mySQLSoundexCodeOf:'Braessneff'.
PhoneticStringUtilities mySQLSoundexCodeOf:'Pressneff'.
PhoneticStringUtilities mySQLSoundexCodeOf:'Presznäph'.
PhoneticStringUtilities mySQLSoundexCodeOf:'Preschnjiev'.
|
-
soundexCodeOf: aString
-
return a soundex phonetic code or nil.
Soundex (1918, 1922) returns similar codes for similar sounding words, making it a useful
tool when searching for words where the correct spelling is unknown.
(read Knuth or search the web if you don't know what a soundex code is).
Caveat: 'similar sounding words' means: 'similar sounding in english'.
Notice that there are better algorithms around (doubleMetaphone)
Usage example(s):
PhoneticStringUtilities soundexCodeOf:'claus'
PhoneticStringUtilities soundexCodeOf:'clause'
PhoneticStringUtilities soundexCodeOf:'close'
PhoneticStringUtilities soundexCodeOf:'smalltalk'
PhoneticStringUtilities soundexCodeOf:'smaltalk'
PhoneticStringUtilities soundexCodeOf:'smaltak'
PhoneticStringUtilities soundexCodeOf:'smaltok'
PhoneticStringUtilities soundexCodeOf:'smoltok'
PhoneticStringUtilities soundexCodeOf:'aa'
PhoneticStringUtilities soundexCodeOf:'by'
PhoneticStringUtilities soundexCodeOf:'bab'
PhoneticStringUtilities soundexCodeOf:'bob'
PhoneticStringUtilities soundexCodeOf:'bop'
|
queries
-
isUtilityClass
-
(comment from inherited method)
a utility class is one which is not to be instantiated,
but only provides a number of utility functions on the class side.
It is usually also abstract
Caverphone2StringComparator
DaitchMokotoffStringComparator
DoubleMetaphoneStringComparator
ExtendedSoundexStringComparator
KoelnerPhoneticCodeStringComparator
MRAStringComparator
MetaphoneStringComparator
MiracodeStringComparator
MySQLSoundexStringComparator
NYSIISStringComparator
PhonemStringComparator
PhoneticStringComparator
SingleResultPhoneticStringComparator
SoundexStringComparator
SpanishPhoneticCodeStringComparator
|