Smalltalk/X Webserver

Documentation of class 'CharacterEncoder':

Class: CharacterEncoder

Inheritance
Description
Class protocol
Instance protocol
Private classes
Examples

Inheritance:

   Object
   |
   +--CharacterEncoder
      |
      +--CharacterEncoder::CompoundEncoder
      |
      +--CharacterEncoder::InverseEncoder
      |
      +--CharacterEncoder::NullEncoder
      |
      +--CharacterEncoder::OtherEncoding
      |
      +--CharacterEncoder::TwoStepEncoder
      |
      +--CharacterEncoderImplementations::FixedBytesEncoder
      |
      +--CharacterEncoderImplementations::ISO10646_1
      |
      +--CharacterEncoderImplementations::VariableBytesEncoder

Package:: stx:libbasic

Category:: Collections-Text-Encodings

Version:: rev: 1.204 date: 2024/02/10 08:11:19; user: cg; file: CharacterEncoder.st directory: libbasic; module: stx stc-classLibrary: libbasic

Description:

please read howToAddMoreCoders.

Character mappings are based on information in character maps found at either:
    http://std.dkuug.dk/i18n/charmaps
or:
    http://www.unicode.org/Public/MAPPINGS

No Warranty.

All the ISO 8859 codesets include ASCII as a proper codeset within them:

ISO-8859-1: Latin 1 - Western European Languages.
ISO-8859-2: Latin 2 - Eastern European Languages.
ISO-8859-3: Latin 3 - Afrikaans, Catalan, Dutch, English, Esperanto, German,
                      Italian, Maltese, Spanish and Turkish.
ISO-8859-4: Latin 4 - Danish, English, Estonian, Finnish, German, Greenlandic, Lappish and Latvian.
ISO-8859-5: Latin/Cyrillic - Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croat and Ukranian.
ISO-8859-6: Latin/Arabic - Arabic.
ISO-8859-7: Latin/Greek - Greek.
ISO-8859-8: Latin/Hebrew - Hebrew.
ISO-8859-9: Latin 5 - Danish, Dutch, English, Finnish, French, German, Irish, Italian,
                      Norwegian, Portuguese, Spanish, Swedish and Turkish.
ISO-8859-10: Latin 6 - Danish, English, Estonian, Finnish, German, Greenlandic, Icelandic,
                      Sami (Lappish), Latvian, Lithuanian, Norwegian, Faroese and Swedish.

copyrightCOPYRIGHT (c) 2004 by eXept Software AG
             All Rights Reserved

This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice.   This software may not
be provided or otherwise made available to, or used by, any
other person.  No title to or ownership of the software is
hereby transferred.

howToAddMoreCodersCoders can be hand-written or automagically generated via a mapping table.
Examples for hand-written coders are UTF8_to_ISO10464 or JIS0208_to_JIS7.

The table driven encode/decode methods can be generated from a character mapping document
as found on the unicode consortium host
    (for example: 'http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT')

or from the i18n character maps:
    (for example: 'http://std.dkuug.dk/i18n/charmaps/ISO-8859-1

In order to add another coder (for example: for EBCDIC or ms-codePage 278),
perform the following steps:
    - create a public subclass of CharacterEncoderImplementations::CharacterEncoderImplementation named (for example) CharacterEncoderImplementations::CP267.

    - define the mappingURL1_relativeName (if the table is found on 'www.unicode.org')
      or the mappingURL2_relativeName (if it is found on 'std.dkuug.dk') method, which
      should return the name of the tables file, relative to the top directory there
      (which is '.../Public/MAPPINGS' on www.unicode.org and '.../i18n/charmaops' on 'std.dkuug.dk'.

      In this example, the table from 'std.dkuug.dk' is used, and named 'EBCDIC-CP-FI' there.

    - generate code by evaluating (make sure that CharacterEncoderGenerator is loaded from stx:goodies):
        CharacterEncoder::CP267 generateCode

That's all!


The existing code was generated by:

    CharacterEncoder::SingleByteEncoder subclassesDo:[:cls | Transcript showCR:cls name. cls flushCode; generateCode ]
    CharacterEncoder::SingleByteEncoder subclassesDo:[:cls | cls allSubclassesDo:[:sub | Transcript showCR:sub name. sub flushCode; generateSubclassCode]]

or individually:
    CharacterEncoder::ASCII flushCode; generateCode.
    CharacterEncoder::ISO8859_1 flushCode; generateCode.
    CharacterEncoder::ISO8859_2 flushCode; generateCode.
    CharacterEncoder::ISO8859_3 flushCode; generateCode.
    CharacterEncoder::ISO8859_4 flushCode; generateCode.
    CharacterEncoder::ISO8859_5 flushCode; generateCode.
    CharacterEncoder::ISO8859_6 flushCode; generateCode.
    CharacterEncoder::ISO8859_7 flushCode; generateCode.
    CharacterEncoder::ISO8859_8 flushCode; generateCode.
    CharacterEncoder::ISO8859_9 flushCode; generateCode.
    CharacterEncoder::ISO8859_10 flushCode; generateCode.
    CharacterEncoder::ISO8859_11 flushCode; generateCode.
    CharacterEncoder::ISO8859_13 flushCode; generateCode.
    CharacterEncoder::ISO8859_14 flushCode; generateCode.
    CharacterEncoder::ISO8859_15 flushCode; generateCode.
    CharacterEncoder::ISO8859_16 flushCode; generateCode.
    CharacterEncoder::KOI8_R flushCode; generateCode.
    CharacterEncoder::GSM0338 flushCode; generateCode.

    CharacterEncoder::KOI8_U flushCode; generateSubclassCode.

    CharacterEncoder::JIS0208 flushCode; generateCode.

Please check if your encoder tables are complete; for example, with:
    0 to:255 do:[:ebc |
        |asc ebc2|

        asc := CharacterEncoderImplementations::EBCDIC new decode:ebc.
        asc notNil ifTrue:[
           ebc2 := CharacterEncoderImplementations::EBCDIC new encode:asc.
           self assert:(ebc2 = ebc)
        ].
    ].

    0 to:255 do:[:asc |
        |ebc asc2|

        ebc := CharacterEncoderImplementations::EBCDIC new encode:asc.
        ebc notNil ifTrue:[
           asc2 := CharacterEncoderImplementations::EBCDIC new decode:ebc.
           self assert:(asc2 = asc)
        ].
    ].

Class protocol:

Compatibility-ST80

encoderNamed: encoderName

q & d hack:

Usage example(s):

     self encoderNamed:'foo'
     self encoderNamed:'utf8'
     self encoderNamed:'cp850'

platformName

accessing

nullEncoderInstance

class initialization

encoderClassesByName

initialize

already initialized

Usage example(s):

     self initialize

initializeEncoderClassesByName

initialize the dictionary which maps commonly used names
to encoder classes.
This is done, because some encodings come along with different names

Usage example(s):

     self initializeEncoderClassesByName

constants

jis7KanjiEscapeSequence: return the escape sequence used to switch to kanji in jis7 encoded strings.
This happens to be the same as ISO2022-JP's escape sequence.
jis7KanjiOldEscapeSequence: return the escape sequence used to switch to kanji in some old jis7 encoded strings.
jis7RomanEscapeSequence: return the escape sequence used to switch to roman in jis7 encoded strings
jisISO2022EscapeSequence: return the escape sequence used to switch to kanji in iso2022 encoded strings

encoding & decoding

decodeString: anEncodedStringOrByteCollection

CharacterEncoderImplementations::ISO8859_1 decodeString:'hello'
CharacterEncoderImplementations::ISO8859_1 decodeString:'hello' asByteArray

decodeString: aString from: oldEncoding

self encodeString:'hello' into:#ebcdic

self decodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic
self decodeString:(self encodeString:'hello' into:#binary) from:#binary

encode: codePoint from: oldEncodingArg into: newEncodingArg

encodeString: aUnicodeString

given a string in unicode, return a string in my encoding for it

Usage example(s):

     CharacterEncoderImplementations::ISO8859_1 encodeString:'hello'

encodeString: aString from: oldEncodingArg into: newEncodingArg

some hard coded aliases

Usage example(s):

     self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#ascii    
     self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#unicode    
     self encodeString:(self encodeString:'Äh ... hello' into:#ebcdic) from:#ebcdic into:#utf8

encodeString: aString into: newEncoding

self encodeString:'hello' into:#ebcdic

self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#ascii
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#unicode
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#utf8

instance creation

decoderForUTF8

return an encoder-instance which can map utf8 to/from unicode

Usage example(s):

     self encoderForUTF8 
     self decoderForUTF8

encoderFor: encodingNameSymbol

given the name of an encoding, return an encoder-instance which can map these from/into unicode.

Usage example(s):

     CharacterEncoder encoderFor:#'blabla2'       
     CharacterEncoder encoderFor:#'latin1'       
     self encoderFor:#'arabic'       
     self encoderFor:#'ms-arabic'       
     self encoderFor:#'cp1250'       
     self encoderFor:#'cp1251'       
     self encoderFor:#'cp1252'       
     self encoderFor:#'cp1253'       
     self encoderFor:#'iso8859-5'    
     self encoderFor:#'koi8-r'      
     self encoderFor:#'koi8-u'      
     self encoderFor:#'jis0208'      
     self encoderFor:#'jis7'      
     self encoderFor:#'utf8'      
     (self encoderFor:#'utf16le') encodeString:'hello'      
     (self encoderFor:#'utf16le') encode:5    
     (self encoderFor:#'utf16be') encodeString:'hello'      
     (self encoderFor:#'utf16be') encode:5      
     (self encoderFor:#'utf32le') encodeString:'hello'      
     (self encoderFor:#'utf32be') encodeString:'hello'      
     self encoderFor:#'sgml'      
     self encoderFor:#'java'      
     self encoderFor:#'cp850'      
     self encoderFor:#'CP850'      
     self encoderFor:#'ms-1258'      
     self encoderFor:#'windows-1258'

encoderFor: encodingNameSymbolArg ifAbsent: exceptionValue

given the name of an encoding, return an encoder-instance which can map these from/into unicode.

Usage example(s):

     CharacterEncoder encoderFor:#'latin1'      => unicode->iso8859-1 
     self encoderFor:#'iso10646-1'              
     self encoderFor:#'ms-ansi'           
     self encoderFor:#'arabic'              
     self encoderFor:#'ms-arabic'           
     self encoderFor:#'iso8859-5'           
     self encoderFor:#'koi8-r'      
     self encoderFor:#'koi8-u'      
     self encoderFor:#'jis0208'      
     self encoderFor:#'jis7'      
     self encoderFor:#'unicode'      
     self encoderFor:#'UTF-8'      
     self encoderFor:'UTF-8'      
     self encoderFor:'iso_8859_1'      => unicode->iso8859-1
     self encoderFor:'iso_8859-1'      => unicode->iso8859-1
     self encoderFor:'iso_8859-1'      => unicode->iso8859-1
     self encoderFor:'iso88591'      
     self encoderFor:'l1'      
     self encoderFor:'ms-1258'
     CharacterEncoder encoderFor:'ISO-2022-JP'

encoderForUTF8

return an encoder-instance which can map unicode into/from utf8

Usage example(s):

     self encoderForUTF8

encoderToEncodeFrom: oldEncodingArg into: newEncodingArg

unicode -> something

Usage example(s):

  CharacterEncoder initialize
  CharacterEncoder encoderToEncodeFrom:#'latin1' into:#'jis7'      
  CharacterEncoder encoderToEncodeFrom:#'koi8-r' into:#'mac-cyrillic'              
  CharacterEncoder encoderToEncodeFrom:#'ms-arabic' into:#'mac-arabic'           
  CharacterEncoder encoderToEncodeFrom:#'iso8859-5' into:#'koi8-r'           
  CharacterEncoder encoderToEncodeFrom:#'iso8859-5' into:#'unicode'           
  CharacterEncoder encoderToEncodeFrom:#'koi8-r' into:#'koi8-u'       
  CharacterEncoder encoderToEncodeFrom:#'utf-8' into:#unicode

private

flushCode: self flushCode

private-mapping setup

generateCode
generateSubclassCode
mapFileURL1_codeColumn
mapFileURL1_relativePathName: must be redefined in concrete subclass(es)
mapFileURL2_relativePathName: must be redefined in concrete subclass(es)
mappingURL1
mappingURL2

queries

alternativeNamesFor: encodingName do: aBlock

try replacing '_' by '-'

alternativeNamesFor: encodingName doWithExit: aBlock

try replacing '_' by '-'

Usage example(s):

     self alternativeNamesFor:'bla-Foo_123' doWithExit:[:nm :exit |
         Transcript showCR:nm
     ].
     
     self alternativeNamesFor:'bla-Foo_123' doWithExit:[:nm :exit |
         nm = 'bla-foo-123' ifTrue:[exit value].
         Transcript showCR:nm
     ]

bomBytes

return the BOM (byte order mark) bytes or nil.
Only applicable for UTF encoders.

encodingNames

subclasses can provide a list of names;
(see https://encoding.spec.whatwg.org/#visualization)
eg: ISO_8859-5 may return
#( 'iso8859_5' 'iso8859-5' 'iso-8859-5' 'cyrillic' 'iso-ir-144' 'csisocyrillic' 'iso88595' 'iso_8859-5' 'iso_8859-5:1988')

isAbstract

Return if this class is an abstract class.
True is returned for CharacterEncoder here; false for subclasses.
Abstract subclasses must redefine this again.

isEncoding: subSetEncodingArg subSetOf: superSetEncodingArg

return true, if superSetEncoding encoding includes all characters of subSetEncoding.
(this means: characters are included - not that they have the same encoding)

maxCode

** This method must be redefined in concrete classes (subclassResponsibility) **

minCode

nameOfDecodedCode

Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.

nameOfEncoding

supportedExternalEncodings

return an array of arrays containing the names of supported
encodings which are supported for external resources (i.e. files).
The first element contains the internally used symbolic name,
the second contains a user-readable string (description).
More than one external name may be mapped onto the same symbolic.

userFriendlyNameOfEncoding

utilities

detectAndSkipBOMInStream: stream

skips over the BOM and returns one of
#utf8
#utf32be
#utf32leüðæëï
#utf16le
#utf16be
nil
if no BOM is detected, the stream is repositions to where it was before.

Usage example(s):

     |s enc|

     s := #[1 2 3 4] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == nil).
     self assert:(s position == 0).

     s := #[16rEF 16rBB 16rBF 4 5 6] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == #utf8).
     self assert:(s position == 3).

     s := #[16rFF 2 3 4] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == nil).
     self assert:(s position == 0).

     s := #[16rFF 16rFE 3 4] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == #utf16le).
     self assert:(s position == 2).

     s := #[16rFF 16rFE 0 4 5] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == #utf16le).
     self assert:(s position == 2).

     s := #[16rFE 16rFF 3 4] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == #utf16be).
     self assert:(s position == 2).

     s := #[16rFF 16rFE 0 0 5 6] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == #utf32le).
     self assert:(s position == 4).

     s := #[0 0 16rFE 16rFF 5 6] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == #utf32be).
     self assert:(s position == 4).

detectBOMInBuffer: buffer

returns one of
#utf8
#utf32be
#utf32le
#utf16le
#utf16be
nil

guessEncodingOfBuffer: buffer

try to guess a string-buffer's encoding.
Basically looks for BOM (byte order marks)
pr a special string of the form
encoding #name
or:
encoding: name
within the given buffer
(which is usually found within the first few bytes of a textFile).
Many editors and tools write such comments (eg. emacs, st/x, etc.)

guessEncodingOfFile: aFilename

look for a BOM (byte order mark) or a special string of the form:
encoding #name
or:
encoding: name
within the given buffer
(which is usually found in the first few bytes of a textFile).
If that's not found, use heuristics (in CharacterArray) to guess.
Return a symbol like #utf8.

Usage example(s):

     self guessEncodingOfFile:'../../libview/resources/de.rs' asFilename 
     self guessEncodingOfFile:'/Users/exept/cg_work/stx/libview/resources/de.rs' asFilename 

     self guessEncodingOfFile:'../../libview/resources/ru.rs' asFilename  
     self guessEncodingOfFile:'/Users/exept/cg_work/stx/libview/resources/ru.rs' asFilename  

     self guessEncodingOfFile:'../../libview/resources/th.rs' asFilename  
     self guessEncodingOfFile:'/Users/exept/cg_work/stx/libview/resources/th.rs' asFilename

guessEncodingOfStream: aStream

look for a BOM (byte order mark) or a special string of the form:
encoding #name
or:
encoding: name
in the first few bytes of aStream.
Return a symbol like #utf8.

initializeEncodingDetectors

setup the list of encoding detectors.
This is a list of blocks, which get a buffer as argument,
and return an encoding symbol or nil.
Can be customized for more detectors
(used to be hard-coded in guessEncodingOfBuffer:)

showCharacterSet

font := (Font family:'courier' face:'medium' style:'roman' size:12 encoding:'iso10646-1').

Usage example(s):

     CharacterEncoderImplementations::MS_Ansi showCharacterSet
     CharacterEncoderImplementations::ISO8859_1 showCharacterSet
     CharacterEncoderImplementations::ISO8859_2 showCharacterSet
     CharacterEncoderImplementations::ISO8859_3 showCharacterSet
     CharacterEncoderImplementations::ISO8859_4 showCharacterSet
     CharacterEncoderImplementations::ISO8859_5 showCharacterSet
     CharacterEncoderImplementations::ISO8859_6 showCharacterSet
     CharacterEncoderImplementations::ISO8859_7 showCharacterSet
     CharacterEncoderImplementations::ISO8859_8 showCharacterSet
     CharacterEncoderImplementations::ISO8859_9 showCharacterSet

Instance protocol:

encoding & decoding

decodeString: anEncodedStringOrByteCollection: given a string in my encoding, return a unicode-string for it

** This method must be redefined in concrete classes (subclassResponsibility) **
encodeCharacter: aUnicodeCharacterOrCodePoint: encode aUnicodeCharacterOrCodePoint to a (8-bit) String or ByteArray
encodeString: aUnicodeString: given a string in unicode, return a string or ByteArray in my encoding for it

** This method must be redefined in concrete classes (subclassResponsibility) **

error handling

decodesToUnicode: answer true, if this encoder decodes data to unicode
decodingError: report an error that there is no unicode-codePoint for a given codePoint in this encoding.
(which is unlikely) or that the encoding is undefined for that value
(for example, holes in the ISO-8859-3 encoding)
defaultDecoderValue: placed into a decoded string, in case there is no unicode codePoint
for a given encoded codePoint. (typically 16rFFFF).
defaultDecoderValueFor: codePoint: no code exists when decoding codePoint;
concrete classes may provide a specific default value (typically 16rFFFF)
defaultEncoderValue: placed into an encoded string, in case there is no codePoint
for a given unicode codePoint. (typically $?).
defaultEncoderValueFor: unicodePoint: no code exists when encoding unicodePoint;
concrete classes may provide a specific default value (typically $?)
encodingError: report an error that some unicode-codePoint cannot be represented by this encoder

printing

printOn: aStream: (comment from inherited method)
append a user printed representation of the receiver to aStream.
The format is suitable for a human - not meant to be read back.

The default here is to output the receiver's class name.
BUT: this method is heavily redefined for objects which
can print prettier.

queries

characterSize: charOrCodePoint: return the number of bytes required to encode codePoint
isEncoderFor: encoding: does this encode to encoding?
isNullEncoder
maxCode
minCode
nameOfDecodedCode: Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.
nameOfEncoding
userFriendlyNameOfEncoding

stream support

encodeCharacter: aUnicodeCharacter on: aStream: given a character in unicode, encode it onto aStream.
Subclasses can redefine this to avoid allocating many new string instances.
encodeString: aUnicodeString on: aStream: given a string in unicode, encode it onto aStream.
Subclasses can redefine this to avoid allocating many new string instances.
(but must then also redefine encodeString:aUnicodeString to collect the characters)
readNext: countArg charactersFrom: aStream
readNextCharacterFrom: aStream: ** This method must be redefined in concrete classes (subclassResponsibility) **

testing

isISO8859_1Encoder: answer true, if this encodes from/to ISO8859_1
isUnicodeSubsetEncoder: answer true, if this encodes a subset of Unicode, that is an 1-to-1
mapping to unicode
isUtf16Encoder: answer true, if this encodes from/to UTF-16 (regardless of byte-order)
isUtfEncoder: answer true, if this encodes from/to any UTF (regardless of how many bytes and byte-order).
In other words: does it make sense to prepend a BOM?

Private classes:

    CompoundEncoder
    DefaultEncoder
    InverseEncoder
    NullEncoder
    OtherEncoding
    TwoStepEncoder

Examples:

    |s1 s2|

    s1 := 'hello'.
    s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'unicode'.
    s2

    |s1 s2|

    s1 := 'hello'.
    s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'iso8859-7'.
    s2

ST/X 7.7.0.0; WebServer 1.702 at 20f6060372b9.unknown:8081; Sun, 17 Aug 2025 02:21:43 GMT