eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'CharacterEncoder':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: CharacterEncoder


Inheritance:

   Object
   |
   +--CharacterEncoder
      |
      +--CharacterEncoder::CompoundEncoder
      |
      +--CharacterEncoder::InverseEncoder
      |
      +--CharacterEncoder::NullEncoder
      |
      +--CharacterEncoder::OtherEncoding
      |
      +--CharacterEncoder::TwoStepEncoder
      |
      +--CharacterEncoderImplementations::FixedBytesEncoder
      |
      +--CharacterEncoderImplementations::ISO10646_1
      |
      +--CharacterEncoderImplementations::VariableBytesEncoder

Package:
stx:libbasic
Category:
Collections-Text-Encodings
Version:
rev: 1.204 date: 2024/02/10 08:11:19
user: cg
file: CharacterEncoder.st directory: libbasic
module: stx stc-classLibrary: libbasic

Description:


please read howToAddMoreCoders.

Character mappings are based on information in character maps found at either:
    http://std.dkuug.dk/i18n/charmaps
or:
    http://www.unicode.org/Public/MAPPINGS

No Warranty.

All the ISO 8859 codesets include ASCII as a proper codeset within them:

ISO-8859-1: Latin 1 - Western European Languages.
ISO-8859-2: Latin 2 - Eastern European Languages.
ISO-8859-3: Latin 3 - Afrikaans, Catalan, Dutch, English, Esperanto, German,
                      Italian, Maltese, Spanish and Turkish.
ISO-8859-4: Latin 4 - Danish, English, Estonian, Finnish, German, Greenlandic, Lappish and Latvian.
ISO-8859-5: Latin/Cyrillic - Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croat and Ukranian.
ISO-8859-6: Latin/Arabic - Arabic.
ISO-8859-7: Latin/Greek - Greek.
ISO-8859-8: Latin/Hebrew - Hebrew.
ISO-8859-9: Latin 5 - Danish, Dutch, English, Finnish, French, German, Irish, Italian,
                      Norwegian, Portuguese, Spanish, Swedish and Turkish.
ISO-8859-10: Latin 6 - Danish, English, Estonian, Finnish, German, Greenlandic, Icelandic,
                      Sami (Lappish), Latvian, Lithuanian, Norwegian, Faroese and Swedish.

copyright

COPYRIGHT (c) 2004 by eXept Software AG All Rights Reserved This software is furnished under a license and may be used only in accordance with the terms of that license and with the inclusion of the above copyright notice. This software may not be provided or otherwise made available to, or used by, any other person. No title to or ownership of the software is hereby transferred.

howToAddMoreCoders

Coders can be hand-written or automagically generated via a mapping table. Examples for hand-written coders are UTF8_to_ISO10464 or JIS0208_to_JIS7. The table driven encode/decode methods can be generated from a character mapping document as found on the unicode consortium host (for example: 'http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT') or from the i18n character maps: (for example: 'http://std.dkuug.dk/i18n/charmaps/ISO-8859-1 In order to add another coder (for example: for EBCDIC or ms-codePage 278), perform the following steps: - create a public subclass of CharacterEncoderImplementations::CharacterEncoderImplementation named (for example) CharacterEncoderImplementations::CP267. - define the mappingURL1_relativeName (if the table is found on 'www.unicode.org') or the mappingURL2_relativeName (if it is found on 'std.dkuug.dk') method, which should return the name of the tables file, relative to the top directory there (which is '.../Public/MAPPINGS' on www.unicode.org and '.../i18n/charmaops' on 'std.dkuug.dk'. In this example, the table from 'std.dkuug.dk' is used, and named 'EBCDIC-CP-FI' there. - generate code by evaluating (make sure that CharacterEncoderGenerator is loaded from stx:goodies): CharacterEncoder::CP267 generateCode That's all! The existing code was generated by: CharacterEncoder::SingleByteEncoder subclassesDo:[:cls | Transcript showCR:cls name. cls flushCode; generateCode ] CharacterEncoder::SingleByteEncoder subclassesDo:[:cls | cls allSubclassesDo:[:sub | Transcript showCR:sub name. sub flushCode; generateSubclassCode]] or individually: CharacterEncoder::ASCII flushCode; generateCode. CharacterEncoder::ISO8859_1 flushCode; generateCode. CharacterEncoder::ISO8859_2 flushCode; generateCode. CharacterEncoder::ISO8859_3 flushCode; generateCode. CharacterEncoder::ISO8859_4 flushCode; generateCode. CharacterEncoder::ISO8859_5 flushCode; generateCode. CharacterEncoder::ISO8859_6 flushCode; generateCode. CharacterEncoder::ISO8859_7 flushCode; generateCode. CharacterEncoder::ISO8859_8 flushCode; generateCode. CharacterEncoder::ISO8859_9 flushCode; generateCode. CharacterEncoder::ISO8859_10 flushCode; generateCode. CharacterEncoder::ISO8859_11 flushCode; generateCode. CharacterEncoder::ISO8859_13 flushCode; generateCode. CharacterEncoder::ISO8859_14 flushCode; generateCode. CharacterEncoder::ISO8859_15 flushCode; generateCode. CharacterEncoder::ISO8859_16 flushCode; generateCode. CharacterEncoder::KOI8_R flushCode; generateCode. CharacterEncoder::GSM0338 flushCode; generateCode. CharacterEncoder::KOI8_U flushCode; generateSubclassCode. CharacterEncoder::JIS0208 flushCode; generateCode. Please check if your encoder tables are complete; for example, with: 0 to:255 do:[:ebc | |asc ebc2| asc := CharacterEncoderImplementations::EBCDIC new decode:ebc. asc notNil ifTrue:[ ebc2 := CharacterEncoderImplementations::EBCDIC new encode:asc. self assert:(ebc2 = ebc) ]. ]. 0 to:255 do:[:asc | |ebc asc2| ebc := CharacterEncoderImplementations::EBCDIC new encode:asc. ebc notNil ifTrue:[ asc2 := CharacterEncoderImplementations::EBCDIC new decode:ebc. self assert:(asc2 = asc) ]. ].

Class protocol:

Compatibility-ST80
o  encoderNamed: encoderName
q & d hack:

Usage example(s):

     self encoderNamed:'foo'
     self encoderNamed:'utf8'
     self encoderNamed:'cp850'

o  platformName

accessing
o  nullEncoderInstance

class initialization
o  encoderClassesByName

o  initialize
already initialized

Usage example(s):

     self initialize

o  initializeEncoderClassesByName
initialize the dictionary which maps commonly used names
to encoder classes.
This is done, because some encodings come along with different names

Usage example(s):

     self initializeEncoderClassesByName

constants
o  jis7KanjiEscapeSequence
return the escape sequence used to switch to kanji in jis7 encoded strings.
This happens to be the same as ISO2022-JP's escape sequence.

o  jis7KanjiOldEscapeSequence
return the escape sequence used to switch to kanji in some old jis7 encoded strings.

o  jis7RomanEscapeSequence
return the escape sequence used to switch to roman in jis7 encoded strings

o  jisISO2022EscapeSequence
return the escape sequence used to switch to kanji in iso2022 encoded strings

encoding & decoding
o  decodeString: anEncodedStringOrByteCollection
CharacterEncoderImplementations::ISO8859_1 decodeString:'hello'
CharacterEncoderImplementations::ISO8859_1 decodeString:'hello' asByteArray

o  decodeString: aString from: oldEncoding
self encodeString:'hello' into:#ebcdic

self decodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic
self decodeString:(self encodeString:'hello' into:#binary) from:#binary

o  encode: codePoint from: oldEncodingArg into: newEncodingArg

o  encodeString: aUnicodeString
given a string in unicode, return a string in my encoding for it

Usage example(s):

     CharacterEncoderImplementations::ISO8859_1 encodeString:'hello'

o  encodeString: aString from: oldEncodingArg into: newEncodingArg
some hard coded aliases

Usage example(s):

     self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#ascii    
     self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#unicode    
     self encodeString:(self encodeString:'Äh ... hello' into:#ebcdic) from:#ebcdic into:#utf8    

o  encodeString: aString into: newEncoding
self encodeString:'hello' into:#ebcdic

self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#ascii
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#unicode
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#utf8

instance creation
o  decoderForUTF8
return an encoder-instance which can map utf8 to/from unicode

Usage example(s):

     self encoderForUTF8 
     self decoderForUTF8

o  encoderFor: encodingNameSymbol
given the name of an encoding, return an encoder-instance which can map these from/into unicode.

Usage example(s):

     CharacterEncoder encoderFor:#'blabla2'       
     CharacterEncoder encoderFor:#'latin1'       
     self encoderFor:#'arabic'       
     self encoderFor:#'ms-arabic'       
     self encoderFor:#'cp1250'       
     self encoderFor:#'cp1251'       
     self encoderFor:#'cp1252'       
     self encoderFor:#'cp1253'       
     self encoderFor:#'iso8859-5'    
     self encoderFor:#'koi8-r'      
     self encoderFor:#'koi8-u'      
     self encoderFor:#'jis0208'      
     self encoderFor:#'jis7'      
     self encoderFor:#'utf8'      
     (self encoderFor:#'utf16le') encodeString:'hello'      
     (self encoderFor:#'utf16le') encode:5    
     (self encoderFor:#'utf16be') encodeString:'hello'      
     (self encoderFor:#'utf16be') encode:5      
     (self encoderFor:#'utf32le') encodeString:'hello'      
     (self encoderFor:#'utf32be') encodeString:'hello'      
     self encoderFor:#'sgml'      
     self encoderFor:#'java'      
     self encoderFor:#'cp850'      
     self encoderFor:#'CP850'      
     self encoderFor:#'ms-1258'      
     self encoderFor:#'windows-1258'      

o  encoderFor: encodingNameSymbolArg ifAbsent: exceptionValue
given the name of an encoding, return an encoder-instance which can map these from/into unicode.

Usage example(s):

     CharacterEncoder encoderFor:#'latin1'      => unicode->iso8859-1 
     self encoderFor:#'iso10646-1'              
     self encoderFor:#'ms-ansi'           
     self encoderFor:#'arabic'              
     self encoderFor:#'ms-arabic'           
     self encoderFor:#'iso8859-5'           
     self encoderFor:#'koi8-r'      
     self encoderFor:#'koi8-u'      
     self encoderFor:#'jis0208'      
     self encoderFor:#'jis7'      
     self encoderFor:#'unicode'      
     self encoderFor:#'UTF-8'      
     self encoderFor:'UTF-8'      
     self encoderFor:'iso_8859_1'      => unicode->iso8859-1
     self encoderFor:'iso_8859-1'      => unicode->iso8859-1
     self encoderFor:'iso_8859-1'      => unicode->iso8859-1
     self encoderFor:'iso88591'      
     self encoderFor:'l1'      
     self encoderFor:'ms-1258'
     CharacterEncoder encoderFor:'ISO-2022-JP'

o  encoderForUTF8
return an encoder-instance which can map unicode into/from utf8

Usage example(s):

     self encoderForUTF8      

o  encoderToEncodeFrom: oldEncodingArg into: newEncodingArg
unicode -> something

Usage example(s):

  CharacterEncoder initialize
  CharacterEncoder encoderToEncodeFrom:#'latin1' into:#'jis7'      
  CharacterEncoder encoderToEncodeFrom:#'koi8-r' into:#'mac-cyrillic'              
  CharacterEncoder encoderToEncodeFrom:#'ms-arabic' into:#'mac-arabic'           
  CharacterEncoder encoderToEncodeFrom:#'iso8859-5' into:#'koi8-r'           
  CharacterEncoder encoderToEncodeFrom:#'iso8859-5' into:#'unicode'           
  CharacterEncoder encoderToEncodeFrom:#'koi8-r' into:#'koi8-u'       
  CharacterEncoder encoderToEncodeFrom:#'utf-8' into:#unicode       

private
o  flushCode
self flushCode

private-mapping setup
o  generateCode

o  generateSubclassCode

o  mapFileURL1_codeColumn

o  mapFileURL1_relativePathName
must be redefined in concrete subclass(es)

o  mapFileURL2_relativePathName
must be redefined in concrete subclass(es)

o  mappingURL1

o  mappingURL2

queries
o  alternativeNamesFor: encodingName do: aBlock
try replacing '_' by '-'

o  alternativeNamesFor: encodingName doWithExit: aBlock
try replacing '_' by '-'

Usage example(s):

     self alternativeNamesFor:'bla-Foo_123' doWithExit:[:nm :exit |
         Transcript showCR:nm
     ].
     
     self alternativeNamesFor:'bla-Foo_123' doWithExit:[:nm :exit |
         nm = 'bla-foo-123' ifTrue:[exit value].
         Transcript showCR:nm
     ]

o  bomBytes
return the BOM (byte order mark) bytes or nil.
Only applicable for UTF encoders.

o  encodingNames
subclasses can provide a list of names;
(see https://encoding.spec.whatwg.org/#visualization)
eg: ISO_8859-5 may return
#( 'iso8859_5' 'iso8859-5' 'iso-8859-5' 'cyrillic' 'iso-ir-144' 'csisocyrillic' 'iso88595' 'iso_8859-5' 'iso_8859-5:1988')

o  isAbstract
Return if this class is an abstract class.
True is returned for CharacterEncoder here; false for subclasses.
Abstract subclasses must redefine this again.

o  isEncoding: subSetEncodingArg subSetOf: superSetEncodingArg
return true, if superSetEncoding encoding includes all characters of subSetEncoding.
(this means: characters are included - not that they have the same encoding)

o  maxCode

** This method must be redefined in concrete classes (subclassResponsibility) **

o  minCode

o  nameOfDecodedCode
Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.

o  nameOfEncoding

o  supportedExternalEncodings
return an array of arrays containing the names of supported
encodings which are supported for external resources (i.e. files).
The first element contains the internally used symbolic name,
the second contains a user-readable string (description).
More than one external name may be mapped onto the same symbolic.

o  userFriendlyNameOfEncoding

utilities
o  detectAndSkipBOMInStream: stream
skips over the BOM and returns one of
#utf8
#utf32be
#utf32leüðæëï
#utf16le
#utf16be
nil
if no BOM is detected, the stream is repositions to where it was before.

Usage example(s):

     |s enc|

     s := #[1 2 3 4] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == nil).
     self assert:(s position == 0).

     s := #[16rEF 16rBB 16rBF 4 5 6] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == #utf8).
     self assert:(s position == 3).

     s := #[16rFF 2 3 4] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == nil).
     self assert:(s position == 0).

     s := #[16rFF 16rFE 3 4] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == #utf16le).
     self assert:(s position == 2).

     s := #[16rFF 16rFE 0 4 5] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == #utf16le).
     self assert:(s position == 2).

     s := #[16rFE 16rFF 3 4] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == #utf16be).
     self assert:(s position == 2).

     s := #[16rFF 16rFE 0 0 5 6] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == #utf32le).
     self assert:(s position == 4).

     s := #[0 0 16rFE 16rFF 5 6] readStream.
     enc := self detectAndSkipBOMInStream:s.
     self assert:(enc == #utf32be).
     self assert:(s position == 4).

o  detectBOMInBuffer: buffer
returns one of
#utf8
#utf32be
#utf32le
#utf16le
#utf16be
nil

o  guessEncodingOfBuffer: buffer
try to guess a string-buffer's encoding.
Basically looks for BOM (byte order marks)
pr a special string of the form
encoding #name
or:
encoding: name
within the given buffer
(which is usually found within the first few bytes of a textFile).
Many editors and tools write such comments (eg. emacs, st/x, etc.)

o  guessEncodingOfFile: aFilename
look for a BOM (byte order mark) or a special string of the form:
encoding #name
or:
encoding: name
within the given buffer
(which is usually found in the first few bytes of a textFile).
If that's not found, use heuristics (in CharacterArray) to guess.
Return a symbol like #utf8.

Usage example(s):

     self guessEncodingOfFile:'../../libview/resources/de.rs' asFilename 
     self guessEncodingOfFile:'/Users/exept/cg_work/stx/libview/resources/de.rs' asFilename 

     self guessEncodingOfFile:'../../libview/resources/ru.rs' asFilename  
     self guessEncodingOfFile:'/Users/exept/cg_work/stx/libview/resources/ru.rs' asFilename  

     self guessEncodingOfFile:'../../libview/resources/th.rs' asFilename  
     self guessEncodingOfFile:'/Users/exept/cg_work/stx/libview/resources/th.rs' asFilename  

o  guessEncodingOfStream: aStream
look for a BOM (byte order mark) or a special string of the form:
encoding #name
or:
encoding: name
in the first few bytes of aStream.
Return a symbol like #utf8.

o  initializeEncodingDetectors
setup the list of encoding detectors.
This is a list of blocks, which get a buffer as argument,
and return an encoding symbol or nil.
Can be customized for more detectors
(used to be hard-coded in guessEncodingOfBuffer:)

o  showCharacterSet
font := (Font family:'courier' face:'medium' style:'roman' size:12 encoding:'iso10646-1').

Usage example(s):

     CharacterEncoderImplementations::MS_Ansi showCharacterSet
     CharacterEncoderImplementations::ISO8859_1 showCharacterSet
     CharacterEncoderImplementations::ISO8859_2 showCharacterSet
     CharacterEncoderImplementations::ISO8859_3 showCharacterSet
     CharacterEncoderImplementations::ISO8859_4 showCharacterSet
     CharacterEncoderImplementations::ISO8859_5 showCharacterSet
     CharacterEncoderImplementations::ISO8859_6 showCharacterSet
     CharacterEncoderImplementations::ISO8859_7 showCharacterSet
     CharacterEncoderImplementations::ISO8859_8 showCharacterSet
     CharacterEncoderImplementations::ISO8859_9 showCharacterSet


Instance protocol:

encoding & decoding
o  decodeString: anEncodedStringOrByteCollection
given a string in my encoding, return a unicode-string for it

** This method must be redefined in concrete classes (subclassResponsibility) **

o  encodeCharacter: aUnicodeCharacterOrCodePoint
encode aUnicodeCharacterOrCodePoint to a (8-bit) String or ByteArray

o  encodeString: aUnicodeString
given a string in unicode, return a string or ByteArray in my encoding for it

** This method must be redefined in concrete classes (subclassResponsibility) **

error handling
o  decodesToUnicode
answer true, if this encoder decodes data to unicode

o  decodingError
report an error that there is no unicode-codePoint for a given codePoint in this encoding.
(which is unlikely) or that the encoding is undefined for that value
(for example, holes in the ISO-8859-3 encoding)

o  defaultDecoderValue
placed into a decoded string, in case there is no unicode codePoint
for a given encoded codePoint. (typically 16rFFFF).

o  defaultDecoderValueFor: codePoint
no code exists when decoding codePoint;
concrete classes may provide a specific default value (typically 16rFFFF)

o  defaultEncoderValue
placed into an encoded string, in case there is no codePoint
for a given unicode codePoint. (typically $?).

o  defaultEncoderValueFor: unicodePoint
no code exists when encoding unicodePoint;
concrete classes may provide a specific default value (typically $?)

o  encodingError
report an error that some unicode-codePoint cannot be represented by this encoder

printing
o  printOn: aStream
(comment from inherited method)
append a user printed representation of the receiver to aStream.
The format is suitable for a human - not meant to be read back.

The default here is to output the receiver's class name.
BUT: this method is heavily redefined for objects which
can print prettier.

queries
o  characterSize: charOrCodePoint
return the number of bytes required to encode codePoint

o  isEncoderFor: encoding
does this encode to encoding?

o  isNullEncoder

o  maxCode

o  minCode

o  nameOfDecodedCode
Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.

o  nameOfEncoding

o  userFriendlyNameOfEncoding

stream support
o  encodeCharacter: aUnicodeCharacter on: aStream
given a character in unicode, encode it onto aStream.
Subclasses can redefine this to avoid allocating many new string instances.

o  encodeString: aUnicodeString on: aStream
given a string in unicode, encode it onto aStream.
Subclasses can redefine this to avoid allocating many new string instances.
(but must then also redefine encodeString:aUnicodeString to collect the characters)

o  readNext: countArg charactersFrom: aStream

o  readNextCharacterFrom: aStream

** This method must be redefined in concrete classes (subclassResponsibility) **

testing
o  isISO8859_1Encoder
answer true, if this encodes from/to ISO8859_1

o  isUnicodeSubsetEncoder
answer true, if this encodes a subset of Unicode, that is an 1-to-1
mapping to unicode

o  isUtf16Encoder
answer true, if this encodes from/to UTF-16 (regardless of byte-order)

o  isUtfEncoder
answer true, if this encodes from/to any UTF (regardless of how many bytes and byte-order).
In other words: does it make sense to prepend a BOM?


Private classes:

    CompoundEncoder
    DefaultEncoder
    InverseEncoder
    NullEncoder
    OtherEncoding
    TwoStepEncoder

Examples:


    |s1 s2|

    s1 := 'hello'.
    s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'unicode'.
    s2
    |s1 s2|

    s1 := 'hello'.
    s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'iso8859-7'.
    s2


ST/X 7.7.0.0; WebServer 1.702 at 20f6060372b9.unknown:8081; Mon, 18 Nov 2024 04:44:16 GMT