eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'CharacterEncoder':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: CharacterEncoder


Inheritance:

   Object
   |
   +--CharacterEncoder
      |
      +--CharacterEncoder::CompoundEncoder
      |
      +--CharacterEncoder::InverseEncoder
      |
      +--CharacterEncoder::NullEncoder
      |
      +--CharacterEncoder::OtherEncoding
      |
      +--CharacterEncoder::TwoStepEncoder
      |
      +--CharacterEncoderImplementations::FixedBytesEncoder
      |
      +--CharacterEncoderImplementations::ISO10646_1
      |
      +--CharacterEncoderImplementations::VariableBytesEncoder

Package:
stx:libbasic
Category:
Collections-Text-Encodings
Version:
rev: 1.174 date: 2019/07/27 12:58:29
user: stefan
file: CharacterEncoder.st directory: libbasic
module: stx stc-classLibrary: libbasic
Author:
Claus Gittinger

Description:


please read howToAddMoreCoders.

Character mappings are based on information in character maps found at either:
    http://std.dkuug.dk/i18n/charmaps
or:
    http://www.unicode.org/Public/MAPPINGS

No Warranty.

All the ISO 8859 codesets include ASCII as a proper codeset within them:

ISO-8859-1: Latin 1 - Western European Languages.
ISO-8859-2: Latin 2 - Eastern European Languages.
ISO-8859-3: Latin 3 - Afrikaans, Catalan, Dutch, English, Esperanto, German,
                      Italian, Maltese, Spanish and Turkish.
ISO-8859-4: Latin 4 - Danish, English, Estonian, Finnish, German, Greenlandic, Lappish and Latvian.
ISO-8859-5: Latin/Cyrillic - Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croat and Ukranian.
ISO-8859-6: Latin/Arabic - Arabic.
ISO-8859-7: Latin/Greek - Greek.
ISO-8859-8: Latin/Hebrew - Hebrew.
ISO-8859-9: Latin 5 - Danish, Dutch, English, Finnish, French, German, Irish, Italian,
                      Norwegian, Portuguese, Spanish, Swedish and Turkish.
ISO-8859-10: Latin 6 - Danish, English, Estonian, Finnish, German, Greenlandic, Icelandic,
                      Sami (Lappish), Latvian, Lithuanian, Norwegian, Faroese and Swedish.


Related information:

    EncodedStream
    Base64Coder

Class protocol:

Compatibility-ST80
o  encoderNamed: encoderName
q & d hack

o  platformName

accessing
o  nullEncoderInstance

class initialization
o  encoderClassesByName

o  initialize
already initialized

usage example(s):

     self initialize

o  initializeEncoderClassesByName
initialize the dictionary which maps commonly used names
to encoder classes.
This is done, because some encodings come along with different names

usage example(s):

     self initializeEncoderClassesByName

constants
o  jis7KanjiEscapeSequence
return the escape sequence used to switch to kanji in jis7 encoded strings.
This happens to be the same as ISO2022-JP's escape sequence.

o  jis7KanjiOldEscapeSequence
return the escape sequence used to switch to kanji in some old jis7 encoded strings.

o  jis7RomanEscapeSequence
return the escape sequence used to switch to roman in jis7 encoded strings

o  jisISO2022EscapeSequence
return the escape sequence used to switch to kanji in iso2022 encoded strings

encoding & decoding
o  decodeString: anEncodedStringOrByteCollection
CharacterEncoderImplementations::ISO8859_1 decodeString:'hello'
CharacterEncoderImplementations::ISO8859_1 decodeString:'hello' asByteArray

o  decodeString: aString from: oldEncoding
self encodeString:'hello' into:#ebcdic

self decodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic

o  encode: codePoint from: oldEncodingArg into: newEncodingArg

o  encodeString: aUnicodeString
given a string in unicode, return a string in my encoding for it

usage example(s):

     CharacterEncoderImplementations::ISO8859_1 encodeString:'hello'

o  encodeString: aString from: oldEncodingArg into: newEncodingArg
some hard coded aliases

usage example(s):

     self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#ascii    
     self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#unicode    
     self encodeString:(self encodeString:'Äh ... hello' into:#ebcdic) from:#ebcdic into:#utf8    

o  encodeString: aString into: newEncoding
self encodeString:'hello' into:#ebcdic

self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#ascii
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#unicode
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#utf8

instance creation
o  decoderForUTF8
return an encoder-instance which can map utf8 to/from unicode

usage example(s):

     self encoderForUTF8 
     self decoderForUTF8

o  encoderFor: encodingNameSymbol
given the name of an encoding, return an encoder-instance which can map these from/into unicode.

usage example(s):

     CharacterEncoder encoderFor:#'blabla2'       
     CharacterEncoder encoderFor:#'latin1'       
     self encoderFor:#'arabic'       
     self encoderFor:#'ms-arabic'       
     self encoderFor:#'cp1250'       
     self encoderFor:#'cp1251'       
     self encoderFor:#'cp1252'       
     self encoderFor:#'cp1253'       
     self encoderFor:#'iso8859-5'    
     self encoderFor:#'koi8-r'      
     self encoderFor:#'koi8-u'      
     self encoderFor:#'jis0208'      
     self encoderFor:#'jis7'      
     self encoderFor:#'utf8'      
     (self encoderFor:#'utf16le') encodeString:'hello'      
     (self encoderFor:#'utf16le') encode:5    
     (self encoderFor:#'utf16be') encodeString:'hello'      
     (self encoderFor:#'utf16be') encode:5      
     (self encoderFor:#'utf32le') encodeString:'hello'      
     (self encoderFor:#'utf32be') encodeString:'hello'      
     self encoderFor:#'sgml'      
     self encoderFor:#'java'      

o  encoderFor: encodingNameSymbolArg ifAbsent: exceptionValue
given the name of an encoding, return an encoder-instance which can map these from/into unicode.

usage example(s):

     CharacterEncoder encoderFor:#'latin1'       
     self encoderFor:#'iso10646-1'              
     self encoderFor:#'arabic'              
     self encoderFor:#'ms-arabic'           
     self encoderFor:#'iso8859-5'           
     self encoderFor:#'koi8-r'      
     self encoderFor:#'koi8-u'      
     self encoderFor:#'jis0208'      
     self encoderFor:#'jis7'      
     self encoderFor:#'unicode'      
     self encoderFor:#'UTF-8'      
     self encoderFor:'UTF-8'      

o  encoderForUTF8
return an encoder-instance which can map unicode into/from utf8

usage example(s):

     self encoderForUTF8      

o  encoderToEncodeFrom: oldEncodingArg into: newEncodingArg
unicode -> something

usage example(s):

  CharacterEncoder initialize
  CharacterEncoder encoderToEncodeFrom:#'latin1' into:#'jis7'      
  CharacterEncoder encoderToEncodeFrom:#'koi8-r' into:#'mac-cyrillic'              
  CharacterEncoder encoderToEncodeFrom:#'ms-arabic' into:#'mac-arabic'           
  CharacterEncoder encoderToEncodeFrom:#'iso8859-5' into:#'koi8-r'           
  CharacterEncoder encoderToEncodeFrom:#'iso8859-5' into:#'unicode'           
  CharacterEncoder encoderToEncodeFrom:#'koi8-r' into:#'koi8-u'       
  CharacterEncoder encoderToEncodeFrom:#'utf-8' into:#unicode       

private
o  flushCode
self flushCode

private-mapping setup
o  generateCode

o  generateSubclassCode

o  mapFileURL1_codeColumn

o  mapFileURL1_relativePathName
must be redefined in concrete subclass(es)

o  mapFileURL2_relativePathName
must be redefined in concrete subclass(es)

o  mappingURL1

o  mappingURL2

queries
o  isAbstract
Return if this class is an abstract class.
True is returned for CharacterEncoder here; false for subclasses.
Abstract subclasses must redefine this again.

o  isEncoding: subSetEncodingArg subSetOf: superSetEncodingArg
return true, if superSetEncoding encoding includes all characters of subSetEncoding.
(this means: characters are included - not that they have the same encoding)

o  nameOfDecodedCode
Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.

o  nameOfEncoding

o  supportedExternalEncodings
return an array of arrays containing the names of supported
encodings which are supported for external resources (i.e. files).
The first element contains the internally used symbolic name,
the second contains a user-readable string (description).
More than one external name may be mapped onto the same symbolic.

o  userFriendlyNameOfEncoding

utilities
o  detectAndSkipBOMInStream: stream
skips over the BOM and returns one of
#utf8
#utf32be
#utf32le
#utf16le
#utf16be
if no BOM is detected, the stream is repositions to where it was before.

o  detectBOMInBuffer: buffer
returns one of
#utf8
#utf32be
#utf32le
#utf16le
#utf16be
nil

o  guessEncodingOfBuffer: buffer
try to guess a string-buffer's encoding.
Basically looks for BOM (byte order marks)
pr a special string of the form
encoding #name
or:
encoding: name
within the given buffer
(which is usually found within the first few bytes of a textFile).
Many editors and tools write such comments (eg. emacs, st/x, etc.)

o  guessEncodingOfFile: aFilename
look for a BOM (byte order mark) or a special string of the form:
encoding #name
or:
encoding: name
within the given buffer
(which is usually found in the first few bytes of a textFile).
If that's not found, use heuristics (in CharacterArray) to guess.
Return a symbol like #utf8.

usage example(s):

     self guessEncodingOfFile:'../../libview/resources/de.rs' asFilename
     self guessEncodingOfFile:'../../libview/resources/ru.rs' asFilename
     self guessEncodingOfFile:'../../libview/resources/th.rs' asFilename

o  guessEncodingOfStream: aStream
look for a BOM (byte order mark) or a special string of the form:
encoding #name
or:
encoding: name
in the first few bytes of aStream.
Return a symbol like #utf8.

o  initializeEncodingDetectors
setup the list of encoding detectors.
This is a list of blocks, which get a buffer as argument,
and return an encoding symbol or nil.
Can be customized for more detectors
(used to be hard-coded in guessEncodingOfBuffer:)

o  showCharacterSet
font := (Font family:'courier' face:'medium' style:'roman' size:12 encoding:'iso10646-1').

usage example(s):

     CharacterEncoderImplementations::MS_Ansi showCharacterSet
     CharacterEncoderImplementations::ISO8859_1 showCharacterSet
     CharacterEncoderImplementations::ISO8859_2 showCharacterSet
     CharacterEncoderImplementations::ISO8859_3 showCharacterSet
     CharacterEncoderImplementations::ISO8859_4 showCharacterSet
     CharacterEncoderImplementations::ISO8859_5 showCharacterSet
     CharacterEncoderImplementations::ISO8859_6 showCharacterSet
     CharacterEncoderImplementations::ISO8859_7 showCharacterSet
     CharacterEncoderImplementations::ISO8859_8 showCharacterSet
     CharacterEncoderImplementations::ISO8859_9 showCharacterSet


Instance protocol:

encoding & decoding
o  decodeString: anEncodedStringOrByteCollection
given a string in my encoding, return a unicode-string for it

** This method raises an error - it must be redefined in concrete classes **

o  encodeCharacter: aUnicodeCharacterOrCodePoint
encode aUnicodeCharacterOrCodePoint to a (8-bit) String or ByteArray

o  encodeString: aUnicodeString
given a string in unicode, return a string or ByteArray in my encoding for it

** This method raises an error - it must be redefined in concrete classes **

error handling
o  decodingError
report an error that there is no unicode-codePoint for a given codePoint in this encoding.
(which is unlikely) or that the encoding is undefined for that value
(for example, holes in the ISO-8859-3 encoding)

o  defaultDecoderValue
placed into a decoded string, in case there is no unicode codePoint
for a given encoded codePoint.
(typically 16rFFFF).

o  defaultEncoderValue
placed into an encoded string, in case there is no codePoint
for a given unicode codePoint.
(typically $?).

o  encodingError
report an error that some unicode-codePoint cannot be represented by this encoder

printing
o  printOn: aStream

queries
o  characterSize: charOrCodePoint
return the number of bytes required to encode codePoint

** This method raises an error - it must be redefined in concrete classes **

o  isEncoderFor: encoding
does this encode to encoding?

o  isNullEncoder

o  nameOfDecodedCode
Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.

o  nameOfEncoding

o  userFriendlyNameOfEncoding

stream support
o  encodeCharacter: aUnicodeCharacter on: aStream
given a character in unicode, encode it onto aStream.
Subclasses can redefine this to avoid allocating many new string instances.

o  encodeString: aUnicodeString on: aStream
given a string in unicode, encode it onto aStream.
Subclasses can redefine this to avoid allocating many new string instances.
(but must then also redefine encodeString:aUnicodeString to collect the characters)

o  readNext: countArg charactersFrom: aStream

o  readNextCharacterFrom: aStream

** This method raises an error - it must be redefined in concrete classes **

testing
o  isUnicodeSubsetEncoder
answer true, if this encodes a subset of Unicode, that is an 1-to-1
mapping to unicode

o  isUtf16Encoder
answer true, if this encodes from/to UTF-16 (regardless of byte-order)


Private classes:

    CompoundEncoder
    DefaultEncoder
    InverseEncoder
    NullEncoder
    OtherEncoding
    TwoStepEncoder

Examples:


    |s1 s2|

    s1 := 'hello'.
    s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'unicode'.
    s2
    |s1 s2|

    s1 := 'hello'.
    s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'iso8859-7'.
    s2


ST/X 7.2.0.0; WebServer 1.670 at bd0aa1f87cdd.unknown:8081; Fri, 19 Apr 2024 00:17:30 GMT