eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'CharacterEncoder':

Home

everywhere
www.exept.de
for:
[back]

Class: CharacterEncoder


Inheritance:

   Object
   |
   +--CharacterEncoder
      |
      +--CharacterEncoder::CompoundEncoder
      |
      +--CharacterEncoder::DefaultEncoder
      |
      +--CharacterEncoder::InverseEncoder
      |
      +--CharacterEncoder::NullEncoder
      |
      +--CharacterEncoder::OtherEncoding
      |
      +--CharacterEncoder::TwoStepEncoder
      |
      +--CharacterEncoderImplementations::ISO10646_1
      |
      +--CharacterEncoderImplementations::SingleByteEncoder
      |
      +--CharacterEncoderImplementations::TwoByteEncoder

Package:
stx:libbasic
Category:
Collections-Text-Encodings
Version:
rev: 1.106 date: 2009/12/11 16:54:00
user: cg
file: CharacterEncoder.st directory: libbasic
module: stx stc-classLibrary: libbasic
Author:
Claus Gittinger

Description:


unfinished code - please read howToAddMoreCoders.

Character mappings are based on information in character maps found at either:
    http://std.dkuug.dk/i18n/charmaps
or:
    http://www.unicode.org/Public/MAPPINGS

No Warranty.

All the ISO 8859 codesets include ASCII as a proper codeset within them: 

ISO 8859-1: Latin 1 - Western European Languages. 
ISO 8859-2: Latin 2 - Eastern European Languages. 
ISO 8859-3: Latin 3 - Afrikaans, Catalan, Dutch, English, Esperanto, German, 
                      Italian, Maltese, Spanish and Turkish. 
ISO 8859-4: Latin 4 - Danish, English, Estonian, Finnish, German, Greenlandic, Lappish and Latvian. 
ISO 8859-5: Latin/Cyrillic - Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croat and Ukranian. 
ISO 8859-6: Latin/Arabic - Arabic. 
ISO 8859-7: Latin/Greek - Greek. 
ISO 8859-8: Latin/Hebrew - Hebrew. 
ISO 8859-9: Latin 5 - Danish, Dutch, English, Finnish, French, German, Irish, Italian, 
                      Norwegian, Portuguese, Spanish, Swedish and Turkish. 
ISO 8859-10: Latin 6 - Danish, English, Estonian, Finnish, German, Greenlandic, Icelandic, 
                      Sami (Lappish), Latvian, Lithuanian, Norwegian, Faroese and Swedish.


Class protocol:

Compatibility-ST80
o  encoderNamed: encoderName
q & d hack

o  platformName

accessing
o  nullEncoderInstance

class initialization
o  initialize

constants
o  jis7KanjiEscapeSequence
return the escape sequence used to switch to kanji in jis7 encoded strings.
This happens to be the same as ISO2022-JP's escape sequence.

o  jis7KanjiOldEscapeSequence
return the escape sequence used to switch to kanji in some old jis7 encoded strings.

o  jis7RomanEscapeSequence
return the escape sequence used to switch to roman in jis7 encoded strings

o  jisISO2022EscapeSequence
return the escape sequence used to switch to kanji in iso2022 encoded strings

encoding & decoding
o  decode: aCodePoint

o  decodeString: aString

o  decodeString: aString from: oldEncoding

o  encode: aCodePoint

o  encode: codePoint from: oldEncodingArg into: newEncodingArg

o  encodeString: aUnicodeString
given a string in unicode, return a string in my encoding for it

o  encodeString: aString from: oldEncodingArg into: newEncodingArg

o  encodeString: aString into: newEncoding

instance creation
o  encoderFor: encodingNameSymbol
given the name of an encoding, return an encoder-instance which can map these from/into unicode.

o  encoderFor: encodingNameSymbolArg ifAbsent: exceptionValue
given the name of an encoding, return an encoder-instance which can map these from/into unicode.

o  encoderForUTF8
return an encoder-instance which can map unicode into/from utf8

o  encoderToEncodeFrom: oldEncodingArg into: newEncodingArg

private
o  flushCode

private-mapping setup
o  generateCode

o  generateSubclassCode

o  mapFileURL1_codeColumn

o  mapFileURL1_relativePathName
raise an error: must be redefined in concrete subclass(es)

o  mapFileURL2_relativePathName
raise an error: must be redefined in concrete subclass(es)

o  mappingURL1
raise an error: must be redefined in concrete subclass(es)

o  mappingURL2
raise an error: must be redefined in concrete subclass(es)

queries
o  isEncoding: subSetEncodingArg subSetOf: superSetEncodingArg
return true, if superSetEncoding encoding includes all characters of subSetEncoding.
(this means: characters are included - not that they have the same encoding)

o  nameOfDecodedCode
Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.

o  nameOfEncoding

o  supportedExternalEncodings
return an array of arrays containing the names of supported
encodings which are supported for external resources (i.e. files).
The first element contains the internally used symbolic name,
the second contains a user-readable string (description).
More than one external name may be mapped onto the same symbolic.

o  userFriendlyNameOfEncoding

testing
o  isAbstract
Return if this class is an abstract class.
True is returned for CharacterEncoder here; false for subclasses.
Abstract subclasses must redefine again.

utilities
o  guessEncodingOfBuffer: buffer
look for a string of the form
encoding #name
or:
encoding: name
within the given buffer
(which is usually the first few bytes of a textFile).

o  guessEncodingOfFile: aFilename
look for a string
encoding #name
or:
encoding: name
within the given buffer
(which is usually the first few bytes of a textFile).
If thats not found, use heuristics (in CharacterArray) to guess.

o  guessEncodingOfStream: aStream
look for a string of the form
encoding #name
or:
encoding: name
in the first few bytes of aStream.

o  showCharacterSet


Instance protocol:

encoding & decoding
o  decode: anEncoding
given an integer in my encoding, return a unicode codePoint for it

** This method raises an error - it must be redefined in concrete classes **

o  decodeString: anEncodedString
given a string in my encoding, return a unicode-string for it

o  encode: aCodePoint
given a codePoint in unicode, return a byte in my encoding for it

** This method raises an error - it must be redefined in concrete classes **

o  encodeString: aUnicodeString
given a string in unicode, return a string in my encoding for it

error handling
o  decodingError
report an error that there is no unicode-codePoint for a given codePoint in this encoding.
(which is unlikely) or that the encoding is undefined for that value
(for example, holes in the ISO8859-3 encoding)

o  defaultDecoderValue
placed into a decoded string, in case there is no unicode codePoint
for a given encoded codePoint.
(typically 16rFFFF).

o  defaultEncoderValue
placed into an encoded string, in case there is no codePoint
for a given unicode codePoint.
(typically $?).

o  encodingError
report an error that some unicode-codePoint cannot be represented by this encoder

printing
o  printOn: aStream

private
o  newString: size

** This method raises an error - it must be redefined in concrete classes **

queries
o  characterSize: codePoint
return the number of bytes required to encode codePoint

** This method raises an error - it must be redefined in concrete classes **

o  isNullEncoder

o  nameOfDecodedCode
Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.

o  nameOfEncoding

o  userFriendlyNameOfEncoding

stream support
o  readNext: charactersToRead charactersFrom: stream

o  readNextCharacterFrom: aStream

o  readNextInputCharacterFrom: aStream


Private classes:

    CompoundEncoder
    DefaultEncoder
    InverseEncoder
    NullEncoder
    OtherEncoding
    TwoStepEncoder

Examples:



    |s1 s2|

    s1 := 'hello'.
    s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'unicode'.
    s2       


    |s1 s2|

    s1 := 'hello'.
    s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'iso8859-7'.
    s2      


ST/X 6.1.1; WebServer 1.620 at exept:8081; Wed, 23 May 2012 08:03:17 GMT