|
Class: CharacterEncoder
Object
|
+--CharacterEncoder
|
+--CharacterEncoder::CompoundEncoder
|
+--CharacterEncoder::InverseEncoder
|
+--CharacterEncoder::NullEncoder
|
+--CharacterEncoder::OtherEncoding
|
+--CharacterEncoder::TwoStepEncoder
|
+--CharacterEncoderImplementations::FixedBytesEncoder
|
+--CharacterEncoderImplementations::ISO10646_1
|
+--CharacterEncoderImplementations::VariableBytesEncoder
- Package:
- stx:libbasic
- Category:
- Collections-Text-Encodings
- Version:
- rev:
1.174
date: 2019/07/27 12:58:29
- user: stefan
- file: CharacterEncoder.st directory: libbasic
- module: stx stc-classLibrary: libbasic
- Author:
- Claus Gittinger
please read howToAddMoreCoders.
Character mappings are based on information in character maps found at either:
http://std.dkuug.dk/i18n/charmaps
or:
http://www.unicode.org/Public/MAPPINGS
No Warranty.
All the ISO 8859 codesets include ASCII as a proper codeset within them:
ISO-8859-1: Latin 1 - Western European Languages.
ISO-8859-2: Latin 2 - Eastern European Languages.
ISO-8859-3: Latin 3 - Afrikaans, Catalan, Dutch, English, Esperanto, German,
Italian, Maltese, Spanish and Turkish.
ISO-8859-4: Latin 4 - Danish, English, Estonian, Finnish, German, Greenlandic, Lappish and Latvian.
ISO-8859-5: Latin/Cyrillic - Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croat and Ukranian.
ISO-8859-6: Latin/Arabic - Arabic.
ISO-8859-7: Latin/Greek - Greek.
ISO-8859-8: Latin/Hebrew - Hebrew.
ISO-8859-9: Latin 5 - Danish, Dutch, English, Finnish, French, German, Irish, Italian,
Norwegian, Portuguese, Spanish, Swedish and Turkish.
ISO-8859-10: Latin 6 - Danish, English, Estonian, Finnish, German, Greenlandic, Icelandic,
Sami (Lappish), Latvian, Lithuanian, Norwegian, Faroese and Swedish.
EncodedStream
Base64Coder
Compatibility-ST80
-
encoderNamed: encoderName
-
q & d hack
-
platformName
-
accessing
-
nullEncoderInstance
-
class initialization
-
encoderClassesByName
-
-
initialize
-
already initialized
usage example(s):
-
initializeEncoderClassesByName
-
initialize the dictionary which maps commonly used names
to encoder classes.
This is done, because some encodings come along with different names
usage example(s):
self initializeEncoderClassesByName
|
constants
-
jis7KanjiEscapeSequence
-
return the escape sequence used to switch to kanji in jis7 encoded strings.
This happens to be the same as ISO2022-JP's escape sequence.
-
jis7KanjiOldEscapeSequence
-
return the escape sequence used to switch to kanji in some old jis7 encoded strings.
-
jis7RomanEscapeSequence
-
return the escape sequence used to switch to roman in jis7 encoded strings
-
jisISO2022EscapeSequence
-
return the escape sequence used to switch to kanji in iso2022 encoded strings
encoding & decoding
-
decodeString: anEncodedStringOrByteCollection
-
CharacterEncoderImplementations::ISO8859_1 decodeString:'hello'
CharacterEncoderImplementations::ISO8859_1 decodeString:'hello' asByteArray
-
decodeString: aString from: oldEncoding
-
self encodeString:'hello' into:#ebcdic
self decodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic
-
encode: codePoint from: oldEncodingArg into: newEncodingArg
-
-
encodeString: aUnicodeString
-
given a string in unicode, return a string in my encoding for it
usage example(s):
CharacterEncoderImplementations::ISO8859_1 encodeString:'hello'
|
-
encodeString: aString from: oldEncodingArg into: newEncodingArg
-
some hard coded aliases
usage example(s):
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#ascii
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#unicode
self encodeString:(self encodeString:'Äh ... hello' into:#ebcdic) from:#ebcdic into:#utf8
|
-
encodeString: aString into: newEncoding
-
self encodeString:'hello' into:#ebcdic
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#ascii
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#unicode
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#utf8
instance creation
-
decoderForUTF8
-
return an encoder-instance which can map utf8 to/from unicode
usage example(s):
self encoderForUTF8
self decoderForUTF8
|
-
encoderFor: encodingNameSymbol
-
given the name of an encoding, return an encoder-instance which can map these from/into unicode.
usage example(s):
CharacterEncoder encoderFor:#'blabla2'
CharacterEncoder encoderFor:#'latin1'
self encoderFor:#'arabic'
self encoderFor:#'ms-arabic'
self encoderFor:#'cp1250'
self encoderFor:#'cp1251'
self encoderFor:#'cp1252'
self encoderFor:#'cp1253'
self encoderFor:#'iso8859-5'
self encoderFor:#'koi8-r'
self encoderFor:#'koi8-u'
self encoderFor:#'jis0208'
self encoderFor:#'jis7'
self encoderFor:#'utf8'
(self encoderFor:#'utf16le') encodeString:'hello'
(self encoderFor:#'utf16le') encode:5
(self encoderFor:#'utf16be') encodeString:'hello'
(self encoderFor:#'utf16be') encode:5
(self encoderFor:#'utf32le') encodeString:'hello'
(self encoderFor:#'utf32be') encodeString:'hello'
self encoderFor:#'sgml'
self encoderFor:#'java'
|
-
encoderFor: encodingNameSymbolArg ifAbsent: exceptionValue
-
given the name of an encoding, return an encoder-instance which can map these from/into unicode.
usage example(s):
CharacterEncoder encoderFor:#'latin1'
self encoderFor:#'iso10646-1'
self encoderFor:#'arabic'
self encoderFor:#'ms-arabic'
self encoderFor:#'iso8859-5'
self encoderFor:#'koi8-r'
self encoderFor:#'koi8-u'
self encoderFor:#'jis0208'
self encoderFor:#'jis7'
self encoderFor:#'unicode'
self encoderFor:#'UTF-8'
self encoderFor:'UTF-8'
|
-
encoderForUTF8
-
return an encoder-instance which can map unicode into/from utf8
usage example(s):
-
encoderToEncodeFrom: oldEncodingArg into: newEncodingArg
-
unicode -> something
usage example(s):
CharacterEncoder initialize
CharacterEncoder encoderToEncodeFrom:#'latin1' into:#'jis7'
CharacterEncoder encoderToEncodeFrom:#'koi8-r' into:#'mac-cyrillic'
CharacterEncoder encoderToEncodeFrom:#'ms-arabic' into:#'mac-arabic'
CharacterEncoder encoderToEncodeFrom:#'iso8859-5' into:#'koi8-r'
CharacterEncoder encoderToEncodeFrom:#'iso8859-5' into:#'unicode'
CharacterEncoder encoderToEncodeFrom:#'koi8-r' into:#'koi8-u'
CharacterEncoder encoderToEncodeFrom:#'utf-8' into:#unicode
|
private
-
flushCode
-
self flushCode
private-mapping setup
-
generateCode
-
-
generateSubclassCode
-
-
mapFileURL1_codeColumn
-
-
mapFileURL1_relativePathName
-
must be redefined in concrete subclass(es)
-
mapFileURL2_relativePathName
-
must be redefined in concrete subclass(es)
-
mappingURL1
-
-
mappingURL2
-
queries
-
isAbstract
-
Return if this class is an abstract class.
True is returned for CharacterEncoder here; false for subclasses.
Abstract subclasses must redefine this again.
-
isEncoding: subSetEncodingArg subSetOf: superSetEncodingArg
-
return true, if superSetEncoding encoding includes all characters of subSetEncoding.
(this means: characters are included - not that they have the same encoding)
-
nameOfDecodedCode
-
Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.
-
nameOfEncoding
-
-
supportedExternalEncodings
-
return an array of arrays containing the names of supported
encodings which are supported for external resources (i.e. files).
The first element contains the internally used symbolic name,
the second contains a user-readable string (description).
More than one external name may be mapped onto the same symbolic.
-
userFriendlyNameOfEncoding
-
utilities
-
detectAndSkipBOMInStream: stream
-
skips over the BOM and returns one of
#utf8
#utf32be
#utf32le
#utf16le
#utf16be
if no BOM is detected, the stream is repositions to where it was before.
-
detectBOMInBuffer: buffer
-
returns one of
#utf8
#utf32be
#utf32le
#utf16le
#utf16be
nil
-
guessEncodingOfBuffer: buffer
-
try to guess a string-buffer's encoding.
Basically looks for BOM (byte order marks)
pr a special string of the form
encoding #name
or:
encoding: name
within the given buffer
(which is usually found within the first few bytes of a textFile).
Many editors and tools write such comments (eg. emacs, st/x, etc.)
-
guessEncodingOfFile: aFilename
-
look for a BOM (byte order mark) or a special string of the form:
encoding #name
or:
encoding: name
within the given buffer
(which is usually found in the first few bytes of a textFile).
If that's not found, use heuristics (in CharacterArray) to guess.
Return a symbol like #utf8.
usage example(s):
self guessEncodingOfFile:'../../libview/resources/de.rs' asFilename
self guessEncodingOfFile:'../../libview/resources/ru.rs' asFilename
self guessEncodingOfFile:'../../libview/resources/th.rs' asFilename
|
-
guessEncodingOfStream: aStream
-
look for a BOM (byte order mark) or a special string of the form:
encoding #name
or:
encoding: name
in the first few bytes of aStream.
Return a symbol like #utf8.
-
initializeEncodingDetectors
-
setup the list of encoding detectors.
This is a list of blocks, which get a buffer as argument,
and return an encoding symbol or nil.
Can be customized for more detectors
(used to be hard-coded in guessEncodingOfBuffer:)
-
showCharacterSet
-
font := (Font family:'courier' face:'medium' style:'roman' size:12 encoding:'iso10646-1').
usage example(s):
CharacterEncoderImplementations::MS_Ansi showCharacterSet
CharacterEncoderImplementations::ISO8859_1 showCharacterSet
CharacterEncoderImplementations::ISO8859_2 showCharacterSet
CharacterEncoderImplementations::ISO8859_3 showCharacterSet
CharacterEncoderImplementations::ISO8859_4 showCharacterSet
CharacterEncoderImplementations::ISO8859_5 showCharacterSet
CharacterEncoderImplementations::ISO8859_6 showCharacterSet
CharacterEncoderImplementations::ISO8859_7 showCharacterSet
CharacterEncoderImplementations::ISO8859_8 showCharacterSet
CharacterEncoderImplementations::ISO8859_9 showCharacterSet
|
encoding & decoding
-
decodeString: anEncodedStringOrByteCollection
-
given a string in my encoding, return a unicode-string for it
** This method raises an error - it must be redefined in concrete classes **
-
encodeCharacter: aUnicodeCharacterOrCodePoint
-
encode aUnicodeCharacterOrCodePoint to a (8-bit) String or ByteArray
-
encodeString: aUnicodeString
-
given a string in unicode, return a string or ByteArray in my encoding for it
** This method raises an error - it must be redefined in concrete classes **
error handling
-
decodingError
-
report an error that there is no unicode-codePoint for a given codePoint in this encoding.
(which is unlikely) or that the encoding is undefined for that value
(for example, holes in the ISO-8859-3 encoding)
-
defaultDecoderValue
-
placed into a decoded string, in case there is no unicode codePoint
for a given encoded codePoint.
(typically 16rFFFF).
-
defaultEncoderValue
-
placed into an encoded string, in case there is no codePoint
for a given unicode codePoint.
(typically $?).
-
encodingError
-
report an error that some unicode-codePoint cannot be represented by this encoder
printing
-
printOn: aStream
-
queries
-
characterSize: charOrCodePoint
-
return the number of bytes required to encode codePoint
** This method raises an error - it must be redefined in concrete classes **
-
isEncoderFor: encoding
-
does this encode to encoding?
-
isNullEncoder
-
-
nameOfDecodedCode
-
Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.
-
nameOfEncoding
-
-
userFriendlyNameOfEncoding
-
stream support
-
encodeCharacter: aUnicodeCharacter on: aStream
-
given a character in unicode, encode it onto aStream.
Subclasses can redefine this to avoid allocating many new string instances.
-
encodeString: aUnicodeString on: aStream
-
given a string in unicode, encode it onto aStream.
Subclasses can redefine this to avoid allocating many new string instances.
(but must then also redefine encodeString:aUnicodeString to collect the characters)
-
readNext: countArg charactersFrom: aStream
-
-
readNextCharacterFrom: aStream
-
** This method raises an error - it must be redefined in concrete classes **
testing
-
isUnicodeSubsetEncoder
-
answer true, if this encodes a subset of Unicode, that is an 1-to-1
mapping to unicode
-
isUtf16Encoder
-
answer true, if this encodes from/to UTF-16 (regardless of byte-order)
CompoundEncoder
DefaultEncoder
InverseEncoder
NullEncoder
OtherEncoding
TwoStepEncoder
|s1 s2|
s1 := 'hello'.
s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'unicode'.
s2
|
|s1 s2|
s1 := 'hello'.
s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'iso8859-7'.
s2
|
|