|
Class: CharacterEncoder
Object
|
+--CharacterEncoder
|
+--CharacterEncoder::CompoundEncoder
|
+--CharacterEncoder::InverseEncoder
|
+--CharacterEncoder::NullEncoder
|
+--CharacterEncoder::OtherEncoding
|
+--CharacterEncoder::TwoStepEncoder
|
+--CharacterEncoderImplementations::FixedBytesEncoder
|
+--CharacterEncoderImplementations::ISO10646_1
|
+--CharacterEncoderImplementations::VariableBytesEncoder
- Package:
- stx:libbasic
- Category:
- Collections-Text-Encodings
- Version:
- rev:
1.204
date: 2024/02/10 08:11:19
- user: cg
- file: CharacterEncoder.st directory: libbasic
- module: stx stc-classLibrary: libbasic
please read howToAddMoreCoders.
Character mappings are based on information in character maps found at either:
http://std.dkuug.dk/i18n/charmaps
or:
http://www.unicode.org/Public/MAPPINGS
No Warranty.
All the ISO 8859 codesets include ASCII as a proper codeset within them:
ISO-8859-1: Latin 1 - Western European Languages.
ISO-8859-2: Latin 2 - Eastern European Languages.
ISO-8859-3: Latin 3 - Afrikaans, Catalan, Dutch, English, Esperanto, German,
Italian, Maltese, Spanish and Turkish.
ISO-8859-4: Latin 4 - Danish, English, Estonian, Finnish, German, Greenlandic, Lappish and Latvian.
ISO-8859-5: Latin/Cyrillic - Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croat and Ukranian.
ISO-8859-6: Latin/Arabic - Arabic.
ISO-8859-7: Latin/Greek - Greek.
ISO-8859-8: Latin/Hebrew - Hebrew.
ISO-8859-9: Latin 5 - Danish, Dutch, English, Finnish, French, German, Irish, Italian,
Norwegian, Portuguese, Spanish, Swedish and Turkish.
ISO-8859-10: Latin 6 - Danish, English, Estonian, Finnish, German, Greenlandic, Icelandic,
Sami (Lappish), Latvian, Lithuanian, Norwegian, Faroese and Swedish.
copyrightCOPYRIGHT (c) 2004 by eXept Software AG
All Rights Reserved
This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice. This software may not
be provided or otherwise made available to, or used by, any
other person. No title to or ownership of the software is
hereby transferred.
howToAddMoreCodersCoders can be hand-written or automagically generated via a mapping table.
Examples for hand-written coders are UTF8_to_ISO10464 or JIS0208_to_JIS7.
The table driven encode/decode methods can be generated from a character mapping document
as found on the unicode consortium host
(for example: 'http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT')
or from the i18n character maps:
(for example: 'http://std.dkuug.dk/i18n/charmaps/ISO-8859-1
In order to add another coder (for example: for EBCDIC or ms-codePage 278),
perform the following steps:
- create a public subclass of CharacterEncoderImplementations::CharacterEncoderImplementation named (for example) CharacterEncoderImplementations::CP267.
- define the mappingURL1_relativeName (if the table is found on 'www.unicode.org')
or the mappingURL2_relativeName (if it is found on 'std.dkuug.dk') method, which
should return the name of the tables file, relative to the top directory there
(which is '.../Public/MAPPINGS' on www.unicode.org and '.../i18n/charmaops' on 'std.dkuug.dk'.
In this example, the table from 'std.dkuug.dk' is used, and named 'EBCDIC-CP-FI' there.
- generate code by evaluating (make sure that CharacterEncoderGenerator is loaded from stx:goodies):
CharacterEncoder::CP267 generateCode
That's all!
The existing code was generated by:
CharacterEncoder::SingleByteEncoder subclassesDo:[:cls | Transcript showCR:cls name. cls flushCode; generateCode ]
CharacterEncoder::SingleByteEncoder subclassesDo:[:cls | cls allSubclassesDo:[:sub | Transcript showCR:sub name. sub flushCode; generateSubclassCode]]
or individually:
CharacterEncoder::ASCII flushCode; generateCode.
CharacterEncoder::ISO8859_1 flushCode; generateCode.
CharacterEncoder::ISO8859_2 flushCode; generateCode.
CharacterEncoder::ISO8859_3 flushCode; generateCode.
CharacterEncoder::ISO8859_4 flushCode; generateCode.
CharacterEncoder::ISO8859_5 flushCode; generateCode.
CharacterEncoder::ISO8859_6 flushCode; generateCode.
CharacterEncoder::ISO8859_7 flushCode; generateCode.
CharacterEncoder::ISO8859_8 flushCode; generateCode.
CharacterEncoder::ISO8859_9 flushCode; generateCode.
CharacterEncoder::ISO8859_10 flushCode; generateCode.
CharacterEncoder::ISO8859_11 flushCode; generateCode.
CharacterEncoder::ISO8859_13 flushCode; generateCode.
CharacterEncoder::ISO8859_14 flushCode; generateCode.
CharacterEncoder::ISO8859_15 flushCode; generateCode.
CharacterEncoder::ISO8859_16 flushCode; generateCode.
CharacterEncoder::KOI8_R flushCode; generateCode.
CharacterEncoder::GSM0338 flushCode; generateCode.
CharacterEncoder::KOI8_U flushCode; generateSubclassCode.
CharacterEncoder::JIS0208 flushCode; generateCode.
Please check if your encoder tables are complete; for example, with:
0 to:255 do:[:ebc |
|asc ebc2|
asc := CharacterEncoderImplementations::EBCDIC new decode:ebc.
asc notNil ifTrue:[
ebc2 := CharacterEncoderImplementations::EBCDIC new encode:asc.
self assert:(ebc2 = ebc)
].
].
0 to:255 do:[:asc |
|ebc asc2|
ebc := CharacterEncoderImplementations::EBCDIC new encode:asc.
ebc notNil ifTrue:[
asc2 := CharacterEncoderImplementations::EBCDIC new decode:ebc.
self assert:(asc2 = asc)
].
].
Compatibility-ST80
-
encoderNamed: encoderName
-
q & d hack:
Usage example(s):
self encoderNamed:'foo'
self encoderNamed:'utf8'
self encoderNamed:'cp850'
|
-
platformName
-
accessing
-
nullEncoderInstance
-
class initialization
-
encoderClassesByName
-
-
initialize
-
already initialized
Usage example(s):
-
initializeEncoderClassesByName
-
initialize the dictionary which maps commonly used names
to encoder classes.
This is done, because some encodings come along with different names
Usage example(s):
self initializeEncoderClassesByName
|
constants
-
jis7KanjiEscapeSequence
-
return the escape sequence used to switch to kanji in jis7 encoded strings.
This happens to be the same as ISO2022-JP's escape sequence.
-
jis7KanjiOldEscapeSequence
-
return the escape sequence used to switch to kanji in some old jis7 encoded strings.
-
jis7RomanEscapeSequence
-
return the escape sequence used to switch to roman in jis7 encoded strings
-
jisISO2022EscapeSequence
-
return the escape sequence used to switch to kanji in iso2022 encoded strings
encoding & decoding
-
decodeString: anEncodedStringOrByteCollection
-
CharacterEncoderImplementations::ISO8859_1 decodeString:'hello'
CharacterEncoderImplementations::ISO8859_1 decodeString:'hello' asByteArray
-
decodeString: aString from: oldEncoding
-
self encodeString:'hello' into:#ebcdic
self decodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic
self decodeString:(self encodeString:'hello' into:#binary) from:#binary
-
encode: codePoint from: oldEncodingArg into: newEncodingArg
-
-
encodeString: aUnicodeString
-
given a string in unicode, return a string in my encoding for it
Usage example(s):
CharacterEncoderImplementations::ISO8859_1 encodeString:'hello'
|
-
encodeString: aString from: oldEncodingArg into: newEncodingArg
-
some hard coded aliases
Usage example(s):
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#ascii
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#unicode
self encodeString:(self encodeString:'Äh ... hello' into:#ebcdic) from:#ebcdic into:#utf8
|
-
encodeString: aString into: newEncoding
-
self encodeString:'hello' into:#ebcdic
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#ascii
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#unicode
self encodeString:(self encodeString:'hello' into:#ebcdic) from:#ebcdic into:#utf8
instance creation
-
decoderForUTF8
-
return an encoder-instance which can map utf8 to/from unicode
Usage example(s):
self encoderForUTF8
self decoderForUTF8
|
-
encoderFor: encodingNameSymbol
-
given the name of an encoding, return an encoder-instance which can map these from/into unicode.
Usage example(s):
CharacterEncoder encoderFor:#'blabla2'
CharacterEncoder encoderFor:#'latin1'
self encoderFor:#'arabic'
self encoderFor:#'ms-arabic'
self encoderFor:#'cp1250'
self encoderFor:#'cp1251'
self encoderFor:#'cp1252'
self encoderFor:#'cp1253'
self encoderFor:#'iso8859-5'
self encoderFor:#'koi8-r'
self encoderFor:#'koi8-u'
self encoderFor:#'jis0208'
self encoderFor:#'jis7'
self encoderFor:#'utf8'
(self encoderFor:#'utf16le') encodeString:'hello'
(self encoderFor:#'utf16le') encode:5
(self encoderFor:#'utf16be') encodeString:'hello'
(self encoderFor:#'utf16be') encode:5
(self encoderFor:#'utf32le') encodeString:'hello'
(self encoderFor:#'utf32be') encodeString:'hello'
self encoderFor:#'sgml'
self encoderFor:#'java'
self encoderFor:#'cp850'
self encoderFor:#'CP850'
self encoderFor:#'ms-1258'
self encoderFor:#'windows-1258'
|
-
encoderFor: encodingNameSymbolArg ifAbsent: exceptionValue
-
given the name of an encoding, return an encoder-instance which can map these from/into unicode.
Usage example(s):
CharacterEncoder encoderFor:#'latin1' => unicode->iso8859-1
self encoderFor:#'iso10646-1'
self encoderFor:#'ms-ansi'
self encoderFor:#'arabic'
self encoderFor:#'ms-arabic'
self encoderFor:#'iso8859-5'
self encoderFor:#'koi8-r'
self encoderFor:#'koi8-u'
self encoderFor:#'jis0208'
self encoderFor:#'jis7'
self encoderFor:#'unicode'
self encoderFor:#'UTF-8'
self encoderFor:'UTF-8'
self encoderFor:'iso_8859_1' => unicode->iso8859-1
self encoderFor:'iso_8859-1' => unicode->iso8859-1
self encoderFor:'iso_8859-1' => unicode->iso8859-1
self encoderFor:'iso88591'
self encoderFor:'l1'
self encoderFor:'ms-1258'
CharacterEncoder encoderFor:'ISO-2022-JP'
|
-
encoderForUTF8
-
return an encoder-instance which can map unicode into/from utf8
Usage example(s):
-
encoderToEncodeFrom: oldEncodingArg into: newEncodingArg
-
unicode -> something
Usage example(s):
CharacterEncoder initialize
CharacterEncoder encoderToEncodeFrom:#'latin1' into:#'jis7'
CharacterEncoder encoderToEncodeFrom:#'koi8-r' into:#'mac-cyrillic'
CharacterEncoder encoderToEncodeFrom:#'ms-arabic' into:#'mac-arabic'
CharacterEncoder encoderToEncodeFrom:#'iso8859-5' into:#'koi8-r'
CharacterEncoder encoderToEncodeFrom:#'iso8859-5' into:#'unicode'
CharacterEncoder encoderToEncodeFrom:#'koi8-r' into:#'koi8-u'
CharacterEncoder encoderToEncodeFrom:#'utf-8' into:#unicode
|
private
-
flushCode
-
self flushCode
private-mapping setup
-
generateCode
-
-
generateSubclassCode
-
-
mapFileURL1_codeColumn
-
-
mapFileURL1_relativePathName
-
must be redefined in concrete subclass(es)
-
mapFileURL2_relativePathName
-
must be redefined in concrete subclass(es)
-
mappingURL1
-
-
mappingURL2
-
queries
-
alternativeNamesFor: encodingName do: aBlock
-
try replacing '_' by '-'
-
alternativeNamesFor: encodingName doWithExit: aBlock
-
try replacing '_' by '-'
Usage example(s):
self alternativeNamesFor:'bla-Foo_123' doWithExit:[:nm :exit |
Transcript showCR:nm
].
self alternativeNamesFor:'bla-Foo_123' doWithExit:[:nm :exit |
nm = 'bla-foo-123' ifTrue:[exit value].
Transcript showCR:nm
]
|
-
bomBytes
-
return the BOM (byte order mark) bytes or nil.
Only applicable for UTF encoders.
-
encodingNames
-
subclasses can provide a list of names;
(see https://encoding.spec.whatwg.org/#visualization)
eg: ISO_8859-5 may return
#( 'iso8859_5' 'iso8859-5' 'iso-8859-5' 'cyrillic' 'iso-ir-144' 'csisocyrillic' 'iso88595' 'iso_8859-5' 'iso_8859-5:1988')
-
isAbstract
-
Return if this class is an abstract class.
True is returned for CharacterEncoder here; false for subclasses.
Abstract subclasses must redefine this again.
-
isEncoding: subSetEncodingArg subSetOf: superSetEncodingArg
-
return true, if superSetEncoding encoding includes all characters of subSetEncoding.
(this means: characters are included - not that they have the same encoding)
-
maxCode
-
** This method must be redefined in concrete classes (subclassResponsibility) **
-
minCode
-
-
nameOfDecodedCode
-
Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.
-
nameOfEncoding
-
-
supportedExternalEncodings
-
return an array of arrays containing the names of supported
encodings which are supported for external resources (i.e. files).
The first element contains the internally used symbolic name,
the second contains a user-readable string (description).
More than one external name may be mapped onto the same symbolic.
-
userFriendlyNameOfEncoding
-
utilities
-
detectAndSkipBOMInStream: stream
-
skips over the BOM and returns one of
#utf8
#utf32be
#utf32leüðæëï
#utf16le
#utf16be
nil
if no BOM is detected, the stream is repositions to where it was before.
Usage example(s):
|s enc|
s := #[1 2 3 4] readStream.
enc := self detectAndSkipBOMInStream:s.
self assert:(enc == nil).
self assert:(s position == 0).
s := #[16rEF 16rBB 16rBF 4 5 6] readStream.
enc := self detectAndSkipBOMInStream:s.
self assert:(enc == #utf8).
self assert:(s position == 3).
s := #[16rFF 2 3 4] readStream.
enc := self detectAndSkipBOMInStream:s.
self assert:(enc == nil).
self assert:(s position == 0).
s := #[16rFF 16rFE 3 4] readStream.
enc := self detectAndSkipBOMInStream:s.
self assert:(enc == #utf16le).
self assert:(s position == 2).
s := #[16rFF 16rFE 0 4 5] readStream.
enc := self detectAndSkipBOMInStream:s.
self assert:(enc == #utf16le).
self assert:(s position == 2).
s := #[16rFE 16rFF 3 4] readStream.
enc := self detectAndSkipBOMInStream:s.
self assert:(enc == #utf16be).
self assert:(s position == 2).
s := #[16rFF 16rFE 0 0 5 6] readStream.
enc := self detectAndSkipBOMInStream:s.
self assert:(enc == #utf32le).
self assert:(s position == 4).
s := #[0 0 16rFE 16rFF 5 6] readStream.
enc := self detectAndSkipBOMInStream:s.
self assert:(enc == #utf32be).
self assert:(s position == 4).
|
-
detectBOMInBuffer: buffer
-
returns one of
#utf8
#utf32be
#utf32le
#utf16le
#utf16be
nil
-
guessEncodingOfBuffer: buffer
-
try to guess a string-buffer's encoding.
Basically looks for BOM (byte order marks)
pr a special string of the form
encoding #name
or:
encoding: name
within the given buffer
(which is usually found within the first few bytes of a textFile).
Many editors and tools write such comments (eg. emacs, st/x, etc.)
-
guessEncodingOfFile: aFilename
-
look for a BOM (byte order mark) or a special string of the form:
encoding #name
or:
encoding: name
within the given buffer
(which is usually found in the first few bytes of a textFile).
If that's not found, use heuristics (in CharacterArray) to guess.
Return a symbol like #utf8.
Usage example(s):
self guessEncodingOfFile:'../../libview/resources/de.rs' asFilename
self guessEncodingOfFile:'/Users/exept/cg_work/stx/libview/resources/de.rs' asFilename
self guessEncodingOfFile:'../../libview/resources/ru.rs' asFilename
self guessEncodingOfFile:'/Users/exept/cg_work/stx/libview/resources/ru.rs' asFilename
self guessEncodingOfFile:'../../libview/resources/th.rs' asFilename
self guessEncodingOfFile:'/Users/exept/cg_work/stx/libview/resources/th.rs' asFilename
|
-
guessEncodingOfStream: aStream
-
look for a BOM (byte order mark) or a special string of the form:
encoding #name
or:
encoding: name
in the first few bytes of aStream.
Return a symbol like #utf8.
-
initializeEncodingDetectors
-
setup the list of encoding detectors.
This is a list of blocks, which get a buffer as argument,
and return an encoding symbol or nil.
Can be customized for more detectors
(used to be hard-coded in guessEncodingOfBuffer:)
-
showCharacterSet
-
font := (Font family:'courier' face:'medium' style:'roman' size:12 encoding:'iso10646-1').
Usage example(s):
CharacterEncoderImplementations::MS_Ansi showCharacterSet
CharacterEncoderImplementations::ISO8859_1 showCharacterSet
CharacterEncoderImplementations::ISO8859_2 showCharacterSet
CharacterEncoderImplementations::ISO8859_3 showCharacterSet
CharacterEncoderImplementations::ISO8859_4 showCharacterSet
CharacterEncoderImplementations::ISO8859_5 showCharacterSet
CharacterEncoderImplementations::ISO8859_6 showCharacterSet
CharacterEncoderImplementations::ISO8859_7 showCharacterSet
CharacterEncoderImplementations::ISO8859_8 showCharacterSet
CharacterEncoderImplementations::ISO8859_9 showCharacterSet
|
encoding & decoding
-
decodeString: anEncodedStringOrByteCollection
-
given a string in my encoding, return a unicode-string for it
** This method must be redefined in concrete classes (subclassResponsibility) **
-
encodeCharacter: aUnicodeCharacterOrCodePoint
-
encode aUnicodeCharacterOrCodePoint to a (8-bit) String or ByteArray
-
encodeString: aUnicodeString
-
given a string in unicode, return a string or ByteArray in my encoding for it
** This method must be redefined in concrete classes (subclassResponsibility) **
error handling
-
decodesToUnicode
-
answer true, if this encoder decodes data to unicode
-
decodingError
-
report an error that there is no unicode-codePoint for a given codePoint in this encoding.
(which is unlikely) or that the encoding is undefined for that value
(for example, holes in the ISO-8859-3 encoding)
-
defaultDecoderValue
-
placed into a decoded string, in case there is no unicode codePoint
for a given encoded codePoint. (typically 16rFFFF).
-
defaultDecoderValueFor: codePoint
-
no code exists when decoding codePoint;
concrete classes may provide a specific default value (typically 16rFFFF)
-
defaultEncoderValue
-
placed into an encoded string, in case there is no codePoint
for a given unicode codePoint. (typically $?).
-
defaultEncoderValueFor: unicodePoint
-
no code exists when encoding unicodePoint;
concrete classes may provide a specific default value (typically $?)
-
encodingError
-
report an error that some unicode-codePoint cannot be represented by this encoder
printing
-
printOn: aStream
-
(comment from inherited method)
append a user printed representation of the receiver to aStream.
The format is suitable for a human - not meant to be read back.
The default here is to output the receiver's class name.
BUT: this method is heavily redefined for objects which
can print prettier.
queries
-
characterSize: charOrCodePoint
-
return the number of bytes required to encode codePoint
-
isEncoderFor: encoding
-
does this encode to encoding?
-
isNullEncoder
-
-
maxCode
-
-
minCode
-
-
nameOfDecodedCode
-
Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.
-
nameOfEncoding
-
-
userFriendlyNameOfEncoding
-
stream support
-
encodeCharacter: aUnicodeCharacter on: aStream
-
given a character in unicode, encode it onto aStream.
Subclasses can redefine this to avoid allocating many new string instances.
-
encodeString: aUnicodeString on: aStream
-
given a string in unicode, encode it onto aStream.
Subclasses can redefine this to avoid allocating many new string instances.
(but must then also redefine encodeString:aUnicodeString to collect the characters)
-
readNext: countArg charactersFrom: aStream
-
-
readNextCharacterFrom: aStream
-
** This method must be redefined in concrete classes (subclassResponsibility) **
testing
-
isISO8859_1Encoder
-
answer true, if this encodes from/to ISO8859_1
-
isUnicodeSubsetEncoder
-
answer true, if this encodes a subset of Unicode, that is an 1-to-1
mapping to unicode
-
isUtf16Encoder
-
answer true, if this encodes from/to UTF-16 (regardless of byte-order)
-
isUtfEncoder
-
answer true, if this encodes from/to any UTF (regardless of how many bytes and byte-order).
In other words: does it make sense to prepend a BOM?
CompoundEncoder
DefaultEncoder
InverseEncoder
NullEncoder
OtherEncoding
TwoStepEncoder
|s1 s2|
s1 := 'hello'.
s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'unicode'.
s2
|
|s1 s2|
s1 := 'hello'.
s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'iso8859-7'.
s2
|
|