|
|
Class: CharacterEncoder
Object
|
+--CharacterEncoder
|
+--CharacterEncoder::CompoundEncoder
|
+--CharacterEncoder::DefaultEncoder
|
+--CharacterEncoder::InverseEncoder
|
+--CharacterEncoder::NullEncoder
|
+--CharacterEncoder::OtherEncoding
|
+--CharacterEncoder::TwoStepEncoder
|
+--CharacterEncoderImplementations::ISO10646_1
|
+--CharacterEncoderImplementations::SingleByteEncoder
|
+--CharacterEncoderImplementations::TwoByteEncoder
- Package:
- stx:libbasic
- Category:
- Collections-Text-Encodings
- Version:
- rev:
1.106
date: 2009/12/11 16:54:00
- user: cg
- file: CharacterEncoder.st directory: libbasic
- module: stx stc-classLibrary: libbasic
- Author:
- Claus Gittinger
unfinished code - please read howToAddMoreCoders.
Character mappings are based on information in character maps found at either:
http://std.dkuug.dk/i18n/charmaps
or:
http://www.unicode.org/Public/MAPPINGS
No Warranty.
All the ISO 8859 codesets include ASCII as a proper codeset within them:
ISO 8859-1: Latin 1 - Western European Languages.
ISO 8859-2: Latin 2 - Eastern European Languages.
ISO 8859-3: Latin 3 - Afrikaans, Catalan, Dutch, English, Esperanto, German,
Italian, Maltese, Spanish and Turkish.
ISO 8859-4: Latin 4 - Danish, English, Estonian, Finnish, German, Greenlandic, Lappish and Latvian.
ISO 8859-5: Latin/Cyrillic - Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croat and Ukranian.
ISO 8859-6: Latin/Arabic - Arabic.
ISO 8859-7: Latin/Greek - Greek.
ISO 8859-8: Latin/Hebrew - Hebrew.
ISO 8859-9: Latin 5 - Danish, Dutch, English, Finnish, French, German, Irish, Italian,
Norwegian, Portuguese, Spanish, Swedish and Turkish.
ISO 8859-10: Latin 6 - Danish, English, Estonian, Finnish, German, Greenlandic, Icelandic,
Sami (Lappish), Latvian, Lithuanian, Norwegian, Faroese and Swedish.
Compatibility-ST80
-
encoderNamed: encoderName
-
q & d hack
-
platformName
-
accessing
-
nullEncoderInstance
-
class initialization
-
initialize
-
constants
-
jis7KanjiEscapeSequence
-
return the escape sequence used to switch to kanji in jis7 encoded strings.
This happens to be the same as ISO2022-JP's escape sequence.
-
jis7KanjiOldEscapeSequence
-
return the escape sequence used to switch to kanji in some old jis7 encoded strings.
-
jis7RomanEscapeSequence
-
return the escape sequence used to switch to roman in jis7 encoded strings
-
jisISO2022EscapeSequence
-
return the escape sequence used to switch to kanji in iso2022 encoded strings
encoding & decoding
-
decode: aCodePoint
-
-
decodeString: aString
-
-
decodeString: aString from: oldEncoding
-
-
encode: aCodePoint
-
-
encode: codePoint from: oldEncodingArg into: newEncodingArg
-
-
encodeString: aUnicodeString
-
given a string in unicode, return a string in my encoding for it
-
encodeString: aString from: oldEncodingArg into: newEncodingArg
-
-
encodeString: aString into: newEncoding
-
instance creation
-
encoderFor: encodingNameSymbol
-
given the name of an encoding, return an encoder-instance which can map these from/into unicode.
-
encoderFor: encodingNameSymbolArg ifAbsent: exceptionValue
-
given the name of an encoding, return an encoder-instance which can map these from/into unicode.
-
encoderForUTF8
-
return an encoder-instance which can map unicode into/from utf8
-
encoderToEncodeFrom: oldEncodingArg into: newEncodingArg
-
private
-
flushCode
-
private-mapping setup
-
generateCode
-
-
generateSubclassCode
-
-
mapFileURL1_codeColumn
-
-
mapFileURL1_relativePathName
-
raise an error: must be redefined in concrete subclass(es)
-
mapFileURL2_relativePathName
-
raise an error: must be redefined in concrete subclass(es)
-
mappingURL1
-
raise an error: must be redefined in concrete subclass(es)
-
mappingURL2
-
raise an error: must be redefined in concrete subclass(es)
queries
-
isEncoding: subSetEncodingArg subSetOf: superSetEncodingArg
-
return true, if superSetEncoding encoding includes all characters of subSetEncoding.
(this means: characters are included - not that they have the same encoding)
-
nameOfDecodedCode
-
Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.
-
nameOfEncoding
-
-
supportedExternalEncodings
-
return an array of arrays containing the names of supported
encodings which are supported for external resources (i.e. files).
The first element contains the internally used symbolic name,
the second contains a user-readable string (description).
More than one external name may be mapped onto the same symbolic.
-
userFriendlyNameOfEncoding
-
testing
-
isAbstract
-
Return if this class is an abstract class.
True is returned for CharacterEncoder here; false for subclasses.
Abstract subclasses must redefine again.
utilities
-
guessEncodingOfBuffer: buffer
-
look for a string of the form
encoding #name
or:
encoding: name
within the given buffer
(which is usually the first few bytes of a textFile).
-
guessEncodingOfFile: aFilename
-
look for a string
encoding #name
or:
encoding: name
within the given buffer
(which is usually the first few bytes of a textFile).
If thats not found, use heuristics (in CharacterArray) to guess.
-
guessEncodingOfStream: aStream
-
look for a string of the form
encoding #name
or:
encoding: name
in the first few bytes of aStream.
-
showCharacterSet
-
encoding & decoding
-
decode: anEncoding
-
given an integer in my encoding, return a unicode codePoint for it
** This method raises an error - it must be redefined in concrete classes **
-
decodeString: anEncodedString
-
given a string in my encoding, return a unicode-string for it
-
encode: aCodePoint
-
given a codePoint in unicode, return a byte in my encoding for it
** This method raises an error - it must be redefined in concrete classes **
-
encodeString: aUnicodeString
-
given a string in unicode, return a string in my encoding for it
error handling
-
decodingError
-
report an error that there is no unicode-codePoint for a given codePoint in this encoding.
(which is unlikely) or that the encoding is undefined for that value
(for example, holes in the ISO8859-3 encoding)
-
defaultDecoderValue
-
placed into a decoded string, in case there is no unicode codePoint
for a given encoded codePoint.
(typically 16rFFFF).
-
defaultEncoderValue
-
placed into an encoded string, in case there is no codePoint
for a given unicode codePoint.
(typically $?).
-
encodingError
-
report an error that some unicode-codePoint cannot be represented by this encoder
printing
-
printOn: aStream
-
private
-
newString: size
-
** This method raises an error - it must be redefined in concrete classes **
queries
-
characterSize: codePoint
-
return the number of bytes required to encode codePoint
** This method raises an error - it must be redefined in concrete classes **
-
isNullEncoder
-
-
nameOfDecodedCode
-
Most coders decode from their code into unicode / encode from unicode into their code.
There are a few exceptions to this, though - these must redefine this.
-
nameOfEncoding
-
-
userFriendlyNameOfEncoding
-
stream support
-
readNext: charactersToRead charactersFrom: stream
-
-
readNextCharacterFrom: aStream
-
-
readNextInputCharacterFrom: aStream
-
CompoundEncoder
DefaultEncoder
InverseEncoder
NullEncoder
OtherEncoding
TwoStepEncoder
|s1 s2|
s1 := 'hello'.
s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'unicode'.
s2
|
|s1 s2|
s1 := 'hello'.
s2 := CharacterEncoder encodeString:s1 from:#'iso8859-1' into:#'iso8859-7'.
s2
|
|