|
Class: ISO10646_to_UTF8_MAC (in CharacterEncoderImplementations)
Object
|
+--CharacterEncoder
|
+--CharacterEncoderImplementations::VariableBytesEncoder
|
+--CharacterEncoderImplementations::ISO10646_to_UTF8
|
+--CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
- Package:
- stx:libbasic
- Category:
- Collections-Text-Encodings
- Version:
- rev:
1.12
date: 2018/01/19 13:43:38
- user: stefan
- file: CharacterEncoderImplementations__ISO10646_to_UTF8_MAC.st directory: libbasic
- module: stx stc-classLibrary: libbasic
- Author:
- Claus Gittinger
UTF-8 can encode some diacritical characters (umlauts) in multiple ways:
- either with a single uniode (e.g. ae -> ä -> ä -> C3 A4)
- or as so called 'Normalization Form canonical Decomposition', i.e. as a regular 'a' followed by a
combining diacritical mark (for example: acute).
MAC OSX needs the second form for its file names.
However, OSX does not decompose the ranges U+2000-U+2FFF, U+F900-U+FAFF and U+2F800-U+2FAFF.
This is a q&d hack, to at least support the first page (latin1) characters.
Will be enhanced for the 2nd and 3rd unicode page, when I find time.
[caveat:]
only a small subset of multi-composes are supported yet (for example: trema plus acute)
[instance variables:]
[class variables:]
ComposeMap DecomposeMap
http://developer.apple.com/library/mac/#qa/qa2001/qa1173.html
initialization
-
initializeDecomposeMap
-
the map which decomposes a diacritical character into its two components
encoding & decoding
-
compositionOf: baseChar with: diacriticalChar to: outStream
-
compose two characters into one
a + umlaut-diacritic-mark -> ä.
-
decodeString: aStringOrByteCollection
-
return a Unicode string from the passed in UTF-8-MAC encoded string.
This is UTF-8 with compose-characters decomposed
(i.e. as separate codes, not as single combined characters).
For now, here is a limited version, which should work
at least for most european countries...
usage example(s):
(ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray
-> #[97 195 164 111 195 182 117 195 188]
(ISO10646_to_UTF8 new decodeString:
(ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray)
(ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray
-> #[97 97 204 136 111 111 204 136 117 117 204 136]
(ISO10646_to_UTF8_MAC new decodeString:
(ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray)
|
-
decompositionOf: codePointIn into: outBlockWithTwoArgs
-
if required, decompose a diacritical character into a base character and a punctuation;
eg. ä -> a + umlaut-diacritic-mark.
Pass both as args to the given block.
For non diactit. chars, pass a nil diacrit-mark value.
Return true, if a decomposition was done.
-
encodeCharacter: aUnicodeCharacter on: aStream
-
return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).
For now, here is a limited version, which should work
at least for most european countries...
-
encodeString: aUnicodeString
-
return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).
For now, here is a limited version, which should work
at least for most european countries...
usage example(s):
(self encodeString:'hello') asByteArray #[104 101 108 108 111]
(self encodeString:(Character value:16r40) asString) asByteArray #[64]
(self encodeString:(Character value:16r7F) asString) asByteArray #[127]
(self encodeString:(Character value:16r80) asString) asByteArray #[194 128]
(self encodeString:(Character value:16rFF) asString) asByteArray #[195 191]
(ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray
-> #[97 195 164 111 195 182 117 195 188]
(ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray
-> #[97 97 204 136 111 111 204 136 117 117 204 136]
ISO10646_to_UTF8_MAC new decodeString:
(ISO10646_to_UTF8_MAC new encodeString:'Packages aus VSE für Smalltalk_X') asByteArray
|
-
encodeString: aUnicodeString on: aStream
-
return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).
For now, here is a limited version, which should work
at least for most european countries...
-
readNextCharacterFrom: aStream
-
(comment from inherited method)
decode the next character or byte on aStream from utf-8 to unicode
queries
-
nameOfEncoding
-
|