|
Class: ISO10646_to_UTF8_MAC (in CharacterEncoderImplementations)
Object
|
+--CharacterEncoder
|
+--CharacterEncoderImplementations::VariableBytesEncoder
|
+--CharacterEncoderImplementations::ISO10646_to_UTF8
|
+--CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
- Package:
- stx:libbasic
- Category:
- Collections-Text-Encodings
- Version:
- rev:
1.14
date: 2023/05/24 08:33:23
- user: stefan
- file: CharacterEncoderImplementations__ISO10646_to_UTF8_MAC.st directory: libbasic
- module: stx stc-classLibrary: libbasic
UTF-8 can encode some diacritical characters (umlauts) in multiple ways:
- either with a single uniode (e.g. ae -> ä -> ä -> C3 A4)
- or as so called 'Normalization Form canonical Decomposition', i.e. as a regular 'a' followed by a
combining diacritical mark (for example: acute).
MAC OSX needs the second form for its file names.
However, OSX does not decompose the ranges U+2000-U+2FFF, U+F900-U+FAFF and U+2F800-U+2FAFF.
This is a q&d hack, to at least support the first page (latin1) characters.
Will be enhanced for the 2nd and 3rd unicode page, when I find time.
[caveat:]
only a small subset of multi-composes are supported yet (for example: trema plus acute)
[instance variables:]
[class variables:]
ComposeMap DecomposeMap
copyrightCOPYRIGHT (c) 2015 by eXept Software AG
All Rights Reserved
This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice. This software may not
be provided or otherwise made available to, or used by, any
other person. No title to or ownership of the software is
hereby transferred.
initialization
-
initializeDecomposeMap
-
the map which decomposes a diacritical character into its two components
encoding & decoding
-
compositionOf: baseChar with: diacriticalChar to: outStream
-
compose two characters into one
a + umlaut-diacritic-mark -> ä.
-
decodeString: aStringOrByteCollection
-
return a Unicode string from the passed in UTF-8-MAC encoded string.
This is UTF-8 with compose-characters decomposed
(i.e. as separate codes, not as single combined characters).
For now, here is a limited version, which should work
at least for most european countries...
Usage example(s):
(ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray
-> #[97 195 164 111 195 182 117 195 188]
(ISO10646_to_UTF8 new decodeString:
(ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray)
(ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray
-> #[97 97 204 136 111 111 204 136 117 117 204 136]
(ISO10646_to_UTF8_MAC new decodeString:
(ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray)
|
-
decompositionOf: codePointIn into: outBlockWithTwoArgs
-
if required, decompose a diacritical character into a base character and a punctuation;
eg. ä -> a + umlaut-diacritic-mark.
Pass both as args to the given block.
For non diactit. chars, pass a nil diacrit-mark value.
Return true, if a decomposition was done.
-
encodeCharacter: aUnicodeCharacter on: aStream
-
return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).
For now, here is a limited version, which should work
at least for most european countries...
-
encodeString: aUnicodeString
-
return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).
For now, here is a limited version, which should work
at least for most european countries...
Usage example(s):
(self encodeString:'hello') asByteArray #[104 101 108 108 111]
(self encodeString:(Character value:16r40) asString) asByteArray #[64]
(self encodeString:(Character value:16r7F) asString) asByteArray #[127]
(self encodeString:(Character value:16r80) asString) asByteArray #[194 128]
(self encodeString:(Character value:16rFF) asString) asByteArray #[195 191]
(ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray
-> #[97 195 164 111 195 182 117 195 188]
(ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray
-> #[97 97 204 136 111 111 204 136 117 117 204 136]
ISO10646_to_UTF8_MAC new decodeString:
(ISO10646_to_UTF8_MAC new encodeString:'Packages aus VSE für Smalltalk_X') asByteArray
|
-
encodeString: aUnicodeString on: aStream
-
return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).
For now, here is a limited version, which should work
at least for most european countries...
-
readNextCharacterFrom: aStream
-
(comment from inherited method)
decode the next character or byte on aStream from utf-8 to unicode
queries
-
nameOfEncoding
-
|