eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'CharacterEncoderImplementations::ISO10646_to_UTF8_MAC':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: ISO10646_to_UTF8_MAC (in CharacterEncoderImplementations)


Inheritance:

   Object
   |
   +--CharacterEncoder
      |
      +--CharacterEncoderImplementations::VariableBytesEncoder
         |
         +--CharacterEncoderImplementations::ISO10646_to_UTF8
            |
            +--CharacterEncoderImplementations::ISO10646_to_UTF8_MAC

Package:
stx:libbasic
Category:
Collections-Text-Encodings
Version:
rev: 1.12 date: 2018/01/19 13:43:38
user: stefan
file: CharacterEncoderImplementations__ISO10646_to_UTF8_MAC.st directory: libbasic
module: stx stc-classLibrary: libbasic
Author:
Claus Gittinger

Description:


UTF-8 can encode some diacritical characters (umlauts) in multiple ways:
    - either with a single uniode (e.g. ae -> ä -> &#228 -> C3 A4)
    - or as so called 'Normalization Form canonical Decomposition', i.e. as a regular 'a' followed by a
      combining diacritical mark (for example: acute).

MAC OSX needs the second form for its file names.
However, OSX does not decompose the ranges U+2000-U+2FFF, U+F900-U+FAFF and U+2F800-U+2FAFF.

This is a q&d hack, to at least support the first page (latin1) characters.
Will be enhanced for the 2nd and 3rd unicode page, when I find time.

[caveat:]
    only a small subset of multi-composes are supported yet (for example: trema plus acute)


[instance variables:]

[class variables:]
    ComposeMap DecomposeMap


Related information:

    http://developer.apple.com/library/mac/#qa/qa2001/qa1173.html

Class protocol:

initialization
o  initializeDecomposeMap
the map which decomposes a diacritical character into its two components


Instance protocol:

encoding & decoding
o  compositionOf: baseChar with: diacriticalChar to: outStream
compose two characters into one
a + umlaut-diacritic-mark -> ä.

o  decodeString: aStringOrByteCollection
return a Unicode string from the passed in UTF-8-MAC encoded string.
This is UTF-8 with compose-characters decomposed
(i.e. as separate codes, not as single combined characters).

For now, here is a limited version, which should work
at least for most european countries...

usage example(s):
     (ISO10646_to_UTF8 new encodeString:'aou') asByteArray   
        -> #[97 195 164 111 195 182 117 195 188]

     (ISO10646_to_UTF8 new decodeString:
            (ISO10646_to_UTF8 new encodeString:'aou') asByteArray)    

    (ISO10646_to_UTF8_MAC new encodeString:'aou') asByteArray 
        -> #[97 97 204 136 111 111 204 136 117 117 204 136]  

     (ISO10646_to_UTF8_MAC new decodeString:
            (ISO10646_to_UTF8_MAC new encodeString:'aou') asByteArray)    

o  decompositionOf: codePointIn into: outBlockWithTwoArgs
if required, decompose a diacritical character into a base character and a punctuation;
eg. ä -> a + umlaut-diacritic-mark.
Pass both as args to the given block.
For non diactit. chars, pass a nil diacrit-mark value.
Return true, if a decomposition was done.

o  encodeCharacter: aUnicodeCharacter on: aStream
return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).

For now, here is a limited version, which should work
at least for most european countries...

o  encodeString: aUnicodeString
return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).

For now, here is a limited version, which should work
at least for most european countries...

usage example(s):
     (self encodeString:'hello') asByteArray                             #[104 101 108 108 111]
     (self encodeString:(Character value:16r40) asString) asByteArray    #[64]
     (self encodeString:(Character value:16r7F) asString) asByteArray    #[127]
     (self encodeString:(Character value:16r80) asString) asByteArray    #[194 128]
     (self encodeString:(Character value:16rFF) asString) asByteArray    #[195 191]

     (ISO10646_to_UTF8     new encodeString:'aou') asByteArray   
        -> #[97 195 164 111 195 182 117 195 188]
     (ISO10646_to_UTF8_MAC new encodeString:'aou') asByteArray 
        -> #[97 97 204 136 111 111 204 136 117 117 204 136]  

     ISO10646_to_UTF8_MAC new decodeString:
         (ISO10646_to_UTF8_MAC new encodeString:'Packages aus VSE fr Smalltalk_X') asByteArray 

o  encodeString: aUnicodeString on: aStream
return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).

For now, here is a limited version, which should work
at least for most european countries...

o  readNextCharacterFrom: aStream
(comment from inherited method)
decode the next character or byte on aStream from utf-8 to unicode

queries
o  nameOfEncoding



ST/X 7.1.0.0; WebServer 1.663 at exept.de:8081; Sat, 22 Sep 2018 07:54:01 GMT