eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'CharacterEncoderImplementations::ISO10646_to_UTF8_MAC':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: ISO10646_to_UTF8_MAC (in CharacterEncoderImplementations)


Inheritance:

   Object
   |
   +--CharacterEncoder
      |
      +--CharacterEncoderImplementations::VariableBytesEncoder
         |
         +--CharacterEncoderImplementations::ISO10646_to_UTF8
            |
            +--CharacterEncoderImplementations::ISO10646_to_UTF8_MAC

Package:
stx:libbasic
Category:
Collections-Text-Encodings
Version:
rev: 1.14 date: 2023/05/24 08:33:23
user: stefan
file: CharacterEncoderImplementations__ISO10646_to_UTF8_MAC.st directory: libbasic
module: stx stc-classLibrary: libbasic

Description:


UTF-8 can encode some diacritical characters (umlauts) in multiple ways:
    - either with a single uniode (e.g. ae -> ä -> &#228 -> C3 A4)
    - or as so called 'Normalization Form canonical Decomposition', i.e. as a regular 'a' followed by a
      combining diacritical mark (for example: acute).

MAC OSX needs the second form for its file names.
However, OSX does not decompose the ranges U+2000-U+2FFF, U+F900-U+FAFF and U+2F800-U+2FAFF.

This is a q&d hack, to at least support the first page (latin1) characters.
Will be enhanced for the 2nd and 3rd unicode page, when I find time.

[caveat:]
    only a small subset of multi-composes are supported yet (for example: trema plus acute)


[instance variables:]

[class variables:]
    ComposeMap DecomposeMap

copyright

COPYRIGHT (c) 2015 by eXept Software AG All Rights Reserved This software is furnished under a license and may be used only in accordance with the terms of that license and with the inclusion of the above copyright notice. This software may not be provided or otherwise made available to, or used by, any other person. No title to or ownership of the software is hereby transferred.

Class protocol:

initialization
o  initializeDecomposeMap
the map which decomposes a diacritical character into its two components


Instance protocol:

encoding & decoding
o  compositionOf: baseChar with: diacriticalChar to: outStream
compose two characters into one
a + umlaut-diacritic-mark -> ä.

o  decodeString: aStringOrByteCollection
return a Unicode string from the passed in UTF-8-MAC encoded string.
This is UTF-8 with compose-characters decomposed
(i.e. as separate codes, not as single combined characters).

For now, here is a limited version, which should work
at least for most european countries...

Usage example(s):

     (ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray   
        -> #[97 195 164 111 195 182 117 195 188]

     (ISO10646_to_UTF8 new decodeString:
            (ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray)    

    (ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray 
        -> #[97 97 204 136 111 111 204 136 117 117 204 136]  

     (ISO10646_to_UTF8_MAC new decodeString:
            (ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray)    

o  decompositionOf: codePointIn into: outBlockWithTwoArgs
if required, decompose a diacritical character into a base character and a punctuation;
eg. ä -> a + umlaut-diacritic-mark.
Pass both as args to the given block.
For non diactit. chars, pass a nil diacrit-mark value.
Return true, if a decomposition was done.

o  encodeCharacter: aUnicodeCharacter on: aStream
return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).

For now, here is a limited version, which should work
at least for most european countries...

o  encodeString: aUnicodeString
return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).

For now, here is a limited version, which should work
at least for most european countries...

Usage example(s):

     (self encodeString:'hello') asByteArray                             #[104 101 108 108 111]
     (self encodeString:(Character value:16r40) asString) asByteArray    #[64]
     (self encodeString:(Character value:16r7F) asString) asByteArray    #[127]
     (self encodeString:(Character value:16r80) asString) asByteArray    #[194 128]
     (self encodeString:(Character value:16rFF) asString) asByteArray    #[195 191]

     (ISO10646_to_UTF8     new encodeString:'aäoöuü') asByteArray   
        -> #[97 195 164 111 195 182 117 195 188]
     (ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray 
        -> #[97 97 204 136 111 111 204 136 117 117 204 136]  

     ISO10646_to_UTF8_MAC new decodeString:
         (ISO10646_to_UTF8_MAC new encodeString:'Packages aus VSE für Smalltalk_X') asByteArray 

o  encodeString: aUnicodeString on: aStream
return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).

For now, here is a limited version, which should work
at least for most european countries...

o  readNextCharacterFrom: aStream
(comment from inherited method)
decode the next character or byte on aStream from utf-8 to unicode

queries
o  nameOfEncoding



ST/X 7.7.0.0; WebServer 1.702 at 20f6060372b9.unknown:8081; Wed, 22 Jan 2025 11:48:35 GMT