Smalltalk/X Webserver

Documentation of class 'CharacterEncoderImplementations::ISO10646_to_UTF8_MAC':

Class: ISO10646_to_UTF8_MAC (in CharacterEncoderImplementations)

Inheritance
Description
Class protocol
- initialization
Instance protocol
- encoding & decoding
- queries

Inheritance:

   Object
   |
   +--CharacterEncoder
      |
      +--CharacterEncoderImplementations::VariableBytesEncoder
         |
         +--CharacterEncoderImplementations::ISO10646_to_UTF8
            |
            +--CharacterEncoderImplementations::ISO10646_to_UTF8_MAC

Package:: stx:libbasic

Category:: Collections-Text-Encodings

Version:: rev: 1.14 date: 2023/05/24 08:33:23; user: stefan; file: CharacterEncoderImplementations__ISO10646_to_UTF8_MAC.st directory: libbasic; module: stx stc-classLibrary: libbasic

Description:

UTF-8 can encode some diacritical characters (umlauts) in multiple ways:
    - either with a single uniode (e.g. ae -> ä -> &#228 -> C3 A4)
    - or as so called 'Normalization Form canonical Decomposition', i.e. as a regular 'a' followed by a
      combining diacritical mark (for example: acute).

MAC OSX needs the second form for its file names.
However, OSX does not decompose the ranges U+2000-U+2FFF, U+F900-U+FAFF and U+2F800-U+2FAFF.

This is a q&d hack, to at least support the first page (latin1) characters.
Will be enhanced for the 2nd and 3rd unicode page, when I find time.

[caveat:]
    only a small subset of multi-composes are supported yet (for example: trema plus acute)


[instance variables:]

[class variables:]
    ComposeMap DecomposeMap

copyrightCOPYRIGHT (c) 2015 by eXept Software AG
             All Rights Reserved

This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice.   This software may not
be provided or otherwise made available to, or used by, any
other person.  No title to or ownership of the software is
hereby transferred.

Class protocol:

initialization

initializeDecomposeMap: the map which decomposes a diacritical character into its two components

Instance protocol:

encoding & decoding

compositionOf: baseChar with: diacriticalChar to: outStream

compose two characters into one
a + umlaut-diacritic-mark -> ä.

decodeString: aStringOrByteCollection

return a Unicode string from the passed in UTF-8-MAC encoded string.
This is UTF-8 with compose-characters decomposed
(i.e. as separate codes, not as single combined characters).

For now, here is a limited version, which should work
at least for most european countries...

Usage example(s):

     (ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray   
        -> #[97 195 164 111 195 182 117 195 188]

     (ISO10646_to_UTF8 new decodeString:
            (ISO10646_to_UTF8 new encodeString:'aäoöuü') asByteArray)    

    (ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray 
        -> #[97 97 204 136 111 111 204 136 117 117 204 136]  

     (ISO10646_to_UTF8_MAC new decodeString:
            (ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray)

decompositionOf: codePointIn into: outBlockWithTwoArgs

if required, decompose a diacritical character into a base character and a punctuation;
eg. ä -> a + umlaut-diacritic-mark.
Pass both as args to the given block.
For non diactit. chars, pass a nil diacrit-mark value.
Return true, if a decomposition was done.

encodeCharacter: aUnicodeCharacter on: aStream

return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).

For now, here is a limited version, which should work
at least for most european countries...

encodeString: aUnicodeString

return the UTF-8-MAC representation of a aUnicodeString.
This is UTF-8 with compose-characters decompose (i.e. as separate codes, not as
single combined characters).

For now, here is a limited version, which should work
at least for most european countries...

Usage example(s):

     (self encodeString:'hello') asByteArray                             #[104 101 108 108 111]
     (self encodeString:(Character value:16r40) asString) asByteArray    #[64]
     (self encodeString:(Character value:16r7F) asString) asByteArray    #[127]
     (self encodeString:(Character value:16r80) asString) asByteArray    #[194 128]
     (self encodeString:(Character value:16rFF) asString) asByteArray    #[195 191]

     (ISO10646_to_UTF8     new encodeString:'aäoöuü') asByteArray   
        -> #[97 195 164 111 195 182 117 195 188]
     (ISO10646_to_UTF8_MAC new encodeString:'aäoöuü') asByteArray 
        -> #[97 97 204 136 111 111 204 136 117 117 204 136]  

     ISO10646_to_UTF8_MAC new decodeString:
         (ISO10646_to_UTF8_MAC new encodeString:'Packages aus VSE für Smalltalk_X') asByteArray

encodeString: aUnicodeString on: aStream

readNextCharacterFrom: aStream

(comment from inherited method)
decode the next character or byte on aStream from utf-8 to unicode

queries

nameOfEncoding

ST/X 7.7.0.0; WebServer 1.702 at 20f6060372b9.unknown:8081; Fri, 19 Sep 2025 03:46:39 GMT