eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'CharacterEncoderImplementations::ISO10646_to_UTF8':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: ISO10646_to_UTF8 (in CharacterEncoderImplementations)


Inheritance:

   Object
   |
   +--CharacterEncoder
      |
      +--CharacterEncoderImplementations::VariableBytesEncoder
         |
         +--CharacterEncoderImplementations::ISO10646_to_UTF8
            |
            +--CharacterEncoderImplementations::ISO10646_to_UTF8_MAC
            |
            +--CharacterEncoderImplementations::ISO10646_to_XMLUTF8

Package:
stx:libbasic
Category:
Collections-Text-Encodings
Version:
rev: 1.36 date: 2024/01/29 16:07:57
user: stefan
file: CharacterEncoderImplementations__ISO10646_to_UTF8.st directory: libbasic
module: stx stc-classLibrary: libbasic

Description:


I can encode unicode characters into utf-8 and
decode utf-8 characters into unicode.

Notice the naming (many are confused):
    Unicode is the set of number-to-glyph assignments
whereas:
    UTF8 is a concrete way of xmitting Unicode codePoints (numbers).
UTF16 is another concrete encoding, for example.    
    
ST/X NEVER uses UTF8 internally - all characters are full 24bit characters.
Only when exchanging data, are these converted into UTF8 (or other) byte sequences.

copyright

COPYRIGHT (c) 2004 by eXept Software AG All Rights Reserved This software is furnished under a license and may be used only in accordance with the terms of that license and with the inclusion of the above copyright notice. This software may not be provided or otherwise made available to, or used by, any other person. No title to or ownership of the software is hereby transferred.

Class protocol:

instance creation
o  flushSingleton
flushes the cached singleton

Usage example(s):

     self flushSingleton

o  new
returns a singleton

o  theOneAndOnlyInstance
returns a singleton

queries
o  bomBytes
(comment from inherited method)
return the BOM (byte order mark) bytes or nil.
Only applicable for UTF encoders.

o  bytesToReadFor: firstByte


Instance protocol:

encoding & decoding
o  decodeString: aStringOrByteCollection
given a string in UTF8 encoding,
return a new string containing the same characters, in Unicode encoding.
Returns either a normal String, a Unicode16String or a Unicode32String instance.
This is only useful, when reading from external sources or communicating with
other systems
(ST/X never uses utf8 internally, but always uses strings of fully decoded unicode characters).
This only handles up-to 30bit characters.

o  encodeString: aUnicodeString
return the UTF-8 representation of a Unicode string.
The resulting string is only useful to be stored on some external file,
not for being used inside ST/X.

queries
o  characterSize: charOrCodePoint
return the number of bytes required to encode codePoint

o  nameOfEncoding

stream support
o  encodeCharacter: aUnicodeCharacter on: aStream
given a character in unicode, encode it onto aStream.

o  encodeString: aUnicodeString on: aStream
given a string in unicode, encode it onto aStream.

o  readNext: charactersToReadArg charactersFrom: aStream
decode the next charactersToRead on aStream from utf-8 to unicode

o  readNextCharacterFrom: aStream
decode the next character or byte on aStream from utf-8 to unicode

testing
o  isUtfEncoder
answer true, if this encodes from/to any UTF (regardless of how many bytes and byte-order).
In other words: does it make sense to prepend a BOM?


Examples:


Encoding (unicode to utf8)
   ISO10646_to_UTF8 encodeString:'hello'.


Decoding (utf8 to unicode):
   |t|

   t := ISO10646_to_UTF8 encodeString:'Helloœ'.
   ISO10646_to_UTF8 decodeString:t.


ST/X 7.7.0.0; WebServer 1.702 at 20f6060372b9.unknown:8081; Wed, 22 Jan 2025 11:05:41 GMT