eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'Character':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: Character


Inheritance:

   Object
   |
   +--Magnitude
      |
      +--Character

Package:
stx:libbasic
Category:
Magnitude-General
Version:
rev: 1.215 date: 2019/08/09 08:52:31
user: cg
file: Character.st directory: libbasic
module: stx stc-classLibrary: libbasic
Author:
Claus Gittinger

Description:


This class represents characters.

Notice, that actual character objects are not used when characters
are stored in strings, symbols etc.
These only store a character's asciiValue/codePoint for a more compact representation.
The word 'asciiValue' is a historic leftover - actually, any integer
code is allowed and actually used (i.e. characters are not limited to 8bit).
Also, the encoding is actually Unicode, of which ascii is a subset and the same encoding value
for the first 128 characters (codePoint 0 to 127 are the same in ascii).

Some heavily used Characters are kept as singletons; i.e. for every asciiValue (0..N),
there exists exactly one instance of Character, which is shared.
Character value:xxx checks for this, and returns a reference to an existing instance.
For N<=255, this is guaranteed; i.e. in all Smalltalks, the single byte characters are always
handled like this, and you can therefore safely compare them using == (identity compare).

Other characters (i.e. codepoint > N) are not guaranteed to be shared;
i.e. these my or may not be created as required.
Actually, do NOT depend on which characters are and which are not shared.
Always compare using #= if there is any chance of a non-ascii character being involved.

Once again (because beginners sometimes make this mistake):
    This means: you may compare characters using #== ONLY IFF you are certain,
    that the characters ranges is 0..255.
    Otherwise, you HAVE TO compare using #=. (if in doubt, always compare using #=).
    Sorry for this inconvenience, but it is (practically) impossible to keep
    the possible maximum of 2^32 characters (Unicode) around, for that convenience alone.

In ST/X, N is (currently) 1024. This means that all the latin characters and some others are
kept as singleton in the CharacterTable class variable (which is also used by the VM when characters
are instantiated).

Methods marked as (JS) come from the manchester Character goody
(CharacterComparing) by Jan Steinman, which allow Characters to be used as
Interval elements (i.e. ($a to:$z) do:[...] );
They are not a big deal, but convenient add-ons.
Some of these have been modified a bit.

WARNING: characters are known by compiler and runtime system -
         do not change the instance layout.

Also, although you can create subclasses of Character, the compiler always
creates instances of Character for literals ...
... and other classes are hard-wired to always return instances of characters
in some cases (i.e. String>>at:, Symbol>>at: etc.).
Therefore, it may not make sense to create a character-subclass.

Case Mapping in Unicode:
    There are a number of complications to case mappings that occur once the repertoire
    of characters is expanded beyond ASCII.

    * Because of the inclusion of certain composite characters for compatibility,
      such as U+01F1 'DZ' capital dz, there is a third case, called titlecase,
      which is used where the first letter of a word is to be capitalized
      (e.g. Titlecase, vs. UPPERCASE, or lowercase).
      For example, the title case of the example character is U+01F2 'Dz' capital d with small z.

    * Case mappings may produce strings of different length than the original.
      For example, the German character U+00DF small letter sharp s expands when uppercased to
      the sequence of two characters 'SS'.
      This also occurs where there is no precomposed character corresponding to a case mapping.
      *** This is not yet implemented (in 5.2) ***

    * Characters may also have different case mappings, depending on the context.
      For example, U+03A3 capital sigma lowercases to U+03C3 small sigma if it is not followed
      by another letter, but lowercases to 03C2 small final sigma if it is.
      *** This is not yet implemented (in 5.2) ***

    * Characters may have case mappings that depend on the locale.
      For example, in Turkish the letter 0049 'I' capital letter i lowercases to 0131 small dotless i.
      *** This is not yet implemented (in 5.2) ***

    * Case mappings are not, in general, reversible.
      For example, once the string 'McGowan' has been uppercased, lowercased or titlecased,
      the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation.

Collation Sequence:
    *** This is not yet implemented (in 5.2) ***


Related information:

    String
    TwoByteString
    Unicode16String
    Unicode32String
    StringCollection
    Text

Class protocol:

accessing untypeable characters
o  controlCharacter: char
Answer the Character representing ctrl-char.
ctrl-a -> 1; ctrl-@ -> 0

usage example(s):

     self controlCharacter:$@ -> 0
     self controlCharacter:$a -> 1
     self controlCharacter:$d -> 4
     self controlCharacter:$z -> 26
     self controlCharacter:$[ -> 27    
     self controlCharacter:$\ -> 28     
     self controlCharacter:$] -> 29
     self controlCharacter:$_ -> 31     

o  endOfInput
Answer the Character representing ctrl-d (Unix-EOF).

o  leftParenthesis
Answer the Character representing a left parenthesis.

o  period
Answer the Character representing a period character.

o  poundSign
Answer the Character representing a pound sign (hash).

o  rightParenthesis
Answer the Character representing a right parenthesis.

constants
o  backspace
return the backspace character

o  bell
return the bell character

o  byteOrderMark
the unicode BOM character as a singleton

usage example(s):

     self byteOrderMark
     self codePoint:16rFEFF

o  cr
return the lineEnd character
- actually (in unix) this is a newline character

o  del
return the delete character

o  doubleQuote
return the double-quote character

o  esc
return the escape character

o  etx
return the end-of-text character

o  euro
The Euro currency sign (notice: not all fonts support it).
The Unicode encoding is U+20AC

usage example(s):

     Transcript font:(Font family:'courier' size:12 encoding:'iso10646-1').
     Transcript showCR:Character euro

o  excla
return the exclamation-mark character

o  ff
return the form-feed character

o  lf
return the newline/linefeed character

o  linefeed
squeak compatibility: return the newline/linefeed character

o  maxImmediateCodePoint
return the maximum codePoint until which the characters are shared

usage example(s):

      self maxImmediateCodePoint

o  maxValue
return the maximum codePoint a character may have

o  newPage
return the form-feed (newPage) character

o  nl
return the newline character

o  null
return the null character;
Notice, that in ST/X strings have an invisible (and w.r.t the string's size uncounted)
terminating NULL character, to make it easier to pass strings to C-functions.
However, this is ONLY true for nin-single-byte strings.

o  pageUp
return the pageUp control character

o  quote
return the single-quote character

o  return
return the (carriage) return character.
In ST/X, this is different from cr - for Unix reasons.

o  space
return the blank character

o  tab
return the tabulator character

instance creation
o  basicNew
catch new - Characters cannot be created with new

o  codePoint: anInteger
return a character with codePoint anInteger

usage example(s):

      self codePoint:16r34.
      self codePoint:16r3455.
      (self codePoint:16rFEFF) == (self codePoint:16rFEFF).
      self codePoint:16rFFFFFFFFFFFFFFFFFFF.

o  digitValue: anInteger
return a character that corresponds to anInteger.
0-9 map to $0-$9, 10-35 map to $A-$Z

o  utf8DecodeFrom: aStream
read and return a single unicode character from an UTF8 encoded stream.
Answer nil, if Stream>>#next answers nil.

usage example(s):

      Character utf8DecodeFrom:'a' readStream
      Character utf8DecodeFrom:#[195 188] asString readStream

o  value: anInteger
return a character with codePoint anInteger - backward compatibility

primitive input
o  fromUser
return a character from the keyboard (C's standard input stream)
- this should only be used for emergency evaluators and the like.

queries
o  allCharacters
added for squeak compatibility: return a collection of all singleton chars.
Notice, for memory efficiency reasons, only some of the low-codepoint characters
are actually kept as singletons. less frequently used character instances are created on the fly,
as wide string elements are accessed (and hopefully garbage collected sooner or later)

usage example(s):

     Character allCharacters

o  hasSharedInstances
return true if this class has shared instances, that is, instances
with the same value are identical.
Although not always shared (TwoByte CodePoint-Characters), these should be treated
so, to be independent of the number of the underlying implementation

o  isBuiltInClass
return true if this class is known by the run-time-system.
Here, true is returned for myself, false for subclasses.

o  isLegalUnicodeCodePoint: anInteger
answer true, if anInteger is a valid unicode code point

o  separators
return a collection of separator chars.
Added for squeak compatibility

usage example(s):

     Character separators


Instance protocol:

Compatibility-Dolphin
o  isAlphaNumeric
Compatibility method for dolphin and VSE - do not use in new code.
Return true, if I am a letter or a digit
Please use isLetterOrDigit for compatibility reasons (which is ANSI).

o  isAlphabetic
Compatibility method - do not use in new code.
Return true, if I am a letter.
Please use isLetter for compatibility reasons (which is ANSI).

o  isControl
Compatibility method - do not use in new code.
Return true if I am a control character (i.e. ascii value < 32)

o  isHexDigit
return true if I am a valid hexadecimal digit

usage example(s):

     $a isHexDigit

o  isPunctuation
Compatibility method - do not use in new code.
The code below is not unicode aware

Compatibility-Squeak
o  asUnicode
( an extension from the stx:libcompat package )
the same as #codePoint

o  charCode
( an extension from the stx:libcompat package )
(self asInteger bitAnd: 16r3FFFFF).

accessing
o  codePoint
return the codePoint of myself.
Traditionally, this was named 'asciiValue';
however, characters are not limited to 8bit characters.

o  instVarAt: index put: anObject
catch instvar access - asciivalue may not be changed

arithmetic
o  + aMagnitude
Return the Character that is <aMagnitude> higher than the receiver.
Wrap if the resulting value is not a legal Character value. (JS)

usage example(s):

     $A + 5

o  - aMagnitude
Return the Character that is <aMagnitude> lower than the receiver.
Wrap if the resulting value is not a legal Character value. (JS)
claus:
return the difference as integer, if the argument is another character.
If the argument is a number, a character is returned.

usage example(s):

     $z - $a
     $d - 3

o  // aMagnitude
Return the Character who's value is the receiver divided by <aMagnitude>.
Wrap if the resulting value is not a legal Character value. (JS)

o  \\ aMagnitude
Return the Character who's value is the receiver modulo <aMagnitude>.
Wrap if the resulting value is not a legal Character value. (JS)

comparing
o  < aMagnitude
return true, if the arguments asciiValue is greater than the receiver's

o  <= aMagnitude
return true, if the arguments asciiValue is greater or equal to the receiver's

o  = aCharacter
return true, if the argument, aCharacter is the same character
Redefined to take care of character sizes > 8bit.

usage example(s):

	$A = (Character value:65)
	$A = (Character codePoint:65)
	$A = ($B-1)
	$A = 65

o  > aMagnitude
return true, if the arguments asciiValue is less than the receiver's

o  >= aMagnitude
return true, if the arguments asciiValue is less or equal to the receiver's

o  hash
return an integer useful for hashing

o  identityHash
return an integer useful for hashing on identity

usage example(s):

      $a identityHash.
      (Character value:1234) identityHash

o  sameAs: aCharacter
return true, if the argument, aCharacter is the same character,
ignoring case differences.

usage example(s):

      $x sameAs:$X
      (Character value:345) sameAs:(Character value:345)
      $Ж sameAs:$ж     -- u0416 - u0436
      $ж sameAs:$Ж     -- u0436 - u0416  

o  ~= aCharacter
return true, if the argument, aCharacter is not the same character
Redefined to take care of character sizes > 8bit.

converting
o  asCharacter
usually sent to integers, but redefined here to allow integers
and characters to be used commonly without a need for a test.

usage example(s):

     32 asCharacter

o  asInteger
the same as #codePoint.
Use #asInteger, if you need protocol compatibility with Numbers etc..
Use #codePoint in any other case for better stc optimization

o  asLowercase
return a character with same letter as the receiver, but in lowercase.
Returns the receiver if it is already lowercase or if there is no lowercase equivalent.
CAVEAT:
for now, this method is only correct for unicode characters up to u+1d6ff (Unicode3.1).
(which is more than mozilla does, btw. ;-)

usage example(s):

     $A asLowercase
     $a asLowercase
     (Character value:16r01F5) asUppercase asLowercase
     (Character value:16r0205) asUppercase asLowercase
     (Character value:16r03B1) asUppercase asLowercase
     (Character value:16r1E00) asLowercase

o  asString
return a string of len 1 with myself as contents

usage example(s):

     (Character value:16rB5) asString
     (Character value:16r1B5) asString

o  asSymbol
Return a unique symbol with the name taken from the receiver's characters.
Here, a single character symbol is returned.

o  asTitlecase
return a character with same letter as the receiver, but in titlecase.
Returns the receiver if it is already titlecase or if there is no titlecase equivalent.

usage example(s):

     $A asTitlecase
     $a asTitlecase
     (Character value:16r01F1) asTitlecase
     (Character value:16r01F2) asTitlecase

o  asUnicodeString
return a unicode string of len 1 with myself as contents.
This will vanish, as we now (rel5.2.x) use Unicode as default.

o  asUppercase
return a character with same letter as the receiver, but in uppercase.
Returns the receiver if it is already uppercase or if there is no uppercase equivalent.
CAVEAT:
for now, this method is only correct for unicode characters up to u+1d6ff (Unicode3.1).
(which is more than mozilla does, btw. ;-)

usage example(s):

     $A asLowercase
     $a asUppercase
     (Character value:16r01F5) asUppercase
     (Character value:16r0205) asUppercase
     (Character value:16r03B1) asUppercase

o  digitValue
return my digitValue for any base (up to 37).
Notice: in case of an invalid character,
ST/X is not X3J20 conform:
ST/X raises an error,
X3J20 returns -1

o  digitValueRadix: base
return my digitValue for base.
Return nil, if it is not a valid character for that base

usage example(s):

     self assert:($0 digitValueRadix:10) == 0.
     self assert:($9 digitValueRadix:10) == 9.
     self assert:($a digitValueRadix:10) == nil.
     self assert:($a digitValueRadix:11) == 10.
     self assert:($A digitValueRadix:11) == 10.
     self assert:($a digitValueRadix:16) == 10.
     self assert:($A digitValueRadix:16) == 10.
     self assert:($f digitValueRadix:16) == 15.
     self assert:($F digitValueRadix:16) == 15.
     self assert:($g digitValueRadix:16) == nil.
     self assert:($G digitValueRadix:16) == nil.
     self assert:($g digitValueRadix:17) == 16.
     self assert:($G digitValueRadix:17) == 16.

o  literalArrayEncoding
encode myself as an array literal, from which a copy of the receiver
can be reconstructed with #decodeAsLiteralArray.

o  to: aMagnitude
Return an Interval over the characters from the receiver to <aMagnitude>.
Wrap <aMagnitude> if it is not a legal Character value. (JS)
CG: why wrap - is this a good idea?

o  to: aMagnitude by: inc
Return an Interval over the characters from the receiver to <aMagnitude>.
Wrap <aMagnitude> if it is not a legal Character value. (JS)
CG: why wrap - is this a good idea?

o  utf8Encoded
convert a character to its UTF-8 encoding.
this returns a String

usage example(s):

     'ä' utf8Encoded
     'a' utf8Encoded

o  withoutDiacritics
return a character with same letter as the receiver, but in without diacritics modifiers
(mapping e.g. Ä to A).
Returns the receiver if it has no diacritics modifiers.

copying
o  , aStringOrCharacter
return a string containing the concatenation of the receiver character
and the argument, a string or character.
Added for symetry, as we allow string,char also char,string should be allowed

usage example(s):

     $. , $:
     $. , 'abc' , $.

      Time millisecondsToRun:[ 10000000 timesRepeat:[ $a , $b ]]
      Time millisecondsToRun:[ 10000000 timesRepeat:[ $a , 'b' ]]
      Time millisecondsToRun:[ 10000000 timesRepeat:[ 'a' , 'b' ]]
      Time millisecondsToRun:[ 10000000 timesRepeat:[ 'a' , $b ]]

o  copy
return a copy of myself
reimplemented since characters are unique

o  deepCopyUsing: aDictionary postCopySelector: postCopySelector
return a deep copy of myself
reimplemented since characters are immutable

o  shallowCopy
return a shallow copy of myself
reimplemented since characters are immutable

o  simpleDeepCopy
return a deep copy of myself
reimplemented since characters are immutable

dependents access
o  addDependent: someOne
It doesn't make sense to add dependents to a shared instance.
Silently ignore ...

o  onChangeSend: selector to: someOne
It doesn't make sense to add dependents to a shared instance.
Silently ignore ...

encoding
o  rot13
Usenet: from `rotate alphabet 13 places']
The simple Caesar-cypher encryption that replaces each English
letter with the one 13 places forward or back along the alphabet,
so that 'The butler did it!' becomes 'Gur ohgyre qvq vg!'
Most Usenet news reading and posting programs include a rot13 feature.
It is used to enclose the text in a sealed wrapper that the reader must choose
to open -- e.g., for posting things that might offend some readers, or spoilers.
A major advantage of rot13 over rot(N) for other N is that it
is self-inverse, so the same code can be used for encoding and decoding.

usage example(s):

     $h rot13
     $h rot13 rot13
     'The butler did it!' rot13             -> 'Gur ohgyre qvq vg!'
     'The butler did it!' rot13 rot13       -> 'The butler did it!'

o  rot: n
Usenet: from `rotate alphabet N places']
The simple Caesar-cypher encryption that replaces each English
letter with the one N places forward or back along the alphabet,
so that 'The butler did it!' becomes 'Gur ohgyre qvq vg!' by rot:13
Most Usenet news reading and posting programs include a rot13 feature.
It is used to enclose the text in a sealed wrapper that the reader must choose
to open -- e.g., for posting things that might offend some readers, or spoilers.
A major advantage of rot13 over rot(N) for other N is that it
is self-inverse, so the same code can be used for encoding and decoding.

usage example(s):

     'The butler did it!' rot:13                -> 'Gur ohgyre qvq vg!'
     ('The butler did it!' rot:13) rot:13       -> 'The butler did it!'

inspecting
o  inspectorExtraAttributes
( an extension from the stx:libtool package )
extra (pseudo instvar) entries to be shown in an inspector.

o  inspectorValueListIconFor: anInspector
( an extension from the stx:libtool package )
returns the icon to be shown alongside the value list of an inspector

o  inspectorValueStringInListFor: anInspector
( an extension from the stx:libtool package )
returns a string to be shown in the inspector's list

obsolete
o  asciiValue
return the asciivalue of myself.
The name 'asciiValue' is a historic leftover:
characters are not limited to 8bit characters.
So the actual value returned is a codePoint (i.e. full potential for 31bit encoding).
PP has removed this method with 4.1 and providing asInteger instead.
ANSI defines #codePoint, please use this method

** This is an obsolete interface - do not use it (it may vanish in future versions) **

printing & storing
o  displayOn: aGCOrStream
Compatibility
append a printed desription on some stream (Dolphin, Squeak)
OR:
display the receiver in a graphicsContext at 0@0 (ST80).
This method allows for any object to be displayed in some view
(although the fallBack is to display its printString ...)

o  isLiteral
return true, if the receiver can be used as a literal constant in ST syntax
(i.e. can be used in constant arrays)

o  print
print myself on stdout.
If Stdout is nil, this method does NOT (by purpose) use the stream classes and
will therefore work even in case of emergency or very early startup (but only, if Stdout is nil).

o  printOn: aStream
print myself on aStream

o  printString
return a string to print me

o  storeOn: aStream
store myself on aStream

private-accessing
o  setCodePoint: anInteger
very private - set the codePoint.
- use this only for newly created characters with codes > MAX_IMMEDIATE_CHARACTER.
DANGER alert:
funny things happen, if this is applied to
one of the shared characters with codePoints 0..MAX_IMMEDIATE_CHARACTER.

queries
o  bitsPerCharacter
return the number of bits I require for storage.
(i.e. am I an Ascii/ISO8859-1 Character or will I need more
bits for storage.

o  bytesPerCharacter
return the number of bytes I require for storage

o  characterSize
return the number of bits I require for storage.
Protocol compatibility with CharacterArray.

o  isSafeForHTTP
( an extension from the stx:libcompat package )
whether a character is 'safe', or needs to be escaped when used, eg, in a URL

o  stringSpecies
return the type of string that is needed to store me

o  unicodeBlock
return the name of the unicode block in which this character is.
incomplete

usage example(s):

     (Character value:16r200) unicodeBlock

o  utf8BytesPerCharacter
return the number of bytes I require for storage in utf-8 encoding

testing
o  isCharacter
return true, if the receiver is some kind of character

o  isControlCharacter
return true if I am a control character (i.e. ascii value < 32 or == 16rFF)

usage example(s):

     (Character value:1) isControlCharacter
     $a isControlCharacter

o  isDigit
return true, if I am a digit (i.e. $0 .. $9)

o  isDigitRadix: r
return true, if I am a digit of a base r number

o  isEndOfLineCharacter
return true if I am a line delimitting character

o  isImmediate
return true if I am an immediate object
i.e. I am represented in the pointer itself and
no real object header/storage is used by me.
For VW compatibility, shared characters (i.e. in the range 0..MAX_IMMEDIATE_CHARACTER)
also return true here

usage example(s):

        $a isImmediate.
        (Character value:255) isImmediate.
        (Character value:256) isImmediate.
        (Character value:1566) isImmediate.

o  isLetter
return true, if I am a letter in the 'a'..'z' range.
Use isNationalLetter, if you are interested in those.

o  isLetterOrDigit
return true, if I am a letter (a..z or A..Z) or a digit (0..9)
Use isNationalAlphaNumeric, if you are interested in those.

o  isLetterOrDigitOrUnderline
return true, if I am a letter or a digit or $_

o  isLetterOrUnderline
return true, if I am a letter or $_

o  isLowercase
return true, if I am a lower-case letter.
This one does care for national characters.
Caveat:
only returns the correct value for codes up to u+1d6ff (Unicode3.1).
(which is more than mozilla does, btw. ;-)

o  isPrintable
return true, if the receiver is a useful printable character
(see fileBrowser's showFile:-method on how it can be used)

o  isSeparator
return true if I am a space, cr, tab, nl, or newPage

o  isUppercase
return true, if I am an upper-case letter.
This one does care for national characters.
Caveat:
only returns the correct value for codes up to u+1d6ff (Unicode3.1).
(which is more than mozilla does, btw. ;-)

o  isVowel
return true, if I am a vowel (lower- or uppercase)

testing - national
o  asNonDiacritical
return a new character which represents the receiver without diacritics.
This is used with string search and when lists are to be ordered/sorted by base character order.
CAVEAT:
for now, this method is only correct for unicode characters up to u+2FF,
i.e. latin languages

usage example(s):

     $e asNonDiacritical
     $é asNonDiacritical
     $ä asNonDiacritical
     $Ã¥ asNonDiacritical

o  isGreekLetter
return true, if the receiver is a greek letter (alpha, beta,...).

usage example(s):

     $a isGreekLetter
     $π isGreekLetter  -- pi
     $Ω isGreekLetter  -- omega

o  isNationalAlphaNumeric
return true, if the receiver is a letter or digit.
This assumes unicode encoding.

o  isNationalDigit
return true, if the receiver is a digit.
This assumes unicode encoding.
WARNING: this method is not complete.

o  isNationalLetter
return true, if the receiver is a letter.
CAVEAT:
for now, this method is only correct for unicode characters up to u+1d6ff (Unicode3.1).
(which is more than mozilla does, btw. ;-)

tracing
o  traceInto: aRequestor level: level from: referrer
double dispatch into tracer, passing my type implicitely in the selector

visiting
o  acceptVisitor: aVisitor with: aParameter
dispatch for visitor pattern; send #visitCharacter:with: to aVisitor



ST/X 7.2.0.0; WebServer 1.670 at bd0aa1f87cdd.unknown:8081; Thu, 28 Mar 2024 14:30:44 GMT