|
|
Class: Character
Object
|
+--Magnitude
|
+--Character
- Package:
- stx:libbasic
- Category:
- Magnitude-General
- Version:
- rev:
1.139
date: 2009/09/16 19:22:24
- user: cg
- file: Character.st directory: libbasic
- module: stx stc-classLibrary: libbasic
- Author:
- Claus Gittinger
This class represents characters.
Notice, that actual character objects are not used when characters
are stored in strings, symbols etc;
these only store a character's asciiValue/codePoint for a more compact representation.
The word 'asciiValue' is a historic leftover - actually, any integer
code is allowed and actually used (i.e. characters are not limited to 8bit).
Single byte Characters are unique; i.e. for every asciiValue (0..255) there exists exactly
one instance of Character, which is shared
(Character value:xxx checks for this, and returns a reference to an existing instance).
Other characters (i.e. asciivalue > 255) are not guaranteed to be shared;
i.e. these might be created as required (actually, do NOT depend on which characters are and
which are not shared; always compare using #= if there is any chance of a non-ascii character
being involved.
This means: you may compare characters using #== ONLY IFF you are certain,
that the characters ranges is 0..255.
Otherwise, you HAVE TO compare using #=. (if in doubt, always compare using #=).
Sorry for this inconvenience, but it is (practically) impossible to keep
the possible maximum of 2^32 characters (Unicode) around, for that convenience alone.
Methods marked as (JS) come from the manchester Character goody
(CharacterComparing) by Jan Steinman, which allow Characters to be used as
Interval elements (i.e. ($a to:$z) do:[...] );
They are not a big deal, but convenient add-ons.
Some of these have been modified a bit.
WARNING: characters are known by compiler and runtime system -
do not change the instance layout.
Also, although you can create subclasses of Character, the compiler always
creates instances of Character for literals ...
... and other classes are hard-wired to always return instances of characters
in some cases (i.e. String>>at:, Symbol>>at: etc.).
Therefore, it may not make sense to create a character-subclass.
Case Mapping in Unicode:
There are a number of complications to case mappings that occur once the repertoire
of characters is expanded beyond ASCII.
* Because of the inclusion of certain composite characters for compatibility,
such as U+01F1 'DZ' capital dz, there is a third case, called titlecase,
which is used where the first letter of a word is to be capitalized
(e.g. Titlecase, vs. UPPERCASE, or lowercase).
For example, the title case of the example character is U+01F2 'Dz' capital d with small z.
* Case mappings may produce strings of different length than the original.
For example, the German character U+00DF small letter sharp s expands when uppercased to
the sequence of two characters 'SS'.
This also occurs where there is no precomposed character corresponding to a case mapping.
*** This is not yet implemented (in 5.2) ***
* Characters may also have different case mappings, depending on the context.
For example, U+03A3 capital sigma lowercases to U+03C3 small sigma if it is not followed
by another letter, but lowercases to 03C2 small final sigma if it is.
*** This is not yet implemented (in 5.2) ***
* Characters may have case mappings that depend on the locale.
For example, in Turkish the letter 0049 'I' capital letter i lowercases to 0131 small dotless i.
*** This is not yet implemented (in 5.2) ***
* Case mappings are not, in general, reversible.
For example, once the string 'McGowan' has been uppercased, lowercased or titlecased,
the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation.
Collation Sequence:
*** This is not yet implemented (in 5.2) ***
String
TwoByteString
Unicode16String
Unicode32String
StringCollection
Text
accessing untypeable characters
-
endOfInput
-
Answer the Character representing ctrl-d (Unix-EOF).
-
leftParenthesis
-
Answer the Character representing a left parenthesis.
-
period
-
Answer the Character representing a carriage period.
-
poundSign
-
Answer the Character representing a pound sign (hash).
-
rightParenthesis
-
Answer the Character representing a right parenthesis.
constants
-
backspace
-
return the backspace character
-
bell
-
return the bell character
-
cr
-
return the lineEnd character
- actually (in unix) this is a newline character
-
del
-
return the delete character
-
doubleQuote
-
return the double-quote character
-
esc
-
return the escape character
-
etx
-
return the end-of-text character
-
euro
-
The Euro currency sign (notice: not all fonts support it).
The Unicode encoding is U+20AC
-
excla
-
return the exclamation-mark character
-
ff
-
return the form-feed character
-
lf
-
return the newline/linefeed character
-
linefeed
-
squeak compatibility: return the newline/linefeed character
-
maxImmediateCodePoint
-
return the maximum codePoint until which the characters are shared
-
maxValue
-
return the maximum codePoint a character may have
-
newPage
-
return the form-feed character
-
nl
-
return the newline character
-
null
-
-
quote
-
return the single-quote character
-
return
-
return the return character.
In ST/X, this is different from cr - for Unix reasons.
-
space
-
return the blank character
-
tab
-
return the tabulator character
instance creation
-
basicNew
-
catch new - Characters cannot be created with new
-
codePoint: anInteger
-
return a character with codePoint anInteger
-
digitValue: anInteger
-
return a character that corresponds to anInteger.
0-9 map to $0-$9, 10-35 map to $A-$Z
-
utf8DecodeFrom: aStream
-
read and return a single unicode character from an UTF8 encoded stream
-
value: anInteger
-
return a character with codePoint anInteger - backward compatibility
primitive input
-
fromUser
-
return a character from the keyboard (C's standard input stream)
- this should only be used for emergency evaluators and the like.
queries
-
allCharacters
-
added for squeak compatibility: return a collection of all chars
-
hasSharedInstances
-
return true if this class has shared instances, that is, instances
with the same value are identical.
Although not always shared (TwoByte CodePoint-Characters), these should be treated
so, to be independent of the number of the underlying implementation
-
isBuiltInClass
-
return true if this class is known by the run-time-system.
Here, true is returned for myself, false for subclasses.
-
isLegalUnicodeCodePoint: anInteger
-
answer true, if anInteger is a valid unicode code point
-
separators
-
added for squeak compatibility: return a collection of separator chars
Compatibility-Dolphin
-
isAlphaNumeric
-
Compatibility method - do not use in new code.
Return true, if I am a letter or a digit
Please use isLetterOrDigit for compatibility reasons (which is ANSI).
-
isAlphabetic
-
Compatibility method - do not use in new code.
Return true, if I am a letter.
Please use isLetter for compatibility reasons (which is ANSI).
-
isControl
-
Compatibility method - do not use in new code.
Return true if I am a control character (i.e. ascii value < 32)
-
isHexDigit
-
return true if I am a valid hexadecimal digit
-
isPunctuation
-
Compatibility method - do not use in new code.
The code below is not unicode aware
accessing
-
codePoint
-
return the codePoint of myself.
Traditionally, this was named 'asciiValue';
however, characters are not limited to 8bit characters.
-
instVarAt: index put: anObject
-
catch instvar access - asciivalue may not be changed
arithmetic
-
+ aMagnitude
-
Return the Character that is <aMagnitude> higher than the receiver.
Wrap if the resulting value is not a legal Character value. (JS)
-
- aMagnitude
-
Return the Character that is <aMagnitude> lower than the receiver.
Wrap if the resulting value is not a legal Character value. (JS)
claus:
modified to return the difference as integer, if the argument
is another character. If the argument is a number, a character is
returned.
-
// aMagnitude
-
Return the Character who's value is the receiver divided by <aMagnitude>.
Wrap if the resulting value is not a legal Character value. (JS)
-
\\ aMagnitude
-
Return the Character who's value is the receiver modulo <aMagnitude>.
Wrap if the resulting value is not a legal Character value. (JS)
binary storage
-
hasSpecialBinaryRepresentation
-
return true, if the receiver has a special binary representation
-
identityHashForBinaryStore
-
-
storeBinaryOn: stream manager: manager
-
store a binary representation of the receiver on stream;
redefined, since single-byte characters are stored more compact
with a special type-code followed by the asciiValue.
comparing
-
< aMagnitude
-
return true, if the arguments asciiValue is greater than the receiver's
-
<= aMagnitude
-
return true, if the arguments asciiValue is greater or equal to the receiver's
-
= aCharacter
-
return true, if the argument, aCharacter is the same character
Redefined to take care of character sizes > 8bit.
-
> aMagnitude
-
return true, if the arguments asciiValue is less than the receiver's
-
>= aMagnitude
-
return true, if the arguments asciiValue is less or equal to the receiver's
-
hash
-
return an integer useful for hashing
-
identityHash
-
return an integer useful for hashing on identity
-
sameAs: aCharacter
-
return true, if the argument, aCharacter is the same character,
ignoring case differences.
-
~= aCharacter
-
return true, if the argument, aCharacter is not the same character
Redefined to take care of character sizes > 8bit.
converting
-
asCharacter
-
usually sent to integers, but redefined here to allow integers
and characters to be used commonly without a need for a test.
-
asInteger
-
the same as #codePoint.
Use #asInteger, if you need protocol compatibility with Numbers etc..
Use #codePoint in any other case for better stc optimization
-
asLowercase
-
return a character with same letter as the receiver, but in lowercase.
Returns the receiver if it is already lowercase or if there is no lowercase equivalent.
CAVEAT:
for now, this method is only correct for unicode characters up to u+1d6ff (Unicode3.1).
(which is more than mozilla does, btw. ;-)
-
asString
-
return a string of len 1 with myself as contents
-
asSymbol
-
Return a unique symbol with the name taken from the receivers characters.
Here, a single character symbol is returned.
-
asTitlecase
-
return a character with same letter as the receiver, but in titlecase.
Returns the receiver if it is already titlecase or if there is no titlecase equivalent.
-
asUnicodeString
-
return a unicode string of len 1 with myself as contents.
This will vanish, as we now (rel5.2.x) use Unicode as default.
-
asUppercase
-
return a character with same letter as the receiver, but in uppercase.
Returns the receiver if it is already uppercase or if there is no uppercase equivalent.
CAVEAT:
for now, this method is only correct for unicode characters up to u+1d6ff (Unicode3.1).
(which is more than mozilla does, btw. ;-)
-
digitValue
-
return my digitValue for any base (up to 37)
-
digitValueRadix: base
-
return my digitValue for base.
Return nil, if it is not a valid character for that base
-
literalArrayEncoding
-
encode myself as an array literal, from which a copy of the receiver
can be reconstructed with #decodeAsLiteralArray.
-
to: aMagnitude
-
Return an Interval over the characters from the receiver to <aMagnitude>.
Wrap <aMagnitude> if it is not a legal Character value. (JS)
-
utf8Encoded
-
convert a character to its UTF-8 encoding.
this returns a String
copying
-
copy
-
return a copy of myself
reimplemented since characters are unique
-
deepCopyUsing: aDictionary postCopySelector: postCopySelector
-
return a deep copy of myself
reimplemented since characters are immutable
-
shallowCopy
-
return a shallow copy of myself
reimplemented since characters are immutable
-
simpleDeepCopy
-
return a deep copy of myself
reimplemented since characters are immutable
encoding
-
rot13
-
Usenet: from `rotate alphabet 13 places']
The simple Caesar-cypher encryption that replaces each English
letter with the one 13 places forward or back along the alphabet,
so that 'The butler did it!' becomes 'Gur ohgyre qvq vg!'
Most Usenet news reading and posting programs include a rot13 feature.
It is used to enclose the text in a sealed wrapper that the reader must choose
to open -- e.g., for posting things that might offend some readers, or spoilers.
A major advantage of rot13 over rot(N) for other N is that it
is self-inverse, so the same code can be used for encoding and decoding.
-
rot: n
-
Usenet: from `rotate alphabet N places']
The simple Caesar-cypher encryption that replaces each English
letter with the one N places forward or back along the alphabet,
so that 'The butler did it!' becomes 'Gur ohgyre qvq vg!' by rot:13
Most Usenet news reading and posting programs include a rot13 feature.
It is used to enclose the text in a sealed wrapper that the reader must choose
to open -- e.g., for posting things that might offend some readers, or spoilers.
A major advantage of rot13 over rot(N) for other N is that it
is self-inverse, so the same code can be used for encoding and decoding.
inspecting
-
inspectorExtraAttributes
-
extra (pseudo instvar) entries to be shown in an inspector.
obsolete
-
asciiValue
-
return the asciivalue of myself.
The name 'asciiValue' is a historic leftover:
characters are not limited to 8bit characters.
So the actual value returned is a codePoint (i.e. full potential for 31bit encoding).
PP has removed this method with 4.1 and providing asInteger instead.
ANSI defines #codePoint, please use this method
** This is an obsolete interface - do not use it (it may vanish in future versions) **
printing & storing
-
displayString
-
return a string used when the receiver is to be displayed
in an inspector kind-of-thing
-
isLiteral
-
return true, if the receiver can be used as a literal constant in ST syntax
(i.e. can be used in constant arrays)
-
print
-
print myself on stdout.
This method does NOT (by purpose) use the stream classes and
will therefore work even in case of emergency (but only, if Stdout is nil).
-
printOn: aStream
-
print myself on aStream
-
printString
-
return a string to print me
-
storeOn: aStream
-
store myself on aStream
private-accessing
-
setCodePoint: anInteger
-
very private - set the codePoint.
- use this only for newly created characters with codes > MAX_IMMEDIATE_CHARACTER.
DANGER alert:
funny things happen, if this is applied to
one of the shared characters with codePoints 0..MAX_IMMEDIATE_CHARACTER.
queries
-
bitsPerCharacter
-
return the number of bits I require for storage
-
stringSpecies
-
return the type of string that is needed to store me
testing
-
isCharacter
-
return true, if the receiver is some kind of character
-
isControlCharacter
-
return true if I am a control character (i.e. ascii value < 32 or == 16rFF)
-
isDigit
-
return true, if I am a digit (i.e. $0 .. $9)
-
isDigitRadix: r
-
return true, if I am a digit of a base r number
-
isEndOfLineCharacter
-
return true if I am a line delimitting character
-
isImmediate
-
return true if I am an immediate object
i.e. I am represented in the pointer itself and
no real object header/storage is used me.
For VW compatibility, shared characters (i.e. in the range 0..MAX_IMMEDIATE_CHARACTER)
also return true here
-
isLetter
-
return true, if I am a letter in the 'a'..'z' range.
Use isNationalLetter, if you are interested in those.
-
isLetterOrDigit
-
return true, if I am a letter (a..z or A..Z) or a digit (0..9)
Use isNationalAlphaNumeric, if you are interested in those.
-
isLowercase
-
return true, if I am a lower-case letter.
This one does care for national characters.
Caveat:
only returns the correct value for codes up to u+1d6ff (Unicode3.1).
(which is more than mozilla does, btw. ;-)
-
isPrintable
-
return true, if the receiver is a useful printable character
(see fileBrowsers showFile:-method on how it can be used)
-
isSeparator
-
return true if I am a space, cr, tab, nl, or newPage
-
isUppercase
-
return true, if I am an upper-case letter.
This one does care for national characters.
Caveat:
only returns the correct value for codes up to u+1d6ff (Unicode3.1).
(which is more than mozilla does, btw. ;-)
-
isVowel
-
return true, if I am a vowel (lower- or uppercase)
testing - national
-
isNationalAlphaNumeric
-
return true, if the receiver is a letter or digit in the
current language (Language variable)
-
isNationalDigit
-
return true, if the receiver is a digit.
This assumes unicode encoding.
WARNING: this method is not complete.
-
isNationalLetter
-
return true, if the receiver is a letter.
CAVEAT:
for now, this method is only correct for unicode characters up to u+1d6ff (Unicode3.1).
(which is more than mozilla does, btw. ;-)
tracing
-
traceInto: aRequestor level: level from: referrer
-
double dispatch into tracer, passing my type implicitely in the selector
visiting
-
acceptVisitor: aVisitor with: aParameter
-
|