[up] [next]

Collection classes

Collections are containers for other objects. Smalltalk provides a rich set of collection classes ready to be used.

The protocol offered by these collections is very orthogonal, making good use of polymorphism. For example, most collections respond to messages like #at:, #includes: or #do:, even though the implementations are totally different and the performance may vary greatly between different collections and access methods (*).

To get a rough idea of the common protocol, take a look at the Collection and SequenceableCollection classes. Collection is the superclass of all collections, and SequenceableCollection is inherited by all collections which use a numeric index (among others: Array, String and OrderedCollection).
These implement most of the common operations, so knowledge about those classes' protocol is highly recommended.

The most useful and most used collection classes are:

less frequently used collections:

seldom used or highly specialized collections:

The above lists only a small subset of the collection classes as provided with the system;
Use the SystemBrowser to see what else is available, to have a more detailed look into the implementation and also to see the full set of operations offered by these classes.

Array

Arrays store references to other objects, and use a numeric key (1 .. ) to access these elements. The elements can be of any type. The size of an array is assumed to be constant. (It is possible to change the size, but this is a relatively expensive i.e. slow operation.)
When accessing elements in arrays, the index is always checked for being within the bounds - you will get an indexBounds exception if you try to access an element using an illegal index
(this is not a special behavior of arrays - instead, it is true for any indexed reference in Smalltalk).

The typical uses of arrays are:

The Smalltalk language allows array constants (so called "Array literals"), the syntax is:
    #( element1 element2 ... elementN )
where the individual elements must be literals (numbers, characters, strings or other arrays). Nested literal Arrays can be written as:
    #(
	(element1.1 element1.2 ... element1.M)
	(element2.1 element2.2 ... element2.M)
	...
	(elementN.1 elementN.2 ... elementN.M)
     )
Notice, the above is an Array of 3 Arrays. Not to confuse with an array containing 9 elements (which you get, if the inner parenthesises are ommited).

Smalltalk/X also allows the objects true, false and nil to be treated like constants in literal arrays. This is NOT compatible with some older Smalltalk implementations (but is now part of the Ansi standard), so care should be taken if that code is to be ported to other systems later.
For example,

    #( nil 1 'foo' true #true )
represents a 5-element array literal consisting of nil, the integer 1, the string 'foo', the boolean true and the symbol #true.

More details are found in "Array's class description".

In practice, arrays are very very seldom used except as literal constants "#( ... )".
Beginners, especially with a C/C++ background, often tend to overuse arrays - please look at the description of OrderedCollection, which behaves much like arrays do, but offer functionality for growing and shrinking and are usually a better choice. Also, for searching, counting, mapping etc. please use Set and Dictionary.

String

Strings are special arrays, which are tuned for storage of characters. (actually, they only allow storing characters - an error is reported, if you try to store something else into it).
They provide the same basic protocol as arrays, but offer many more string specific operations. For example, pattern matching, substring searching, padding, upper/lowercase translation and others.

Depending on the number of bits which are required to store a character, strings use a single- or double-byte storage scheme. Unicode is used as internal encoding. However, various converters assist in transcoding to/from different coding schemes used with fonts or in files.

Typical operations on strings are:

Constant strings (string literals) are entered by enclosing the characters in single quotes ('). Single quotes within string constants are to be doubled. String literals can cross line boundaries; in this case, the newline character is part of the string.
Examples:

    'a normal string constant'

    'a string constant with embedded ''quotes'''

    'a string constant
     with mutliple
     lines'
Currently, there exists no standard on how non printable characters are to be represented in strings via escape sequences. For embedded newline characters, Smalltalk-80 provides the #withCRs method, VisualAge calls this #addLineDelimiters.
Both are supported by ST/X and convert embedded '\' (backslash) characters to the newLine character, as in:
    'a string\with newlines\' withCRs
In addition, the (non standard) method #withEscapes converts embedded C-language escape sequences:
    'a string\n\t\twith newline and tabs' withEscapes

More details are found in the class documentation of "String".
and (highly recommended) especially in its superclass "CharacterArray". This is where most of the functionality of String, TwoByteString and Unicode16String is actually implemented.

Symbol

Symbols are much like strings, with the exception that the system keeps track of all existing symbols and will make certain that two symbols are identical, if they contain the same characters.
In other words: if the system is asked to create a symbol from some string, it will search the existing symbols first for an already existing one with the same characters. If its found, that existing object is returned. If not found, a new symbol is created and entered into this table.

Therefore, symbols can be compared using identity compare (==), which is much faster than value compare (=). Identity compare is technically implemented as a pointer compare, while value compare on strings is implementated by comparing each element of the two objects.

Symbols are internally used by the system to assign names to class objects (in the Smalltalk global table), and to assign method names to method objects (in each classes method/selector association). Also, symbols are perfect to use as keys in IdentityDictionaries.

Since symbols are handled just as fast as integers, they are also perfect for flags, states (state machines) or other variables which are represented by enumeration types or define-constants in other programming languages.

Symbols do not allow any change of their character elements; the at:put: method will raise an error. Other than that, all read-accesses are the same as for strings (actually, the Symbol class inherits all those methods from the String class).

Consider symbols as being some kind of readonly string.

Symbols can be created (at runtime) from a string with:

    sym := aString asSymbol
also, symbol constants (i.e. symbol literals) can be written directly by prefixing some identifier with the #-character as in:
    sym := #helloThere.
These symbol literals are created at compile time - NOT at execution time. Therefore, the cost of using them is not more than using integers
Which is why there is no need for enumeration-types in Smalltalk; simply use a symbol, wherever other programming languages would use a numeric define or an enumeration type.

Only alphanumeric identifiers or valid message selectors are allowed in the above example. Examples for valid symbol iterals are:

    #helloThere
    #FooBar
    #+
    #at:
    #at:put:
Use quotes around the characters for other symbol literals:
    sym := #'a symbol with spaces and .... other stuff'
The quotes are not part of the symbols characters; therefore #+ and #'+' represent the same symbol.
Symbol literals can also be present in array literals as in:
    #( #foo #bar #'baz' )
Within such an array literal, the leading #-character can be ommitted (for your convenience). Thus, the above is equivalent to:
    #( foo bar baz )
but, NOT equivalent to:
    #( 'foo' 'bar' 'baz' )
which is an array of strings.

Since true, false and nil are treated as constant objects, you'd have to add a #-prefix if you ever want those symbols in an array literal, as in:

    #( #true #false #nil)
Which - you have guessed it - is not the same as:
    #( true false nil)

More details are found in "Symbol's class description".

Set and IdentitySet

Sets keep references to other objects, but do not allow for duplicates. Thus an individual object will appear at most once in a set.
To check if some object is already contained in the set, the elements are compared using =, which typically compares objects values.
Thus, 1 (the Integer "one") will be considered equal to 1.0 (the Float "one") in set-inclusion tests.
The alternative IdentitySet uses identity compare (==) and treats 1 and 1.0 as different objects.
The typical use of sets and identitySets is:

Because sets keep their elements unordered, the keyed access methods (#at: and #at:put:) are not allowed and trigger an error.

Sets and identitySets use hashing algorithms internally for inclusion tests. Thus, they are usually fast and show a constant access time which is determined by their fill-grade and the quality of the hash key algorithm; not by the absolute size of the set. The hash key generation is implemented by the elements - not by the set.

Sets automatically grow (and shrink) as required when elements are added or removed. To get good hashing performance, they grow larger (by some percentage) than the actual size required for the elements. Expect some 20% storage overhead when using sets.

IdentitySets are sometimes faster than Sets, since both hashKey generation and comparing is implemented more efficiently. If your keys are small integers or symbols, you may use an IdentitySet.
However, for string elements or float numbers, using identitySets is usually a bad choice.

More details are found in "Set's class description" and "IdentitySet's class description".

Dictionary and IdentityDictionary

Dictionaries are probably the most powerful and useful collections around; it is definitely worth knowing and using them.
Dictionaries are somewhat like arrays, in that they allow access to the elements by #at: and #at:put: messages, using an access key. However, in contrast to arrays they allow ANY other object to be used as key. Thus an array can be thought of as a special dictionary using numeric keys only (if you want to do this, it is faster to use an array rightaway - except if your array is filled very sparse).

Like sets, dictionaries come in two flavours: Dictionary uses value compare (=) and IdentityDictionary which uses identity compare (==) for key accesses.
Thus:

    |d value|

    d := Dictionary new.
    d at:1 put:'one'.
     ...
    value := d at:1.0
will return the expected 'one' in the variable "value", because the integer "1" compares equal to the float "1.0" and therefore, the dictionary will find a value under the key "1.0".
In contrast,
    |d value|

    d := IdentityDictionary new.
    d at:1 put:'one'.
     ...
    value := d at:1.0
will raise an error, since there is no element stored under the key "1.0" (remember: "1 = 1.0" returns true, while "1 == 1.0" returns false).

Like with sets, dictionaries use hashing internally, and therefore show a constant access time depending on the fill-rate only. Not on the absolute size. (However, they do depend upon the quality of the hashing algorithm used).

Dictionaries are very powerful collections, and worth a second look. You can put associations between objects into dictionaries. A simple example:

    |nameToAge|

    nameToAge := Dictionary new.
    nameToAge at:'fred' put:17.
    nameToAge at:'peter' put:21.
     ...
    age := nameToAgeAt:name ifAbsent:['I do not know'].
You can also use Dictionaries as a replacement for simple data structures, which are called ``record'' or ``struct'' in other programming languages:
    |peter marry|

    peter := IdentityDictionary new.
    peter at:#age put:23.
    peter at:#firstName put:'peter'.
    peter at:#middleName put:'m'.
    peter at:#name put:'smalltalker'.
    peter at:#salary put:51234.
     ...
    marry := IdentityDictionary new.
    marry at:#age put:21.
    marry at:#firstName put:'marry'.
    marry at:#middleName put:'s'.
    marry at:#name put:'smalltalker'.
    marry at:#husband put:peter.
    ...
    age := peter at:#age.
    firstNameOfHusband := (marry at:#husband) at:#firstName.
Of course, real programmers define a Person class and create an instance of it ... this is both more 'object oriented' and provides faster access than using dictionaries (which are not meant to be used for the above 'constant' key access).
However, the above scheme may be used, if the number of information slots to be held is not constant, or not foreseeable by the programmer. A typical example would be an application, in which the end-user is allowed to add user-defined fields to some entity.

More details are found in "Dictionary's class description" and "IdentityDictionary's class description".

OrderedCollection

OrderedCollections are collections which access their elements via a numeric key - in that, they are much like arrays. However, they are specially tuned for adding/removing elements (i.e. they can easily grow and shrink), which is a very expensive operation if applied to an array.
Therefore, orderedCollections can also be used as stacks and queues - either directly or by subclassing.

typical use:

Beginners may get a bit confused by the new: instance creation method, which uses the numeric argument as a size hint, not to specify the logical size of the resulting collection. It therefore returns a logically empty collection, which has some slots "preallocated" for performance reasons. One may argue about this and many do not like it. However, as it is too late to change it now without affecting thousands of programs, you will have to live with it (just like the others have to).

Notice, that the above #includes: message is usually much slower for large orderedCollections or arrays than it is for sets or dictionaries. The reason is that the former have to do a sequential search over all elements, while the later can use a fast hashed access, typically finding an element after a few compares.

Stacks are implemented by using #addLast: as push operation, and #removeLast as a pop operation.
Queues are implemented using #addLast: and #removeFirst.
Also, Smalltalk provides a specialized Queue class, which is tuned for this particular type of access.

More details are found in "OrderedCollection's class description".

SortedCollection

SortedCollections are like OrderedCollections, but they keep their elements in sorted order. By default, sortedCollections sort in ascending order, by comparing individual elements using '#>' (which implies that the elements must respond to this comparison message, and return a useful truth value).
You can specify your own sorting rule (so called sortBlock).

Sorting is done using quickSort and binary search (when inserting elements). typical use:

Be aware, that the building of a sorted collection by adding individual elements is not necessarily a fast operation. Because the contents must be shifted inside the collection in order to make an empty slot for incoming elements, this may actually be a very slow way of setting up a sorted collection (for the specialists: its an O(nČ) algorithm, unless the elements come in already sorted).
It is usually better to either add all elements to an OrderedCollection and to sort this as a whole, using the sort message. Or else to use another collection, such as a BTree, which is more specialized to insertion/deletion of individual elements.

More details are found in "SortedCollection's class description".

ByteArray

ByteArrays are tuned for space efficient storage of short (byte-valued) integers in the range 0..255. They are used for bitmap storage and executable (byte-) code.

Bytearrays are also valuable for data interchange with other programs via files, sockets, shared memory or pipes.
Protocolwise, they behave much like Arrays.

More details are found in "ByteArray's class description".

FloatArray and DoubleArray

Float- and DoubleArray are specially tuned for space efficient storage of short (typically 32bit) and double float (typically 64bit) Float values.
For normal users, these are less interesting, but offer performance and space advantages when matrices or 3D graphics are involved, where large float number collections are often needed.
Protocolwise, they behave much like Arrays.

More details are found in "FloatArray's class description".

Interval

An interval represents a set of numbers which can be generated from a start value, a final value and a step. For example, the even numbers between 2 and 10 can be represented by an interval as (2 to:10 by:2).

More details are found in "Interval's class description".

LinkedList

A linkedList consists of a chained number of Link or ValueLink instances. They provide reasonable fast methods for insertion at either ends, but show poor performance when any inner element is to be accessed (since the list has to be walked up to the element).

Although often overemphasized in existing computer science literature, linkedLists are not very often used in Smalltalk, due to their poor indexed access performance and their storage overhead.
In most cases, an orderedColllection or a specialized tree implementation will outperform a linkedList.

More details are found in "LinkedList's class description".

Queue and SharedQueue

A queue is a collection where elements are added at one end and removed at the other end. An interesting class to look at is SharedQueue (which is a subclass of Queue) and helps in writing producer/consumer applications with separate writer/reader processes.

Notice, that queue functionality is also provided by OrderedCollection (addLast: / removeFirst messages); however the Queue-class protocol mimics a stream (next and nextPut: messages) and is therefore slightly different.
If you don't care for this and if you don't need the synchronization of a SharedQueue, use whichever you find more appealing.

More details are found in "Queue's class description".

WeakArray, WeakIdentitySet, WeakIdentityDictionary

to be documented

More details are found in "WeakArray's class description" or "WeakIdentitySet's class description" or "WeakIdentityDictionary's class description".

Registry and HandleRegistry

to be documented

More details are found in "Registry's class description" or "HandleRegistry's class description".

CacheDictionary and ResourcePack

A cacheDictionary is a dictionary which only stores a certain number of elements - removing existing entries when full and new elements are to be added. As the name suggests, they can be used to keep a collection of recently used objects.

ResourcePacks are used to keep track of string translations for national language variants. Technically, they are dictionaries. However, the class provides additional protocol for loading resourcePacks from so called resource files.

More details are found in "CacheDictionary's class description" or "ResourcePack's class description".

MappedCollection

A mappedCollection offers indirect indexing; access is via a map which provides the actual access key.

More details are found in "MappedCollection's class description".

Notes:

(*)
Some collections are tuned for growing, while others are not; some use hashing or other fast access methods to support fast search and/or inclusion test, while others do sequential searches.
Please check out and understand the individual collection classes and use the one which best fits your needs to get best performance and memory utilisation. Most collection classes provide detailed information in their #documentation method.


Copyright © 1996-2016 Claus Gittinger Development & Consulting

<info@exept.de>

Doc $Revision: 1.38 $ $Date: 2016/10/13 23:28:46 $