[prev] [up] [next]

Binary Object Storage

Contents

Introduction

This document will teach you how to store and retrieve objects on/from an external medium.

Smalltalk offers various ways to store and retrieve objects to/from the external world. Beside the wellknown #storeOn: method, binary storage is supported by any object.
Binary storage is both more dense (i.e. requires less space) and faster than textual storage in the normal case. In addition, the format used by the binary storage mechanism allows recursive and cyclic objects to be handled correctly, which is not possible with the ascii representation used by the normal #storeOn: mechanism.

The disadvantages are that (1) the binary storage format is proprietary to each Smalltalk dialect, and communication with other dialects is usually not possible "out-of-the-box". (2) the binary format "knows" the stored object format, and conversion may be needed, if stored objects are loaded after a class has changed its instance layout.

Non-Binary Store

Before we look at binary storage, lets first have a look at the ascii storage mechanism - this helps us understanding the differences and specialities of binary storage later.

In Smalltalk, all classes support the #storeOn: message, which asks the object to append a textual (i.e. ascii) representation of itself to a stream, from which a copy of this object can be reconstructed. This scheme works for simple objects, which do NOT contain self references or cycles. Also, this format is compatible among different Smalltalk implementations, if the layout of the instances is the same across them (i.e. the instance's class exists with the same instance layout on the target system).
For example:

    |myObject outStream|

    myObject := Array with:'hello world'
		      with:1.2345
		      with:#(1 2 3 4 5)
		      with:('one' -> #one).

    outStream := 'data' asFilename writeStream.
    myObject storeOn:outStream.
    outStream close.
stores the array (myObject) in the file named "data".

If you inspect this file, you will notice that it contains a single Smalltalk expression (in textual representation) which when evaluated recreates the original array. From this, the object can be reconstructed by asking the compiler to evaluate that expression:

    |string|

    string := 'data' asFilename readStream contents asString.

    myObject := Compiler evaluate:string.
    myObject inspect.
the above has been wrapped into an easier to use method, which is understood by any class:
    |string|

    string := 'data' asFilename readStream contents asString.

    myObject := Object readFromString:string.
    myObject inspect.
or, alternatively, reading directly from the stream:
    |inStream|

    inStream := 'data' asFilename readStream.

    myObject := Object readFrom:inStream.
    myObject inspect.
Thus, any object can be stored by sending it #storeOn: and retrieved by sending #readFrom: to Object or a class.

Problems with Non-Binary Store

Although simple and straightforward, this mechanism has a few drawbacks:

Storing a literalArrayEncoding

Some of the above problems are fixed, if you use a literalArrayEncoding for storage. This format is used internally to store and retrieve window-, menu and other specifications which are used as GUI resources.
Actually, the literalArrayEncoding generates an array, which represents the original object - however, the array's storeString can be used as ascii text representation and is therefore human readable.

The literalArrayEncoding is also very useful to store objects and descriptions in the program itself - that's what the windowSpec methods are actually for - they simply return an array which describes the original (spec-) object.

This store format is independent of the object's instance variable order. When an object is later retrieved, the setter methods are invoked for the values that were present at store time.
This means, that even future changed classes can provide a backward compatibility protocol.

Using a literalArrayEncoding, you can store your objects with:

    anObject literalArrayEncoding storeOn:aStream
(or, you may want to provide a storeLiteralOn method in your class(es), as:
    storeLiteralOn:aStream
	self literalArrayEncoding storeOn:aStream

Perfectionists may even use the already present pretty-printer from the GUI framework, to create a nice, indented storeFormat:

    storeLiteralPrettyOn:aStream
	UISpecification
	    prettyPrintSpecArray:self literalArrayEncoding
	    on:aStream
	    indent:0
and:
    storeLiteralPrettyOnFile:aFilename
	|s|

	s := aFilename asFilename writeStream.
	self storeLiteralPrettyOn:s.
	s close.

Retrieval is by:

    fromLiteralFile:aFilename
	|s arr|

	s := aFilename asFilename readStream.
	o := Array readFrom:s.
	s close.
	^ o decodeAsLiteralArray

This format does solve some of the storeOn: problem, but still cannot handle recursive or self referencing object structures. Also, it does not preserve object identity. However, it may be suitable for many simple applications.

Sample code is found in the file: 'doc/coding/StoringObjectsAscii-example.st'. It contains a class named "User" which stores and retrieves instances using literalArray encoding. File this in using the FileBrowser and explore it in the SystemBrowser.

Storing in JSON Format

The JSON format was originally designed to exchange simple, non-recursive data structures between a JavaScript program in a web browser and its server side pendant (originally a Java program). This format is now widely supported by many programming languages, and can be used both for data interchange and for persistence of simple data like settings or preferences.

The JSON support classes are provided in a separate package, so you may have to load it first with:

    Smalltalk loadPackage:'stx:goodies/json'
to write objects, use a JSONPrinter, to read, a JSONReader:
     |o1 s o2|

     o1 := Dictionary withKeysAndValues:#('one' 1 'two' 2 'three' 3.0 'four' 'vier').
     s := JSONReader toJSON:o1.
     o2 := JSONReader fromJSON:s
for more information, refer to their class documentation and examples found there. Be aware that the set of supported objects which can be stored/retrieved is very limited: basically, they must be Numbers, Strings, Boleans, Arrays and Dictionaries thereof.

If required, convert the objects into some Dictionary format, and store/retrieve those. Data can only be stored by value - no references (and definitely no recursive references) can be stored or retrieved.

Storing in XML Format

XML, although not being a binary format, is able to store and retrieve arbitrary objects. This includes recursive references and also preserves object identity. XML has the advantage, that it can support for it exsists in almost every programming language or Smalltalk dialect. Making XML documents quite portable. However, there XML requires more work to be done by the programmer and is also relatively expensive in terms of memory and processing power. XML documents also tend to be less compact than the other formats described here. The use of XML is described in a separate document.

Using Binary Storage

Binary storage solves the above problems, by storing objects in a compact encoded format, keeping track of object identities and by providing a migration mechanism for old stored objects to be converted to a object layout on the fly.

In contrast to the above described #storeOn: format, this format is not meant to be human readable. Also, since it uses all 8 bits of a byte, it may not be possible to send binary encoded objects directly via some ancient transport mechanisms (i.e. old electronic mail transports which only support 7bit ascii and not using uuencode). A limitation which is probably no longer present, these days.

Binary storage has the disadvantage that it is not compatible between different Smalltalk implementations. Although all Smalltalk dialects do support some form of binary object storage with similar functionality, no common encoding standard exists.

It is used in pretty much the same way as above, simply replace #storeOn: by #storeBinaryOn: and #readFrom: by #readBinaryFrom::

    |original retrieved hello inStream outStream|

    hello := 'hello'.
    original := Array with:hello
		      with:hello.

    outStream := 'data.bos' asFilename writeStream binary.
    original storeBinaryOn:outStream.
    outStream close.

    inStream := 'data.bos' asFilename readStream binary.
    retrieved := Object readBinaryFrom:inStream.
    inStream close.

    Transcript showCR:
	(original at:1) == (original at:2).  "evaluates to true"

    Transcript showCR:
	(retrieved at:1) == (retrieved at:2).  "evaluates to true"
The above can be used on any stream which supports reading/writing of bytes. (i.e. a WriteStream on a ByteArray, FileStreams, Sockets, Pipes etc.).

The binary storage mechanism handles cyclic or self referencing structures, preserving object identity. It does so by assigning unique object IDs (i.e. integers) to stored objects. It keeps track of previously assigned IDs, and writes the ID of previously encountered objects if an object is to be stored which was already stored before. In addition to preserving object identity, this also creates a more compact output, as each individual object's contents is only stored once. (The process of converting an arbitrary graph of objects into a flat sequence is also refered to as flattening or marshalling.)

At retrieval time, the reverse is done, keeping track of objectIDs as objects are restored and reconstructing the original references from the ID.

Example (storing the above self-referencing object):

    |original retrieved hello inStream outStream|

    original := Array new:3.
    original at:1 put:'hello'.
    original at:2 put:'world'.
    original at:3 put:original.

    outStream := 'data.bos' asFilename writeStream.
    original storeBinaryOn:outStream.
    outStream close.

    inStream := 'data.bos' asFilename readStream.
    retrieved := Object readBinaryFrom:inStream.
    inStream close.

    Inspector openOn:original title:'original'.
    Inspector openOn:retrieved title:'retrieved'.
looking into the retrieved object in the inspector, you will find that the original self reference was correctly reconstructed.

Storing Objects in a Simple Database

For simple object storage, Smalltalk/X provides a class, called PersistencyManager, which implements a dictionary-like protocol and allows storage and retrieval of objects by a key.

The low-level mechanism used by PersistencyManager is based upon the "db-1.6" berkeley database library which is a successor of the well known "dbm/ndmb" library.

Using PersistencyManager, objects can be stored with:

    ...

    manager := PersistencyManager file:'<somefileName>'.
    ...
    manager at:<someKey> put:<someObject>.
    ...
    manager close
and retrieved with:
    ...

    manager := PersistencyManager file:'<somefileName>'.
    ...
    <someObject> := manager at:<someKey>.
    ...
    manager close
The #at: / #at:put: interface is especially convenient, as you can test your application using in-memory dictionaries first, and switch to an external database later. Like with ordinary Dictionaries, any object is allowed as key.

Example (storing):
(in a real-world application, you would create a PersonRecord class, and store its instances - instead of dictionaries).

    |manager record|

    manager := PersistencyManager file:'sampleData'.

    record := IdentityDictionary new.
    record at:#firstName put:'Joe'.
    record at:#lastName put:'Sampleman'.
    record at:#age put:35.
    record at:#salary put:75000.
    record at:#personalID put:123456.
    manager at:(record at:#personalID) put:record.

    record := IdentityDictionary new.
    record at:#firstName put:'Boris'.
    record at:#lastName put:'Jelzin'.
    record at:#age put:99.
    record at:#salary put:175000.
    record at:#personalID put:34561.
    manager at:(record at:#personalID) put:record.

    record := IdentityDictionary new.
    record at:#firstName put:'Tony'.
    record at:#lastName put:'Friedman'.
    record at:#age put:25.
    record at:#salary put:35000.
    record at:#personalID put:78905.
    manager at:(record at:#personalID) put:record.

    manager release.
Example (retrieving):
    |manager record|

    manager := PersistencyManager file:'sampleData'.
    record := manager at:78905.
    manager release.

    record inspect
Notes:
PersistencyManager does not provide the functionality of a real database - it is just a goody thrown in, for simple applications.
A particular limitation is that only a single key is supported - you have to manually add functionality on top of this basic mechanism (i.e. using multiple databases to provide mappings from different keys to the object's real key) to implement these.

Don't blame us for this - after all, this is a free goody.

The following shows a quick and dirty hack to provide mulitple keys; here, we assume that the "key-to-personID" mappings are small enough to fit into memory (and therefore, we retrieve them entirely). The mappings are stored within the same database, under a special key.

First, build the "lastName to personID" mapping (ignoring duplicates, for simplicity):

    |manager nameToIDMapping|

    manager := PersistencyManager file:'sampleData'.

    nameToIDMapping := Dictionary new.
    manager do:[:record |
	"/ ignore non-person objects
	(record includesKey:#personalID) ifTrue:[
	    nameToIDMapping
		at:(record at:#lastName)
		put:(record at:#personalID)
	]
    ].

    "/ store the mapping under a special key

    manager at:#nameToIDMapping put:nameToIDMapping.
    manager release.
retrieve the nameToIDMapping first, and use this to fetch records by name:
    |manager nameToIDMapping record|

    manager := PersistencyManager file:'sampleData'.
    nameToIDMapping := manager at:#nameToIDMapping.

    record := manager at:(nameToIDMapping at:'Friedman' ifAbsent:nil).

    manager release.

    record inspect

Layout of Binary Data

This section describes the logical structure of stored binary data. It may be useful for a deeper understanding of how binary object storage works and why errors (as described below) can occur.
It may also be interesting, if you want to write a loader for binary data in another programming language (such as C/C++); however, to do this, you have to have a look into the source code - the information presented here is not detailed enough, and certainly not meant as a specification.

You may skip this section - and use binary storage while ignoring these internals.

A binary object stream consists of a sequence of typeID bytes and objectID bytes. The typeID specifies how following bytes are to be interpreted.
Basically, there are four major typeIDs:

When writing, objects and their corresponding IDs are remembered. When an object is about to be stored again, only a reference via its ID is written. When retrieving, a table of "object-to-id" is maintained, to resolve such references.

As an example, the binary representation of:

    s := 'hello'.
    Array
       with:1
       with:s
       with:s.
looks like:

    classDefinition
	ID: 1
	name:      'String'
	...

    objectDefinition
	ID: 2
	classID:    1
	contents:   'hello'

    classDefinition
	ID: 3
	name:       'Array'
	...

    objectDefinition
	ID: 4
	classID:    3
	contents:
	    specialObject(SmallInteger)  1
	    objectReference  ID: 2
	    objectReference  ID: 2
(the above is a conceptional picture - the real encoding is somewhat different)

The interesting thing is that classes are stored by name, not by contents. This is done to limit the amount of stored data.

If this was not done, and the classes structure be treated like any other object instead, a binary store would trace & dump all classes along the object's superclass chain; thereby dumping class variables, metaclasses and in most cases traverse the full set of existing objects. (Because it may encounter the list of global variables in the Smalltalk object - from which almost every other object can be reached.)

Obviously, this is not a behavior we want (it is cheaper to save a snapshot image to get this ;-).

Since classes are stored by name, a corresponding class must be available at reconstruction time (see below on how the system behaves if that is not the case). To catch the case of changed class layouts, additional information (a so called signature) is written with the name in a classDefinition block. This signature is checked against the existing classes signature at reload time and an exception is raised if they do not match.
The signature contains enough information to reconstruct a dummy container class for the restored object. However, no semantic information (i.e. methods) are stored.

For a full description of typeIDs, read the class documentation of BinaryOutputManager and its subclasses.
(Click here: to open a browser on those classes).

Tricks and Hints

It should be clear, that some overhead is involved in managing object IDs during binaryStore/binaryLoad. The storage manager has to keep track of object <-> ID associations during the store and again during the read. This is done in an identityDictionary which is constructed during the process.

Also, every binaryStore and binaryRead operation starts with a new, empty association table and saves class definitions again (assuming that the objects stored in individual #storeBinaryOn: operations are to be reconstructed using individual #readBinaryFrom: later).

Therefore, there is a big difference in the time/space requirements of the following two examples:

    |array element outStream|

    element := 1@1.
    array := Array new:1000 withAll:element.

    outStream := 'data1.bos' asFilename writeStream.
    array storeBinaryOn:outStream.
    outStream close.
and:
    |array element outStream|

    element := 1@1.
    array := Array new:1000 withAll:element.

    outStream := 'data2.bos' asFilename writeStream.
    array do:[:el |
	el storeBinaryOn:outStream.
    ].
    outStream close.
the first stores the definition of the Point class only once, reusing it for every stored point. The second stores this class definition once for each individual point.
Looking at the size of the created file shows this. The first requires 1.9Kb, while the second requires 24.4Kb. Also, the times required to store/load the data are quite different: 130ms vs. 2800ms (stored via NFS to a disk on a remote machine. Your actual numbers will be different, but the ratio should be alike).

The second example has the advantage, that individual elements can be read from the file (if you remember the file positions). In contrast, the first examples' data can only be reconstructed as a whole array.

In some cases, you may want to avoid the above overhead, AND store data while reusing information about previously stored classes/objects.
This makes sense, if:

If above preconditions are true, you can reuse the storage managers collected internal state and incrementally store objects.
To do this, use a lower level interface to the storage manager:
Storing:
    |array element outStream manager|

    element := 1@1.
    array := Array new:1000 withAll:element.

    manager := BinaryOutputManager new.
    outStream := 'data3.bos' asFilename writeStream binary.
    array do:[:el |
	el storeBinaryOn:outStream manager:manager.
    ].
    outStream close.
    manager release.
loading:
    |array element inStream manager|

    array := Array new:1000.

    inStream := 'data3.bos' asFilename readStream binary.
    manager := BinaryInputManager on:inStream.
    1 to:array size do:[:index |
	array at:index put:(manager nextObject).
    ].
    inStream close.
    array inspect
As a concrete example, consider the case, where you have a tree of person objects, consisting of firstName, lastName and whatever, but you only want to binaryStore the firstName values of each node:
    |tree outStream manager|

    ...

    manager := BinaryOutputManager new
    outStream := 'namedata.bos' asFilename writeStream binary.
    tree inOrderDo:[:aNode |
	|name|

	name := aNode firstName.
	name storeBinaryOn:outStream manager:manager.
    ].
    outStream close.
    manager relase.
and reconstruct the tree with the names only:
    |tree name inStream manager|

    ...
    tree := NameTree new.
    ...

    inStream := 'namedata.bos' asFilename readStream binary.
    manager := BinaryInputManager on:inStream

    [inStream atEnd] whileFalse:[
	name := manager nextObject.
	tree insertNode:(PersonNode for:name).
    ].
    inStream close.
    ...
using a little trick, it is also possible to extract individual objects from this dataStream; to do this, you have to read and skip all objects before the one to be reconstructed (to let the manager build up its id information table).
The following example stores 1000 individual points: Storing:
    |array element outStream manager|

    array := ((1 to:1000) collect:[:i | i @ i]) asArray.

    manager := BinaryOutputManager new.
    outStream := 'data3.bos' asFilename writeStream binary.
    array do:[:el |
	el storeBinaryOn:outStream manager:manager.
    ].
    outStream close.
    manager release.
and reads the 400th point:
    |element inStream manager|

    inStream := 'data3.bos' asFilename readStream binary.
    manager := BinaryInputManager on:inStream.

    399 timesRepeat:[manager nextObject].
    element := manager nextObject.

    inStream close.
    element inspect
the inputmanager offers a (slightly faster) skipObject method for skipping:
    |element inStream manager|

    inStream := 'data3.bos' asFilename readStream binary.
    manager := BinaryInputManager on:inStream.

    399 timesRepeat:[manager skipObject].
    element := manager nextObject.

    inStream close.
    element inspect
Since all class and object definitions still have to to be processed, do not expect skipObject to be dramatically faster than nextObject.

Error handling

Binary storage is much more sensitive to changed instance layout (of classes) than textual storage. Consider the following case:
  1. an object is stored somewhere
  2. the object's class is changed to include one more instance variable
  3. you try to (binary-) load the original object
Of course, at retrieval time, the now existing class is no longer valid for the object to be reconstructed.
Notice: this is also true with textual storage for most classes, since the default storeOn: as defined in the Object class stores a description which reconstructs the object based on instVarAt:put:. Of course, this also reconstructs a wrong object if the relative offsets of instance variables have changed. (if you want to take precautions against this, reimplement the storeOn: method in your classes, to not create instVarAt:put: expressions, but write expressions sending instance variable access messages instead.)

To avoid this, some classes redefine storeOn: and create an expression based on an instance variables name.

Smalltalk/X offers an error handling mechanism to catch situations when an object is restored for which no valid class exists. As usual, the error is signalled using the exception mechanism, by raising some signal (see ``Exception handling and signals'').

It is possible to handle these signals and either:

Ignoring Errors

All errors are signalled by one of the signals: By defining a handler for these (or for one of the parent signals), an error during the binary read operation will not bring you into the debugger.
Instead, the exception handler can decide what to do:
The handler gets the newly created subclass of ObsoleteObjectas parameter; this allows for the handler to decide for every detected class individually, how things are to be handled. That class is named after the original classes name, and has all required meta information at hand; especially, instance size and names of instance variables may be of interest.

After a proceed, the handler will not be called again for the same class; any further retrieved objects of the same class will be silently made instances of the same class (either as obsolete, or whatever the handler returned in the first place).

Examples:
Abort the binary load on any error:

    |inStream data|

    ...
    inStream := .... asFilename readStream binary.
    ...
    BinaryIOManager binaryIOError handle:[:ex |
	"
	 other error (such as corrupted file etc.)
	"
	Transcript showCR:'some other error occured in binary load'.
	Transcript showCR:'abort the load ...'.
	ex return.
    ] do:[
	BinaryIOManager invalidClassSignal handle:[:ex |
	    |oldClass|

	    oldClass := ex parameter.
	    Transcript showCR:'cannot restore instance of ' , oldClass name.
	    Transcript showCR:'reason: ' , ex signal notifierString.
	    Transcript showCR:'abort the load ...'.
	    ex return.
	] do:[:
	    data := Object readBinaryFrom:inStream.
	]
    ].
    ...
    s close.
    ...
in the above, the binary read will be aborted, and nil be left in data.

Ignoring the error to return an obsoleteObject:

    |inStream data|

    ...
    inStream := .... asFilename readStream binary.
    ...
    BinaryIOManager binaryIOError handle:[:ex |
	...
    ] do:[
	BinaryIOManager invalidClassSignal handle:[:ex |
	    |oldClass|

	    oldClass := ex parameter.
	    Transcript showCR:'cannot restore instance of ' , oldClass name.
	    Transcript showCR:'reason: ' , ex signal notifierString.
	    Transcript showCR:'continue with obsolete object...'.
	    ex proceed.
	] do:[:
	    data := Object readBinaryFrom:inStream.
	]
    ].
    ...
    s close.
    ...
in the above, data may contain an instance of a subclass of ObsoleteObject. This object will not be usable, since it traps on most messages into a messageNotUnderstood exception.
However, it will contain the original values, so manual or programatic conversion is possible.
(a concrete application could provide some kind of database conversion procedure to convert all obsoleteObjects into something useful.)

Return a replacement class and retrieve these objects as instances of that:

    |inStream data|

    ...
    inStream := .... asFilename readStream binary.
    ...
    BinaryIOManager binaryIOError handle:[:ex |
	...
    ] do:[
	BinaryIOManager invalidClassSignal handle:[:ex |
	    |oldClass|

	    oldClass := ex parameter.
	    Transcript showCR:'cannot restore instance of ' , oldClass name.
	    Transcript showCR:'reason: ' , ex signal notifierString.
	    Transcript showCR:'return as instance of another class ...'.
	    ex proceedWith:ReplacementClass.
	] do:[:
	    data := Object readBinaryFrom:inStream.
	]
    ].
    ...
    s close.
    ...
See example code in "doc/coding/BOSS-errors".

Correcting Errors

It is possible to automatically convert obsolete objects to another format or to make them become an instance of another class while reading binary data.

To do so, the binaryLoader will raise the requestConversion exception, passing the existing class and the obsolete object as arguments to the exception handler. The handler should somehow try to convert the obsolete object and proceed with the new object as value.

This conversion signal is only raised by the binary loader if an exception handler is present; therefore, not handling (or ignoring) the conversionSignal results in obsoleteObjectes to be returned from the binary load (as described above).

Also, since any invalidClass exceptions are raised before any conversion is tried, these must be handled as described above.
The reason is that during binaryStore/binaryRead, classes are written/encountered first, before any instances. Therefore, all class related exceptions will occur first; but only once per class, since classes (like any other object) are only stored once.

Conversion requests are signalled for each individual obsolete object being loaded (in contrast to the above invalidClass signals, which are only signalled once per class).

The existing (new) class can provide a conversion method (#cloneFrom:), which should create and return a new instance of itself based on some a template object.
Here, the template object is the obsolete object as retrieved from the binary load.
A default #cloneFrom: method is provided, which creates an object with all named and indexed instance variables preserved. However, for special needs, your class may redefine this method and do whatever is required for conversion (or even decide to return nil ...)

For more details, see example code in "doc/coding/BOSS-errors"

Skipping Instvars

For some objects, it does not make sense to store all of their instance variables; either because those are not needed or can be easily reconstructed, or will not be valid at reload time.
For example, an object which has a reference to a process or view or any other object which may not be valid (or can be reconstructed) at load time, may want to skip these in the store operation, and reconstruct or leave them as nil when doing a binary read.
To do this, the object must implement two methods to respond to #representBinaryOn: and #readBinaryContentsFromData:manager:.

For an example, see the file "doc/coding/BOSS-special"

Limitations & Bugs

Due to the implementation of ST/X, you cannot (currently, and maybe forwever) retrieve processes and contexts via binary storage (you cannot retrieve them via normal textual storage as well).
If you application requires this, you have to store the state of the processes computation somehow different and recreate a new process when the object is retrieved.

Since views require a process for proper operation (the windowgroup process), this limitation results in the inability to store and retrieve views.


Copyright © 1995 Claus Gittinger, all rights reserved

<cg at exept.de>

Doc $Revision: 1.28 $ $Date: 2021/03/13 18:24:51 $