Smalltalk Performance - Myths and Facts

This document should clearify some myths about Smalltalk performance.

Introduction
Part1 - Micro Benchmarks
Part2 - Medium Level Benchmarks
Graphic Display Speed

Introduction

Historically, Smalltalk was considered to be a programming language with lots of internal overhead, leading to slow programs.
I think most of those are myths and held by people who do not really know what they are talking about. This document should give you some background facts to at least base the discussion on up-to-date facts.

It is NOT (I repeat: NOT) meant to start another language war; however, those who like these wars should at least base the discussion upon facts instead of myths ..

Please: Leave me alone in such a language war - it's a waste of time.

Although the following concentrates on performance, there are many other things beside raw speed to consider:

does the system support or hinder the programmer in getting his work done
how much of her time spends a programmer with tasks NOT related to the problem (i.e. writing low level code, such as linkedlist implementations, memory management functions, hashing functions, collection search loops, sorting functions etc.)
how often is he ``reinventing the wheel'' (i.e. does the system really support code reuse)
how much time is spent debugging; what is the average time to repair a bug
... many more, not listed here; add your own

Speed comparison with other languages - does it make sense ?

Comparing programming languages is hard to do; especially, comparing Smalltalk to non object oriented languages (C) rarely makes sense.
(Remember the times when Pascal or C came up first - remember the language wars between Assembler lovers and C freaks ?)

It may be better to compare Smalltalk to other object oriented languages, such as Java, C++, Ada9x, CLOS or Eiffel - to name a few. Or, more recent innovations: Python, Ruby and JavaScript.

If comparing C programs, a programming technique similar to that used in object oriented languages should be used - for example, by simulating message sends by indirect function calls (btw: this is how C++ does it).

For a real comparison, a full application should be written in the various languages, and its performance be judged.
MicroBenchmarks which concentrate on a single feature or language construct, are seldom directly related to the overall speed.
However, without them, reproducable measurements are difficult to acquire; therefore, you will see some microBenchmarks in the following chapters.

Part1 - Micro Benchmarks

Message send speed

At first, people tend to think that message lookup is too slow, and that systems which do runtime method lookup cannot be fast (by definition).

This is definitely no longer true with modern Smalltalk implementations, which heavily use caching to avoid most lookup operations. Especially, with the help of inline caching (see Unger [5]), most message sends are reduced to an indirect function call and an additional compare instruction.
As shown below, caching reduces Smalltalk's message send speed to roughly the speed of (compile-time resolved) virtual function calls in C++ or indirect function calls in C.

We acknowledge the fact that a direct function call is still faster than an indirect one - therefore, the speed of C function calls, C++ non-virtual function calls (or even: inline code) can never be reached in Smalltalk (which is not completely true: read documentation on the Self language).
However, once we abandon message sends, we are leaving the object oriented arena - even C++ fans come to the insight that - for flexibility & reuse - most functions should be virtual functions (some must be for proper operation to be guaranteed - which may make things a bit tricky occasionally).

The Benchmark:
Although originally written to benchmark function calling speed in lisp systems, the tak benchmark results are also a good indicator for non polymorph message send speed.
The tak function (or method) calls itself recursively; except for a bit of integer arithmetic (subtracting the constant `one'), all it does is function calling / message sending.
When executed, it is recursively invoked 63609 times and performs 47706 subtract operations.

The following table lists some execution times.
BTW: It is interesting to see, that modern systems are far better than systems we used to work on a few years ago ...
... and considered them to be fast at those times ;-).
So, today, interpreted Smalltalk on a low price PC is more than 3 times faster than compiled C on a much more expensive Vax 11/780.

   System               Language                       Time (ms)      Notes

   VAX  11/750          Franz Lisp (slowest)           47600          generic arithmetic
   VAX  11/750          PSL (portable std. Lisp)        7100          generic arithmetic
   VAX  11/750          Franz Lisp                      3600          fixnum arithmetic
   Symbolics LM2        Lisp                            2905
   VAX  11/750          C                               2400
   68k                  C                               1900          unknown clock; probably 8Mhz
   VAX  11/750          Franz Lisp (best)               1900          fixnum arithmetic (lfc)
   VAX  11/750          PSL (portable std. Lisp)        1400          inum arithmetic
   VAX  11/780          C                               1350
   VAX  11/780          Franz Lisp                      1100          fixnum arithmetic (lfc)
   68k                  Assembler                        700          unknown clock; probably 8Mhz
   Dorado               Interlisp                        526
   LMI Lambda           microcompiled Lisp               450
   Symbolics 3600       Lisp                             430
   i586/100             ST/X interpreted bytecode        376          (*)
   i586/100             ST/X compiled (-O) (Float args)  167          (**)
   i586/100             ST/X compiled                     55          (*)
   Cray-1               PSL (portable std. Lisp)          44
   i586/100             ST/X compiled (-O)                32          (*)
   i586/100             ST/X JIT compiled bytecode        31          (*)
   i586/100             ST/X compiled (-O +optContext)    30          (***)
   i586/100             C (gcc2.6.3)                      26.9        (*) double version
   i586/100             C++ (gcc2.6.3)                    26.7        (*)
   i586/100             C (gcc2.6.3 -O)                   23.8        (*) double version
   i586/100             C++ (gcc2.6.3 -O)                 21.87       (*)
   i586/100             C (gcc2.6.3)                      15.5        (*)
   i586/100             C (gcc2.6.3 -O)                   12          (*)
   P3/700               ST/X JIT compiled bytecode        4.6         (*)
   I5/2.4Ghz            ST/X compiled (-O)                0.6         (*4)

(*) Except for the times marked as (*), the numbers are taken from:
``Performance and Evaluation of Lisp Systems'', by Richard P Gabriel, MIT Press
(**) You cannot do this on inum/fixnum arithmetic Lisps; you have to write a separate function (virtual function) in C/C++ in order to support double args.
(***) +optContext generates code where the full debugging information (i.e. context chain) is available as usual, but the machines full register set is not saved with every method invocation. The consequence is that methods are not restartable in the debugger (they are abortable & interruptable, though)
(*4)late update 2020, on a 2012 MacBook; a message send takes about 4ns.

Now that is interesting: although this little program does almost nothing else than message sending (there is a tiny bit of arithmetic), the Smalltalk speed is about 2.5 times slower (30ms vs. 12ms) than that of C (on the i586).
C is almost twice as fast as C++ (which also has to perform an indirect function call), but Smalltalk is only about 30% slower than C++ (for reasons described below).

Times are achanging: today, ST/X on a Pentium is faster than compiled lisp on a Cray-1 used to be a few years ago :-))
[big smily here; I don't know any details about the PSL implementation].

The ST/X compiler does a bunch of optimizations in order to reduce most overhead; the remaining difference in execution time is probably not related to message sending, but instead due to:

context chain management
In order to be able to interrupt the program at any time, AND show useful information in the debugger, some context information has to be kept.
This requires a few additional instructions to be executed upon entry to a method.
Other systems keep this information outside of the program (typically in a symbol table) and use an external symbolic debugger to present this to the user.
arithmetic range & overflow check and polymorph arithmetic
The Smalltalk code has to be prepared for arguments being nonIntegers or arithmetic to leave the range of valid integers (in the subtract operation) .
Notice, that the Smalltalk code also works (unchanged) if executed with float or fractional arguments (actually, with anything that understands arithmetic messages).
This requires a type check and overflow check to be performed for all arithmetic on arguments or other variables, of which the compiler cannot determine the range of possibe values.
These checks add a few instructions to every arithmetic operation; in a benchmark which only consists of sends and subtracts, the percentage of this overhead is obviously larger than in more complex operations.
For example, try: "TakBenchmark takX:18.0 Y:12.0 Z:6.0"
The execution time of this is listed as ``Float args'' in the above list. Notice, that this heavily allocates float objects - the increased execution time is mostly due to the allocation of them.
The execution times of a corresponding C version (using doubles instead of ints) is listed in the table as double version
interruptability
The Smalltalk code is prepared to execute in a multithreaded environment, and is interruptable, suspendable, restartable & resumable. This may not seem obvious at first sight, but implies certain restrictions on the generated code.
For example, interrupts have to be remembered, and shortly after handled at safe places in the code (by software checking for any pending interrupts).
Many other systems (C, C++) do not support this (if not carefully coded by the programemr), and their memory management (especially: instantiation & freeing) is often not threadSafe or interruptSafe.
(try new() in a signal handler, which interrupted another new())
Turning off method restartablility saves some 8-10% of method invocation overhead; the ST/X compiler offers this as an option.

Summary: Message send speed myth

From the above, it should be clear that message sends are NOT critical in most situations.
I admit, that above example is a special case, were ST/X's caches work best. Things look a bit different, if sends are highly polymorph at the calling side.
However, statistics showed that these polymorph sends are rather uncommon, and polymorph messages are handled by another cache which provides adequate speed for those as well (although not as fast as the inline cache).

Benchmark code

For those who want to measure this on their own machine, here is the code:

C implementation:


main() {
    int i;

    for (i=0; i<1000; i++) {
	tak(18, 12, 6);
    }
}

tak(x, y, z) {
    if (y >= x) return z;
    return
	tak(
	    tak(x-1, y, z),
	    tak(y-1, z, x),
	    tak(z-1, x, y));
}

(execute it with "time a.out" and divide the resulting time by 1000)

C++ implementation:


class Tak {
    public :
	Tak () { }
	~Tak () { }

	virtual int tak ( int x, int y, int z );
};

int Tak::tak ( int x, int y, int z )
{
    if (y >= x) return z;

    return
	tak(
	    tak(x-1, y, z),
	    tak(y-1, z, x),
	    tak(z-1, x, y));
}

main() {
    Tak p;

    for (int i=0; i<1000; i++) {
	p.tak(18, 12, 6);
    }
}

(execute it with "time a.out" and divide the resulting time by 1000)

Lisp implementation:


(define tak (x y z)
	(if (not (< y z))
		z
		(tak (tak (1- x) y z)
		     (tak (1- y) z x)
		     (tak (1- z) x y))))


(tak 18 12 6)

Smalltalk implementation:


!TakBenchmark class methodsFor:'benchmarking'!

takX:x Y:y Z:z
    y < x ifFalse:[
	^ z
    ] ifTrue:[
	^ self
	      takX: (self takX:(x - 1) Y:y Z:z)
		 Y: (self takX:(y - 1) Y:z Z:x)
		 Z: (self takX:(z - 1) Y:x Z:y)
    ]

    "
      Time millisecondsToRun:[
	 TakBenchmark takX:18 Y:12 Z:6
      ]
    "
    "On a fast machine:
      (Time millisecondsToRun:[
	 100 timesRepeat:
	    [ TakBenchmark takX:18 Y:12 Z:6 ]
      ]) / 100.0
    "

Memory management speed

Second-most to its runtime method lookup, Smalltalk used to be blamed for its automatic memory management. (no longer - with the arrival of Java and C#, this discussion has now ceased. It seems that most opponents simply did not know what they were talking about... ... some of them are now even actively promoting garbage collection (sigh).
The following microBenchmark was written to especially stress the memory system.

The Benchmark:
The benchmark builds up a linked list consisting of 1000 elements, and frees them afterwards. The benchmark is executed 1000 times, for a total of 1 million created elements. Each created element has a size of (at least) 8 bytes.

Notice, that in C++ and Smalltalk, the actual size of the allocated objects is bigger, in C++, there is a need to store a reference to the virtual function table (at least), bringing the overall memory usage to (at least) 12Mb.

Depending on the internals of the underlying malloc() / free() algorithm, additional (invisible) allocation overhead can be expected both in C and C++.

In Smalltalk, every object includes a header containing its size, class and additional information; the headers size is 12 bytes. The header includes all required information (i.e. there is no other hidden allocation overhead). Therefore, an overall of 20Mb is allocated in the ST/X run.

Notice also, that no freeing of the objects is required in Smalltalk - the system does this automatically for you. During the run, several garbage collects occur in Smalltalk - the execution time includes those.

This is a very artifical benchmark, but probably the shortest and easiest way to built some dynamic structure and release it afterwards. In other words: probably the shortest way to measure the memory management system alone.
In real world systems, much more would be done - especially, the created elements would be initialized ...

Here are some execution times (all done on the same system, using the same gcc compiler):

   System               Language                   Time (ms)

   i586                 ST/X interpreted           10702
   i586                 C++ (gcc2.6.3)              3500
   i586                 C++ (gcc2.6.3) virtual      3490
   i586                 C++ (gcc2.6.3) inline       3005
   i586                 C++ (gcc2.6.3 -O6) virtual  2850
   i586                 C++ (gcc2.6.3 -O6)          2840
   i586                 C++ (gcc2.6.3 -O6) inline   2490  (-O6 slower than -O; how comes this ?)
   i586                 ST/X JIT compiled bytecode  2467
   i586                 C++ (gcc2.6.3 -O)           2460
   i586                 C++ (gcc2.6.3 -O) inline    2460
   i586                 ST/X compiled               2224
   i586                 ST/X compiled (-O6)         2042
   i586                 C (gcc2.6.3)                1770
   i586                 C (gcc2.6.3 -O)             1670
   i586                 C (gcc2.6.3 -O6)            1620

(inline) - means: compile with -DINLINE

Summary: Memory management speed

Well, I guess the above should clearify things.
In this particular benchmark, ST/X's memory manager is faster than the C++ version AND includes automatic memory reclamation.

To be fair, I have to admit that these numbers are only valid for short term memory - memory which survives for a long time (i.e. finds its way into a higher generation of the memory hierarchy) is slower reclaimed.
However, in practice, most of the allocated objects are reclaimed shortly after allocation, so the above results should be true for most real world applications.

Most Smalltalk implementations allow further tuning / adjustment of the allocation policy parameters, to give hints about abnormal allocation behavior to the memory management. The above numbers were taken with a default setup (system adjusts its parameters itself)

For example, changing a single parameter in ST/X's memory policy (newSpaceSize) can further reduce the execution time (down to 1845), which is almost the C speed.
We do not recommend doing this (because it has effects on other parts of the system, especially affecting real time response), but it shows that there is much more to understanding automatic memory management systems.

On the other hand: in real world applications, mechanisms based upon malloc()/free() may suffer from their own problems which are not shown by this simple benchmark:

If freeLists are used internally, there may be increased overhead when searching for free chunks, or when merging chunks while freeing a piece of memory.
Buddy systems tend to waste more than the above mentioned 4 bytes.
All of them may fragment their memory, again leading to waste of memory. It may be interesting to measure things with randomly sized objects.
Most are not threadSafe or interruptSafe. I.e. trying to malloc/free in a signal handler may lead to big trouble. You either have to be very careful, rewrite the memory allocator or use locking (i.e. critical regions & semaphores). If locks are added, performance goes down and/or your program gets more complicated.
C/C++ programmers will argue that in high performance systems, less memory is allocated on the heap; instead, stack objects are used, which do not suffer from any malloc/free overhead. This may be true for small scale systems - all big systems I have seen so far use a (propriety) framework for object management - often using suboptimal representations & algorithms to get around memory leak problems (envelope-letter systems, copying collections, reference counting etc.).
It's those systems to compare - not the artificial hello-world programs !
As usual: manual memory management tends to lead to programming errors - you know the consequences.
Again - a 30% faster CPU may be far cheaper than a programming team wasting its time in implementing object management frameworks.

TODO: measure long term memory behavior

Benchmark code

For those who want to measure this on their own machine, here is the code:

C implementation:


struct link {
    int          value;
    struct link *next;
};

main() {
    int i;

    for (i=0; i<1000; i++) {
	benchalloc();
    }
}

benchalloc() {
    struct link *link, *next;
    struct link *anchor = (struct link *)0;
    int i;

    for (i=0; i<1000; i++) {
	link = (struct link *) malloc(sizeof(struct link));
	link->value = i;
	link->next = anchor;
	anchor = link;
    }

    for (link = anchor; link; link=next) {
	next = link->next;
	free(link);
    }
}

(execute it with "time a.out" )

C++ implementation:


class Link {
    int          value;
    Link        *next;

    public :
		Link () {};
	virtual ~Link() {};
#ifdef INLINE
	inline  void setNext(Link *l)   { next = l; };
	inline Link *getNext()          { return next; };
	inline void setValue(int v)     { value = v; };
	inline int getValue()           { return value; };
#else
# ifdef VIRTUAL
	virtual void setNext(Link *);
	virtual Link *getNext();
	virtual void setValue(int);
	virtual int getValue();
# else
	void setNext(Link *);
	Link *getNext();
	void setValue(int);
	int getValue();
# endif
#endif
};

#ifndef INLINE
void Link::setValue(int v)
{
    value = v;
}

int Link::getValue ()
{
    return value;
}

void Link::setNext(Link *l)
{
    next = l;
}

Link *Link::getNext()
{
    return next;
}
#endif /* not INLINE */

void
benchAlloc() {
    Link *link, *next;
    Link *anchor = (Link *)0;

    for (int i=0; i<1000; i++) {
	link = new Link;
	link->setValue(i);
	link->setNext(anchor);
	anchor = link;
    }

    for (link = anchor; link; link = next) {
	next = link->getNext();
	delete link;
    }
}

main() {
    for (int i=0; i<1000; i++) {
	benchAlloc();
    }
}

(execute it with "time a.out")

Smalltalk implementation:



!LinkedListBenchmark class methodsFor:'benchmarking'!

buildList
    |anchor link|


    anchor := nil.
    1 to:1000 do:[:i |
	link := ValueLink basicNew value:i.
	link nextLink:anchor.
	anchor := link.
    ]

    "
     Time millisecondsToRun:[
	1000 timesRepeat:[LinkedListBenchmark buildList]
     ]
    "

Medium Level Benchmarks

Here are some more recent (2014) benchmarks which compare the speed of some utility classes in Smalltalk against a modern, highly optimizing hotspot Java implementation. All were executed on the same hardware (a 2.6Ghz CoreI7 Macbook Pro).

ArrayList vs. OrderedCollection

The benchmark creates a dynamic array and fills it with 1mio values (Integers from 1 to 1000000).

Notice, that in Smalltalk, the 1-to-1 translated, hand-coded initialization code is not what an experienced Smalltalk would use; alternative versions are shown and measured as well.

The code is:

Java:

    long t1 = System.currentTimeMillis();
    for(int j = 0; j < 10; ++j) {
	List<Integer> list = new ArrayList<Integer>();
	for(int i = 0; i<1000000; ++i) {
	    list.add(i);
	}
    }
    long t2 = System.currentTimeMillis();
    System.out.println("Time taken: " + (t2 - t1) / 10.0);

Java preallocated ArrayList:

    ...
	List<Integer> list = new ArrayList<Integer>(1000000);
	for(int i = 0; i < 1000000; ++i) {
	    list.add(i);
	}
    ...

Smalltalk V1 (hand-coded, naive loop):

    Transcript showCR:(
	Time millisecondsToRun:[
	    |c|
	    10 timesRepeat:[
		c := OrderedCollection new.
		1 to:1000000 do:[:i | c add:i].
	    ].
	]
    ]) / 10

Smalltalk preallocated V1 (hand-coded, naive loop):

    ...
	    |c|
	    10 timesRepeat:[
		c := OrderedCollection new:1000000.
		1 to:1000000 do:[:i | c add:i].
	    ].
    ...

Smalltalk V2 (better code):

    ...
	    |c|
	    10 timesRepeat:[
		(1 to:1000000) asOrderedCollection
	    ].
    ...

Timings:

Java 139

Java preallocated 123

Smalltalk v1 177

Smalltalk v1 preallocated 171

Smalltalk v2 60

Hashtable vs. Dictionary

The benchmark creates a hashtable and fills it either with 1mio unique values or with of subset of 10000 values. This tests HashTable, hashing and boxing (which you have to use in Java). Notice, that in Smalltalk, every integer is always boxed.

To get a feeling of the amount of time spent in string handling vs. hashtable handling, the code is written both for string and integer keys.

The code is:

Java 1mio string values (variant 1):

    Hashtable<String, Integer> h = new Hashtable<String, Integer>();
    for(Integer i = 0; i < 1000000; ++i) {
	h.put(i.toString(), i);
    }

Java 1mio string values (variant 2):

    Hashtable<String, Integer> h = new Hashtable<String, Integer>();
    for(int i = 0; i < 1000000; ++i) {
	h.put(Integer.toString(i), i);
    }

Java 10k string values:

    Hashtable<String, Integer> h = new Hashtable<String, Integer>();
    for(int i = 0; i < 1000000; ++i) {
	h.put(Integer.toString(i % 10000), i);
    }

Java 1mio integer values:

    Hashtable<String, Integer> h = new Hashtable<String, Integer>();
    for(Integer i = 0; i < 1000000; ++i) {
	h.put(i, i);
    }

Java 10k integer values:

    Hashtable<String, Integer> h = new Hashtable<String, Integer>();
    for(int i = 0; i < 1000000; ++i) {
	h.put(new Integer(i % 10000), i);
    }

Smalltalk 1mio values:

    h := Dictionary new.
    1 to:1000000 do:[:i | h at:i printString put:i].

A variant is to use a bulk-initializing method, for example:

    h := Dictionary withKeys:(1 to:1000000) andValues:(1 to:1000000).

Smalltalk 10k values:

    h := Dictionary new.
    1 to:1000000 do:[:i | h at:(i \\ 10000) printString put:i].

Timings:

Java 1mio string variant1 1413

Java 1mio string variant2 1481

Java 10k string 132

Java 1mio integer 412

Java 10k integer 47

Smalltalk 1mio string 3788

Smalltalk 10k string 619

Smalltalk 1mio integer 926

Smalltalk 1mio variant 568

Smalltalk 10k integer 364

Summary

It must be admitted that Java collection classs are usually faster, unless specialized methods of the Smalltalk class library are used (which for many part do not exist in the corresponding Java class libs).

However, this is not really suprising, considering the amount of work (man years) which went into the Java hot spot compiler as compared to the amount which went into ST/X (a one man show).

Graphic Display Speed

Both the earliest and the newest Smalltalk implementations use their own graphic drawing mechanism - they assume direct control over the displays pixel memory and perform all drawing on the pixel level.

There are good reasons for doing this - the old systems did it since there were no fancy graphic controllers and pixel graphics was a recent invention.
The modern Smalltalk implementation (i.e.: squeak) does this for flexibility - who else is able to let a 3d animated bunny hover above a browser window, rotating in the z-dimension around the x-axis ?

Smalltalks for comercial use (as opposed to academic testbeds) typically interface to the graphics system via higher level APIs - typically X11 or WinAPI.
The overhead introduced by Smalltalk is marginal and graphical display of text, lines or any other graphics primitive which is directly supported by the API is almost as fast compared to programs implemented in other languages.

<cg at exept.de>

Doc $Revision: 1.39 $ $Date: 2021/03/17 02:12:53 $

Java	139
Java preallocated	123
Smalltalk v1	177
Smalltalk v1 preallocated	171
Smalltalk v2	60

Java 1mio string variant1	1413
Java 1mio string variant2	1481
Java 10k string	132
Java 1mio integer	412
Java 10k integer	47
Smalltalk 1mio string	3788
Smalltalk 10k string	619
Smalltalk 1mio integer	926
Smalltalk 1mio variant	568
Smalltalk 10k integer	364

Smalltalk Performance - Myths and Facts

Contents

Speed comparison with other languages - does it make sense ?

Summary: Message send speed myth

Benchmark code

Summary: Memory management speed

Benchmark code