It is NOT (I repeat: NOT) meant to start another language war; however, those who like these wars should at least base the discussion upon facts instead of myths ..
Please: Leave me alone in such a language war - it's a waste of time.
Although the following concentrates on performance, there are many other things beside raw speed to consider:
It may be better to compare Smalltalk to other object oriented languages, such as Java, C++, Ada9x, CLOS or Eiffel - to name a few. Or, more recent innovations: Python, Ruby and JavaScript.
If comparing C programs, a programming technique similar to that used in object oriented languages should be used - for example, by simulating message sends by indirect function calls (btw: this is how C++ does it).
For a real comparison, a full application should be written in the various
languages, and its performance be judged.
MicroBenchmarks which concentrate on a single feature or language construct,
are seldom directly related to the overall speed.
However, without them, reproducable
measurements are difficult to acquire;
therefore,
you will see some microBenchmarks in the following chapters.
This is definitely no longer true with modern Smalltalk implementations,
which heavily use caching to avoid most lookup operations.
Especially, with the help of
inline caching (see Unger [5]), most message sends are reduced to an indirect
function call and an additional compare instruction.
As shown below, caching reduces Smalltalk's message send speed
to roughly the speed of
(compile-time resolved) virtual function calls in C++ or indirect function calls in C.
We acknowledge the fact that a direct function call is still faster than
an indirect one - therefore, the speed of C function calls,
C++ non-virtual function calls
(or even: inline code) can never be reached in Smalltalk
(which is not completely true: read documentation on the Self language).
However, once we abandon message sends, we are leaving the object oriented
arena - even C++ fans come to the insight that - for flexibility & reuse -
most functions should be virtual functions
(some must be for proper operation to be guaranteed - which may make things
a bit tricky occasionally).
The Benchmark:
Although originally written to benchmark function calling speed in lisp systems,
the tak benchmark results are also a good indicator for non polymorph
message send speed.
The tak function (or method) calls itself recursively; except for a bit of
integer arithmetic (subtracting the constant `one'),
all it does is function calling / message sending.
When executed, it is recursively invoked 63609 times and performs 47706 subtract operations.
The following table lists some execution times.
BTW: It is interesting to see, that modern systems are far better
than systems we used to work on a few years ago ...
... and considered them to be fast at those times ;-).
So, today, interpreted Smalltalk on a low price PC is more than 3 times faster
than compiled C on a much more expensive Vax 11/780.
System Language Time (ms) Notes
VAX 11/750 Franz Lisp (slowest) 47600 generic arithmetic
VAX 11/750 PSL (portable std. Lisp) 7100 generic arithmetic
VAX 11/750 Franz Lisp 3600 fixnum arithmetic
Symbolics LM2 Lisp 2905
VAX 11/750 C 2400
68k C 1900 unknown clock; probably 8Mhz
VAX 11/750 Franz Lisp (best) 1900 fixnum arithmetic (lfc)
VAX 11/750 PSL (portable std. Lisp) 1400 inum arithmetic
VAX 11/780 C 1350
VAX 11/780 Franz Lisp 1100 fixnum arithmetic (lfc)
68k Assembler 700 unknown clock; probably 8Mhz
Dorado Interlisp 526
LMI Lambda microcompiled Lisp 450
Symbolics 3600 Lisp 430
i586/100 ST/X interpreted bytecode 376 (*)
i586/100 ST/X compiled (-O) (Float args) 167 (**)
i586/100 ST/X compiled 55 (*)
Cray-1 PSL (portable std. Lisp) 44
i586/100 ST/X compiled (-O) 32 (*)
i586/100 ST/X JIT compiled bytecode 31 (*)
i586/100 ST/X compiled (-O +optContext) 30 (***)
i586/100 C (gcc2.6.3) 26.9 (*) double version
i586/100 C++ (gcc2.6.3) 26.7 (*)
i586/100 C (gcc2.6.3 -O) 23.8 (*) double version
i586/100 C++ (gcc2.6.3 -O) 21.87 (*)
i586/100 C (gcc2.6.3) 15.5 (*)
i586/100 C (gcc2.6.3 -O) 12 (*)
P3/700 ST/X JIT compiled bytecode 4.6 (*)
I5/2.4Ghz ST/X compiled (-O) 0.6 (*4)
(*) Except for the times marked as (*), the numbers are taken from:
``Performance and Evaluation of Lisp Systems'', by Richard P Gabriel, MIT Press(**) You cannot do this on inum/fixnum arithmetic Lisps; you have to write a separate function (virtual function) in C/C++ in order to support double args.
(***)
+optContext
generates code where the full debugging information (i.e. context chain) is available as usual, but the machines full register set is not saved with every method invocation. The consequence is that methods are not restartable in the debugger (they are abortable & interruptable, though)(*4)late update 2020, on a 2012 MacBook; a message send takes about 4ns.
Now that is interesting: although this little program does almost nothing else
than message sending (there is a tiny bit of arithmetic), the Smalltalk speed is about
2.5 times slower (30ms vs. 12ms) than that of C (on the i586).
C is almost twice as fast as C++ (which also has to perform an indirect function call),
but Smalltalk is only about 30% slower than C++ (for reasons described below).
Times are achanging: today, ST/X on a Pentium is faster than compiled lisp on a Cray-1
used to be a few years ago :-))
[big smily here; I don't know any details about the PSL implementation].
The ST/X compiler does a bunch of optimizations in order to reduce most overhead; the remaining difference in execution time is probably not related to message sending, but instead due to:
For example, try:
"TakBenchmark takX:18.0 Y:12.0 Z:6.0
"
The execution time of this is listed as ``Float args'' in the above list.
Notice, that this heavily allocates float objects - the increased execution time is mostly
due to the allocation of them.
The execution times of a corresponding C version (using doubles instead of ints) is listed in
the table as double version
new()
in a signal handler, which interrupted another new()
)
main() {
int i;
for (i=0; i<1000; i++) {
tak(18, 12, 6);
}
}
tak(x, y, z) {
if (y >= x) return z;
return
tak(
tak(x-1, y, z),
tak(y-1, z, x),
tak(z-1, x, y));
}
(execute it with "time a.out"
and divide the resulting time by 1000)
class Tak {
public :
Tak () { }
~Tak () { }
virtual int tak ( int x, int y, int z );
};
int Tak::tak ( int x, int y, int z )
{
if (y >= x) return z;
return
tak(
tak(x-1, y, z),
tak(y-1, z, x),
tak(z-1, x, y));
}
main() {
Tak p;
for (int i=0; i<1000; i++) {
p.tak(18, 12, 6);
}
}
(execute it with "time a.out"
and divide the resulting time by 1000)
(define tak (x y z)
(if (not (< y z))
z
(tak (tak (1- x) y z)
(tak (1- y) z x)
(tak (1- z) x y))))
(tak 18 12 6)
!TakBenchmark class methodsFor:'benchmarking'!
takX:x Y:y Z:z
y < x ifFalse:[
^ z
] ifTrue:[
^ self
takX: (self takX:(x - 1) Y:y Z:z)
Y: (self takX:(y - 1) Y:z Z:x)
Z: (self takX:(z - 1) Y:x Z:y)
]
"
Time millisecondsToRun:[
TakBenchmark takX:18 Y:12 Z:6
]
"
"On a fast machine:
(Time millisecondsToRun:[
100 timesRepeat:
[ TakBenchmark takX:18 Y:12 Z:6 ]
]) / 100.0
"
The Benchmark:
The benchmark builds up a linked list consisting of 1000 elements, and frees
them afterwards. The benchmark is executed 1000 times, for a total of 1 million
created elements. Each created element has a size of (at least) 8 bytes.
Notice, that in C++ and Smalltalk, the actual size of the allocated objects is bigger, in C++, there is a need to store a reference to the virtual function table (at least), bringing the overall memory usage to (at least) 12Mb.
Depending on the internals of the underlying malloc() / free()
algorithm,
additional (invisible) allocation overhead can be expected both in C and C++.
In Smalltalk, every object includes a header containing its size, class and additional information; the headers size is 12 bytes. The header includes all required information (i.e. there is no other hidden allocation overhead). Therefore, an overall of 20Mb is allocated in the ST/X run.
Notice also, that no freeing of the objects is required in Smalltalk - the system does this automatically for you. During the run, several garbage collects occur in Smalltalk - the execution time includes those.
This is a very artifical benchmark, but probably the shortest and easiest way to
built some dynamic structure and release it afterwards. In other words:
probably the shortest way to measure the memory management system alone.
In real world systems, much more would be done - especially, the created elements
would be initialized ...
Here are some execution times (all done on the same system, using the same gcc compiler):
(inline) - means: compile with -DINLINE
System Language Time (ms)
i586 ST/X interpreted 10702
i586 C++ (gcc2.6.3) 3500
i586 C++ (gcc2.6.3) virtual 3490
i586 C++ (gcc2.6.3) inline 3005
i586 C++ (gcc2.6.3 -O6) virtual 2850
i586 C++ (gcc2.6.3 -O6) 2840
i586 C++ (gcc2.6.3 -O6) inline 2490 (-O6 slower than -O; how comes this ?)
i586 ST/X JIT compiled bytecode 2467
i586 C++ (gcc2.6.3 -O) 2460
i586 C++ (gcc2.6.3 -O) inline 2460
i586 ST/X compiled 2224
i586 ST/X compiled (-O6) 2042
i586 C (gcc2.6.3) 1770
i586 C (gcc2.6.3 -O) 1670
i586 C (gcc2.6.3 -O6) 1620
To be fair, I have to admit that these numbers are only valid for short term
memory - memory which survives for a long time (i.e. finds its way into a higher
generation of the memory hierarchy) is slower reclaimed.
However, in practice, most of the allocated objects are reclaimed shortly
after allocation, so the above results should be true for most real world applications.
Most Smalltalk implementations allow further tuning / adjustment of the allocation policy parameters, to give hints about abnormal allocation behavior to the memory management. The above numbers were taken with a default setup (system adjusts its parameters itself)
For example, changing a single parameter in ST/X's memory policy (newSpaceSize)
can further reduce the execution time (down to 1845),
which is almost the C speed.
We do not recommend doing this (because it has effects on other parts of the system,
especially affecting real time response),
but it shows that there is much more to understanding automatic memory management systems.
On the other hand: in real world applications, mechanisms based upon malloc()/free()
may
suffer from their own problems which are not shown by this simple benchmark:
TODO: measure long term memory behavior
struct link {
int value;
struct link *next;
};
main() {
int i;
for (i=0; i<1000; i++) {
benchalloc();
}
}
benchalloc() {
struct link *link, *next;
struct link *anchor = (struct link *)0;
int i;
for (i=0; i<1000; i++) {
link = (struct link *) malloc(sizeof(struct link));
link->value = i;
link->next = anchor;
anchor = link;
}
for (link = anchor; link; link=next) {
next = link->next;
free(link);
}
}
(execute it with "time a.out"
)
class Link {
int value;
Link *next;
public :
Link () {};
virtual ~Link() {};
#ifdef INLINE
inline void setNext(Link *l) { next = l; };
inline Link *getNext() { return next; };
inline void setValue(int v) { value = v; };
inline int getValue() { return value; };
#else
# ifdef VIRTUAL
virtual void setNext(Link *);
virtual Link *getNext();
virtual void setValue(int);
virtual int getValue();
# else
void setNext(Link *);
Link *getNext();
void setValue(int);
int getValue();
# endif
#endif
};
#ifndef INLINE
void Link::setValue(int v)
{
value = v;
}
int Link::getValue ()
{
return value;
}
void Link::setNext(Link *l)
{
next = l;
}
Link *Link::getNext()
{
return next;
}
#endif /* not INLINE */
void
benchAlloc() {
Link *link, *next;
Link *anchor = (Link *)0;
for (int i=0; i<1000; i++) {
link = new Link;
link->setValue(i);
link->setNext(anchor);
anchor = link;
}
for (link = anchor; link; link = next) {
next = link->getNext();
delete link;
}
}
main() {
for (int i=0; i<1000; i++) {
benchAlloc();
}
}
(execute it with "time a.out"
)
!LinkedListBenchmark class methodsFor:'benchmarking'!
buildList
|anchor link|
anchor := nil.
1 to:1000 do:[:i |
link := ValueLink basicNew value:i.
link nextLink:anchor.
anchor := link.
]
"
Time millisecondsToRun:[
1000 timesRepeat:[LinkedListBenchmark buildList]
]
"
Notice, that in Smalltalk, the 1-to-1 translated, hand-coded initialization code is not what an experienced Smalltalk would use; alternative versions are shown and measured as well.
The code is:
Java:
long t1 = System.currentTimeMillis();
for(int j = 0; j < 10; ++j) {
List<Integer> list = new ArrayList<Integer>();
for(int i = 0; i<1000000; ++i) {
list.add(i);
}
}
long t2 = System.currentTimeMillis();
System.out.println("Time taken: " + (t2 - t1) / 10.0);
Java preallocated ArrayList:
...
List<Integer> list = new ArrayList<Integer>(1000000);
for(int i = 0; i < 1000000; ++i) {
list.add(i);
}
...
Smalltalk V1 (hand-coded, naive loop):
Transcript showCR:(
Time millisecondsToRun:[
|c|
10 timesRepeat:[
c := OrderedCollection new.
1 to:1000000 do:[:i | c add:i].
].
]
]) / 10
Smalltalk preallocated V1 (hand-coded, naive loop):
...
|c|
10 timesRepeat:[
c := OrderedCollection new:1000000.
1 to:1000000 do:[:i | c add:i].
].
...
Smalltalk V2 (better code):
...
|c|
10 timesRepeat:[
(1 to:1000000) asOrderedCollection
].
...
Timings:
Java | 139 |
Java preallocated | 123 |
Smalltalk v1 | 177 |
Smalltalk v1 preallocated | 171 |
Smalltalk v2 | 60 |
To get a feeling of the amount of time spent in string handling vs. hashtable handling, the code is written both for string and integer keys.
The code is:
Java 1mio string values (variant 1):
Hashtable<String, Integer> h = new Hashtable<String, Integer>();
for(Integer i = 0; i < 1000000; ++i) {
h.put(i.toString(), i);
}
Java 1mio string values (variant 2):
Hashtable<String, Integer> h = new Hashtable<String, Integer>();
for(int i = 0; i < 1000000; ++i) {
h.put(Integer.toString(i), i);
}
Java 10k string values:
Hashtable<String, Integer> h = new Hashtable<String, Integer>();
for(int i = 0; i < 1000000; ++i) {
h.put(Integer.toString(i % 10000), i);
}
Java 1mio integer values:
Hashtable<String, Integer> h = new Hashtable<String, Integer>();
for(Integer i = 0; i < 1000000; ++i) {
h.put(i, i);
}
Java 10k integer values:
Hashtable<String, Integer> h = new Hashtable<String, Integer>();
for(int i = 0; i < 1000000; ++i) {
h.put(new Integer(i % 10000), i);
}
Smalltalk 1mio values:
h := Dictionary new.
1 to:1000000 do:[:i | h at:i printString put:i].
A variant is to use a bulk-initializing method, for example:
h := Dictionary withKeys:(1 to:1000000) andValues:(1 to:1000000).
Smalltalk 10k values:
h := Dictionary new.
1 to:1000000 do:[:i | h at:(i \\ 10000) printString put:i].
Timings:
Java 1mio string variant1 | 1413 |
Java 1mio string variant2 | 1481 |
Java 10k string | 132 |
Java 1mio integer | 412 |
Java 10k integer | 47 |
Smalltalk 1mio string | 3788 |
Smalltalk 10k string | 619 |
Smalltalk 1mio integer | 926 |
Smalltalk 1mio variant | 568 |
Smalltalk 10k integer | 364 |
However, this is not really suprising, considering the amount of work (man years) which went into the Java hot spot compiler as compared to the amount which went into ST/X (a one man show).
There are good reasons for doing this - the old systems did it since there were
no fancy graphic controllers and pixel graphics was a recent invention.
The modern Smalltalk implementation (i.e.: squeak) does this for flexibility -
who else is able to let a 3d animated bunny hover above a browser window,
rotating in the z-dimension around the x-axis ?
Smalltalks for comercial use (as opposed to academic testbeds) typically interface
to the graphics system via higher level APIs - typically X11 or WinAPI.
The overhead introduced by Smalltalk is marginal and graphical display of text, lines
or any other graphics primitive which is directly supported by the API is almost
as fast compared to programs implemented in other languages.
Copyright © 1995 Claus Gittinger, all rights reserved
<cg at exept.de>