[prev] [up] [next]
				640K ought to be enough for anybody.

				(Bill Gates 1981)

Garbage Collection in General and in Smalltalk

Contents

This document is meant for people who are not too familiar with the concept of garbage collection (GC), and to prevent too much discussion with people who think that garbage Collection is a slow, useless feature which is not needed (i.e. radical C/C++ fans :-)

I have collected some arguments (not my own ones) from literature and discussions - both on the net and private ones. Not being naive enough to believe that this religious war might end soon, it may at least give some common ground on which to base discussions.

Also, it shall prevent useless questions/flames etc. sent to my person in the future.

For those of you who know Smalltalk/GC, it may not give you any new information - just skip & forget about it.

Some of the stuff below may sound a bit drastic, cynic or whatever - think of a big smily :-) after every such statement ...

Note: this paper was written 1995, when 32Mb was considered a huge amount of memory, and a Pentium 3 was considered a fast machine... Today, some of the numbers given below are outdated, but the central message of the paper is still valid.

Update note: also notice that now, 20years later, most of those "specialists" who were against garbage collection in the 90s are now happy Java users, and admit that they would not want to live without GC any more. Times are a' changing! (and it shows, the Lisp and Smalltalk fans were simply 20years ahead of the crowd.)

Introduction: what is GC

Almost every programming system offers the programmer some functionality to dynamically allocate storage in amounts not foreseen at coding time; for example, Pascal offers a "new" operation, the C-libraries provide a family of "malloc" functions, C++ provides "new" and Lisp the CONS, to just name some.

Except in the rare cases when those objects (not in the OO-sense) are used and needed for the whole running time of the program, sooner or later the storage used by them must be freed and given back to some management facility - otherwise your program would continue to grow, eating up tons of unused storage space.

Although this is a bit of a simplification, programming systems can be roughly divided into two classes:

What are the Arguments of GC Opponents

GC is too slow and not required by "good programmers"

What are the Dangers of NOT doing GC

Simply saying:
many believe that it is almost impossible in a modest sized programming project, to tell for sure when an object is no longer needed

Some even state that it is impossible to create a large system, which will not have memory leaks (without GC).

(large being something where many people work for years, creating code with megabytes in size)

Therefore (in practice) one of three situations arises:

My personal experience supports these arguments - I have heard of (and seen) systems, where more than 100 people created a program which was almost impossible to debug and where reference errors were almost impossible to find in the end. Finally these development groups typically throw away big parts, invent some kind of GC and restrict the use of the language by forbidding everything which originally made the used programming language "so efficient". This means: overloading all assignment, pointer and whatever stuff, wrapping pointers into envelopes, making each pointer reference a virtual function call and eventually slowing down the execution to that of interpreted basic :-)

To make it clear:

I REALLY do NOT think (and do NOT want to give the impression) that all these programmers are incapable of good programming - every one of us has (and had) many errors of this kind - even the very best gurus make those errors !!!
It is simply the complexity of big systems (especially when created by a big group) which makes these errors appear again and again.

When separate subsystems (libraries, especially: binary libraries) are involved, things tend to become even harder, since it is often unclear (from the specification and documentation) who is responsible for freeing of objects.

((just think of a container-class, which keeps references to some other objects; the question is, if the container should free its referred-to objects when freed or leave this to the user of the container. If left to the user, how does he know if other parts of the program do not have references to those objects? How does the container know when freeing those?
One typical 'solution' is to not keep references in those containers, but keep copies of the objects there. This may solve the above problems, but add additional overhead, since objects must now be copied in and out of containers everywhere. Also, the copies in the containers have to be kept in sync with any elements which were copied out of the containers, which may be hard to do in some cases.
Another 'solution' is to add reference counting or other ad-hoc GC mechanisms. But thats exactly what we are talking about here. BTW: as we will see below, reference counting is not the best GC algorithm, and certainly insufficient in many environments.))

How can GC be Implemented

There are many strategies for doing GC, which will not be covered here in much detail - there is a lot of literature (*) available in this area.

The most well known and most often used strategies are:

Variations and mixed algorithms are also possible - for example, a combination of the mark & sweep and the generational collector can be designed, which avoids the sweeping phase, for the price of some additional pointer slots in every object.

Literature:
For a more information on GC, read ``Garbage Collection Techniques'' by Paul R. Wilson, in ``ACM Computing Surveys'', also found in the proceedings of the ``1992 International Workshop on Memory Management, Springer Lecture notes in C.S.''

Note from the future: This paper was written in 1994 - a bunch of new things came into existence since then - most noteworth is the train algorithm, which allows for efficient garbage collection of the old space.

Pros & Cons of the Techniques

What GC Enemies Suggest

Interestingly, most GC opponents suggest a device called "smart pointer" (or similar buzz-word) which is usually nothing other than a reference counting garbage collector !! This - as we have seen above is NOT the best solution to garbage collection (*).

Others use tools, which (more or less successful) find invalid references to already freed objects and multiple freeing of the same object and so on. To my knowledge, none of these is perfect and can make certain that all such situations are handled: - analysis of the source code is (theoretically) impossible, - analysis of the executing process needs that every possible flow of control through the program be executed (at least once) in every possible combination, to detect all possible memory bugs. This is practically impossible - it may require thousands of years to run through all possible combinations - even in small programs.

Also - in the strict sense, these check-runs have to be repeated for every change in the program - whatever small this change was.

I doubt that these are very useful for modest size applications, involving a team of many programmers. (not talking about toy programs here :-)

There is finally a very radical, puristic group of GC enemies which simply suggests the following:

"if you cannot get your memory organized, give up programming - since you are simply too dumb"

(this is no joke; this statement really occurred in the comp.lang.c++ newsgroup) Big smily here :-)

(*) Notes:

There are other GC's possible; a new and relatively popular algorithm is the so called conservative GC which scans memory for things which "look like pointer" to track reachable objects.
Since in C-like languages, objects may not be moved around, these collectors typically use a variant of the mark and sweep mechanism; however, faster schemes are possible and occassionally implemented (treatmill, non copying generation scavenging etc.)

What Type of GC do Smalltalks Use

Traditionally, the early versions used reference counting GC (ST-80 - I guess up to V2.x), the newer ST-80 (OWST) versions use a combination of generation scavenging for new objects and incremental mark and sweep augmented by a compressor for the old objects. I am walking on sandy ground here - not knowing for sure. This information is based on what others told me...

I do not know what ST/V and Enfin use.

ST/X uses a modified generation scavenging scheme for new objects (with adaptive aging) and three different algorithms to reclaim old objects: a Baker-style copying compressor, an inplace compressor and a stopping mark-and-sweep, all are started on demand.
There is also an incremental mark & sweep running over the old objects at idle times or (optionally) as a low priority background process.
The implementation is prepared for and its planned to add a third intermediate generation, and to replace the semispace Baker algorithm by some in-place compressor. This compressor will be somewhat slower, but relaxing VM need.

Oldspace collections happen very infrequent - if no special memory needs arise, as when doing image processing/viewing, the system may run forever without one. Especially with the incremental background collector running.

For long running server applications, it may be a good idea to explicitely force a compressing collect at times when the system is known to be idle or not heavily used. For example, a watcher thread my trigger a GC every Sunday night at 2 o'clock in the morning.

Worst Case Situations for Generational GC's and ST/X's GC in particular

For a fair discussion, some worst-case situations for generational GC's have to be discussed in this paper. In ST/X (and probably other systems using similar algorithms), there are two situations, in which the GC gets into trouble, possibly leading to longer than the above mentioned average pause times and/or GC overhead. The following will discuss these cases and describe how ST/X tries to avoid them:

Conclusion

There is no real argument against GC; except for some small percentage of added run time overhead (true realtime applications, will be discussed in another paper). With todays high performance computers, this is an adequate price to pay for the added security, and - not to forget - the time savings of the programmer, who would otherwise spend a lot of his/her time in the debugger, instead of doing productive work.

Of course, there are always special situations in which one algorithm performs better than others - memory allocation patterns vary over different applications. Generation scavenging provides the best average performance over different allocation patterns.

Or, to summarize (and estimate/simplify):

if a programmer spends 80% of his/her time debugging and testing code (which is realistic), and 50% of all errors are dangling pointer, bad free or other GC-avoidable bugs (which is optimistic), getting rid of those bugs rewards you with a development cost saving of 40%!
Now, those savings will pretty soon amortize the added cost of a 5% to 10% faster CPU.

The exact percentages may vary and not be correct in every situation, but you see the point; don't you?

Evaluation Code

The following appendix provides some code examples and discusses the results of various source level "optimizations".
Some of them are the result of internal discussion with friends and opponents of GC.

Example1:
First we demonstrate a straight-forward algorithm to compute the N-th row of the pascal triangle; this is actually a functional problem, but the implementation given here allocates a lot of memory and is therefore a good example for memory management issues.

The pascal triangle looks like:

		1
	      1   1
	    1   2   1
	  1   3   3   1
	1   4   6   4   1
	       ...
the N'th row can be computed recursively, by first computing the (N-1)'st row , and then adding up consecutive elements of that row, and both prepending and appending a 1 upFront and at the end.
Here is a sample implementation: (Integer methods)
    pascal
	"return the n'th row of the pascal triangle"

	|pNMinus1 shL shR|

	self == 1 ifTrue:[ ^ #( 1 ) ].

	pNMinus1 := (self - 1) pascal.

	shL := pNMinus1 , #(0).           "/ create:      1 n n n n n n n n 0
	shR := #(0) , pNMinus1.           "/ create:      0 n n n n n n n n 1
	^ shL with:shR
	    collect:[:a :b | a + b].      "/ add them up  1 m m m m m m m m 1
The code above takes the N-1'st row and creates two temporary rows (shL and shR) which have a 0 upFront and at the end respectively. Then, these two are added column-wise, and the results are collected for the final result.
The recursion ends of course with the first row, which is #(1) by definition.

How much memory is allocated along this ?
For the N'th row, 3 new arrays of size N+1 are allocated; that means that the overall memory requirements to calculate pascal(N) is (1+2+3+4+...+N) * 3 which is N*N*(3/2).
To compute pascal(10), 150 cells are allocated - roughly 600 bytes.
To compute pascal(1000), 1.5mio cells are allocated - roughly 6Mb.

One might tend to think that memory allocation is responsible for the execution time, and optimize the code to get rid of (some) allocations.
Here is slightly rewritten version, which allocates two new array per invocation (the comma operation also allocates a new array):

    pascal2
	"return the n'th row of the pascal triangle"

	|pNMinus1 prev|

	self == 1 ifTrue:[ ^ #( 1 ) ].

	pNMinus1 := (self - 1) pascal.

	"/ add it up manually
	prev := 0.
	^ (pNMinus1 collect:[:el| |sum| sum := prev+el. prev := el. sum]) , #(1) .
this allocates (1+2+3+4+...+N) * 2 cells which is N*N.
i.e. to compute pascal(1000), 1.0mio cells are allocated - roughly 4.5Mb.

We can get rid of the extra allocation due to the comma concatenation, by avoiding a collect, and doing things manually (things becoming more and more obfuscated, though):
Here is a version, which only allocates one new array per invocation:

pascal3
    "return the n'th row of the pascal triangle"

    |pNMinus1 newRow prev|

    self == 1 ifTrue:[ ^ #( 1 ) ].

    pNMinus1 := (self - 1) pascal.
    newRow := Array new:(self + 1).

    "/ add it up manually
    prev := 0.
    1 to:self-1 do:[:index |
	|el|

	el := pNMinus1 at:index.
	sum := prev + el.
	newRow at:index put:sum.
	prev := el
    ].
    newRow at:self put:1.
    ^ newRow.
this allocates (1+2+3+4+...+N) cells which is N*N/2.
i.e. to compute pascal(1000), 0.5 mio cells are allocated - roughly 2Mb.

Can we write a version which does not allocate anything at all ?
Yes, we can. The following code preallocates an array and computes the result into it (recursive invocations use the same store for all the previous results):

pascalHelperInto:anArray
    "write the n'th row of the pascal triangle into anArray.
     Destructive version (overwrites elements in the array)"

    |prev|

    self == 1 ifTrue:[
	anArray at:1 put:1.
	^ anArray.
    ].

    (self - 1) pascalHelperInto:anArray.

    prev := 0.
    1 to:(self - 1) do:[:idx |
	|el sum|

	el := anArray at:idx.
	sum := prev+el.
	prev := el.
	anArray at:idx put:sum.
    ].
    anArray at:self put:1.
    ^ anArray

pascal4
    "return the n'th row of the pascal triangle"

    ^ self pascalHelperInto:(Array new:self)
this last example only allocates N cells.

Execution times on 800Mhz Celeron notebook:
program time in ms (5 runs)
pascal 1197 1188 1181 1180 1181
pascal2 1176 1189 1199 1177 1186
pascal3 1276 1253 1220 1223 1222
pascal4 826 822 847 833 945

Summary:

The above execution times are very interesting; our "optimized" versions pascal2 and pascal3 are not at all faster - pascal3 is actually slower, due to the fact that the comma and collect operations are probably highly performant (even the author was surprized and expected a few percent of speedup).
The only real speedup comes with pascal4, which does no allocation at all - but for the price of a much less intuitive program.
This means, that all of the memory allocations and garbage collections only accounted for roughly 20-30% in this particular example.

It is left to the reader to decide if this is worth the trouble of writing complicated code like pascal4.

It is also left to the reader, to reimplement this example on C, C++, Java or any other system and compare the results.
but please: with correct results; as a quick check, the sum of the 1000'th row elements is

   (1000 pascal) inject:0 into:[:acc :el | acc + el]
which is:
    535754303593133660474212524530000905280702405852766803721875194
    185175525562468061246599189407847929063797336458776573412593572
    642846157021799228878734928740196728388741211549271053730253118
    557093897709107652323749179097063369938377958277197303853145728
    5598238843271083830214915826312193418602834034688

And - also look at how long it takes you to write, test and measure the examples. If it takes more than half an hour, you might be using the wrong programming language ;-)


Copyright © 1995 Claus Gittinger, all rights reserved

<cg at exept.de>

Doc $Revision: 1.29 $ $Date: 2021/03/13 18:24:51 $