640K ought to be enough for anybody. (Bill Gates 1981)
I have collected some arguments (not my own ones) from literature and discussions - both on the net and private ones. Not being naive enough to believe that this religious war might end soon, it may at least give some common ground on which to base discussions.
Also, it shall prevent useless questions/flames etc. sent to my person in the future.
For those of you who know Smalltalk/GC, it may not give you any new information - just skip & forget about it.
Some of the stuff below may sound a bit drastic, cynic or whatever - think of a big smily :-) after every such statement ...
Note: this paper was written 1995, when 32Mb was considered a huge amount of memory, and a Pentium 3 was considered a fast machine... Today, some of the numbers given below are outdated, but the central message of the paper is still valid.
Update note: also notice that now, 20years later, most of those "specialists" who were against garbage collection in the 90s are now happy Java users, and admit that they would not want to live without GC any more. Times are a' changing! (and it shows, the Lisp and Smalltalk fans were simply 20years ahead of the crowd.)
Except in the rare cases when those objects (not in the OO-sense) are used and needed for the whole running time of the program, sooner or later the storage used by them must be freed and given back to some management facility - otherwise your program would continue to grow, eating up tons of unused storage space.
Although this is a bit of a simplification, programming systems can be roughly divided into two classes:
many believe that it is almost impossible in a modest sized programming project, to tell for sure when an object is no longer needed(large being something where many people work for years, creating code with megabytes in size)Some even state that it is impossible to create a large system, which will not have memory leaks (without GC).
Therefore (in practice) one of three situations arises:
- this results in so called "memory leaks" and leads to ever growing storage needs.
take the X-window server or popular internet browsers (Firefox or IE) as example, which - even after years of development - are still known to contain some of these leaks. Also, many windowing toolkits (such as the motif library) suffer from this problem.
"Garbage collection" is typically done by terminating the program and restarting from scratch - this suggestion can even be found in a vendor's user manual (this is not a joke).
- this may result in a variety of errors, from going totally unnoticed (if the data did not get overwritten in the meantime), to invalid data to total crashes of the program. It is also possible, that a program runs pretty well on one machine, but crashes badly on another - due to different reallocation strategies in the memory allocator. Or, that a simple change in the program makes a perfectly running program suddenly crash - as a consequence of some storage address changes.
To make it clear:
I REALLY do NOT think (and do NOT want to give the impression) that all these programmers are incapable of good programming - every one of us has (and had) many errors of this kind - even the very best gurus make those errors !!!It is simply the complexity of big systems (especially when created by a big group) which makes these errors appear again and again.
When separate subsystems (libraries, especially: binary libraries) are involved, things tend to become even harder, since it is often unclear (from the specification and documentation) who is responsible for freeing of objects.
((just think of a container-class, which keeps references to some other objects; the question is, if the container should free its referred-to objects when freed or leave this to the user of the container. If left to the user, how does he know if other parts of the program do not have references to those objects? How does the container know when freeing those?
One typical 'solution' is to not keep references in those containers, but keep copies of the objects there. This may solve the above problems, but add additional overhead, since objects must now be copied in and out of containers everywhere. Also, the copies in the containers have to be kept in sync with any elements which were copied out of the containers, which may be hard to do in some cases.
Another 'solution' is to add reference counting or other ad-hoc GC mechanisms. But thats exactly what we are talking about here. BTW: as we will see below, reference counting is not the best GC algorithm, and certainly insufficient in many environments.))
The most well known and most often used strategies are:
Then a sweep is done over the whole memory, looking for unmarked storage areas - these are put back onto some free list or equivalent.
Literature:
For a more information on GC, read ``Garbage Collection Techniques''
by Paul R. Wilson, in ``ACM Computing Surveys'',
also found in the proceedings of the
``1992 International Workshop on Memory Management,
Springer Lecture notes in C.S.''
Note from the future: This paper was written in 1994 - a bunch of new things came into existence since then - most noteworth is the train algorithm, which allows for efficient garbage collection of the old space.
Most of the early GC implementations used the mark & sweep collection scheme, which (I think) is responsible for many peoples opposition against GC. The basic mark & sweep algorithm leads to a pause, making interaction with the program impossible for a while. This disruptive pause becomes extreme, if paging is involved in virtual memory systems (since all of the virtual memory must be processed during the sweep phase, many page faults may occur in systems where the real memory is smaller than the virtual memory need).
Considering early Lisp systems (timesharing 20 users on a DEC20, running a megabyte lisp program in 128-256 k real memory) you can imagine how long these pauses could become. There are many references found in the literature - describing pause times of seconds and even minutes.
The cpu-time overhead created by mark & sweep (ignoring paging) is in the range of 5% to 10%. If the time was not spent in one big pause, mark & sweep would not be too bad (especially considering modern workstations, where 32Mb of memory are standard so paging is rare, and a cpu-overhead of some % can be tolerated).
Mark & sweep does not compress the memory - this means that even though enough memory is available, it may not be usable for an allocation due to fragmentation. However, variations of the basic algorithm exist, which compress memory during the sweep phase or which add a third phase for compression.
At first look, reference counting seems to overcome those problems - distributing the GC overhead over the execution time of the program. However, reference counting too creates some overhead:
Measurements show, that counting allone may create a cpu-time overhead of some 15% (remember: counter updates are required even when passing arguments to functions) and another 5% for recursive freeing. These are average numbers, as found in the literature. The actual overhead depends heavily on the application and can vary from 0 to almost 50%.
The real bad news with reference counting is that it is not sufficient. Self-referencing objects, and objects containing reference cycles cannot be freed by a reference counting GC. Thus there is still a need for another GC technique, to get rid of those (usually: mark & sweep). Alternatively, cycles could be broken manually by the programmer - but that again leads to possible omissions and therefore hard to find errors. Actually, asking the programmer to manually break those cycles is as error prone as asking him to manually free objects.
[[ At first, some might argue that cycles are relatively seldom used, but this is NOT true ! Just consider the data structure for a window on the screen, containing references to its subviews, which contain a reference to their parent view. Voila: a cycle. If modern programming techniques, AI, multiple lightweight processes etc. are involved, even more indirect cycles can be expected, especially in a language such as Smalltalk, which presents even stack frames, processes and executable code as objects ... ]]
[[ Literature notes that reference counting GC's in early Smalltalk systems lead to some very obscure errors. These occured whenever a programmer forgot to break up the MVC-cycle in their view closing code. These cycles had to be broken 'manually', otherwise the counts had never dropped to zero.]]
Copying collectors are among the fastest available GC techniques, especially generation scavenging (which uses a hierarchy of semispaces) is to date unbeaten in its average low cpu-time overhead (to my knowledge).
In generation scavenging, all memory is divided into separate areas for generations of objects - all objects are first created in a relatively small area called newSpace, Eden or similar. When this area becomes full, a scavenge operation is performed on this area only, which can be done relatively fast. Measurements showed, that usually most of these new objects are no longer referenced, thus - since only the living objects are investigated and copied - leads to short pause times for this space (20ms to 50 ms), which can easily be hidden between two key strokes or a file-access operation in timesharing systems (as in UNIX systems, where a sync-operation may produce comparable delays from time to time).
To avoid copying objects around forever, objects which survived these scavenges for some time will be copied to another area (generation) called oldSpace. This action is called tenuring (aging).
Systems may have two or more of these spaces.
Once an object has been moved into a higher generation, it will no longer create any copying overhead in the younger generation. Of course, these generations may fill up too, but this filling up occurs much less likely. Some specialized collector is still required for the oldest generation; typically, a background collector is used to find unused old objects. (since this space fills up much slower, chances are good that the background collector can keep up with the allocation rate.)
Measurements show, that generation scavenging produces an overhead of some 3% to 5% (read the ``worst case situations'' discussion below).
Copying collectors offer the additional benefit, that cyclic references are handled correctly, AND that the storage space is always compacted after a GC.
Interestingly, pause times are shorter if lots of garbage is reclaimed, and longer if many objects survive.
The disadvantage of the copying techniques is that they require an additional amount of (unusable) memory, ranging from 50% for the basic Baker algorithm, down to some 20% for Ungars generation scavenging. This memory needs to be virtual only - it is only needed physically while switching semispaces. During program execution, the other semispace can be paged out.
Of course, the memory could be divided into areas which are collected with different algorithms, for example, it is possible to use a copying collector for young objects, and move objects into another area collected by mark & sweep or reference counting.
It is also possible use different algorithms for collecting depending on the state of the memory. For example, ST/X's memory management may choose between a mark&sweep or compressing collect; depending on how fragmented the memory is at collect time.
Finally, it makes sense, to decide upon the available amount of real memory for which algorithm to choose. Copying collectors may show poor performance, if memory is tight, and paging occurs during the collect. In such a situation, it may be preferrable to use a mark and sweep (or a shifting mark & sweep, which compacts memory in a third phase).
Both mark & sweep and copying algorithms can be made incremental and run in the background, at idle times or whenever another object gets allocated. There are even algorithms for multi-CPU systems, where background collection is done by a separate processor. These incremental algorithms are useful as an additional strategy, to use the cpu while nothing else can be done - for example, in an interactive application, even with a fast typist, there is a lot of cpu time available even between two key strokes (some 100ms), which is enough time to do a scavenge operation or incrementally mark and sweep some objects.
Incremental algorithms are typically somewhat slower (since additional synchronization and locking is required) than their non-incremental versions. However, since they only run in the background, not disturbing the active foreground process, this does not hurt.
Others use tools, which (more or less successful) find invalid references to already freed objects and multiple freeing of the same object and so on. To my knowledge, none of these is perfect and can make certain that all such situations are handled: - analysis of the source code is (theoretically) impossible, - analysis of the executing process needs that every possible flow of control through the program be executed (at least once) in every possible combination, to detect all possible memory bugs. This is practically impossible - it may require thousands of years to run through all possible combinations - even in small programs.
Also - in the strict sense, these check-runs have to be repeated for every change in the program - whatever small this change was.
I doubt that these are very useful for modest size applications, involving a team of many programmers. (not talking about toy programs here :-)
There is finally a very radical, puristic group of GC enemies which simply suggests the following:
"if you cannot get your memory organized, give up programming - since you are simply too dumb"
(this is no joke; this statement really occurred in the comp.lang.c++ newsgroup) Big smily here :-)
(*) Notes:
There are other GC's possible; a new and relatively popular algorithm is the so called conservative GC which scans memory for things which "look like pointer" to track reachable objects.
Since in C-like languages, objects may not be moved around, these collectors typically use a variant of the mark and sweep mechanism; however, faster schemes are possible and occassionally implemented (treatmill, non copying generation scavenging etc.)
I do not know what ST/V and Enfin use.
ST/X uses a modified generation scavenging scheme for new objects (with
adaptive aging) and three different algorithms to reclaim old objects:
a Baker-style copying compressor, an inplace compressor and a
stopping mark-and-sweep, all are started on demand.
There is also an incremental mark & sweep running over the old objects
at idle times or (optionally) as a low priority background process.
The implementation is prepared for and its planned to add a third intermediate
generation, and to replace the semispace Baker algorithm by some in-place
compressor. This compressor will be somewhat slower, but relaxing VM need.
Oldspace collections happen very infrequent - if no special memory needs arise, as when doing image processing/viewing, the system may run forever without one. Especially with the incremental background collector running.
For long running server applications, it may be a good idea to explicitely force a compressing collect at times when the system is known to be idle or not heavily used. For example, a watcher thread my trigger a GC every Sunday night at 2 o'clock in the morning.
If the programmer knows that a permanent data structure is built, hints can be given to storage system. Either by changing the tenuring age or by directly allocating those objects in the oldSpace.
Often, temporarily increasing the size of the newSpace to some big amount (say 1 or 2 megabytes) gives a big improvement.
Thus, reclamation of this space is now the responsibility of the oldspace
collector; which, for various reasons, is producing more cpu-time
overhead. Normally, the incremental GC can free this memory easily; however,
if the incremental GC cannot keep up with the oldspace allocation produced by tenuring,
the oldSpace will fill up finally, and a noticeable pause cannot be avoided
(when the oldSpace is reclaimed and/or compacted).
The adaptive aging does usually prevent this, but whenever big-chunks
of data are allocated, there is a danger of those being moved to the
oldspace early or, if the chunks size exceeds some limit, being
directly allocated in the oldspace.
Temporarily increasing the size of the newSpace helps here as well. For example, in a request-response-loop kinds of applications, such as a webserver, it is wise to adjust the newspace size to provide enough space to handle a single request. Then, the next scavenge will (at almost zero cost) find that no objects survive and reclaim (at zero cost) the whole area.
The last only applies to systems which do not provide a reasonable mmap system call (i.e. Ultrix, RealIX and some other older SYSV3 based systems).
Of course, there are always special situations in which one algorithm performs better than others - memory allocation patterns vary over different applications. Generation scavenging provides the best average performance over different allocation patterns.
Or, to summarize (and estimate/simplify):
if a programmer spends 80% of his/her time debugging and testing code (which is realistic), and 50% of all errors are dangling pointer, bad free or other GC-avoidable bugs (which is optimistic), getting rid of those bugs rewards you with a development cost saving of 40%!Now, those savings will pretty soon amortize the added cost of a 5% to 10% faster CPU.
The exact percentages may vary and not be correct in every situation, but you see the point; don't you?
Example1:
First we demonstrate a straight-forward algorithm to compute the N-th row of
the pascal triangle; this is actually a functional problem, but the implementation
given here allocates a lot of memory and is therefore a good example for memory management
issues.
The pascal triangle looks like:
the N'th row can be computed recursively,
by first computing the (N-1)'st row , and then
adding up consecutive elements of that row,
and both prepending and appending a 1 upFront and at the end.
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
...
Here is a sample implementation: (Integer methods)
The code above takes the N-1'st row and creates two temporary
rows (shL and shR) which have a 0 upFront and at the end respectively.
Then, these two are added column-wise, and the results are collected for
the final result.
pascal
"return the n'th row of the pascal triangle"
|pNMinus1 shL shR|
self == 1 ifTrue:[ ^ #( 1 ) ].
pNMinus1 := (self - 1) pascal.
shL := pNMinus1 , #(0). "/ create: 1 n n n n n n n n 0
shR := #(0) , pNMinus1. "/ create: 0 n n n n n n n n 1
^ shL with:shR
collect:[:a :b | a + b]. "/ add them up 1 m m m m m m m m 1
The recursion ends of course with the first row, which is #(1) by definition.
How much memory is allocated along this ?
For the N'th row, 3 new arrays of size N+1 are allocated;
that means that the overall memory requirements to calculate pascal(N)
is (1+2+3+4+...+N) * 3 which is N*N*(3/2).
To compute pascal(10), 150 cells are allocated - roughly 600 bytes.
To compute pascal(1000), 1.5mio cells are allocated - roughly 6Mb.
One might tend to think that memory allocation is responsible for the execution
time, and optimize the code to get rid of (some) allocations.
Here is slightly rewritten version, which allocates two new array per invocation
(the comma operation also allocates a new array):
this allocates (1+2+3+4+...+N) * 2 cells which is N*N.
pascal2
"return the n'th row of the pascal triangle"
|pNMinus1 prev|
self == 1 ifTrue:[ ^ #( 1 ) ].
pNMinus1 := (self - 1) pascal.
"/ add it up manually
prev := 0.
^ (pNMinus1 collect:[:el| |sum| sum := prev+el. prev := el. sum]) , #(1) .
i.e. to compute pascal(1000), 1.0mio cells are allocated - roughly 4.5Mb.
We can get rid of the extra allocation due to the comma concatenation,
by avoiding a collect, and doing things manually
(things becoming more and more obfuscated, though):
Here is a version, which only allocates one new array per invocation:
this allocates (1+2+3+4+...+N) cells which is N*N/2.
pascal3
"return the n'th row of the pascal triangle"
|pNMinus1 newRow prev|
self == 1 ifTrue:[ ^ #( 1 ) ].
pNMinus1 := (self - 1) pascal.
newRow := Array new:(self + 1).
"/ add it up manually
prev := 0.
1 to:self-1 do:[:index |
|el|
el := pNMinus1 at:index.
sum := prev + el.
newRow at:index put:sum.
prev := el
].
newRow at:self put:1.
^ newRow.
i.e. to compute pascal(1000), 0.5 mio cells are allocated - roughly 2Mb.
Can we write a version which does not allocate anything at all ?
Yes, we can. The following code preallocates an array and
computes the result into it (recursive invocations use the same store
for all the previous results):
this last example only allocates N cells.
pascalHelperInto:anArray
"write the n'th row of the pascal triangle into anArray.
Destructive version (overwrites elements in the array)"
|prev|
self == 1 ifTrue:[
anArray at:1 put:1.
^ anArray.
].
(self - 1) pascalHelperInto:anArray.
prev := 0.
1 to:(self - 1) do:[:idx |
|el sum|
el := anArray at:idx.
sum := prev+el.
prev := el.
anArray at:idx put:sum.
].
anArray at:self put:1.
^ anArray
pascal4
"return the n'th row of the pascal triangle"
^ self pascalHelperInto:(Array new:self)
Execution times on 800Mhz Celeron notebook:
program | time in ms (5 runs) | ||||
pascal | 1197 | 1188 | 1181 | 1180 | 1181 |
pascal2 | 1176 | 1189 | 1199 | 1177 | 1186 |
pascal3 | 1276 | 1253 | 1220 | 1223 | 1222 |
pascal4 | 826 | 822 | 847 | 833 | 945 |
Summary:
The above execution times are very interesting; our "optimized" versions pascal2 and pascal3
are not at all faster - pascal3 is actually slower, due to the fact that the comma
and collect operations are probably highly performant
(even the author was surprized and expected a few percent of speedup).
The only real speedup comes with pascal4, which does no allocation at all - but for the price
of a much less intuitive program.
This means, that all of the memory allocations and garbage collections only accounted for
roughly 20-30% in this particular example.
It is left to the reader to decide if this is worth the trouble of writing complicated code like pascal4.
It is also left to the reader, to reimplement this example on C, C++, Java or any other
system and compare the results.
but please: with correct results; as a quick check,
the sum of the 1000'th row elements is
which is:
(1000 pascal) inject:0 into:[:acc :el | acc + el]
535754303593133660474212524530000905280702405852766803721875194
185175525562468061246599189407847929063797336458776573412593572
642846157021799228878734928740196728388741211549271053730253118
557093897709107652323749179097063369938377958277197303853145728
5598238843271083830214915826312193418602834034688
And - also look at how long it takes you to write, test and measure the examples. If it takes more than half an hour, you might be using the wrong programming language ;-)
Copyright © 1995 Claus Gittinger, all rights reserved
<cg at exept.de>