Malloc(3) in modern Virtual Memory environments.
Revised Fri Apr 5 12:50:07 1996
Poul-Henning Kamp
<phk@FreeBSD.org>
Den Andensidste Viking
Valbygaardsvej 8
DK-4200 Slagelse
Denmark
ABSTRACT
Malloc/free is one of the oldest part of the
C language environment and obviously the world has
changed a bit since it was first made. The fact
that most UNIX kernels have changed from a
swap/segment to a virtual memory/page based memory
management has not been sufficiently reflected in
the implementations of the malloc/free API.
A new implementation was designed, written,
tested and bench-marked with an eye on the work-
ings and performance characteristics of modern
Virtual Memory systems.
1. Introduction
Most programs need to allocate storage dynamically in
addition to whatever static storage the compiler reserved at
compile-time. To C programmers this fact is rather obvious,
but for many years this was not an accepted and recognized
fact, and many languages still used today don't support this
notion adequately.
The classic UNIX kernel provides two very simple and
powerful mechanisms for obtaining dynamic storage, the exe-
cution stack and the heap. The stack is usually put at the
far upper end of the address-space, from where it grows down
as far as needed, though this may depend on the CPU design.
The heap starts at the end of the bss segment and grows
upwards as needed.
There isn't really a kernel-interface to the stack as
such. The kernel will allocate some amount of memory for it,
not even telling the process the exact size. If the process
- 2 - Introduction
needs more space than that, it will simply try to access it,
hoping that the kernel will detect that access have been
attempted outside the allocated memory, and try to extend
it. If the kernel fails to extend the stack, this could be
because of lack of resources or permissions or because it
may just be impossible to do in the first place, the process
will usually be shot down by the kernel.
In the C language, there exists a little used interface
to the stack, alloca(3), which will explicitly allocate
space on the stack. This is not a interface to the kernel,
but merely an adjustment done to the stack-pointer such that
space will be available and unharmed by any subroutine calls
yet to be made while the context of the current subroutine
is intact.
Due to the nature of normal use of the stack, there is
no corresponding "free" operator, but instead the space is
returned when the current function returns to its caller and
the stack frame is dismanteled. This is the cause of much
grief, and probably the single most important reason that
alloca(3) is not, and should not be, used widely.
The heap on the other hand has an explicit kernel-
interface in the system call brk(2). The argument to brk(2)
is a pointer to where the process wants the heap to end.
There is also a interface called sbrk(2) taking an increment
to the current end of the heap, but this is merely a libc
front for brk(2).
In addition to these two memory resources, modern vir-
tual memory kernels provide the mmap(2)/munmap(2) interface
which allows almost complete control over any bit of virtual
memory in the process address room.
Because of the generality of the mmap(2) interface and
the way the data structures representing the regions are
laid out, sbrk(2) is actually faster in use than the
equivalent mmap(2) call, simply because the mmap(2) has to
search for information that is implicit in the sbrk(2) call.
2. The kernel and memory
Brk(2) isn't a particularly convenient interface, it
was probably made more to fit the memory model of the
hardware being used, than to fill the needs of the program-
mers.
Before paged and/or virtual memory systems became com-
mon, the most popular memory management facility used for
UNIX was segments. This was also very often the only vehicle
for imposing protection on various parts of memory. Depend-
ing on the hardware, segments can be anything, and conse-
quently how the kernels exploited them varied a lot from
- 3 - The kernel and memory
UNIX to UNIX and from machine to machine.
Typically a process would have one segment for the text
section, one for the data and bss section combined and one
for the stack. On some systems the text shared a segment
with the data and bss, and was consequently just as writable
as them.
In this setup all the brk(2) system call have to do is
to find the right amount of free storage, possibly moving
things around in physical memory, maybe even swapping out a
segment or two to make space, and change the upper limit on
the data segment according to the address given.
In a more modern page based virtual memory implementa-
tion this is still pretty much the situation, except that
the granularity is now pages: The kernel finds the right
number of free pages, possibly paging some pages out to free
them up, and then plug them into the page-table of the pro-
cess.
As such the difference is very small, the real differ-
ence is that in the old world of swapping, either the entire
process was in primary storage (or it wouldn't be selected
to be run) in a modern VM kernel, a process might only have
a subset of its pages in primary memory, the rest will be
paged in, if and when the process tries to access them.
Only very few programs deal with the brk(2) interface
directly, the few that does usually have their own memory
management facilities. LISP or FORTH interpreters are good
examples. Most other programs use the malloc(3) interface
instead, and leave it to the malloc implementation to use
brk(2) to get storage allocated from the kernel.
3. Malloc and free
The job of malloc(3) is to turn the rather simple
brk(2) facility into a service programs can actually use
without getting hurt.
The archetypical malloc(3) implementation keeps track
of the memory between the end of the bss section, as defined
by the _end symbol, and the current brk(2) point using a
linked list of chunks of memory. Each item on the list has a
status as either free or used, a pointer to the next entry
and in most cases to the previous as well, to speed up
inserts and deletes in the list.
When a malloc(3) request comes in, the list is
traversed from the front and if a free chunk big enough to
hold the request is found, it is returned, if the free chunk
is bigger than the size requested, a new free chunk is made
from the excess and put back on the list.
- 4 - Malloc and free
When a chunk is free(3)'ed, the chunk is found in the
list, its status is changed to free and if one or both of
the surrounding chunks are free, they are collapsed to one.
A third kind of request, realloc(3) exists, it will
resize a chunk, trying to avoid copying the contents if pos-
sible. It is seldom used, and has only had a significant
impact on performance in a few special situations. The typi-
cal pattern of use is to malloc(3) a chunk of the maximum
size needed, read in the data and adjust the size of the
chunk to match the size of the data read using realloc(3).
For reasons of efficiency, the original implementation
of malloc(3) put the small structure used to contain the
next and previous pointers plus the state of the chunk right
before the chunk itself.
As a matter of fact, the canonical malloc(3) implemen-
tation can be studied in the ``Old testament'', chapter 8
verse 7 [Kernighan & Ritchie]
Various optimisations can be applied to the above basic
algorithm:
If in freeing a chunk, we end up with the last chunk on
the list being free, we can return that to the kernel
by calling brk(2) with the first address of that chunk
and then make the previous chunk the last on the chain
by terminating its ``next'' pointer.
A best-fit algorithm can be used instead of first-fit
at an expense of memory, because statistically fewer
chances to brk(2) backwards will present themselves.
Splitting the list in two, once for used and one for
free chunks to speed the searching.
Putting free chunks on one of several free-list depend-
ing on the size to speed allocation.
...
4. The problems
Even though malloc(3) is a lot simpler to use than the
raw brk(2)/sbrk(2) interface or maybe exactly because of
that, a lot of problems arise from its use.
Writing to memory outside the allocated chunk. The most
likely result being that the data structure used to
hold the links and flags about this chunk or the next
one gets thrashed.
Freeing a pointer to memory not allocated by malloc.
- 5 - The problems
This is often a pointer that points to an object on the
stack or in the data-section, in newer implementations
of C it may even be in the text- section where it is
likely to be readonly. Some malloc implementations
detect this, some don't.
Freeing a modified pointer. This is a very common mis-
take, freeing not the pointer malloc(3) returned, but
rather some offset from it. Some mallocs will handle
this correctly if the offset is positive.
Freeing the same pointer more than once.
Accessing memory in a chunk after it has been
free(3)'ed.
The handling of these problems have traditionally been
weak. A core-dump was the most common form for "handling",
but in rare cases one could experience the famous "malloc:
corrupt arena." message before the core-dump. Even worse
though, very often the program will just continue, possibly
giving wrong results.
An entirely different form for problem is that the
memory returned by malloc(3) can contain any value. Unfor-
tunately most kernels, correctly, zero out the storage they
provide with brk(2), and thus the storage malloc returns
will be zeroed in many cases as well, so programmers are not
particular apt to notice that their code depend on malloc'ed
storage to be zeroed.
With problems this big and error handling this weak, it
is not surprising that problems are hard and time consuming
to find and fix.
5. Alternative implementations
These problems were actually the inspiration for the
first alternative malloc implementations. Since their main
aim was debugging, they would often use techniques like
allocating a guard zone before and after the chunk, and pos-
sibly fill these guard zones with some pattern, so accesses
outside the allocated chunk can be detected with some decent
probability. Another widely used technique is to use tables
to keep track of what chunks were actually in what state and
so on.
This class of debugging has been taken to its practical
extreme by the product "Purify" which does the entire
memory-colouring exercise and not only keeps track of what
is in use and what isn't, but also detects if the first
reference is a read (which would return undefined values)
and other such violations.
- 6 - Alternative implementations
Later actual complete implementations of malloc
arrived, but many of these still based their workings on the
basic schema mentioned previously, disregarding that in the
meantime virtual memory and paging have become the standard
environment.
The most widely used "alternative" malloc is undoubt-
edly ``gnumalloc'' which have received wide acclaim and cer-
tainly runs faster than most stock mallocs. It does however
tend to fare badly in a cases where paging is the norm
rather than the exception.
The particular malloc that prompted this work basically
didn't bother reusing storage until the kernel forced it to
do so by refusing further allocations with sbrk(2). That may
make sense if you work alone on your own personal mainframe,
but as a general policy it is less than optimal.
6. Performance
Performance for a malloc(3) implementation comes as two
variables:
A: How much time does it use for searching and manipu-
lating data structures. We will refer to this as
``overhead time''.
B: How well does it manage the storage. This rather
vague metric we call ``quality of allocation''.
The overhead time is easy to measure, just to a lot of
malloc/free calls of various kinds and combination, and com-
pare the results.
The quality of allocation is not quite as simple as
that. One measure of quality is the size of the process,
that should obviously be minimized. Another measure is the
execution time of the process. This is not an obvious indi-
cator of quality, but people will generally agree that it
should be minimized as well, and if malloc(3) can do any-
thing to do so, it should. Explanation why it is still a
good metric follows:
In a traditional segment/swap kernel, the desirable
behaviour of a process is to keep the brk(2) as low as pos-
sible, thus minimizing the size of the data/bss/heap seg-
ment, which in turn translates to a smaller process and a
smaller probability of the process being swapped out, qed:
faster execution time as an average.
In a paging environment this is not a bad choice for a
default, but a couple of details needs to be looked at much
more carefully.
- 7 - Performance
First of all, the size of a process becomes a more
vague concept since only the pages that are actually used
needs to be in primary storage for execution to progress,
and they only need to be there when used. That implies that
many more processes can fit in the same amount of primary
storage, since most processes have a high degree of locality
of reference and thus only need some fraction of their pages
to actually do their job.
From this it follows that the interesting size of the
process, is some subset of the total amount of virtual
memory occupied by the process. This number isn't a con-
stant, it varies depending on the whereabouts of the pro-
cess, and it may indeed fluctuate wildly over the lifetime
of the process.
One of the names for this vague concept is ``current
working set''. It has been defined many different ways over
the years, mostly to satisfy and support claims in marketing
or benchmark contexts.
For now we can simply say that it is the number of
pages the process needs in order to run at a sufficiently
low paging rate in a congested primary storage. (If primary
storage isn't congested, this is not really important of
course, but most systems would be better off using the pages
for disk-cache or similar functions, so from that perspec-
tive it will always be congested.) If the number of pages is
too small, the process will wait for its pages to be read
from secondary storage much of the time, if it's too big,
the space could be used better for something else.
From the view of any single process, this number of
pages is "all of my pages", but from the point of view of
the OS it should be tuned to maximise the total throughput
of all the processes on the machine at the time. This is
usually done using various kinds of least-recently-used
replacement algorithms to select page candidates for
replacement.
With this knowledge, can we decide what the performance
goal is for a modern malloc(3) ? Well, it's almost as simple
as it used to be: Minimize the number of pages accessed.
This really is the core of it all. If the number of
accessed pages is small, then locality of reference is
higher, and all kinds of caches (which essentially is what
the primary storage is in a VM system) works better.
It's interesting to notice that the classical malloc
fails on this one because the information about free chunks
are kept with the free chunks themselves. In some of the
benchmarks this came out as all the pages were paged in
every time a malloc were made, because malloc had to
- 8 - Performance
traverse the free-list to find a suitable chunk for the
allocation. If memory is not in use, then you shouldn't
access it.
The secondary goal is more evident: Try to work in
pages.
That makes it easier for the kernel, and wastes less
virtual memory. Most modern implementations does this when
they interact with the kernel, but few try to avoid objects
spanning pages.
If an objects size is less or equal to a page, there is
no reason for it to span two pages. Having objects span
pages means that two pages must be paged in, if that object
is accessed.
With this analysis in the luggage, we can start coding.
7. Implementation
A new malloc(3) implementation was written to meet the
goals, and to the extent possible to address the shortcom-
ings listed previously.
The source is 1218 lines of C code, and can be found in
FreeBSD 2.2 (and probably later versions as well) as
src/lib/libc/stdlib/malloc.c.
The main data structure is the page-directory which
contains a void* for each page we have control over. The
value can be one of:
MALLOC_NOT_MINE Another part of the code may call
brk(2) to get a piece of the cake. Consequently we can-
not rely on the memory we get from the kernel to be one
consecutive piece of memory and therefore we need a way
to mark such pages as "untouchable".
MALLOC_FREE This is a free page.
MALLOC_FIRST This is the first page in a (multi-)page
allocation.
MALLOC_FOLLOW This is a subsequent page in a multi-page
allocation.
struct pginfo* A pointer to a structure describing a
partitioned page.
In addition there exist a linked list of small data
structures that describe the free space as runs of free
pages.
- 9 - Implementation
Notice that these structures are not part of the free
pages themselves, but rather allocated with malloc so that
the free pages themselves are never referenced while they
are free.
When a request for storage comes in, it will be treated
as a ``page'' allocation if it is bigger than half a page.
The freelist will be searched and the first run of free
pages that can satisfy the request is used. The first page
gets set to MALLOC_FIRST status, if more than that one page
is needed the rest of them gets MALLOC_FOLLOW status in the
page-directory.
If there were no pages on the free-list, brk(2) will be
called, and the pages will get added to the page-directory
with status MALLOC_FREE and the search restarts.
Freeing a number of pages is done by changing their
state in the page directory to MALLOC_FREE, and then
traverse the free-pages list to find the right place for
this run of pages, possibly collapsing with the two neigh-
bouring runs into one run and, if it is possible, release
some memory back to the kernel by calling brk(2).
If the request is less than or equal to half of a page,
its size will be rounded up to the nearest power of two
before being processed and if the request is less than some
minimum size, it is rounded up to that size.
These sub-page allocations are served from pages which
are split up into some number of equal size chunks. For each
of these pages a struct pginfo describes the size of the
chunks on this page, how many there are, how many are free
and so on. The description consist of a bitmap of used
chunks, and various counters and numbers used to keep track
of the stuff in the page.
For each size of sub-page allocation, the pginfo struc-
tures for the pages that have free chunks in them form a
list. The head of these lists are stored in predetermined
slots at the beginning of the page directory to make access
fast.
To allocate a chunk of some size, the head of the list
for the corresponding size is examined, and a free chunk
found, the number of free chunks on that page is decreased
by one and if zero the pginfo structure is unlinked from the
list.
To free a chunk, the page is derived from the pointer,
the page table for that page contains a pointer to the
pginfo structure, where the free bit is set for the chunk,
the number of free chunks increased by one, and if equal to
one, the pginfo structure is linked into the proper place on
- 10 - Implementation
the list for this size of chunks. If the count increases to
match the number of chunks on the page, the pginfo structure
is unlinked from the list and free(3)'ed and the actual page
itself is free(3)'ed too.
To be 100% correct performance-wise these lists should
be ordered according to the recent number of accesses to
that page. This information is not available and it would
essentially mean a reordering of the list on every memory
reference to keep it up-to-date. Instead they are ordered
according to the address of the pages. Interestingly enough,
in practice this comes out to almost the same thing perfor-
mance wise.
It's not that surprising after all, it's the difference
between following the crowd or actively directing where it
can go, in both ways you can end up in the middle of it all.
The side effect of this compromise is that it also uses
less storage, and the list never has to be reordered, all
the ordering happens when pages are added or deleted.
It is an interesting twist to the implementation that
the struct pginfo Is allocated with malloc. That is, "as
with malloc" to be painfully correct. The code knows the
special case where the first (couple) of allocations on the
page is actually the pginfo structure and deals with it
accordingly. This avoids some silly "chicken and egg"
issues.
8. Bells and whistles.
brk(2) is actually not a very fast system call when you
ask for storage. This is mainly because of the need by the
kernel to zero the pages before handing them over, so there-
fore this implementation does not release back heap-pages,
until there is a large chunk to release back to the kernel.
Chances are pretty good that we will need it again pretty
soon anyway. Since these pages are not accessed at all, they
will soon be paged out and don't affect anything but swap-
space usage.
The page directory is actually kept in a mmap(2)'ed
piece of anonymous memory. This avoids some rather silly
cases that we would otherwise have to be handled when the
page directory has to be extended.
One particular nice feature is that all pointers passed
to free(3) and realloc(3) can be checked conclusively for
validity: First the pointer is masked to find the page. The
page directory is then examined, it must contain either
MALLOC_FIRST, in which case the pointer must point exactly
at the page, or it can contain a struct pginfo*, in which
case the pointer must point to a one of the chunks described
- 11 - Bells and whistles.
by that structure. Warnings will be printed on stderr and
nothing will be done with the pointer in case it is found to
be invalid.
An environment variable MALLOC_OPTIONS allows the user
some control over the behaviour of malloc. Some of the more
interesting options are:
Abort If malloc fails to allocate storage, core-dump
the process with a message rather than expect it handle
this correctly. It's amazing how few programs actually
handle this condition correctly, and consequently the
havoc they can create is the more creative or destruc-
tive.
Dump Writes malloc statistics to a file called
``malloc.out'' prior to process termination.
Hint Pass a hint to the kernel about pages we no longer
need through the madvise(2) system call. This can help
performance on machines that page heavily by eliminat-
ing unnecessary page-ins and page-outs of unused data.
Realloc Always do a free and malloc when realloc(3) is
called. The default is to leave things alone if the
size of the allocation is still in the same size-class.
For programs doing garbage collect using realloc(3)
this make the heap collapse faster. Since the malloc
will reallocate from the lowest available address.
Junk will explicitly fill the allocated area with a
particular value to try to detect if programs rely on
it being zero.
Zero will explicitly zero out the allocated chunk of
memory, while any space after the allocation in the
chunk will be filled with the junk value to try to
catch out of the chunk references.
9. The road not yet taken.
A couple of avenues were explored that could be
interesting in some set of circumstances.
Using mmap(2) instead of brk(2) was actually slower,
since brk(2) knows a lot of the things that mmap has to find
out first.
In general there is little room for further improvement
of the time-overhead of the malloc, further improvements
will have to be in the area of improving paging behaviour.
It is still under consideration to add a feature such
that if realloc is called with two zero arguments, the
- 12 - The road not taken.
internal allocations will be reallocated to perform a gar-
bage collect. This could be used in certain types of pro-
grams to collapse the memory use, but so far it doesn't seem
to be worth the effort.
Malloc/Free can be a significant point of contention in
multi-threaded programs. Low-grain locking of the data-
structures inside the implementation should be implemented
to avoid excessive spin-waiting.
10. Conclusion and experience.
In general the performance differences between gnumal-
loc and this malloc are not that big. The major difference
comes when primary storage is seriously over-committed, in
which case gnumalloc wastes time paging in pages it's not
going to use. In such cases as much as a factor of five in
wall-clock time has been seen in difference. Apart from that
gnumalloc and this implementation are pretty much head-on
performance wise.
Several legacy programs in the BSD 4.4 Lite distribu-
tion had code that depended on the memory returned from mal-
loc to be zeroed, in a couple of cases free(3) was called
more than once for the same allocation and a few cases even
called free(3) with pointers to objects in the data section
or on the stack.
A couple of users have reported that using this malloc
on other platforms yielded "pretty impressive results", but
no hard benchmarks have been made.
11. Acknowledgements & references.
The first implementation of this algorithm was actually
a file system, done in assembler using 5-hole ``Baudot''
paper tape for a drum storage device attached to a 20 bit
germanium transistor computer with 2000 words of memory, but
that was many years ago.
Peter Wemm <peter@FreeBSD.org> came up with the idea to
store the page-directory in mmap(2)'ed memory instead of in
the heap. This has proven to be a good move.
Lars Fredriksen <fredriks@mcs.com> found and identified
a fence-post bug in the code.
Generated on 2013-04-27 00:20:00 by $MirOS: src/scripts/roff2htm,v 1.77 2013/01/01 20:49:09 tg Exp $
These manual pages and other documentation are copyrighted by their respective writers;
their source is available at our CVSweb,
AnonCVS, and other mirrors. The rest is Copyright © 2002‒2013 The MirOS Project, Germany.
This product includes material
provided by Thorsten Glaser.
This manual page’s HTML representation is supposed to be valid XHTML/1.1; if not, please send a bug report – diffs preferred.