A Fast File System for UNIX*
Marshall Kirk McKusick, William N. Joy-,
Samuel J. Leffler=, Robert S. Fabry
Computer Systems Research Group
Computer Science Division
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, CA 94720
ABSTRACT
A reimplementation of the UNIX file system is
described. The reimplementation provides substan-
tially higher throughput rates by using more flex-
ible allocation policies that allow better local-
ity of reference and can be adapted to a wide
range of peripheral and processor characteristics.
The new file system clusters data that is sequen-
tially accessed and provides two block sizes to
allow fast access to large files while not wasting
large amounts of space for small files. File
access rates of up to ten times faster than the
traditional UNIX file system are experienced. Long
needed enhancements to the programmers' interface
are discussed. These include a mechanism to place
advisory locks on files, extensions of the name
space across file systems, the ability to use long
file names, and provisions for administrative con-
trol of resource usage.
Revised February 18, 1984
_________________________
* UNIX is a trademark of Bell Laboratories.
- William N. Joy is currently employed by: Sun Mi-
crosystems, Inc, 2550 Garcia Avenue, Mountain View, CA
94043
= Samuel J. Leffler is currently employed by: Lucasfilm
Ltd., PO Box 2009, San Rafael, CA 94912
This work was done under grants from the National Sci-
ence Foundation under grant MCS80-05144, and the De-
fense Advance Research Projects Agency (DoD) under ARPA
Order No. 4031 monitored by Naval Electronic System
Command under Contract No. N00039-82-C-0235.
SMM:05-2 A Fast File System for UNIX
CR Categories and Subject Descriptors: D.4.3 [Operating Sys-
tems]: File Systems Management - file organization, direc-
tory structures, access methods; D.4.2 [Operating Systems]:
Storage Management - allocation/deallocation strategies,
secondary storage devices; D.4.8 [Operating Systems]: Per-
formance - measurements, operational analysis; H.3.2 [Infor-
mation Systems]: Information Storage - file organization
Additional Keywords and Phrases: UNIX, file system organiza-
tion, file system performance, file system design, applica-
tion program interface.
General Terms: file system, measurement, performance.
A Fast File System for UNIX SMM:05-3
TABLE OF CONTENTS
1. Introduction
2. Old file system
3. New file system organization
3.1. Optimizing storage utilization
3.2. File system parameterization
3.3. Layout policies
4. Performance
5. File system functional enhancements
5.1. Long file names
5.2. File locking
5.3. Symbolic links
5.4. Rename
5.5. Quotas
Acknowledgements
References
1. Introduction
This paper describes the changes from the original 512
byte UNIX file system to the new one released with the 4.2
Berkeley Software Distribution. It presents the motivations
for the changes, the methods used to effect these changes,
the rationale behind the design decisions, and a description
of the new implementation. This discussion is followed by a
summary of the results that have been obtained, directions
for future work, and the additions and changes that have
been made to the facilities that are available to program-
mers.
The original UNIX system that runs on the PDP-11- has
simple and elegant file system facilities. File system
input/output is buffered by the kernel; there are no align-
ment constraints on data transfers and all operations are
made to appear synchronous. All transfers to the disk are in
512 byte blocks, which can be placed arbitrarily within the
data area of the file system. Virtually no constraints
other than available disk space are placed on file growth
[Ritchie74], [Thompson78].*
_________________________
- DEC, PDP, VAX, MASSBUS, and UNIBUS are trademarks of
Digital Equipment Corporation.
* In practice, a file's size is constrained to be less
than about one gigabyte.
SMM:05-4 A Fast File System for UNIX
When used on the VAX-11 together with other UNIX
enhancements, the original 512 byte UNIX file system is
incapable of providing the data throughput rates that many
applications require. For example, applications such as VLSI
design and image processing do a small amount of processing
on a large quantities of data and need to have a high
throughput from the file system. High throughput rates are
also needed by programs that map files from the file system
into large virtual address spaces. Paging data in and out of
the file system is likely to occur frequently [Ferrin82b].
This requires a file system providing higher bandwidth than
the original 512 byte UNIX one that provides only about two
percent of the maximum disk bandwidth or about 20 kilobytes
per second per arm [White80], [Smith81b].
Modifications have been made to the UNIX file system to
improve its performance. Since the UNIX file system inter-
face is well understood and not inherently slow, this
development retained the abstraction and simply changed the
underlying implementation to increase its throughput. Conse-
quently, users of the system have not been faced with mas-
sive software conversion.
Problems with file system performance have been dealt
with extensively in the literature; see [Smith81a] for a
survey. Previous work to improve the UNIX file system per-
formance has been done by [Ferrin82a]. The UNIX operating
system drew many of its ideas from Multics, a large, high
performance operating system [Feiertag71]. Other work
includes Hydra [Almes78], Spice [Thompson80], and a file
system for a LISP environment [Symbolics81]. A good intro-
duction to the physical latencies of disks is described in
[Pechura83].
2. Old File System
In the file system developed at Bell Laboratories (the
``traditional'' file system), each disk drive is divided
into one or more partitions. Each of these disk partitions
may contain one file system. A file system never spans mul-
tiple partitions.- A file system is described by its super-
block, which contains the basic parameters of the file sys-
tem. These include the number of data blocks in the file
_________________________
- By ``partition'' here we refer to the subdivision of
physical space on a disk drive. In the traditional
file system, as in the new file system, file systems
are really located in logical disk partitions that may
overlap. This overlapping is made available, for exam-
ple, to allow programs to copy entire disk drives con-
taining multiple file systems.
A Fast File System for UNIX SMM:05-5
system, a count of the maximum number of files, and a
pointer to the free list, a linked list of all the free
blocks in the file system.
Within the file system are files. Certain files are
distinguished as directories and contain pointers to files
that may themselves be directories. Every file has a
descriptor associated with it called an inode. An inode con-
tains information describing ownership of the file, time
stamps marking last modification and access times for the
file, and an array of indices that point to the data blocks
for the file. For the purposes of this section, we assume
that the first 8 blocks of the file are directly referenced
by values stored in an inode itself*. An inode may also con-
tain references to indirect blocks containing further data
block indices. In a file system with a 512 byte block size,
a singly indirect block contains 128 further block
addresses, a doubly indirect block contains 128 addresses of
further singly indirect blocks, and a triply indirect block
contains 128 addresses of further doubly indirect blocks.
A 150 megabyte traditional UNIX file system consists of
4 megabytes of inodes followed by 146 megabytes of data.
This organization segregates the inode information from the
data; thus accessing a file normally incurs a long seek from
the file's inode to its data. Files in a single directory
are not typically allocated consecutive slots in the 4 mega-
bytes of inodes, causing many non-consecutive blocks of
inodes to be accessed when executing operations on the
inodes of several files in a directory.
The allocation of data blocks to files is also subop-
timum. The traditional file system never transfers more than
512 bytes per disk transaction and often finds that the next
sequential data block is not on the same cylinder, forcing
seeks between 512 byte transfers. The combination of the
small block size, limited read-ahead in the system, and many
seeks severely limits file system throughput.
The first work at Berkeley on the UNIX file system
attempted to improve both reliability and throughput. The
reliability was improved by staging modifications to criti-
cal file system information so that they could either be
completed or repaired cleanly by a program after a crash
[Kowalski78]. The file system performance was improved by a
factor of more than two by changing the basic block size
from 512 to 1024 bytes. The increase was because of two fac-
tors: each disk transfer accessed twice as much data, and
most files could be described without need to access
indirect blocks since the direct blocks contained twice as
_________________________
* The actual number may vary from system to system, but
is usually in the range 5-13.
SMM:05-6 A Fast File System for UNIX
much data. The file system with these changes will hen-
ceforth be referred to as the old file system.
This performance improvement gave a strong indication
that increasing the block size was a good method for improv-
ing throughput. Although the throughput had doubled, the old
file system was still using only about four percent of the
disk bandwidth. The main problem was that although the free
list was initially ordered for optimal access, it quickly
became scrambled as files were created and removed. Eventu-
ally the free list became entirely random, causing files to
have their blocks allocated randomly over the disk. This
forced a seek before every block access. Although old file
systems provided transfer rates of up to 175 kilobytes per
second when they were first created, this rate deteriorated
to 30 kilobytes per second after a few weeks of moderate use
because of this randomization of data block placement. There
was no way of restoring the performance of an old file sys-
tem except to dump, rebuild, and restore the file system.
Another possibility, as suggested by [Maruyama76], would be
to have a process that periodically reorganized the data on
the disk to restore locality.
3. New file system organization
In the new file system organization (as in the old file
system organization), each disk drive contains one or more
file systems. A file system is described by its super-block,
located at the beginning of the file system's disk parti-
tion. Because the super-block contains critical data, it is
replicated to protect against catastrophic loss. This is
done when the file system is created; since the super-block
data does not change, the copies need not be referenced
unless a head crash or other hard disk error causes the
default super-block to be unusable.
To insure that it is possible to create files as large
as 232 bytes with only two levels of indirection, the
minimum size of a file system block is 4096 bytes. The size
of file system blocks can be any power of two greater than
or equal to 4096. The block size of a file system is
recorded in the file system's super-block so it is possible
for file systems with different block sizes to be simultane-
ously accessible on the same system. The block size must be
decided at the time that the file system is created; it can-
not be subsequently changed without rebuilding the file sys-
tem.
The new file system organization divides a disk parti-
tion into one or more areas called cylinder groups. A
cylinder group is comprised of one or more consecutive
cylinders on a disk. Associated with each cylinder group is
A Fast File System for UNIX SMM:05-7
some bookkeeping information that includes a redundant copy
of the super-block, space for inodes, a bit map describing
available blocks in the cylinder group, and summary informa-
tion describing the usage of data blocks within the cylinder
group. The bit map of available blocks in the cylinder group
replaces the traditional file system's free list. For each
cylinder group a static number of inodes is allocated at
file system creation time. The default policy is to allocate
one inode for each 2048 bytes of space in the cylinder
group, expecting this to be far more than will ever be
needed.
All the cylinder group bookkeeping information could be
placed at the beginning of each cylinder group. However if
this approach were used, all the redundant information would
be on the top platter. A single hardware failure that des-
troyed the top platter could cause the loss of all redundant
copies of the super-block. Thus the cylinder group bookkeep-
ing information begins at a varying offset from the begin-
ning of the cylinder group. The offset for each successive
cylinder group is calculated to be about one track further
from the beginning of the cylinder group than the preceding
cylinder group. In this way the redundant information
spirals down into the pack so that any single track,
cylinder, or platter can be lost without losing all copies
of the super-block. Except for the first cylinder group, the
space between the beginning of the cylinder group and the
beginning of the cylinder group information is used for data
blocks.-
3.1. Optimizing storage utilization
Data is laid out so that larger blocks can be
transferred in a single disk transaction, greatly increasing
file system throughput. As an example, consider a file in
the new file system composed of 4096 byte data blocks. In
the old file system this file would be composed of 1024 byte
_________________________
- While it appears that the first cylinder group could
be laid out with its super-block at the ``known'' loca-
tion, this would not work for file systems with blocks
sizes of 16 kilobytes or greater. This is because of a
requirement that the first 8 kilobytes of the disk be
reserved for a bootstrap program and a separate re-
quirement that the cylinder group information begin on
a file system block boundary. To start the cylinder
group on a file system block boundary, file systems
with block sizes larger than 8 kilobytes would have to
leave an empty space between the end of the boot block
and the beginning of the cylinder group. Without know-
ing the size of the file system blocks, the system
would not know what roundup function to use to find the
beginning of the first cylinder group.
SMM:05-8 A Fast File System for UNIX
blocks. By increasing the block size, disk accesses in the
new file system may transfer up to four times as much infor-
mation per disk transaction. In large files, several 4096
byte blocks may be allocated from the same cylinder so that
even larger data transfers are possible before requiring a
seek.
The main problem with larger blocks is that most UNIX
file systems are composed of many small files. A uniformly
large block size wastes space. Table 1 shows the effect of
file system block size on the amount of wasted space in the
file system. The files measured to obtain these figures
reside on one of our time sharing systems that has roughly
1.2 gigabytes of on-line storage. The measurements are based
on the active user file systems containing about 920 mega-
bytes of formatted space.
_________________________________________________________________________
|Space used| % waste| Organization |
|__________|_________|__________________________________________________|
|775.2 Mb | 0.0 | Data only, no separation between files |
|807.8 Mb | 4.2 | Data only, each file starts on 512 byte boundary|
|828.7 Mb | 6.9 | Data + inodes, 512 byte block UNIX file system |
|866.5 Mb | 11.8 | Data + inodes, 1024 byte block UNIX file system |
|948.5 Mb | 22.4 | Data + inodes, 2048 byte block UNIX file system |
|1128.3 Mb | 45.6 | Data + inodes, 4096 byte block UNIX file system |
|__________|_________|__________________________________________________|
Table 1 - Amount of wasted space as a function of block size.
The space wasted is calculated to be the percentage of space
on the disk not containing user data. As the block size on
the disk increases, the waste rises quickly, to an intoler-
able 45.6% waste with 4096 byte file system blocks.
To be able to use large blocks without undue waste,
small files must be stored in a more efficient way. The new
file system accomplishes this goal by allowing the division
of a single file system block into one or more fragments.
The file system fragment size is specified at the time that
the file system is created; each file system block can
optionally be broken into 2, 4, or 8 fragments, each of
which is addressable. The lower bound on the size of these
fragments is constrained by the disk sector size, typically
512 bytes. The block map associated with each cylinder group
records the space available in a cylinder group at the frag-
ment level; to determine if a block is available, aligned
fragments are examined. Figure 1 shows a piece of a map from
a 4096/1024 file system. Each bit in the map records the
status of a fragment; an ``X'' shows that the fragment is in
use, while a ``O'' shows that the fragment is available for
allocation. In this example, fragments 0-5, 10, and 11 are
in use, while fragments 6-9, and 12-15 are free. Fragments
A Fast File System for UNIX SMM:05-9
_______________________________________________
|Bits in map | XXXX XXOO OOXX OOOO |
|Fragment numbers| 0-3 4-7 8-11 12-15|
|Block numbers | 0 1 2 3 |
|________________|____________________________|
Figure 1 - Example layout of blocks and fragments in a 4096/1024 file system.
of adjoining blocks cannot be used as a full block, even if
they are large enough. In this example, fragments 6-9 cannot
be allocated as a full block; only fragments 12-15 can be
coalesced into a full block.
On a file system with a block size of 4096 bytes and a
fragment size of 1024 bytes, a file is represented by zero
or more 4096 byte blocks of data, and possibly a single
fragmented block. If a file system block must be fragmented
to obtain space for a small amount of data, the remaining
fragments of the block are made available for allocation to
other files. As an example consider an 11000 byte file
stored on a 4096/1024 byte file system. This file would uses
two full size blocks and one three fragment portion of
another block. If no block with three aligned fragments is
available at the time the file is created, a full size block
is split yielding the necessary fragments and a single
unused fragment. This remaining fragment can be allocated to
another file as needed.
Space is allocated to a file when a program does a
write system call. Each time data is written to a file, the
system checks to see if the size of the file has increased*.
If the file needs to be expanded to hold the new data, one
of three conditions exists:
1) There is enough space left in an already allocated
block or fragment to hold the new data. The new data is
written into the available space.
2) The file contains no fragmented blocks (and the last
block in the file contains insufficient space to hold
the new data). If space exists in a block already allo-
cated, the space is filled with new data. If the
remainder of the new data contains more than a full
block of data, a full block is allocated and the first
full block of new data is written there. This process
is repeated until less than a full block of new data
remains. If the remaining new data to be written will
_________________________
* A program may be overwriting data in the middle of an
existing file in which case space would already have
been allocated.
SMM:05-10 A Fast File System for UNIX
fit in less than a full block, a block with the neces-
sary fragments is located, otherwise a full block is
located. The remaining new data is written into the
located space.
3) The file contains one or more fragments (and the frag-
ments contain insufficient space to hold the new data).
If the size of the new data plus the size of the data
already in the fragments exceeds the size of a full
block, a new block is allocated. The contents of the
fragments are copied to the beginning of the block and
the remainder of the block is filled with new data. The
process then continues as in (2) above. Otherwise, if
the new data to be written will fit in less than a full
block, a block with the necessary fragments is located,
otherwise a full block is located. The contents of the
existing fragments appended with the new data are writ-
ten into the allocated space.
The problem with expanding a file one fragment at a a
time is that data may be copied many times as a fragmented
block expands to a full block. Fragment reallocation can be
minimized if the user program writes a full block at a time,
except for a partial block at the end of the file. Since
file systems with different block sizes may reside on the
same system, the file system interface has been extended to
provide application programs the optimal size for a read or
write. For files the optimal size is the block size of the
file system on which the file is being accessed. For other
objects, such as pipes and sockets, the optimal size is the
underlying buffer size. This feature is used by the Standard
Input/Output Library, a package used by most user programs.
This feature is also used by certain system utilities such
as archivers and loaders that do their own input and output
management and need the highest possible file system
bandwidth.
The amount of wasted space in the 4096/1024 byte new
file system organization is empirically observed to be about
the same as in the 1024 byte old file system organization. A
file system with 4096 byte blocks and 512 byte fragments has
about the same amount of wasted space as the 512 byte block
UNIX file system. The new file system uses less space than
the 512 byte or 1024 byte file systems for indexing informa-
tion for large files and the same amount of space for small
files. These savings are offset by the need to use more
space for keeping track of available free blocks. The net
result is about the same disk utilization when a new file
system's fragment size equals an old file system's block
size.
In order for the layout policies to be effective, a
file system cannot be kept completely full. For each file
system there is a parameter, termed the free space reserve,
A Fast File System for UNIX SMM:05-11
that gives the minimum acceptable percentage of file system
blocks that should be free. If the number of free blocks
drops below this level only the system administrator can
continue to allocate blocks. The value of this parameter may
be changed at any time, even when the file system is mounted
and active. The transfer rates that appear in section 4 were
measured on file systems kept less than 90% full (a reserve
of 10%). If the number of free blocks falls to zero, the
file system throughput tends to be cut in half, because of
the inability of the file system to localize blocks in a
file. If a file system's performance degrades because of
overfilling, it may be restored by removing files until the
amount of free space once again reaches the minimum accept-
able level. Access rates for files created during periods of
little free space may be restored by moving their data once
enough space is available. The free space reserve must be
added to the percentage of waste when comparing the organi-
zations given in Table 1. Thus, the percentage of waste in
an old 1024 byte UNIX file system is roughly comparable to a
new 4096/512 byte file system with the free space reserve
set at 5%. (Compare 11.8% wasted with the old file system to
6.9% waste + 5% reserved space in the new file system.)
3.2. File system parameterization
Except for the initial creation of the free list, the
old file system ignores the parameters of the underlying
hardware. It has no information about either the physical
characteristics of the mass storage device, or the hardware
that interacts with it. A goal of the new file system is to
parameterize the processor capabilities and mass storage
characteristics so that blocks can be allocated in an
optimum configuration-dependent way. Parameters used include
the speed of the processor, the hardware support for mass
storage transfers, and the characteristics of the mass
storage devices. Disk technology is constantly improving and
a given installation can have several different disk techno-
logies running on a single processor. Each file system is
parameterized so that it can be adapted to the characteris-
tics of the disk on which it is placed.
For mass storage devices such as disks, the new file
system tries to allocate new blocks on the same cylinder as
the previous block in the same file. Optimally, these new
blocks will also be rotationally well positioned. The dis-
tance between ``rotationally optimal'' blocks varies
greatly; it can be a consecutive block or a rotationally
delayed block depending on system characteristics. On a pro-
cessor with an input/output channel that does not require
any processor intervention between mass storage transfer
requests, two consecutive disk blocks can often be accessed
without suffering lost time because of an intervening disk
revolution. For processors without input/output channels,
the main processor must field an interrupt and prepare for a
SMM:05-12 A Fast File System for UNIX
new disk transfer. The expected time to service this inter-
rupt and schedule a new disk transfer depends on the speed
of the main processor.
The physical characteristics of each disk include the
number of blocks per track and the rate at which the disk
spins. The allocation routines use this information to cal-
culate the number of milliseconds required to skip over a
block. The characteristics of the processor include the
expected time to service an interrupt and schedule a new
disk transfer. Given a block allocated to a file, the allo-
cation routines calculate the number of blocks to skip over
so that the next block in the file will come into position
under the disk head in the expected amount of time that it
takes to start a new disk transfer operation. For programs
that sequentially access large amounts of data, this stra-
tegy minimizes the amount of time spent waiting for the disk
to position itself.
To ease the calculation of finding rotationally optimal
blocks, the cylinder group summary information includes a
count of the available blocks in a cylinder group at dif-
ferent rotational positions. Eight rotational positions are
distinguished, so the resolution of the summary information
is 2 milliseconds for a typical 3600 revolution per minute
drive. The super-block contains a vector of lists called
rotational layout tables. The vector is indexed by rota-
tional position. Each component of the vector lists the
index into the block map for every data block contained in
its rotational position. When looking for an allocatable
block, the system first looks through the summary counts for
a rotational position with a non-zero block count. It then
uses the index of the rotational position to find the
appropriate list to use to index through only the relevant
parts of the block map to find a free block.
The parameter that defines the minimum number of mil-
liseconds between the completion of a data transfer and the
initiation of another data transfer on the same cylinder can
be changed at any time, even when the file system is mounted
and active. If a file system is parameterized to lay out
blocks with a rotational separation of 2 milliseconds, and
the disk pack is then moved to a system that has a processor
requiring 4 milliseconds to schedule a disk operation, the
throughput will drop precipitously because of lost disk
revolutions on nearly every block. If the eventual target
machine is known, the file system can be parameterized for
it even though it is initially created on a different pro-
cessor. Even if the move is not known in advance, the rota-
tional layout delay can be reconfigured after the disk is
moved so that all further allocation is done based on the
characteristics of the new host.
A Fast File System for UNIX SMM:05-13
3.3. Layout policies
The file system layout policies are divided into two
distinct parts. At the top level are global policies that
use file system wide summary information to make decisions
regarding the placement of new inodes and data blocks. These
routines are responsible for deciding the placement of new
directories and files. They also calculate rotationally
optimal block layouts, and decide when to force a long seek
to a new cylinder group because there are insufficient
blocks left in the current cylinder group to do reasonable
layouts. Below the global policy routines are the local
allocation routines that use a locally optimal scheme to lay
out data blocks.
Two methods for improving file system performance are
to increase the locality of reference to minimize seek
latency as described by [Trivedi80], and to improve the lay-
out of data to make larger transfers possible as described
by [Nevalainen77]. The global layout policies try to improve
performance by clustering related information. They cannot
attempt to localize all data references, but must also try
to spread unrelated data among different cylinder groups. If
too much localization is attempted, the local cylinder group
may run out of space forcing the data to be scattered to
non-local cylinder groups. Taken to an extreme, total local-
ization can result in a single huge cluster of data resem-
bling the old file system. The global policies try to bal-
ance the two conflicting goals of localizing data that is
concurrently accessed while spreading out unrelated data.
One allocatable resource is inodes. Inodes are used to
describe both files and directories. Inodes of files in the
same directory are frequently accessed together. For exam-
ple, the ``list directory'' command often accesses the inode
for each file in a directory. The layout policy tries to
place all the inodes of files in a directory in the same
cylinder group. To ensure that files are distributed
throughout the disk, a different policy is used for direc-
tory allocation. A new directory is placed in a cylinder
group that has a greater than average number of free inodes,
and the smallest number of directories already in it. The
intent of this policy is to allow the inode clustering pol-
icy to succeed most of the time. The allocation of inodes
within a cylinder group is done using a next free strategy.
Although this allocates the inodes randomly within a
cylinder group, all the inodes for a particular cylinder
group can be read with 8 to 16 disk transfers. (At most 16
disk transfers are required because a cylinder group may
have no more than 2048 inodes.) This puts a small and con-
stant upper bound on the number of disk transfers required
to access the inodes for all the files in a directory. In
contrast, the old file system typically requires one disk
transfer to fetch the inode for each file in a directory.
SMM:05-14 A Fast File System for UNIX
The other major resource is data blocks. Since data
blocks for a file are typically accessed together, the pol-
icy routines try to place all data blocks for a file in the
same cylinder group, preferably at rotationally optimal
positions in the same cylinder. The problem with allocating
all the data blocks in the same cylinder group is that large
files will quickly use up available space in the cylinder
group, forcing a spill over to other areas. Further, using
all the space in a cylinder group causes future allocations
for any file in the cylinder group to also spill to other
areas. Ideally none of the cylinder groups should ever
become completely full. The heuristic solution chosen is to
redirect block allocation to a different cylinder group when
a file exceeds 48 kilobytes, and at every megabyte
thereafter.* The newly chosen cylinder group is selected
from those cylinder groups that have a greater than average
number of free blocks left. Although big files tend to be
spread out over the disk, a megabyte of data is typically
accessible before a long seek must be performed, and the
cost of one long seek per megabyte is small.
The global policy routines call local allocation rou-
tines with requests for specific blocks. The local alloca-
tion routines will always allocate the requested block if it
is free, otherwise it allocates a free block of the
requested size that is rotationally closest to the requested
block. If the global layout policies had complete informa-
tion, they could always request unused blocks and the allo-
cation routines would be reduced to simple bookkeeping. How-
ever, maintaining complete information is costly; thus the
implementation of the global layout policy uses heuristics
that employ only partial information.
If a requested block is not available, the local allo-
cator uses a four level allocation strategy:
1) Use the next available block rotationally closest to
the requested block on the same cylinder. It is
assumed here that head switching time is zero. On disk
controllers where this is not the case, it may be pos-
sible to incorporate the time required to switch
_________________________
* The first spill over point at 48 kilobytes is the
point at which a file on a 4096 byte block file system
first requires a single indirect block. This appears
to be a natural first point at which to redirect block
allocation. The other spillover points are chosen with
the intent of forcing block allocation to be redirected
when a file has used about 25% of the data blocks in a
cylinder group. In observing the new file system in day
to day use, the heuristics appear to work well in
minimizing the number of completely filled cylinder
groups.
A Fast File System for UNIX SMM:05-15
between disk platters when constructing the rotational
layout tables. This, however, has not yet been tried.
2) If there are no blocks available on the same cylinder,
use a block within the same cylinder group.
3) If that cylinder group is entirely full, quadratically
hash the cylinder group number to choose another
cylinder group to look for a free block.
4) Finally if the hash fails, apply an exhaustive search
to all cylinder groups.
Quadratic hash is used because of its speed in finding
unused slots in nearly full hash tables [Knuth75]. File sys-
tems that are parameterized to maintain at least 10% free
space rarely use this strategy. File systems that are run
without maintaining any free space typically have so few
free blocks that almost any allocation is random; the most
important characteristic of the strategy used under such
conditions is that the strategy be fast.
4. Performance
Ultimately, the proof of the effectiveness of the algo-
rithms described in the previous section is the long term
performance of the new file system.
Our empirical studies have shown that the inode layout
policy has been effective. When running the ``list direc-
tory'' command on a large directory that itself contains
many directories (to force the system to access inodes in
multiple cylinder groups), the number of disk accesses for
inodes is cut by a factor of two. The improvements are even
more dramatic for large directories containing only files,
disk accesses for inodes being cut by a factor of eight.
This is most encouraging for programs such as spooling dae-
mons that access many small files, since these programs tend
to flood the disk request queue on the old file system.
Table 2 summarizes the measured throughput of the new
file system. Several comments need to be made about the con-
ditions under which these tests were run. The test programs
measure the rate at which user programs can transfer data to
or from a file without performing any processing on it.
These programs must read and write enough data to insure
that buffering in the operating system does not affect the
results. They are also run at least three times in succes-
sion; the first to get the system into a known state and the
second two to insure that the experiment has stabilized and
is repeatable. The tests used and their results are dis-
cussed in detail in [Kridle83]-. The systems were running
_________________________
- A UNIX command that is similar to the reading test
SMM:05-16 A Fast File System for UNIX
multi-user but were otherwise quiescent. There was no con-
tention for either the CPU or the disk arm. The only differ-
ence between the UNIBUS and MASSBUS tests was the con-
troller. All tests used an AMPEX Capricorn 330 megabyte Win-
chester disk. As Table 2 shows, all file system test runs
were on a VAX 11/750. All file systems had been in produc-
tion use for at least a month before being measured. The
same number of system calls were performed in all tests; the
basic system call overhead was a negligible portion of the
total running time of the tests.
______________________________________________________________________
| Type of Processor and| Read |
| File System Bus Measured | Speed Bandwidth % CPU|
|_____________________________|______________________________________|
| old 1024 750/UNIBUS | 29 Kbytes/sec 29/983 3% 11% |
|new 4096/1024 750/UNIBUS | 221 Kbytes/sec 221/983 22% 43% |
|new 8192/1024 750/UNIBUS | 233 Kbytes/sec 233/983 24% 29% |
|new 4096/1024 750/MASSBUS | 466 Kbytes/sec 466/983 47% 73% |
|new 8192/1024 750/MASSBUS | 466 Kbytes/sec 466/983 47% 54% |
|_____________________________|______________________________________|
Table 2a - Reading rates of the old and new UNIX file systems.
______________________________________________________________________
| Type of Processor and| Write |
| File System Bus Measured | Speed Bandwidth % CPU|
|_____________________________|______________________________________|
| old 1024 750/UNIBUS | 48 Kbytes/sec 48/983 5% 29% |
|new 4096/1024 750/UNIBUS | 142 Kbytes/sec 142/983 14% 43% |
|new 8192/1024 750/UNIBUS | 215 Kbytes/sec 215/983 22% 46% |
|new 4096/1024 750/MASSBUS | 323 Kbytes/sec 323/983 33% 94% |
|new 8192/1024 750/MASSBUS | 466 Kbytes/sec 466/983 47% 95% |
|_____________________________|______________________________________|
Table 2b - Writing rates of the old and new UNIX file systems.
Unlike the old file system, the transfer rates for the
new file system do not appear to change over time. The
throughput rate is tied much more strongly to the amount of
free space that is maintained. The measurements in Table 2
were based on a file system with a 10% free space reserve.
Synthetic work loads suggest that throughput deteriorates to
about half the rates given in Table 2 when the file systems
are full.
The percentage of bandwidth given in Table 2 is a meas-
ure of the effective utilization of the disk by the file
_________________________
that we used is ``cp file /dev/null'', where ``file''
is eight megabytes long.
A Fast File System for UNIX SMM:05-17
system. An upper bound on the transfer rate from the disk is
calculated by multiplying the number of bytes on a track by
the number of revolutions of the disk per second. The
bandwidth is calculated by comparing the data rates the file
system is able to achieve as a percentage of this rate.
Using this metric, the old file system is only able to use
about 3-5% of the disk bandwidth, while the new file system
uses up to 47% of the bandwidth.
Both reads and writes are faster in the new system than
in the old system. The biggest factor in this speedup is
because of the larger block size used by the new file sys-
tem. The overhead of allocating blocks in the new system is
greater than the overhead of allocating blocks in the old
system, however fewer blocks need to be allocated in the new
system because they are bigger. The net effect is that the
cost per byte allocated is about the same for both systems.
In the new file system, the reading rate is always at
least as fast as the writing rate. This is to be expected
since the kernel must do more work when allocating blocks
than when simply reading them. Note that the write rates are
about the same as the read rates in the 8192 byte block file
system; the write rates are slower than the read rates in
the 4096 byte block file system. The slower write rates
occur because the kernel has to do twice as many disk allo-
cations per second, making the processor unable to keep up
with the disk transfer rate.
In contrast the old file system is about 50% faster at
writing files than reading them. This is because the write
system call is asynchronous and the kernel can generate disk
transfer requests much faster than they can be serviced,
hence disk transfers queue up in the disk buffer cache.
Because the disk buffer cache is sorted by minimum seek dis-
tance, the average seek between the scheduled disk writes is
much less than it would be if the data blocks were written
out in the random disk order in which they are generated.
However when the file is read, the read system call is pro-
cessed synchronously so the disk blocks must be retrieved
from the disk in the non-optimal seek order in which they
are requested. This forces the disk scheduler to do long
seeks resulting in a lower throughput rate.
In the new system the blocks of a file are more
optimally ordered on the disk. Even though reads are still
synchronous, the requests are presented to the disk in a
much better order. Even though the writes are still asyn-
chronous, they are already presented to the disk in minimum
seek order so there is no gain to be had by reordering them.
Hence the disk seek latencies that limited the old file sys-
tem have little effect in the new file system. The cost of
allocation is the factor in the new system that causes
writes to be slower than reads.
SMM:05-18 A Fast File System for UNIX
The performance of the new file system is currently
limited by memory to memory copy operations required to move
data from disk buffers in the system's address space to data
buffers in the user's address space. These copy operations
account for about 40% of the time spent performing an
input/output operation. If the buffers in both address
spaces were properly aligned, this transfer could be per-
formed without copying by using the VAX virtual memory
management hardware. This would be especially desirable when
transferring large amounts of data. We did not implement
this because it would change the user interface to the file
system in two major ways: user programs would be required to
allocate buffers on page boundaries, and data would disap-
pear from buffers after being written.
Greater disk throughput could be achieved by rewriting
the disk drivers to chain together kernel buffers. This
would allow contiguous disk blocks to be read in a single
disk transaction. Many disks used with UNIX systems contain
either 32 or 48 512 byte sectors per track. Each track holds
exactly two or three 8192 byte file system blocks, or four
or six 4096 byte file system blocks. The inability to use
contiguous disk blocks effectively limits the performance on
these disks to less than 50% of the available bandwidth. If
the next block for a file cannot be laid out contiguously,
then the minimum spacing to the next allocatable block on
any platter is between a sixth and a half a revolution. The
implication of this is that the best possible layout without
contiguous blocks uses only half of the bandwidth of any
given track. If each track contains an odd number of sec-
tors, then it is possible to resolve the rotational delay to
any number of sectors by finding a block that begins at the
desired rotational position on another track. The reason
that block chaining has not been implemented is because it
would require rewriting all the disk drivers in the system,
and the current throughput rates are already limited by the
speed of the available processors.
Currently only one block is allocated to a file at a
time. A technique used by the DEMOS file system when it
finds that a file is growing rapidly, is to preallocate
several blocks at once, releasing them when the file is
closed if they remain unused. By batching up allocations,
the system can reduce the overhead of allocating at each
write, and it can cut down on the number of disk writes
needed to keep the block pointers on the disk synchronized
with the block allocation [Powell79]. This technique was not
included because block allocation currently accounts for
less than 10% of the time spent in a write system call and,
once again, the current throughput rates are already limited
by the speed of the available processors.
A Fast File System for UNIX SMM:05-19
5. File system functional enhancements
The performance enhancements to the UNIX file system
did not require any changes to the semantics or data struc-
tures visible to application programs. However, several
changes had been generally desired for some time but had not
been introduced because they would require users to dump and
restore all their file systems. Since the new file system
already required all existing file systems to be dumped and
restored, these functional enhancements were introduced at
this time.
5.1. Long file names
File names can now be of nearly arbitrary length. Only
programs that read directories are affected by this change.
To promote portability to UNIX systems that are not running
the new file system, a set of directory access routines have
been introduced to provide a consistent interface to direc-
tories on both old and new systems.
Directories are allocated in 512 byte units called
chunks. This size is chosen so that each allocation can be
transferred to disk in a single operation. Chunks are broken
up into variable length records termed directory entries. A
directory entry contains the information necessary to map
the name of a file to its associated inode. No directory
entry is allowed to span multiple chunks. The first three
fields of a directory entry are fixed length and contain: an
inode number, the size of the entry, and the length of the
file name contained in the entry. The remainder of an entry
is variable length and contains a null terminated file name,
padded to a 4 byte boundary. The maximum length of a file
name in a directory is currently 255 characters.
Available space in a directory is recorded by having
one or more entries accumulate the free space in their entry
size fields. This results in directory entries that are
larger than required to hold the entry name plus fixed
length fields. Space allocated to a directory should always
be completely accounted for by totaling up the sizes of its
entries. When an entry is deleted from a directory, its
space is returned to a previous entry in the same directory
chunk by increasing the size of the previous entry by the
size of the deleted entry. If the first entry of a directory
chunk is free, then the entry's inode number is set to zero
to indicate that it is unallocated.
5.2. File locking
The old file system had no provision for locking files.
Processes that needed to synchronize the updates of a file
had to use a separate ``lock'' file. A process would try to
create a ``lock'' file. If the creation succeeded, then the
SMM:05-20 A Fast File System for UNIX
process could proceed with its update; if the creation
failed, then the process would wait and try again. This
mechanism had three drawbacks. Processes consumed CPU time
by looping over attempts to create locks. Locks left lying
around because of system crashes had to be manually removed
(normally in a system startup command script). Finally,
processes running as system administrator are always permit-
ted to create files, so were forced to use a different
mechanism. While it is possible to get around all these
problems, the solutions are not straight forward, so a
mechanism for locking files has been added.
The most general schemes allow multiple processes to
concurrently update a file. Several of these techniques are
discussed in [Peterson83]. A simpler technique is to serial-
ize access to a file with locks. To attain reasonable effi-
ciency, certain applications require the ability to lock
pieces of a file. Locking down to the byte level has been
implemented in the Onyx file system by [Bass81]. However,
for the standard system applications, a mechanism that locks
at the granularity of a file is sufficient.
Locking schemes fall into two classes, those using hard
locks and those using advisory locks. The primary difference
between advisory locks and hard locks is the extent of
enforcement. A hard lock is always enforced when a program
tries to access a file; an advisory lock is only applied
when it is requested by a program. Thus advisory locks are
only effective when all programs accessing a file use the
locking scheme. With hard locks there must be some override
policy implemented in the kernel. With advisory locks the
policy is left to the user programs. In the UNIX system,
programs with system administrator privilege are allowed
override any protection scheme. Because many of the programs
that need to use locks must also run as the system adminis-
trator, we chose to implement advisory locks rather than
create an additional protection scheme that was inconsistent
with the UNIX philosophy or could not be used by system
administration programs.
The file locking facilities allow cooperating programs
to apply advisory shared or exclusive locks on files. Only
one process may have an exclusive lock on a file while mul-
tiple shared locks may be present. Both shared and exclusive
locks cannot be present on a file at the same time. If any
lock is requested when another process holds an exclusive
lock, or an exclusive lock is requested when another process
holds any lock, the lock request will block until the lock
can be obtained. Because shared and exclusive locks are
advisory only, even if a process has obtained a lock on a
file, another process may access the file.
Locks are applied or removed only on open files. This
means that locks can be manipulated without needing to close
A Fast File System for UNIX SMM:05-21
and reopen a file. This is useful, for example, when a pro-
cess wishes to apply a shared lock, read some information
and determine whether an update is required, then apply an
exclusive lock and update the file.
A request for a lock will cause a process to block if
the lock can not be immediately obtained. In certain
instances this is unsatisfactory. For example, a process
that wants only to check if a lock is present would require
a separate mechanism to find out this information. Conse-
quently, a process may specify that its locking request
should return with an error if a lock can not be immediately
obtained. Being able to conditionally request a lock is use-
ful to ``daemon'' processes that wish to service a spooling
area. If the first instance of the daemon locks the direc-
tory where spooling takes place, later daemon processes can
easily check to see if an active daemon exists. Since locks
exist only while the locking processes exist, lock files can
never be left active after the processes exit or if the sys-
tem crashes.
Almost no deadlock detection is attempted. The only
deadlock detection done by the system is that the file to
which a lock is applied must not already have a lock of the
same type (i.e. the second of two successive calls to apply
a lock of the same type will fail).
5.3. Symbolic links
The traditional UNIX file system allows multiple direc-
tory entries in the same file system to reference a single
file. Each directory entry ``links'' a file's name to an
inode and its contents. The link concept is fundamental;
inodes do not reside in directories, but exist separately
and are referenced by links. When all the links to an inode
are removed, the inode is deallocated. This style of
referencing an inode does not allow references across physi-
cal file systems, nor does it support inter-machine linkage.
To avoid these limitations symbolic links similar to the
scheme used by Multics [Feiertag71] have been added.
A symbolic link is implemented as a file that contains
a pathname. When the system encounters a symbolic link while
interpreting a component of a pathname, the contents of the
symbolic link is prepended to the rest of the pathname, and
this name is interpreted to yield the resulting pathname. In
UNIX, pathnames are specified relative to the root of the
file system hierarchy, or relative to a process's current
working directory. Pathnames specified relative to the root
are called absolute pathnames. Pathnames specified relative
to the current working directory are termed relative path-
names. If a symbolic link contains an absolute pathname, the
absolute pathname is used, otherwise the contents of the
symbolic link is evaluated relative to the location of the
SMM:05-22 A Fast File System for UNIX
link in the file hierarchy.
Normally programs do not want to be aware that there is
a symbolic link in a pathname that they are using. However
certain system utilities must be able to detect and manipu-
late symbolic links. Three new system calls provide the
ability to detect, read, and write symbolic links; seven
system utilities required changes to use these calls.
In future Berkeley software distributions it may be
possible to reference file systems located on remote
machines using pathnames. When this occurs, it will be pos-
sible to create symbolic links that span machines.
5.4. Rename
Programs that create a new version of an existing file
typically create the new version as a temporary file and
then rename the temporary file with the name of the target
file. In the old UNIX file system renaming required three
calls to the system. If a program were interrupted or the
system crashed between these calls, the target file could be
left with only its temporary name. To eliminate this possi-
bility the rename system call has been added. The rename
call does the rename operation in a fashion that guarantees
the existence of the target name.
Rename works both on data files and directories. When
renaming directories, the system must do special validation
checks to insure that the directory tree structure is not
corrupted by the creation of loops or inaccessible direc-
tories. Such corruption would occur if a parent directory
were moved into one of its descendants. The validation check
requires tracing the descendents of the target directory to
insure that it does not include the directory being moved.
5.5. Quotas
The UNIX system has traditionally attempted to share
all available resources to the greatest extent possible.
Thus any single user can allocate all the available space in
the file system. In certain environments this is unaccept-
able. Consequently, a quota mechanism has been added for
restricting the amount of file system resources that a user
can obtain. The quota mechanism sets limits on both the
number of inodes and the number of disk blocks that a user
may allocate. A separate quota can be set for each user on
each file system. Resources are given both a hard and a soft
limit. When a program exceeds a soft limit, a warning is
printed on the users terminal; the offending program is not
terminated unless it exceeds its hard limit. The idea is
that users should stay below their soft limit between login
sessions, but they may use more resources while they are
actively working. To encourage this behavior, users are
A Fast File System for UNIX SMM:05-23
warned when logging in if they are over any of their soft
limits. If users fails to correct the problem for too many
login sessions, they are eventually reprimanded by having
their soft limit enforced as their hard limit.
Acknowledgements
We thank Robert Elz for his ongoing interest in the new
file system, and for adding disk quotas in a rational and
efficient manner. We also acknowledge Dennis Ritchie for his
suggestions on the appropriate modifications to the user
interface. We appreciate Michael Powell's explanations on
how the DEMOS file system worked; many of his ideas were
used in this implementation. Special commendation goes to
Peter Kessler and Robert Henry for acting like real users
during the early debugging stage when file systems were less
stable than they should have been. The criticisms and
suggestions by the reviews contributed significantly to the
coherence of the paper. Finally we thank our sponsors, the
National Science Foundation under grant MCS80-05144, and the
Defense Advance Research Projects Agency (DoD) under ARPA
Order No. 4031 monitored by Naval Electronic System Command
under Contract No. N00039-82-C-0235.
References
[Almes78] Almes, G., and Robertson, G. "An Exten-
sible File System for Hydra" Proceedings
of the Third International Conference on
Software Engineering, IEEE, May 1978.
[Bass81] Bass, J. "Implementation Description for
File Locking", Onyx Systems Inc, 73 E.
Trimble Rd, San Jose, CA 95131 Jan 1981.
[Feiertag71] Feiertag, R. J. and Organick, E. I.,
"The Multics Input-Output System",
Proceedings of the Third Symposium on
Operating Systems Principles, ACM, Oct
1971. pp 35-41
[Ferrin82a] Ferrin, T.E., "Performance and Robust-
ness Improvements in Version 7 UNIX",
Computer Graphics Laboratory Technical
Report 2, School of Pharmacy, University
of California, San Francisco, January
1982. Presented at the 1982 Winter
Usenix Conference, Santa Monica, Cali-
fornia.
SMM:05-24 A Fast File System for UNIX
[Ferrin82b] Ferrin, T.E., "Performance Issuses of
VMUNIX Revisited", ;login: (The Usenix
Association Newsletter), Vol 7, #5,
November 1982. pp 3-6
[Kridle83] Kridle, R., and McKusick, M., "Perfor-
mance Effects of Disk Subsystem Choices
for VAX Systems Running 4.2BSD UNIX",
Computer Systems Research Group, Dept of
EECS, Berkeley, CA 94720, Technical
Report #8.
[Kowalski78] Kowalski, T. "FSCK - The UNIX System
Check Program", Bell Laboratory, Murray
Hill, NJ 07974. March 1978
[Knuth75] Kunth, D. "The Art of Computer Program-
ming", Volume 3 - Sorting and Searching,
Addison-Wesley Publishing Company Inc,
Reading, Mass, 1975. pp 506-549
[Maruyama76] Maruyama, K., and Smith, S. "Optimal
reorganization of Distributed Space Disk
Files", CACM, 19, 11. Nov 1976. pp 634-
642
[Nevalainen77] Nevalainen, O., Vesterinen, M. "Deter-
mining Blocking Factors for Sequential
Files by Heuristic Methods", The Com-
puter Journal, 20, 3. Aug 1977. pp 245-
247
[Pechura83] Pechura, M., and Schoeffler, J.
"Estimating File Access Time of Floppy
Disks", CACM, 26, 10. Oct 1983. pp 754-
763
[Peterson83] Peterson, G. "Concurrent Reading While
Writing", ACM Transactions on Program-
ming Languages and Systems, ACM, 5, 1.
Jan 1983. pp 46-55
[Powell79] Powell, M. "The DEMOS File System",
Proceedings of the Sixth Symposium on
Operating Systems Principles, ACM, Nov
1977. pp 33-42
[Ritchie74] Ritchie, D. M. and Thompson, K., "The
UNIX Time-Sharing System", CACM 17, 7.
July 1974. pp 365-375
[Smith81a] Smith, A. "Input/Output Optimization and
Disk Architectures: A Survey", Perfor-
mance and Evaluation 1. Jan 1981. pp
A Fast File System for UNIX SMM:05-25
104-117
[Smith81b] Smith, A. "Bibliography on File and I/O
System Optimization and Related Topics",
Operating Systems Review, 15, 4. Oct
1981. pp 39-54
[Symbolics81] "Symbolics File System", Symbolics Inc,
9600 DeSoto Ave, Chatsworth, CA 91311
Aug 1981.
[Thompson78] Thompson, K. "UNIX Implementation", Bell
System Technical Journal, 57, 6, part 2.
pp 1931-1946 July-August 1978.
[Thompson80] Thompson, M. "Spice File System",
Carnegie-Mellon University, Department
of Computer Science, Pittsburg, PA 15213
#CMU-CS-80, Sept 1980.
[Trivedi80] Trivedi, K. "Optimal Selection of CPU
Speed, Device Capabilities, and File
Assignments", Journal of the ACM, 27, 3.
July 1980. pp 457-473
[White80] White, R. M. "Disk Storage Technology",
Scientific American, 243(2), August
1980.
Generated on 2013-04-27 00:20:00 by $MirOS: src/scripts/roff2htm,v 1.77 2013/01/01 20:49:09 tg Exp $
These manual pages and other documentation are copyrighted by their respective writers;
their source is available at our CVSweb,
AnonCVS, and other mirrors. The rest is Copyright © 2002‒2013 The MirOS Project, Germany.
This product includes material
provided by Thorsten Glaser.
This manual page’s HTML representation is supposed to be valid XHTML/1.1; if not, please send a bug report – diffs preferred.