UNIX Implementation PS2:4-1
UNIX Implementation
K. Thompson
AT&T Bell Laboratories
Murray Hill, New Jersey 07974
ABSTRACT
This paper describes in high-level terms the
implementation of the resident UNIX- kernel. This dis-
cussion is broken into three parts. The first part
describes how the UNIX system views processes, users,
and programs. The second part describes the I/O system.
The last part describes the UNIX file system.
1. INTRODUCTION
The UNIX kernel consists of about 10,000 lines of C code and
about 1,000 lines of assembly code. The assembly code can be
further broken down into 200 lines included for the sake of effi-
ciency (they could have been written in C) and 800 lines to per-
form hardware functions not possible in C.
This code represents 5 to 10 percent of what has been lumped
into the broad expression ``the UNIX operating system.'' The ker-
nel is the only UNIX code that cannot be substituted by a user to
his own liking. For this reason, the kernel should make as few
real decisions as possible. This does not mean to allow the user
a million options to do the same thing. Rather, it means to allow
only one way to do one thing, but have that way be the least-
common divisor of all the options that might have been provided.
What is or is not implemented in the kernel represents both
a great responsibility and a great power. It is a soap-box plat-
form on ``the way things should be done.'' Even so, if ``the
way'' is too radical, no one will follow it. Every important
decision was weighed carefully. Throughout, simplicity has been
substituted for efficiency. Complex algorithms are used only if
their complexity can be localized.
_________________________
- UNIX is a registered trademark of AT&T Bell Labora-
tories in the USA and other countries.
December 26, 2008
PS2:4-2 UNIX Implementation
2. PROCESS CONTROL
In the UNIX system, a user executes programs in an environ-
ment called a user process. When a system function is required,
the user process calls the system as a subroutine. At some point
in this call, there is a distinct switch of environments. After
this, the process is said to be a system process. In the normal
definition of processes, the user and system processes are dif-
ferent phases of the same process (they never execute simultane-
ously). For protection, each system process has its own stack.
The user process may execute from a read-only text segment,
which is shared by all processes executing the same code. There
is no functional benefit from shared-text segments. An efficiency
benefit comes from the fact that there is no need to swap read-
only segments out because the original copy on secondary memory
is still current. This is a great benefit to interactive programs
that tend to be swapped while waiting for terminal input. Furth-
ermore, if two processes are executing simultaneously from the
same copy of a read-only segment, only one copy needs to reside
in primary memory. This is a secondary effect, because simultane-
ous execution of a program is not common. It is ironic that this
effect, which reduces the use of primary memory, only comes into
play when there is an overabundance of primary memory, that is,
when there is enough memory to keep waiting processes loaded.
All current read-only text segments in the system are main-
tained from the text table. A text table entry holds the location
of the text segment on secondary memory. If the segment is
loaded, that table also holds the primary memory location and the
count of the number of processes sharing this entry. When this
count is reduced to zero, the entry is freed along with any pri-
mary and secondary memory holding the segment. When a process
first executes a shared-text segment, a text table entry is allo-
cated and the segment is loaded onto secondary memory. If a
second process executes a text segment that is already allocated,
the entry reference count is simply incremented.
A user process has some strictly private read-write data
contained in its data segment. As far as possible, the system
does not use the user's data segment to hold system data. In par-
ticular, there are no I/O buffers in the user address space.
The user data segment has two growing boundaries. One,
increased automatically by the system as a result of memory
faults, is used for a stack. The second boundary is only grown
(or shrunk) by explicit requests. The contents of newly allocated
primary memory is initialized to zero.
Also associated and swapped with a process is a small
fixed-size system data segment. This segment contains all the
data about the process that the system needs only when the pro-
cess is active. Examples of the kind of data contained in the
system data segment are: saved central processor registers, open
December 26, 2008
UNIX Implementation PS2:4-3
file descriptors, accounting information, scratch data area, and
the stack for the system phase of the process. The system data
segment is not addressable from the user process and is therefore
protected.
Last, there is a process table with one entry per process.
This entry contains all the data needed by the system when the
process is not active. Examples are the process's name, the loca-
tion of the other segments, and scheduling information. The pro-
cess table entry is allocated when the process is created, and
freed when the process terminates. This process entry is always
directly addressable by the kernel.
Figure 1 shows the relationships between the various process
control data. In a sense, the process table is the definition of
all processes, because all the data associated with a process may
be accessed starting from the process table entry.
2.1. Process creation and program execution
Processes are created by the system primitive fork. The
newly created process (child) is a copy of the original process
(parent). There is no detectable sharing of primary memory
between the two processes. (Of course, if the parent process was
executing from a read-only text segment, the child will share the
text segment.) Copies of all writable data segments are made for
the child process. Files that were open before the fork are truly
shared after the fork. The processes are informed as to their
part in the relationship to allow them to select their own (usu-
ally non-identical) destiny. The parent may wait for the termina-
tion of any of its children.
A process may exec a file. This consists of exchanging the
current text and data segments of the process for new text and
data segments specified in the file. The old segments are lost.
Doing an exec does not change processes; the process that did the
exec persists, but after the exec it is executing a different
program. Files that were open before the exec remain open after
the exec.
If a program, say the first pass of a compiler, wishes to
overlay itself with another program, say the second pass, then it
simply execs the second program. This is analogous to a ``goto.''
If a program wishes to regain control after execing a second pro-
gram, it should fork a child process, have the child exec the
second program, and have the parent wait for the child. This is
analogous to a ``call.'' Breaking up the call into a binding fol-
lowed by a transfer is similar to the subroutine linkage in SL-5.
griswold hanson sl5 overview
2.2. Swapping
The major data associated with a process (the user data seg-
ment, the system data segment, and the text segment) are swapped
December 26, 2008
PS2:4-4 UNIX Implementation
to and from secondary memory, as needed. The user data segment
and the system data segment are kept in contiguous primary memory
to reduce swapping latency. (When low-latency devices, such as
bubbles, CCDs, or scatter/gather devices, are used, this decision
will have to be reconsidered.) Allocation of both primary and
secondary memory is performed by the same simple first-fit algo-
rithm. When a process grows, a new piece of primary memory is
allocated. The contents of the old memory is copied to the new
memory. The old memory is freed and the tables are updated. If
there is not enough primary memory, secondary memory is allocated
instead. The process is swapped out onto the secondary memory,
ready to be swapped in with its new size.
One separate process in the kernel, the swapping process,
simply swaps the other processes in and out of primary memory. It
examines the process table looking for a process that is swapped
out and is ready to run. It allocates primary memory for that
process and reads its segments into primary memory, where that
process competes for the central processor with other loaded
processes. If no primary memory is available, the swapping pro-
cess makes memory available by examining the process table for
processes that can be swapped out. It selects a process to swap
out, writes it to secondary memory, frees the primary memory, and
then goes back to look for a process to swap in.
Thus there are two specific algorithms to the swapping pro-
cess. Which of the possibly many processes that are swapped out
is to be swapped in? This is decided by secondary storage
residence time. The one with the longest time out is swapped in
first. There is a slight penalty for larger processes. Which of
the possibly many processes that are loaded is to be swapped out?
Processes that are waiting for slow events (i.e., not currently
running or waiting for disk I/O) are picked first, by age in pri-
mary memory, again with size penalties. The other processes are
examined by the same age algorithm, but are not taken out unless
they are at least of some age. This adds hysteresis to the swap-
ping and prevents total thrashing.
These swapping algorithms are the most suspect in the sys-
tem. With limited primary memory, these algorithms cause total
swapping. This is not bad in itself, because the swapping does
not impact the execution of the resident processes. However, if
the swapping device must also be used for file storage, the swap-
ping traffic severely impacts the file system traffic. It is
exactly these small systems that tend to double usage of limited
disk resources.
2.3. Synchronization and scheduling
Process synchronization is accomplished by having processes
wait for events. Events are represented by arbitrary integers. By
convention, events are chosen to be addresses of tables associ-
ated with those events. For example, a process that is waiting
for any of its children to terminate will wait for an event that
December 26, 2008
UNIX Implementation PS2:4-5
[
PT: [ T: box invis ht
.2i "Process Table"; move down
.125i A: box ht .25i;
down PTE: box "Process"
"Table" "Entry"; down C:
box ht .25i
]
move right 1.5i
TT: [ T: box invis ht
.2i "Text Table"; move down .125i
A: box ht .25i; down
TTE: box "Text" "Table"
"Entry"; down C: box ht
.25i
]
move down 1i from TT.C.s
move right 0.5i
UTS: [ box ht 0.75i
wid 0.75i "User" "Text" "Segment"
]
move left 1.5i from UTS.w
DS: [
SDS: box "System" "Data"
"Segment" ; move down .5i from
SDS.n ;
UDS: box ht 0.75i "User"
"Data" "Segment"
]
move left 1i from DS.UDS.w
move down 0.25i
UAS: [ box invis
"User" "Address" "Space"
]
line from UAS.ne to UAS.se
line from UAS.nw to UAS.sw
line right 0.15i from UAS.nw
line right 0.15i from UAS.sw
line left 0.15i from UAS.ne
line left 0.15i from UAS.se
arrow from 1/4 of the way
between PT.PTE.ne and PT.PTE.se
right 1.875i
arrow from TT.TTE.e right .5i
then down to UTS.n
arrow from PT.PTE.e right
.875i then down to DS.SDS.n
arrow from 3/4 of the way
between PT.PTE.ne and PT.PTE.se
right .25i then down 1.5i then
right .25i
arrow from 1/4 of the way
between UAS.ne and UAS.se right
.375i then up .25i then right
.25i
arrow from 3/4 of the way
between UAS.ne and UAS.se right
December 26, 2008
2.375i then up .875i then right
.5i
PS2:4-6 UNIX Implementation
move up 1.3175i from UAS.nw
move left .75i
line right 5.625i
move left 5.25i
move up .3125i
RS: [
box invis ht 0.2i
"Resident"
]
move down .8i
SW: [
box invis ht 0.2i
"Swapped"
]
arrow <-> from RS.s to SW.n ]
Fig. 1-Process control data structure.
is the address of its own process table entry. When a process
terminates, it signals the event represented by its parent's pro-
cess table entry. Signaling an event on which no process is wait-
ing has no effect. Similarly, signaling an event on which many
processes are waiting will wake all of them up. This differs con-
siderably from Dijkstra's P and V synchronization operations,
dijkstra sequential processes 1968 in that no memory is associ-
ated with events. Thus there need be no allocation of events
prior to their use. Events exist simply by being used.
On the negative side, because there is no memory associated
with events, no notion of ``how much'' can be signaled via the
event mechanism. For example, processes that want memory might
wait on an event associated with memory allocation. When any
amount of memory becomes available, the event would be signaled.
All the competing processes would then wake up to fight over the
new memory. (In reality, the swapping process is the only process
that waits for primary memory to become available.)
If an event occurs between the time a process decides to
wait for that event and the time that process enters the wait
state, then the process will wait on an event that has already
happened (and may never happen again). This race condition hap-
pens because there is no memory associated with the event to
indicate that the event has occurred; the only action of an event
is to change a set of processes from wait state to run state.
This problem is relieved largely by the fact that process switch-
ing can only occur in the kernel by explicit calls to the event-
wait mechanism. If the event in question is signaled by another
process, then there is no problem. But if the event is signaled
by a hardware interrupt, then special care must be taken. These
synchronization races pose the biggest problem when UNIX is
adapted to multiple-processor configurations. hawley meyer mul-
tiprocessing unix
The event-wait code in the kernel is like a co-routine
December 26, 2008
UNIX Implementation PS2:4-7
linkage. At any time, all but one of the processes has called
event-wait. The remaining process is the one currently executing.
When it calls event-wait, a process whose event has been signaled
is selected and that process returns from its call to event-wait.
Which of the runable processes is to run next? Associated
with each process is a priority. The priority of a system process
is assigned by the code issuing the wait on an event. This is
roughly equivalent to the response that one would expect on such
an event. Disk events have high priority, teletype events are
low, and time-of-day events are very low. (From observation, the
difference in system process priorities has little or no perfor-
mance impact.) All user-process priorities are lower than the
lowest system priority. User-process priorities are assigned by
an algorithm based on the recent ratio of the amount of compute
time to real time consumed by the process. A process that has
used a lot of compute time in the last real-time unit is assigned
a low user priority. Because interactive processes are character-
ized by low ratios of compute to real time, interactive response
is maintained without any special arrangements.
The scheduling algorithm simply picks the process with the
highest priority, thus picking all system processes first and
user processes second. The compute-to-real-time ratio is updated
every second. Thus, all other things being equal, looping user
processes will be scheduled round-robin with a 1-second quantum.
A high-priority process waking up will preempt a running, low-
priority process. The scheduling algorithm has a very desirable
negative feedback character. If a process uses its high priority
to hog the computer, its priority will drop. At the same time, if
a low-priority process is ignored for a long time, its priority
will rise.
3. I/O SYSTEM
The I/O system is broken into two completely separate sys-
tems: the block I/O system and the character I/O system. In
retrospect, the names should have been ``structured I/O'' and
``unstructured I/O,'' respectively; while the term ``block I/O''
has some meaning, ``character I/O'' is a complete misnomer.
Devices are characterized by a major device number, a minor
device number, and a class (block or character). For each class,
there is an array of entry points into the device drivers. The
major device number is used to index the array when calling the
code for a particular device driver. The minor device number is
passed to the device driver as an argument. The minor number has
no significance other than that attributed to it by the driver.
Usually, the driver uses the minor number to access one of
several identical physical devices.
The use of the array of entry points (configuration table)
as the only connection between the system code and the device
drivers is very important. Early versions of the system had a
December 26, 2008
PS2:4-8 UNIX Implementation
much less formal connection with the drivers, so that it was
extremely hard to handcraft differently configured systems. Now
it is possible to create new device drivers in an average of a
few hours. The configuration table in most cases is created
automatically by a program that reads the system's parts list.
3.1. Block I/O system
The model block I/O device consists of randomly addressed,
secondary memory blocks of 512 bytes each. The blocks are uni-
formly addressed 0, 1, ... up to the size of the device. The
block device driver has the job of emulating this model on a phy-
sical device.
The block I/O devices are accessed through a layer of
buffering software. The system maintains a list of buffers (typi-
cally between 10 and 70) each assigned a device name and a device
address. This buffer pool constitutes a data cache for the block
devices. On a read request, the cache is searched for the desired
block. If the block is found, the data are made available to the
requester without any physical I/O. If the block is not in the
cache, the least recently used block in the cache is renamed, the
correct device driver is called to fill up the renamed buffer,
and then the data are made available. Write requests are handled
in an analogous manner. The correct buffer is found and relabeled
if necessary. The write is performed simply by marking the buffer
as ``dirty.'' The physical I/O is then deferred until the buffer
is renamed.
The benefits in reduction of physical I/O of this scheme are
substantial, especially considering the file system implementa-
tion. There are, however, some drawbacks. The asynchronous nature
of the algorithm makes error reporting and meaningful user error
handling almost impossible. The cavalier approach to I/O error
handling in the UNIX system is partly due to the asynchronous
nature of the block I/O system. A second problem is in the
delayed writes. If the system stops unexpectedly, it is almost
certain that there is a lot of logically complete, but physically
incomplete, I/O in the buffers. There is a system primitive to
flush all outstanding I/O activity from the buffers. Periodic use
of this primitive helps, but does not solve, the problem.
Finally, the associativity in the buffers can alter the physical
I/O sequence from that of the logical I/O sequence. This means
that there are times when data structures on disk are incon-
sistent, even though the software is careful to perform I/O in
the correct order. On non-random devices, notably magnetic tape,
the inversions of writes can be disastrous. The problem with mag-
netic tapes is ``cured'' by allowing only one outstanding write
request per drive.
3.2. Character I/O system
The character I/O system consists of all devices that do not
fall into the block I/O model. This includes the ``classical''
December 26, 2008
UNIX Implementation PS2:4-9
character devices such as communications lines, paper tape, and
line printers. It also includes magnetic tape and disks when they
are not used in a stereotyped way, for example, 80-byte physical
records on tape and track-at-a-time disk copies. In short, the
character I/O interface means ``everything other than block.''
I/O requests from the user are sent to the device driver essen-
tially unaltered. The implementation of these requests is, of
course, up to the device driver. There are guidelines and conven-
tions to help the implementation of certain types of device
drivers.
3.2.1. Disk drivers
Disk drivers are implemented with a queue of transaction
records. Each record holds a read/write flag, a primary memory
address, a secondary memory address, and a transfer byte count.
Swapping is accomplished by passing such a record to the swapping
device driver. The block I/O interface is implemented by passing
such records with requests to fill and empty system buffers. The
character I/O interface to the disk drivers create a transaction
record that points directly into the user area. The routine that
creates this record also insures that the user is not swapped
during this I/O transaction. Thus by implementing the general
disk driver, it is possible to use the disk as a block device, a
character device, and a swap device. The only really disk-
specific code in normal disk drivers is the pre-sort of transac-
tions to minimize latency for a particular device, and the actual
issuing of the I/O request.
3.2.2. Character lists
Real character-oriented devices may be implemented using the
common code to handle character lists. A character list is a
queue of characters. One routine puts a character on a queue.
Another gets a character from a queue. It is also possible to ask
how many characters are currently on a queue. Storage for all
queues in the system comes from a single common pool. Putting a
character on a queue will allocate space from the common pool and
link the character onto the data structure defining the queue.
Getting a character from a queue returns the corresponding space
to the pool.
A typical character-output device (paper tape punch, for
example) is implemented by passing characters from the user onto
a character queue until some maximum number of characters is on
the queue. The I/O is prodded to start as soon as there is any-
thing on the queue and, once started, it is sustained by hardware
completion interrupts. Each time there is a completion interrupt,
the driver gets the next character from the queue and sends it to
the hardware. The number of characters on the queue is checked
and, as the count falls through some intermediate level, an event
(the queue address) is signaled. The process that is passing
characters from the user to the queue can be waiting on the
event, and refill the queue to its maximum when the event occurs.
December 26, 2008
PS2:4-10 UNIX Implementation
A typical character input device (for example, a paper tape
reader) is handled in a very similar manner.
Another class of character devices is the terminals. A ter-
minal is represented by three character queues. There are two
input queues (raw and canonical) and an output queue. Characters
going to the output of a terminal are handled by common code
exactly as described above. The main difference is that there is
also code to interpret the output stream as ASCII characters and
to perform some translations, e.g., escapes for deficient termi-
nals. Another common aspect of terminals is code to insert real-
time delay after certain control characters.
Input on terminals is a little different. Characters are
collected from the terminal and placed on a raw input queue. Some
device-dependent code conversion and escape interpretation is
handled here. When a line is complete in the raw queue, an event
is signaled. The code catching this signal then copies a line
from the raw queue to a canonical queue performing the character
erase and line kill editing. User read requests on terminals can
be directed at either the raw or canonical queues.
3.2.3. Other character devices
Finally, there are devices that fit no general category.
These devices are set up as character I/O drivers. An example is
a driver that reads and writes unmapped primary memory as an I/O
device. Some devices are too fast to be treated a character at
time, but do not fit the disk I/O mold. Examples are fast commun-
ications lines and fast line printers. These devices either have
their own buffers or ``borrow'' block I/O buffers for a while and
then give them back.
4. THE FILE SYSTEM
In the UNIX system, a file is a (one-dimensional) array of
bytes. No other structure of files is implied by the system.
Files are attached anywhere (and possibly multiply) onto a
hierarchy of directories. Directories are simply files that users
cannot write. For a further discussion of the external view of
files and directories, see Ref. ritchie thompson unix bstj 1978
%Q This issue
The UNIX file system is a disk data structure accessed com-
pletely through the block I/O system. As stated before, the
canonical view of a ``disk'' is a randomly addressable array of
512-byte blocks. A file system breaks the disk into four self-
identifying regions. The first block (address 0) is unused by the
file system. It is left aside for booting procedures. The second
block (address 1) contains the so-called ``super-block.'' This
block, among other things, contains the size of the disk and the
boundaries of the other regions. Next comes the i-list, a list of
file definitions. Each file definition is a 64-byte structure,
called an i-node. The offset of a particular i-node within the
December 26, 2008
UNIX Implementation PS2:4-11
i-list is called its i-number. The combination of device name
(major and minor numbers) and i-number serves to uniquely name a
particular file. After the i-list, and to the end of the disk,
come free storage blocks that are available for the contents of
files.
The free space on a disk is maintained by a linked list of
available disk blocks. Every block in this chain contains a disk
address of the next block in the chain. The remaining space con-
tains the address of up to 50 disk blocks that are also free.
Thus with one I/O operation, the system obtains 50 free blocks
and a pointer where to find more. The disk allocation algorithms
are very straightforward. Since all allocation is in fixed-size
blocks and there is strict accounting of space, there is no need
to compact or garbage collect. However, as disk space becomes
dispersed, latency gradually increases. Some installations choose
to occasionally compact disk space to reduce latency.
An i-node contains 13 disk addresses. The first 10 of these
addresses point directly at the first 10 blocks of a file. If a
file is larger than 10 blocks (5,120 bytes), then the eleventh
address points at a block that contains the addresses of the next
128 blocks of the file. If the file is still larger than this
(70,656 bytes), then the twelfth block points at up to 128
blocks, each pointing to 128 blocks of the file. Files yet larger
(8,459,264 bytes) use the thirteenth address for a ``triple
indirect'' address. The algorithm ends here with the maximum file
size of 1,082,201,087 bytes.
A logical directory hierarchy is added to this flat physical
structure simply by adding a new type of file, the directory. A
directory is accessed exactly as an ordinary file. It contains
16-byte entries consisting of a 14-byte name and an i-number. The
root of the hierarchy is at a known i-number (viz., 2). The file
system structure allows an arbitrary, directed graph of direc-
tories with regular files linked in at arbitrary places in this
graph. In fact, very early UNIX systems used such a structure.
Administration of such a structure became so chaotic that later
systems were restricted to a directory tree. Even now, with regu-
lar files linked multiply into arbitrary places in the tree,
accounting for space has become a problem. It may become neces-
sary to restrict the entire structure to a tree, and allow a new
form of linking that is subservient to the tree structure.
The file system allows easy creation, easy removal, easy
random accessing, and very easy space allocation. With most phy-
sical addresses confined to a small contiguous section of disk,
it is also easy to dump, restore, and check the consistency of
the file system. Large files suffer from indirect addressing, but
the cache prevents most of the implied physical I/O without
adding much execution. The space overhead properties of this
scheme are quite good. For example, on one particular file sys-
tem, there are 25,000 files containing 130M bytes of data-file
content. The overhead (i-node, indirect blocks, and last block
December 26, 2008
PS2:4-12 UNIX Implementation
breakage) is about 11.5M bytes. The directory structure to sup-
port these files has about 1,500 directories containing 0.6M
bytes of directory content and about 0.5M bytes of overhead in
accessing the directories. Added up any way, this comes out to
less than a 10 percent overhead for actual stored data. Most sys-
tems have this much overhead in padded trailing blanks alone.
4.1. File system implementation
Because the i-node defines a file, the implementation of the
file system centers around access to the i-node. The system main-
tains a table of all active i-nodes. As a new file is accessed,
the system locates the corresponding i-node, allocates an i-node
table entry, and reads the i-node into primary memory. As in the
buffer cache, the table entry is considered to be the current
version of the i-node. Modifications to the i-node are made to
the table entry. When the last access to the i-node goes away,
the table entry is copied back to the secondary store i-list and
the table entry is freed.
All I/O operations on files are carried out with the aid of
the corresponding i-node table entry. The accessing of a file is
a straightforward implementation of the algorithms mentioned pre-
viously. The user is not aware of i-nodes and i-numbers. Refer-
ences to the file system are made in terms of path names of the
directory tree. Converting a path name into an i-node table entry
is also straightforward. Starting at some known i-node (the root
or the current directory of some process), the next component of
the path name is searched by reading the directory. This gives an
i-number and an implied device (that of the directory). Thus the
next i-node table entry can be accessed. If that was the last
component of the path name, then this i-node is the result. If
not, this i-node is the directory needed to look up the next com-
ponent of the path name, and the algorithm is repeated.
The user process accesses the file system with certain prim-
itives. The most common of these are open, create, read, write,
seek, and close. The data structures maintained are shown in Fig.
2. In the system data segment associated with a user, there is
room for some (usually between 10 and 50) open files. This open
file table consists of pointers that can be used to access
corresponding i-node table entries. Associated with each of these
open files is a current I/O pointer. This is a byte offset of the
next read/write operation on the file. The system treats each
read/write request as random with an implied seek to the I/O
pointer. The user usually thinks of the file as sequential with
the I/O pointer automatically counting the number of bytes that
have been read/written from the file. The user may, of course,
perform random I/O by setting the I/O pointer before
reads/writes.
With file sharing, it is necessary to allow related
processes to share a common I/O pointer and yet have separate I/O
pointers for independent processes that access the same file.
December 26, 2008
UNIX Implementation PS2:4-13
With these two conditions, the I/O pointer cannot reside in the
i-node table nor can it reside in the list of open files for the
process. A new table (the open file table) was invented for the
sole purpose of holding the I/O pointer. Processes that share the
same open file (the result of forks) share a common open file
table entry. A separate open of the same file will only share the
i-node table entry, but will have distinct open file table
entries.
The main file system primitives are implemented as follows.
open converts a file system path name into an i-node table entry.
A pointer to the i-node table entry is placed in a newly created
open file table entry. A pointer to the file table entry is
placed in the system data segment for the process. create first
creates a new i-node entry, writes the i-number into a directory,
and then builds the same structure as for an open. read and write
just access the i-node entry as described above. seek simply
manipulates the I/O pointer. No physical seeking is done. close
just frees the structures built by open and create. Reference
counts are kept on the open file table entries and the i-node
table entries to free these structures after the last reference
goes away. unlink simply decrements the count of the number of
directories pointing at the given i-node. When the last reference
to an i-node table entry goes away, if the i-node has no direc-
tories pointing to it, then the file is removed and the i-node is
freed. This delayed removal of files prevents problems arising
from removing active files. A file may be removed while still
open. The resulting unnamed file vanishes when the file is
closed. This is a method of obtaining temporary files.
There is a type of unnamed FIFO file called a pipe. Imple-
mentation of pipes consists of implied seeks before each read or
write in order to implement first-in-first-out. There are also
checks and synchronization to prevent the writer from grossly
outproducing the reader and to prevent the reader from overtaking
the writer.
4.2. Mounted file systems
The file system of a UNIX system starts with some designated
block device formatted as described above to contain a hierarchy.
The root of this structure is the root of the UNIX file system. A
second formatted block device may be mounted at any leaf of the
current hierarchy. This logically extends the current hierarchy.
The implementation of mounting is trivial. A mount table is main-
tained containing pairs of designated leaf i-nodes and block
devices. When converting a path name into an i-node, a check is
made to see if the new i-node is a designated leaf. If it is, the
i-node of the root of the block device replaces it.
Allocation of space for a file is taken from the free pool
on the device on which the file lives. Thus a file system con-
sisting of many mounted devices does not have a common pool of
free secondary storage space. This separation of space on
December 26, 2008
PS2:4-14 UNIX Implementation
[ PUOFT: [
A: box invis ht
.4i wid 1i "Per-User Open" "File
Table" B: box ht
.25i with .n at A.s
C: box with .n at
B.s D: box ht
.25i with .n at C.s ]
move down 1.0625i left
1.25i from PUOFT.D.s OFT:
[ A: box invis ht
.4i wid 1i "Open File" "Table"
B: box ht .25i
with .n at A.s C:
box with .n at B.s
D: box ht .25i
with .n at C.s ]
move down 1.0625i right
1.25i from PUOFT.D.s AIT:
[ A: box invis ht
.4i wid 1i "Active I-node"
"Table" B: box ht
.25i with .n at A.s
C: box with .n at
B.s D: box ht
.25i with .n at C.s ]
move down 2.5i from
PUOFT.D.s IF: [
A: box ht .25i
B: box ht .25i
"I-node" with .n at A.s
C: box ht .25i
with .n at B.s D:
box ht .25i "File" with .n at C.s
E: box ht .25i
with .n at D.s ]
move right 1.5i from
IF.D.w FMA: [
box invis "File"
"Mapping" "Algorithms" ]
line from FMA.ne to
FMA.se line from FMA.nw
to FMA.sw line left .15i
from FMA.se line left
.15i from FMA.ne line
right .15i from FMA.nw
line right .15i from
FMA.sw
arrow from FMA.w to
IF.D.e arrow from AIT.C.e
right .25i then down 2.125i then
left .5i arrow from
OFT.C.e to AIT.C.w arrow
from PUOFT.C.w left .5i then down
1.625i then left .5i ar-
row <-> from IF.B.e right .5i
December 26, 2008
then up 1.5i then right .5i
UNIX Implementation PS2:4-15
move up .1875i from
OFT.A.nw line right 5i
move left 5i down 1.9375i
line right 5i
move up 1.63475i right
2.75i from PUOFT.D.s line
right .1i down .1i then down .6i
then right .1i down .1i then left
.1i down .1i then down .6i then
left .1i down .1i move
down .34375i right 2.75i from
PUOFT.D.s line right .1i
down .1i then down .6i then right
.1i down .1i then left .1i down
.1i then down .6i then left .1i
down .1i move down
2.34375i right 2.75i from
PUOFT.D.s line right .1i
down .1i then down .6i then right
.1i down .1i then left .1i down
.1i then down .6i then left .1i
down .1i
move up 0.817375i right
2.9i from PUOFT.D.s box
invis "Swapped" "Per User"
move down 1.15625i right
2.9i from PUOFT.D.s box
invis wid 1i "Resident" "Per Sys-
tem" move down 3.15675i
right 2.9i from PUOFT.D.s
box invis ht 1i wid 1i
"Secondary" "Storage" "Per" "File
System" ]
Fig. 2-File system data structure.
different devices is necessary to allow easy unmounting of a
device.
4.3. Other system functions
There are some other things that the system does for the
user-a little accounting, a little tracing/debugging, and a lit-
tle access protection. Most of these things are not very well
developed because our use of the system in computing science
research does not need them. There are some features that are
missed in some applications, for example, better inter-process
communication.
The UNIX kernel is an I/O multiplexer more than a complete
operating system. This is as it should be. Because of this
December 26, 2008
PS2:4-16 UNIX Implementation
outlook, many features are found in most other operating systems
that are missing from the UNIX kernel. For example, the UNIX ker-
nel does not support file access methods, file disposition, file
formats, file maximum size, spooling, command language, logical
records, physical records, assignment of logical file names, log-
ical file names, more than one character set, an operator's con-
sole, an operator, log-in, or log-out. Many of these things are
symptoms rather than features. Many of these things are imple-
mented in user software using the kernel as a tool. A good exam-
ple of this is the command language. bourne shell 1978 bstj %Q
This issue Each user may have his own command language. Mainte-
nance of such code is as easy as maintaining user code. The idea
of implementing ``system'' code with general user primitives
comes directly from MULTICS. organick multics 1972
$LIST$
December 26, 2008
Generated on 2008-12-26 21:13:42 by $MirOS: src/scripts/roff2htm,v 1.57 2008/12/09 22:04:51 tg Exp $
These manual pages are copyrighted
by their respective writers; their source is available at our CVSweb, AnonCVS, and other mirrors.
The rest is Copyright © 2002-2008 The
MirOS Project, Germany.
This product includes material provided by Thorsten Glaser.
This manual page’s HTML representation is supposed to be valid XHTML/1.1; if not, please send a bug report – diffs preferred.