Reprinted with permission from the "Proceedings of the
Winter 1994 Usenix Conference", January 1994, San Francisco,
CA, Copyright The Usenix Association.
Not Quite NFS, Soft Cache Consistency for NFS
Rick Macklem
University of Guelph
Abstract
There are some constraints inherent in the NFS- proto-
col that result in performance limitations for high perfor-
mance workstation environments. This paper discusses an
NFS-like protocol named Not Quite NFS (NQNFS), designed to
address some of these limitations. This protocol provides
full cache consistency during normal operation, while per-
mitting more effective client-side caching in an effort to
improve performance. There are also a variety of minor pro-
tocol changes, in order to resolve various NFS issues. The
emphasis is on observed performance of a preliminary imple-
mentation of the protocol, in order to show how well this
design works and to suggest possible areas for further
improvement.
1. Introduction
It has been observed that overall workstation perfor-
mance has not been scaling with processor speed and that
file system I/O is a limiting factor [Ousterhout90].
Ousterhout notes that a principal challenge for operating
system developers is the decoupling of system calls from
their underlying I/O operations, in order to improve average
system call response times. For distributed file systems,
every synchronous Remote Procedure Call (RPC) takes a
minimum of a few milliseconds and, as such, is analogous to
an underlying I/O operation. This suggests that client cach-
ing with a very good hit ratio for read type operations,
along with asynchronous writing, is required in order to
avoid delays waiting for RPC replies. However, the NFS pro-
tocol requires that the server be stateless[1] and does not
provide any explicit mechanism for client cache consistency,
putting constraints on how the client may cache data. This
paper describes an NFS-like protocol that includes a cache
consistency component designed to enhance client caching
performance. It does provide full consistency under normal
operation, but without requiring that hard state information
be maintained on the server. Design tradeoffs were made
____________________
[1]The server must not require any state that may be lost
due to a crash, to function correctly.
towards simplicity and high performance over cache con-
sistency under abnormal conditions. The protocol design uses
a variation of Leases [Gray89] to provide state on the
server that does not need to be recovered after a crash.
The protocol also includes changes designed to address
other limitations of NFS in a modern workstation environ-
ment. The use of TCP transport is optionally available to
avoid the pitfalls of Sun RPC over UDP transport when run-
ning across an internetwork [Nowicki89]. Kerberos
[Steiner88] support is available to do proper user authenti-
cation, in order to provide improved security and arbitrary
client to server user ID mappings. There are also a variety
of other changes to accommodate large file systems, such as
64bit file sizes and offsets, as well as lifting the 8Kbyte
I/O size limit. The remainder of this paper gives an over-
view of the protocol, highlighting performance related com-
ponents, followed by an evaluation of resultant performance
for the 4.4BSD implementation.
2. Distributed File Systems and Caching
Clients using distributed file systems cache recently-
used data in order to reduce the number of synchronous
server operations, and therefore improve average response
times for system calls. Unfortunately, maintaining con-
sistency between these caches is a problem whenever write
sharing occurs; that is, when a process on a client writes
to a file and one or more processes on other client(s) read
the file. If the writer closes the file before any reader(s)
open the file for reading, this is called sequential write
sharing. Both the Andrew ITC file system [Howard88] and NFS
[Sandberg85] maintain consistency for sequential write shar-
ing by requiring the writer to push all the writes through
to the server on close and having readers check to see if
the file has been modified upon open. If the file has been
modified, the client throws away all cached data for that
file, as it is now stale. NFS implementations typically
detect file modification by checking a cached copy of the
file's modification time; since this cached value is often
several seconds out of date and only has a resolution of one
second, an NFS client often uses stale cached data for some
time after the file has been updated on the server.
A more difficult case is concurrent write sharing,
where write operations are intermixed with read operations.
Consistency for this case, often referred to as "full cache
consistency," requires that a reader always receives the
most recently written data. Neither NFS nor the Andrew ITC
file system maintain consistency for this case. The simplest
mechanism for maintaining full cache consistency is the one
used by Sprite [Nelson88], which disables all client caching
of the file whenever concurrent write sharing might occur.
There are other mechanisms described in the literature
[Kent87a, Burrows88], but they appeared to be too elaborate
for incorporation into NQNFS (for example, Kent's requires
specialized hardware). NQNFS differs from Sprite in the way
it detects write sharing. The Sprite server maintains a list
of files currently open by the various clients and detects
write sharing when a file open request for writing is
received and the file is already open for reading (or vice
versa). This list of open files is hard state information
that must be recovered after a server crash, which is a sig-
nificant problem in its own right [Mogul93, Welch90].
The approach used by NQNFS is a variant of the Leases
mechanism [Gray89]. In this model, the server issues to a
client a promise, referred to as a "lease," that the client
may cache a specific object without fear of conflict. A
lease has a limited duration and must be renewed by the
client if it wishes to continue to cache the object. In
NQNFS, clients hold short-term (up to one minute) leases on
files for reading or writing. The leases are analogous to
entries in the open file list, except that they expire after
the lease term unless renewed by the client. As such, one
minute after issuing the last lease there are no current
leases and therefore no lease records to be recovered after
a crash, hence the term "soft server state."
A related design consideration is the way client writ-
ing is done. Synchronous writing requires that all writes be
pushed through to the server during the write system call.
This is the simplest variant, from a consistency point of
view, since the server always has the most recently written
data. It also permits any write errors, such as "file system
out of space" to be propagated back to the client's process
via the write system call return. Unfortunately this
approach limits the client write rate, based on server write
performance and client/server RPC round trip time (RTT).
An alternative to this is delayed writing, where the
write system call returns as soon as the data is cached on
the client and the data is written to the server sometime
later. This permits client writing to occur at the rate of
local storage access up to the size of the local cache.
Also, for cases where file truncation/deletion occurs
shortly after writing, the write to the server may be
avoided since the data has already been deleted, reducing
server write load. There are some obvious drawbacks to this
approach. For any Sprite-like system to maintain full con-
sistency, the server must "callback" to the client to cause
the delayed writes to be written back to the server when
write sharing is about to occur. There are also problems
with the propagation of errors back to the client process
that issued the write system call. The reason for this is
that the system call has already returned without reporting
an error and the process may also have already terminated.
As well, there is a risk of the loss of recently written
data if the client crashes before the data is written back
to the server.
A compromise between these two alternatives is asyn-
chronous writing, where the write to the server is initiated
during the write system call but the write system call
returns before the write completes. This approach minimizes
the risk of data loss due to a client crash, but negates the
possibility of reducing server write load by throwing writes
away when a file is truncated or deleted.
NFS implementations usually do a mix of asynchronous
and delayed writing but push all writes to the server upon
close, in order to maintain open/close consistency. Pushing
the delayed writes on close negates much of the performance
advantage of delayed writing, since the delays that were
avoided in the write system calls are observed in the close
system call. Akin to Sprite, the NQNFS protocol does delayed
writing in an effort to achieve good client performance and
uses a callback mechanism to maintain full cache con-
sistency.
3. Related Work
There has been a great deal of effort put into improv-
ing the performance and consistency of the NFS protocol.
This work can be put in two categories. The first category
are implementation enhancements for the NFS protocol and the
second involve modifications to the protocol.
The work done on implementation enhancements have
attacked two problem areas, NFS server write performance and
RPC transport problems. Server write performance is a major
problem for NFS, in part due to the requirement to push all
writes to the server upon close and in part due to the fact
that, for writes, all data and meta-data must be committed
to non-volatile storage before the server replies to the
write RPC. The Prestoserve- [Moran90] system uses non-
volatile RAM as a buffer for recently written data on the
server, so that the write RPC replies can be returned to the
client before the data is written to the disk surface. Write
gathering [Juszczak94] is a software technique used on the
server where a write RPC request is delayed for a short time
in the hope that another contiguous write request will
arrive, so that they can be merged into one write operation.
Since the replies to all of the merged writes are not
returned to the client until the write operation is com-
pleted, this delay does not violate the protocol. When write
operations are merged, the number of disk writes can be
reduced, improving server write performance. Although either
of the above reduces write RPC response time for the server,
it cannot be reduced to zero, and so, any client side cach-
ing mechanism that reduces write RPC load or client depen-
dence on server RPC response time should still improve
overall performance. Good client side caching should be com-
plementary to these server techniques, although client per-
formance improvements as a result of caching may be less
dramatic when these techniques are used.
In NFS, each Sun RPC request is packaged in a UDP
datagram for transmission to the server. A timer is started,
and if a timeout occurs before the corresponding RPC reply
is received, the RPC request is retransmitted. There are two
problems with this model. First, when a retransmit timeout
occurs, the RPC may be redone, instead of simply retransmit-
ting the RPC request message to the server. A recent-request
cache can be used on the server to minimize the negative
impact of redoing RPCs [Juszczak89]. The second problem is
that a large UDP datagram, such as a read request or write
reply, must be fragmented by IP and if any one IP fragment
is lost in transit, the entire UDP datagram is lost
[Kent87]. Since entire requests and replies are packaged in
a single UDP datagram, this puts an upper bound on the
read/write data size (8 kbytes).
Adjusting the retransmit timeout (RTT) interval dynami-
cally and applying a congestion window on outstanding
requests has been shown to be of some help [Nowicki89] with
the retransmission problem. An alternative to this is to use
TCP transport to delivery the RPC messages reliably [Mack-
lem90] and one of the performance results in this paper
shows the effects of this further.
Srinivasan and Mogul [Srinivasan89] enhanced the NFS
protocol to use the Sprite cache consistency algorithm in an
effort to improve performance and to provide full client
cache consistency. This experimental implementation demon-
strated significantly better performance than NFS, but suf-
fered from a lack of crash recovery support. The NQNFS pro-
tocol design borrowed heavily from this work, but differed
from the Sprite algorithm by using Leases instead of file
open state to detect write sharing. The decision to use
Leases was made primarily to avoid the crash recovery prob-
lem. More recent work by the Sprite group [Baker91] and
Mogul [Mogul93] have addressed the crash recovery problem,
making this design tradeoff more questionable now.
Sun has recently updated the NFS protocol to Version 3
[SUN93], using some changes similar to NQNFS to address
various issues. The Version 3 protocol uses 64bit file sizes
and offsets, provides a Readdir_and_Lookup RPC and an access
RPC. It also provides cache hints, to permit a client to be
able to determine whether a file modification is the result
of that client's write or some other client's write. It
would be possible to add either Spritely NFS or NQNFS sup-
port for cache consistency to the NFS Version 3 protocol.
4. NQNFS Consistency Protocol and Recovery
The NQNFS cache consistency protocol uses a somewhat
Sprite-like [Nelson88] mechanism, but is based on Leases
[Gray89] instead of hard server state information about open
files. The basic principle is that the server disables
client caching of files whenever concurrent write sharing
could occur, by performing a server-to-client callback,
forcing the client to flush its caches and to do all subse-
quent I/O on the file with synchronous RPCs. A Sprite server
maintains a record of the open state of files for all
clients and uses this to determine when concurrent write
sharing might occur. This open state information might also
be referred to as an infinite-term lease for the file, with
explicit lease cancellation. NQNFS, on the other hand, uses
a short-term lease that expires due to timeout after a max-
imum of one minute, unless explicitly renewed by the client.
The fundamental difference is that an NQNFS client must keep
renewing a lease to use cached data whereas a Sprite client
assumes the data is valid until canceled by the server or
the file is closed. Using leases permits the server to
remain "stateless," since the soft state information, which
consists of the set of current leases, is moot after one
minute, when all the leases expire.
Whenever a client wishes to access a file's data it
must hold one of three types of lease: read-caching, write-
caching or non-caching. The latter type requires that all
file operations be done synchronously with the server via
the appropriate RPCs.
A read-caching lease allows for client data caching but
no modifications may be done. It may, however, be shared
between multiple clients. Diagram 1 shows a typical read-
caching scenario. The vertical solid black lines depict the
lease records. Note that the time lines are nowhere near to
scale, since a client/server interaction will normally take
less than one hundred milliseconds, whereas the normal lease
duration is thirty seconds. Every lease includes a modrev
value, which changes upon every modification of the file. It
may be used to check to see if data cached on the client is
still current.
A write-caching lease permits delayed write caching,
but requires that all data be pushed to the server when the
lease expires or is terminated by an eviction callback. When
a write-caching lease has almost expired, the client will
attempt to extend the lease if the file is still open, but
is required to push the delayed writes to the server if
renewal fails (as depicted by diagram 2). The writes may not
arrive at the server until after the write lease has expired
on the client, but this does not result in a consistency
problem, so long as the write lease is still valid on the
server. Note that, in diagram 2, the lease record on the
server remains current after the expiry time, due to the
conditions mentioned in section 5. If a write RPC is done on
the server after the write lease has expired on the server,
this could be considered an error since consistency could be
lost, but it is not handled as such by NQNFS.
Diagram 3 depicts how read and write leases are
replaced by a non-caching lease when there is the potential
for write sharing. A write-caching lease is not used in the
Stanford V Distributed System [Gray89], since synchronous
writing is always used. A side effect of this change is that
the five to ten second lease duration recommended by Gray
was found to be insufficient to achieve good performance for
the write-caching lease. Experimentation showed that thirty
seconds was about optimal for cases where the client and
server are connected to the same local area network, so
thirty seconds is the default lease duration for NQNFS. A
maximum of twice that value is permitted, since Gray showed
that for some network topologies, a larger lease duration
functions better. Although there is an explicit get_lease
RPC defined for the protocol, most lease requests are pig-
gybacked onto the other RPCs to minimize the additional
overhead introduced by leasing.
4.1. Rationale
Leasing was chosen over hard server state information
for the following reasons:
1. The server must maintain state information about all
current client leases. Since at most one lease is allo-
cated for each RPC and the leases expire after their
lease term, the upper bound on the number of current
leases is the product of the lease term and the server
RPC rate. In practice, it has been observed that less
than 10% of RPCs request new leases and since most
leases have a term of thirty seconds, the following
rule of thumb should estimate the number of server
lease records:
Number of Server Lease Records = 0.1 * 30 * RPC rate
Since each lease record occupies 64 bytes of server
memory, storing the lease records should not be a seri-
ous problem. If a server has exhausted lease storage,
it can simply wait a few seconds for a lease to expire
and free up a record. On the other hand, a Sprite-like
server must store records for all files currently open
by all clients, which can require significant storage
line from 0.738,5.388 to 1.238,5.388
dashwid = 0.050i
line dashed from 1.488,10.075 to 1.488,5.450
line dashed from 2.987,10.075 to 2.987,5.450
line dashed from 4.487,10.075 to 4.487,5.450
line from 4.487,7.013 to 4.487,5.950
line from 2.987,7.700 to 2.987,5.950 to 2.987,6.075
line from 1.488,7.513 to 1.488,5.950
line from 2.987,9.700 to 2.987,8.325
line from 1.488,9.450 to 1.488,8.325
line from 2.987,6.450 to 4.487,6.200
line from 4.385,6.192 to 4.487,6.200 to 4.393,6.241
line from 4.487,6.888 to 2.987,6.575
line from 3.080,6.620 to 2.987,6.575 to 3.090,6.571
line from 2.987,7.263 to 4.487,7.013
line from 4.385,7.004 to 4.487,7.013 to 4.393,7.054
line from 4.487,7.638 to 2.987,7.388
line from 3.082,7.429 to 2.987,7.388 to 3.090,7.379
line from 2.987,6.888 to 1.488,6.575
line from 1.580,6.620 to 1.488,6.575 to 1.590,6.571
line from 1.488,7.200 to 2.987,6.950
line from 2.885,6.942 to 2.987,6.950 to 2.893,6.991
line from 2.987,7.700 to 1.488,7.513
line from 1.584,7.550 to 1.488,7.513 to 1.590,7.500
line from 1.488,8.012 to 2.987,7.763
line from 2.885,7.754 to 2.987,7.763 to 2.893,7.804
line from 2.987,9.012 to 1.488,8.825
line from 1.584,8.862 to 1.488,8.825 to 1.590,8.813
line from 1.488,9.325 to 2.987,9.137
line from 2.885,9.125 to 2.987,9.137 to 2.891,9.175
line from 2.987,9.637 to 1.488,9.450
line from 1.584,9.487 to 1.488,9.450 to 1.590,9.438
line from 1.488,9.887 to 2.987,9.700
line from 2.885,9.688 to 2.987,9.700 to 2.891,9.737
"Lease valid on machine" at 1.363,5.296 ljust
"with same modrev" at 1.675,7.421 ljust
"miss)" at 2.612,9.233 ljust
"(cache" at 2.300,9.358 ljust
"Diagram #1: Read Caching Leases" at 0.738,5.114 ljust
"Client B" at 4.112,10.176 ljust
"Server" at 2.612,10.176 ljust
"Client A" at 0.925,10.176 ljust
"from cache" at 4.675,6.546 ljust
"Read syscalls" at 4.675,6.796 ljust
"Reply" at 3.737,6.108 ljust
"(cache miss)" at 3.675,6.421 ljust
"Read req" at 3.737,6.608 ljust
"to lease" at 3.112,6.796 ljust
"Client B added" at 3.112,6.983 ljust
"Reply" at 3.237,7.296 ljust
"Read + lease req" at 3.175,7.671 ljust
"Read syscall" at 4.675,7.608 ljust
"Reply" at 1.675,6.796 ljust
"miss)" at 2.487,7.108 ljust
"Read req (cache" at 1.675,7.233 ljust
"from cache" at 0.425,6.296 ljust
"Read syscalls" at 0.425,6.546 ljust
"cache" at 0.425,6.858 ljust
"so can still" at 0.425,7.108 ljust
"Modrev same" at 0.425,7.358 ljust
"Reply" at 1.675,7.671 ljust
"Get lease req" at 1.675,8.108 ljust
"Read syscall" at 0.425,7.983 ljust
"Lease times out" at 0.425,8.296 ljust
"from cache" at 0.425,9.046 ljust
"Read syscalls" at 0.425,9.296 ljust
"for Client A" at 3.112,9.296 ljust
"Read caching lease" at 3.112,9.483 ljust
"Reply" at 1.675,8.983 ljust
"Read req" at 1.675,9.358 ljust
"Reply" at 1.675,9.608 ljust
"Read + lease req" at 1.675,9.921 ljust
"Read syscall" at 0.425,9.921 ljust
line from 1.175,5.700 to 1.300,5.700
line from 0.738,5.700 to 1.175,5.700
line from 2.987,6.638 to 2.987,6.075
dashwid = 0.050i
line dashed from 2.987,6.575 to 2.987,5.950
line dashed from 1.488,6.575 to 1.488,5.888
line from 2.987,9.762 to 2.987,6.638
line from 1.488,9.450 to 1.488,7.700
line from 2.987,6.763 to 1.488,6.575
line from 1.584,6.612 to 1.488,6.575 to 1.590,6.563
line from 1.488,7.013 to 2.987,6.825
line from 2.885,6.813 to 2.987,6.825 to 2.891,6.862
line from 2.987,7.325 to 1.488,7.075
line from 1.582,7.116 to 1.488,7.075 to 1.590,7.067
line from 1.488,7.700 to 2.987,7.388
line from 2.885,7.383 to 2.987,7.388 to 2.895,7.432
line from 2.987,8.575 to 1.488,8.325
line from 1.582,8.366 to 1.488,8.325 to 1.590,8.317
line from 1.488,8.887 to 2.987,8.637
line from 2.885,8.629 to 2.987,8.637 to 2.893,8.679
line from 2.987,9.637 to 1.488,9.450
line from 1.584,9.487 to 1.488,9.450 to 1.590,9.438
line from 1.488,9.887 to 2.987,9.762
line from 2.886,9.746 to 2.987,9.762 to 2.890,9.796
line dashed from 2.987,10.012 to 2.987,6.513
line dashed from 1.488,10.012 to 1.488,6.513
"write" at 4.237,5.921 ljust
"Lease valid on machine" at 1.425,5.733 ljust
"Diagram #2: Write Caching Lease" at 0.738,5.551 ljust
"Server" at 2.675,10.114 ljust
"Client A" at 1.113,10.114 ljust
"seconds after last" at 3.112,5.921 ljust
"Expires write_slack" at 3.112,6.108 ljust
"due to write activity" at 3.112,6.608 ljust
"Expiry delayed" at 3.112,6.796 ljust
"Lease times out" at 3.112,7.233 ljust
"Lease renewed" at 3.175,8.546 ljust
"Lease for client A" at 3.175,9.358 ljust
"Write caching" at 3.175,9.608 ljust
"Reply" at 1.675,6.733 ljust
"Write req" at 1.988,7.046 ljust
"Reply" at 1.675,7.233 ljust
"Write req" at 1.675,7.796 ljust
"Lease expires" at 0.487,7.733 ljust
"Close syscall" at 0.487,8.108 ljust
"lease granted" at 1.675,8.546 ljust
"Get write lease" at 1.675,8.921 ljust
"before expiry" at 0.487,8.608 ljust
"Lease renewal" at 0.487,8.796 ljust
"syscalls" at 0.487,9.046 ljust
"Delayed write" at 0.487,9.233 ljust
"lease granted" at 1.675,9.608 ljust
"Get write lease req" at 1.675,9.921 ljust
"Write syscall" at 0.487,9.858 ljust
line from 0.613,2.638 to 1.238,2.638
line from 1.488,4.075 to 1.488,3.638
line from 2.987,4.013 to 2.987,3.575
line from 4.487,4.013 to 4.487,3.575
line from 2.987,3.888 to 4.487,3.700
line from 4.385,3.688 to 4.487,3.700 to 4.391,3.737
line from 4.487,4.138 to 2.987,3.950
line from 3.084,3.987 to 2.987,3.950 to 3.090,3.938
line from 2.987,4.763 to 4.487,4.450
line from 4.385,4.446 to 4.487,4.450 to 4.395,4.495
line from 4.487,4.438 to 4.487,4.013
line from 4.487,5.138 to 2.987,4.888
line from 3.082,4.929 to 2.987,4.888 to 3.090,4.879
line from 4.487,6.513 to 4.487,5.513
line from 4.487,6.513 to 4.487,6.513 to 4.487,5.513
line from 2.987,5.450 to 2.987,5.200
line from 1.488,5.075 to 1.488,4.075
line from 2.987,5.263 to 2.987,4.013
line from 2.987,7.700 to 2.987,5.325
line from 4.487,7.575 to 4.487,6.513
line from 1.488,8.512 to 1.488,8.075
line from 2.987,8.637 to 2.987,8.075
line from 2.987,9.637 to 2.987,8.825
line from 1.488,9.450 to 1.488,8.950
line from 2.987,4.450 to 1.488,4.263
line from 1.584,4.300 to 1.488,4.263 to 1.590,4.250
line from 1.488,4.888 to 2.987,4.575
line from 2.885,4.571 to 2.987,4.575 to 2.895,4.620
line from 2.987,5.263 to 1.488,5.075
line from 1.584,5.112 to 1.488,5.075 to 1.590,5.063
line from 4.487,5.513 to 2.987,5.325
line from 3.084,5.362 to 2.987,5.325 to 3.090,5.313
line from 2.987,5.700 to 4.487,5.575
line from 4.386,5.558 to 4.487,5.575 to 4.390,5.608
line from 4.487,6.013 to 2.987,5.825
line from 3.084,5.862 to 2.987,5.825 to 3.090,5.813
line from 2.987,6.200 to 4.487,6.075
line from 4.386,6.058 to 4.487,6.075 to 4.390,6.108
line from 4.487,6.450 to 2.987,6.263
line from 3.084,6.300 to 2.987,6.263 to 3.090,6.250
line from 2.987,6.700 to 4.487,6.513
line from 4.385,6.500 to 4.487,6.513 to 4.391,6.550
line from 1.488,6.950 to 2.987,6.763
line from 2.885,6.750 to 2.987,6.763 to 2.891,6.800
line from 2.987,7.700 to 4.487,7.575
line from 4.386,7.558 to 4.487,7.575 to 4.390,7.608
line from 4.487,7.950 to 2.987,7.763
line from 3.084,7.800 to 2.987,7.763 to 3.090,7.750
line from 2.987,8.637 to 1.488,8.512
line from 1.585,8.546 to 1.488,8.512 to 1.589,8.496
line from 1.488,8.887 to 2.987,8.700
line from 2.885,8.688 to 2.987,8.700 to 2.891,8.737
line from 2.987,9.637 to 1.488,9.450
line from 1.584,9.487 to 1.488,9.450 to 1.590,9.438
line from 1.488,9.950 to 2.987,9.762
line from 2.885,9.750 to 2.987,9.762 to 2.891,9.800
dashwid = 0.050i
line dashed from 4.487,10.137 to 4.487,2.825
line dashed from 2.987,10.137 to 2.987,2.825
line dashed from 1.488,10.137 to 1.488,2.825
"(not cached)" at 4.612,3.858 ljust
"Diagram #3: Write sharing case" at 0.613,2.239 ljust
"Write syscall" at 4.675,7.546 ljust
"Read syscall" at 0.550,9.921 ljust
"Lease valid on machine" at 1.363,2.551 ljust
"(can still cache)" at 1.675,8.171 ljust
"Reply" at 3.800,3.858 ljust
"Write" at 3.175,4.046 ljust
"writes" at 4.612,4.046 ljust
"synchronous" at 4.612,4.233 ljust
"write syscall" at 4.675,5.108 ljust
"non-caching lease" at 3.175,4.296 ljust
"Reply " at 3.175,4.483 ljust
"req" at 3.175,4.983 ljust
"Get write lease" at 3.175,5.108 ljust
"Vacated msg" at 3.175,5.483 ljust
"to the server" at 4.675,5.858 ljust
"being flushed to" at 4.675,6.046 ljust
"Delayed writes" at 4.675,6.233 ljust
"Server" at 2.675,10.182 ljust
"Client B" at 3.925,10.182 ljust
"Client A" at 0.863,10.182 ljust
"(not cached)" at 0.550,4.733 ljust
"Read data" at 0.550,4.921 ljust
"Reply data" at 1.675,4.421 ljust
"Read request" at 1.675,4.921 ljust
"lease" at 1.675,5.233 ljust
"Reply non-caching" at 1.675,5.421 ljust
"Reply" at 3.737,5.733 ljust
"Write" at 3.175,5.983 ljust
"Reply" at 3.737,6.171 ljust
"Write" at 3.175,6.421 ljust
"Eviction Notice" at 3.175,6.796 ljust
"Get read lease" at 1.675,7.046 ljust
"Read syscall" at 0.550,6.983 ljust
"being cached" at 4.675,7.171 ljust
"Delayed writes" at 4.675,7.358 ljust
"lease" at 3.175,7.233 ljust
"Reply write caching" at 3.175,7.421 ljust
"Get write lease" at 3.175,7.983 ljust
"Write syscall" at 4.675,7.983 ljust
"with same modrev" at 1.675,8.358 ljust
"Lease" at 0.550,8.171 ljust
"Renewed" at 0.550,8.358 ljust
"Reply" at 1.675,8.608 ljust
"Get Lease Request" at 1.675,8.983 ljust
"Read syscall" at 0.550,8.733 ljust
"from cache" at 0.550,9.108 ljust
"Read syscall" at 0.550,9.296 ljust
"Reply " at 1.675,9.671 ljust
"plus lease" at 2.050,9.983 ljust
"Read Request" at 1.675,10.108 ljust
for a large, heavily loaded server. In [Mogul93], it is
proposed that a mechanism vaguely similar to paging
could be used to deal with this for Spritely NFS, but
this appears to introduce a fair amount of complexity
and may limit the usefulness of open records for stor-
ing other state information, such as file locks.
2. After a server crashes it must recover lease records
for the current outstanding leases, which actually
implies that if it waits until all leases have expired,
there is no state to recover. The server must wait for
the maximum lease duration of one minute, and it must
serve all outstanding write requests resulting from
terminated write-caching leases before issuing new
leases. The one minute delay can be overlapped with
file system consistency checking (eg. fsck). Because no
state must be recovered, a lease-based server, like an
NFS server, avoids the problem of state recovery after
a crash.
There can, however, be problems during crash recovery
because of a potentially large number of write backs
due to terminated write-caching leases. One of these
problems is a "recovery storm" [Baker91], which could
occur when the server is overloaded by the number of
write RPC requests. The NQNFS protocol deals with this
by replying with a return status code called
try_again_later to all RPC requests (except write)
until the write requests subside. At this time, there
has not been sufficient testing of server crash
recovery while under heavy server load to determine if
the try_again_later reply is a sufficient solution to
the problem. The other problem is that consistency will
be lost if other RPCs are performed before all of the
write backs for terminated write-caching leases have
completed. This is handled by only performing write
RPCs until no write RPC requests arrive for write_slack
seconds, where write_slack is set to several times the
client timeout retransmit interval, at which time it is
assumed all clients have had an opportunity to send
their writes to the server.
3. Another advantage of leasing is that, since leases are
required at times when other I/O operations occur,
lease requests can almost always be piggybacked on
other RPCs, avoiding some of the overhead associated
with the explicit open and close RPCs required by a
Sprite-like system. Compared with Sprite cache con-
sistency, this can result in a significantly lower RPC
load (see table #1).
5. Limitations of the NQNFS Protocol
There is a serious risk when leasing is used for
delayed write caching. If the server is simply too busy to
service a lease renewal before a write-caching lease ter-
minates, the client will not be able to push the write data
to the server before the lease has terminated, resulting in
inconsistency. Note that the danger of inconsistency occurs
when the server assumes that a write-caching lease has ter-
minated before the client has had the opportunity to write
the data back to the server. In an effort to avoid this
problem, the NQNFS server does not assume that a write-
caching lease has terminated until three conditions are met:
1 - clock time > (expiry time + clock skew)
2 - there is at least one server daemon (nfsd) waiting for an RPC request
3 - no write RPCs received for leased file within write_slack after the corrected expiry time
The first condition ensures that the lease has expired on
the client. The clock_skew, by default three seconds, must
be set to a value larger than the maximum time-of-day clock
error that is likely to occur during the maximum lease dura-
tion. The second condition attempts to ensure that the
client is not waiting for replies to any writes that are
still queued for service by an nfsd. The third condition
tries to guarantee that the client has transmitted all write
requests to the server, since write_slack is set to several
times the client's timeout retransmit interval.
There are also certain file system semantics that are
problematic for both NFS and NQNFS, due to the lack of state
information maintained by the server. If a file is unlinked
on one client while open on another it will be removed from
the file server, resulting in failed file accesses on the
client that has the file open. If the file system on the
server is out of space or the client user's disk quota has
been exceeded, a delayed write can fail long after the write
system call was successfully completed. With NFS this error
will be detected by the close system call, since the delayed
writes are pushed upon close. With NQNFS however, the
delayed write RPC may not occur until after the close system
call, possibly even after the process has exited. Therefore,
if a process must check for write errors, a system call such
as fsync must be used.
Another problem occurs when a process on one client is
running an executable file and a process on another client
starts to write to the file. The read lease on the first
client is terminated by the server, but the client has no
recourse but to terminate the process, since the process is
already in progress on the old executable.
The NQNFS protocol does not support file locking, since
a file lock would have to involve hard, recovered after a
crash, state information.
6. Other NQNFS Protocol Features
NQNFS also includes a variety of minor modifications to
the NFS protocol, in an attempt to address various limita-
tions. The protocol uses 64bit file sizes and offsets in
order to handle large files. TCP transport may be used as an
alternative to UDP for cases where UDP does not perform
well. Transport mechanisms such as TCP also permit the use
of much larger read/write data sizes, which might improve
performance in certain environments.
The NQNFS protocol replaces the Readdir RPC with a
Readdir_and_Lookup RPC that returns the file handle and
attributes for each file in the directory as well as name
and file id number. This additional information may then be
loaded into the lookup and file-attribute caches on the
client. Thus, for cases such as "ls -l", the stat system
calls can be performed locally without doing any lookup or
getattr RPCs. Another additional RPC is the Access RPC that
checks for file accessibility against the server. This is
necessary since in some cases the client user ID is mapped
to a different user on the server and doing the access check
locally on the client using file attributes and client
credentials is not correct. One case where this becomes
necessary is when the NQNFS mount point is using Kerberos
authentication, where the Kerberos authentication ticket is
translated to credentials on the server that are mapped to
the client side user id. For further details on the proto-
col, see [Macklem93].
7. Performance
In order to evaluate the effectiveness of the NQNFS
protocol, a benchmark was used that was designed to typify
real work on the client workstation. Benchmarks, such as
Laddis [Wittle93], that perform server load characterization
are not appropriate for this work, since it is primarily
client caching efficiency that needs to be evaluated. Since
these tests are measuring overall client system performance
and not just the performance of the file system, each
sequence of runs was performed on identical hardware and
operating system in order to factor out the system
components affecting performance other than the file system
protocol.
The equipment used for the all the benchmarks are
members of the DECstation- family of workstations using the
MIPSS RISC architecture. The operating system running on
these systems was a pre-release version of 4.4BSD Unix=. For
all benchmarks, the file server was a DECstation 2100 (10
MIPS) with 8Mbytes of memory and a local RZ23 SCSI disk
(27msec average access time). The clients range in speed
from DECstation 2100s to a DECstation 5000/25, and always
run with six block I/O daemons and a 4Mbyte buffer cache,
except for the test runs where the buffer cache size was the
independent variable. In all cases /tmp is mounted on the
local SCSI disk[2], all machines were attached to the same
uncongested Ethernet, and ran in single user mode during the
benchmarks. Unless noted otherwise, test runs used UDP RPC
transport and the results given are the average values of
four runs.
The benchmark used is the Modified Andrew Benchmark
(MAB) [Ousterhout90], which is a slightly modified version
of the benchmark used to characterize performance of the
Andrew ITC file system [Howard88]. The MAB was set up with
the executable binaries in the remote mounted file system
and the final load step was commented out, due to a linkage
problem during testing under 4.4BSD. Therefore, these
results are not directly comparable to other reported MAB
results. The MAB is made up of five distinct phases:
1. Makes five directories (no significant cost)
2. Copy a file system subtree to a working directory
3. Get file attributes (stat) of all the working
files
4. Search for strings (grep) in the files
5. Compile a library of C sources and archive them
Of the five phases, the fifth is by far the largest and is
the one affected most by client caching mechanisms. The
results for phase #1 are invariant over all the caching
mechanisms.
____________________
[2]Testing using the 4.4BSD MFS [McKusick90] resulted in
slightly degraded performance, probably since the machines
only had 16Mbytes of memory, and so paging increased.
7.1. Buffer Cache Size Tests
The first experiment was done to see what effect chang-
ing the size of the buffer cache would have on client per-
formance. A single DECstation 5000/25 was used to do a
series of runs of MAB with different buffer cache sizes for
four variations of the file system protocol. The four varia-
tions are as follows:
Case 1: NFS - The NFS protocol as implemented in 4.4BSD
Case 2: Leases - The NQNFS protocol using leases for cache
consistency
Case 3: Leases, Rdirlookup - The NQNFS protocol using
leases for cache consistency and with the readdir
RPC replaced by Readdir_and_Lookup
Case 4: Leases, Attrib leases, Rdirlookup - The NQNFS pro-
tocol using leases for cache consistency, with the
readdir RPC replaced by the Readdir_and_Lookup,
and requiring a valid lease not only for file-data
access, but also for file-attribute access.
As can be seen in figure 1, the buffer cache achieves about
optimal performance for the range of two to ten megabytes in
size. At eleven megabytes in size, the system pages heavily
and the runs did not complete in a reasonable time. Even at
64Kbytes, the buffer cache improves performance over no
buffer cache by a significant margin of 136-148 seconds
versus 239 seconds. This may be due, in part, to the fact
that the Compile Phase of the MAB uses a rather small work-
ing set of file data. All variants of NQNFS achieve about
the same performance, running around 30% faster than NFS,
with a slightly larger difference for large buffer cache
sizes. Based on these results, all remaining tests were run
with the buffer cache size set to 4Mbytes. Although I do not
know what causes the local peak in the curves between 0.5
and 2 megabytes, there is some indication that contention
for buffer cache blocks, between the update process (which
pushes delayed writes to the server every thirty seconds)
and the I/O system calls, may be involved.
7.2. Multiple Client Load Tests
During preliminary runs of the MAB, it was observed
that the server RPC counts were reduced significantly by
NQNFS as compared to NFS (table 1). (Spritely NFS and
Ultrix4.3/NFS numbers were taken from [Mogul93] and are not
directly comparable, due to numerous differences in the
experimental setup including deletion of the load step from
phase 5.) This suggests that the NQNFS protocol might scale
better with respect to the number of clients accessing the
dashwid = 0.050i
line dashed from 0.900,7.888 to 4.787,7.888
line dashed from 0.900,7.888 to 0.900,10.262
line from 0.900,7.888 to 0.963,7.888
line from 4.787,7.888 to 4.725,7.888
line from 0.900,8.188 to 0.963,8.188
line from 4.787,8.188 to 4.725,8.188
line from 0.900,8.488 to 0.963,8.488
line from 4.787,8.488 to 4.725,8.488
line from 0.900,8.775 to 0.963,8.775
line from 4.787,8.775 to 4.725,8.775
line from 0.900,9.075 to 0.963,9.075
line from 4.787,9.075 to 4.725,9.075
line from 0.900,9.375 to 0.963,9.375
line from 4.787,9.375 to 4.725,9.375
line from 0.900,9.675 to 0.963,9.675
line from 4.787,9.675 to 4.725,9.675
line from 0.900,9.963 to 0.963,9.963
line from 4.787,9.963 to 4.725,9.963
line from 0.900,10.262 to 0.963,10.262
line from 4.787,10.262 to 4.725,10.262
line from 0.900,7.888 to 0.900,7.950
line from 0.900,10.262 to 0.900,10.200
line from 1.613,7.888 to 1.613,7.950
line from 1.613,10.262 to 1.613,10.200
line from 2.312,7.888 to 2.312,7.950
line from 2.312,10.262 to 2.312,10.200
line from 3.025,7.888 to 3.025,7.950
line from 3.025,10.262 to 3.025,10.200
line from 3.725,7.888 to 3.725,7.950
line from 3.725,10.262 to 3.725,10.200
line from 4.438,7.888 to 4.438,7.950
line from 4.438,10.262 to 4.438,10.200
line from 0.900,7.888 to 4.787,7.888
line from 4.787,7.888 to 4.787,10.262
line from 4.787,10.262 to 0.900,10.262
line from 0.900,10.262 to 0.900,7.888
line from 3.800,8.775 to 4.025,8.775
line from 0.925,10.088 to 0.925,10.088
line from 0.925,10.088 to 0.938,9.812
line from 0.938,9.812 to 0.988,9.825
line from 0.988,9.825 to 1.075,9.838
line from 1.075,9.838 to 1.163,9.938
line from 1.163,9.938 to 1.250,9.838
line from 1.250,9.838 to 1.613,9.825
line from 1.613,9.825 to 2.312,9.750
line from 2.312,9.750 to 3.025,9.713
line from 3.025,9.713 to 3.725,9.850
line from 3.725,9.850 to 4.438,9.875
dashwid = 0.037i
line dotted from 3.800,8.625 to 4.025,8.625
line dotted from 0.925,9.912 to 0.925,9.912
line dotted from 0.925,9.912 to 0.938,9.887
line dotted from 0.938,9.887 to 0.988,9.713
line dotted from 0.988,9.713 to 1.075,9.562
line dotted from 1.075,9.562 to 1.163,9.562
line dotted from 1.163,9.562 to 1.250,9.562
line dotted from 1.250,9.562 to 1.613,9.675
line dotted from 1.613,9.675 to 2.312,9.363
line dotted from 2.312,9.363 to 3.025,9.375
line dotted from 3.025,9.375 to 3.725,9.387
line dotted from 3.725,9.387 to 4.438,9.450
line dashed from 3.800,8.475 to 4.025,8.475
line dashed from 0.925,10.000 to 0.925,10.000
line dashed from 0.925,10.000 to 0.938,9.787
line dashed from 0.938,9.787 to 0.988,9.650
line dashed from 0.988,9.650 to 1.075,9.537
line dashed from 1.075,9.537 to 1.163,9.613
line dashed from 1.163,9.613 to 1.250,9.800
line dashed from 1.250,9.800 to 1.613,9.488
line dashed from 1.613,9.488 to 2.312,9.375
line dashed from 2.312,9.375 to 3.025,9.363
line dashed from 3.025,9.363 to 3.725,9.325
line dashed from 3.725,9.325 to 4.438,9.438
dashwid = 0.075i
line dotted from 3.800,8.325 to 4.025,8.325
line dotted from 0.925,9.963 to 0.925,9.963
line dotted from 0.925,9.963 to 0.938,9.750
line dotted from 0.938,9.750 to 0.988,9.662
line dotted from 0.988,9.662 to 1.075,9.613
line dotted from 1.075,9.613 to 1.163,9.613
line dotted from 1.163,9.613 to 1.250,9.700
line dotted from 1.250,9.700 to 1.613,9.438
line dotted from 1.613,9.438 to 2.312,9.463
line dotted from 2.312,9.463 to 3.025,9.312
line dotted from 3.025,9.312 to 3.725,9.387
line dotted from 3.725,9.387 to 4.438,9.425
"0" at 0.825,7.810 rjust
"20" at 0.825,8.110 rjust
"40" at 0.825,8.410 rjust
"60" at 0.825,8.697 rjust
"80" at 0.825,8.997 rjust
"100" at 0.825,9.297 rjust
"120" at 0.825,9.597 rjust
"140" at 0.825,9.885 rjust
"160" at 0.825,10.185 rjust
"0" at 0.900,7.660
"2" at 1.613,7.660
"4" at 2.312,7.660
"6" at 3.025,7.660
"8" at 3.725,7.660
"10" at 4.438,7.660
"Time (sec)" at 0.150,8.997
"Buffer Cache Size (MBytes)" at 2.837,7.510
"Figure #1: MAB Phase 5 (compile)" at 2.837,10.335
"NFS" at 3.725,8.697 rjust
"Leases" at 3.725,8.547 rjust
"Leases, Rdirlookup" at 3.725,8.397 rjust
"Leases, Attrib leases, Rdirlookup" at 3.725,8.247 rjust
server. The experiment described in this section ran the MAB
on from one to ten clients concurrently, to observe the
effects of heavier server load. The clients were started at
roughly the same time by pressing all the <return> keys
together and, although not synchronized beyond that point,
all clients would finish the test run within about two
seconds of each other. This was not a realistic load of N
active clients, but it did result in a reproducible increas-
ing client load on the server. The results for the four
variants are plotted in figures 2-5.
For the MAB benchmark, the NQNFS protocol reduces the
RPC counts significantly, but with a minimum of extra over-
head (the GetLease/Open-Close count).
In figure 2, where a subtree of seventy small files is
copied, the difference between the protocol variants is
minimal, with the NQNFS variants performing slightly better.
For this case, the Readdir_and_Lookup RPC is a slight hin-
drance under heavy load, possibly because it results in
larger directory blocks in the buffer cache.
In figure 3, for the phase that gets file attributes
for a large number of files, the leasing variants take about
50% longer, indicating that there are performance problems
in this area. For the case where valid current leases are
required for every file when attributes are returned, the
performance is significantly worse than when the attributes
are allowed to be stale by a few seconds on the client. I
have not been able to explain the oscillation in the curves
for the Lease cases.
_______________________________________________________________________________________
| Table #1: MAB RPC Counts |
| RPC Getattr Read Write Lookup Other GetLease/Open-Close Total|
|______________|_______________________________________________________________________|
| BSD/NQNFS | 277 139 306 575 294 127 1718 |
| BSD/NFS | 1210 506 451 489 238 0 2894 |
| Spritely NFS | 259 836 192 535 306 1467 3595 |
| Ultrix4.3/NFS| 1225 1186 476 810 305 0 4002 |
|______________|_______________________________________________________________________|
dashwid = 0.050i
line dashed from 0.900,7.888 to 4.787,7.888
line dashed from 0.900,7.888 to 0.900,10.262
line from 0.900,7.888 to 0.963,7.888
line from 4.787,7.888 to 4.725,7.888
line from 0.900,8.225 to 0.963,8.225
line from 4.787,8.225 to 4.725,8.225
line from 0.900,8.562 to 0.963,8.562
line from 4.787,8.562 to 4.725,8.562
line from 0.900,8.900 to 0.963,8.900
line from 4.787,8.900 to 4.725,8.900
line from 0.900,9.250 to 0.963,9.250
line from 4.787,9.250 to 4.725,9.250
line from 0.900,9.588 to 0.963,9.588
line from 4.787,9.588 to 4.725,9.588
line from 0.900,9.925 to 0.963,9.925
line from 4.787,9.925 to 4.725,9.925
line from 0.900,10.262 to 0.963,10.262
line from 4.787,10.262 to 4.725,10.262
line from 0.900,7.888 to 0.900,7.950
line from 0.900,10.262 to 0.900,10.200
line from 1.613,7.888 to 1.613,7.950
line from 1.613,10.262 to 1.613,10.200
line from 2.312,7.888 to 2.312,7.950
line from 2.312,10.262 to 2.312,10.200
line from 3.025,7.888 to 3.025,7.950
line from 3.025,10.262 to 3.025,10.200
line from 3.725,7.888 to 3.725,7.950
line from 3.725,10.262 to 3.725,10.200
line from 4.438,7.888 to 4.438,7.950
line from 4.438,10.262 to 4.438,10.200
line from 0.900,7.888 to 4.787,7.888
line from 4.787,7.888 to 4.787,10.262
line from 4.787,10.262 to 0.900,10.262
line from 0.900,10.262 to 0.900,7.888
line from 3.800,8.900 to 4.025,8.900
line from 1.250,8.325 to 1.250,8.325
line from 1.250,8.325 to 1.613,8.500
line from 1.613,8.500 to 2.312,8.825
line from 2.312,8.825 to 3.025,9.175
line from 3.025,9.175 to 3.725,9.613
line from 3.725,9.613 to 4.438,10.012
dashwid = 0.037i
line dotted from 3.800,8.750 to 4.025,8.750
line dotted from 1.250,8.275 to 1.250,8.275
line dotted from 1.250,8.275 to 1.613,8.412
line dotted from 1.613,8.412 to 2.312,8.562
line dotted from 2.312,8.562 to 3.025,9.088
line dotted from 3.025,9.088 to 3.725,9.375
line dotted from 3.725,9.375 to 4.438,10.000
line dashed from 3.800,8.600 to 4.025,8.600
line dashed from 1.250,8.250 to 1.250,8.250
line dashed from 1.250,8.250 to 1.613,8.438
line dashed from 1.613,8.438 to 2.312,8.637
line dashed from 2.312,8.637 to 3.025,9.088
line dashed from 3.025,9.088 to 3.725,9.525
line dashed from 3.725,9.525 to 4.438,10.075
dashwid = 0.075i
line dotted from 3.800,8.450 to 4.025,8.450
line dotted from 1.250,8.262 to 1.250,8.262
line dotted from 1.250,8.262 to 1.613,8.425
line dotted from 1.613,8.425 to 2.312,8.613
line dotted from 2.312,8.613 to 3.025,9.137
line dotted from 3.025,9.137 to 3.725,9.512
line dotted from 3.725,9.512 to 4.438,9.988
"0" at 0.825,7.810 rjust
"20" at 0.825,8.147 rjust
"40" at 0.825,8.485 rjust
"60" at 0.825,8.822 rjust
"80" at 0.825,9.172 rjust
"100" at 0.825,9.510 rjust
"120" at 0.825,9.847 rjust
"140" at 0.825,10.185 rjust
"0" at 0.900,7.660
"2" at 1.613,7.660
"4" at 2.312,7.660
"6" at 3.025,7.660
"8" at 3.725,7.660
"10" at 4.438,7.660
"Time (sec)" at 0.150,8.997
"Number of Clients" at 2.837,7.510
"Figure #2: MAB Phase 2 (copying)" at 2.837,10.335
"NFS" at 3.725,8.822 rjust
"Leases" at 3.725,8.672 rjust
"Leases, Rdirlookup" at 3.725,8.522 rjust
"Leases, Attrib leases, Rdirlookup" at 3.725,8.372 rjust
dashwid = 0.050i
line dashed from 0.900,7.888 to 4.787,7.888
line dashed from 0.900,7.888 to 0.900,10.262
line from 0.900,7.888 to 0.963,7.888
line from 4.787,7.888 to 4.725,7.888
line from 0.900,8.188 to 0.963,8.188
line from 4.787,8.188 to 4.725,8.188
line from 0.900,8.488 to 0.963,8.488
line from 4.787,8.488 to 4.725,8.488
line from 0.900,8.775 to 0.963,8.775
line from 4.787,8.775 to 4.725,8.775
line from 0.900,9.075 to 0.963,9.075
line from 4.787,9.075 to 4.725,9.075
line from 0.900,9.375 to 0.963,9.375
line from 4.787,9.375 to 4.725,9.375
line from 0.900,9.675 to 0.963,9.675
line from 4.787,9.675 to 4.725,9.675
line from 0.900,9.963 to 0.963,9.963
line from 4.787,9.963 to 4.725,9.963
line from 0.900,10.262 to 0.963,10.262
line from 4.787,10.262 to 4.725,10.262
line from 0.900,7.888 to 0.900,7.950
line from 0.900,10.262 to 0.900,10.200
line from 1.613,7.888 to 1.613,7.950
line from 1.613,10.262 to 1.613,10.200
line from 2.312,7.888 to 2.312,7.950
line from 2.312,10.262 to 2.312,10.200
line from 3.025,7.888 to 3.025,7.950
line from 3.025,10.262 to 3.025,10.200
line from 3.725,7.888 to 3.725,7.950
line from 3.725,10.262 to 3.725,10.200
line from 4.438,7.888 to 4.438,7.950
line from 4.438,10.262 to 4.438,10.200
line from 0.900,7.888 to 4.787,7.888
line from 4.787,7.888 to 4.787,10.262
line from 4.787,10.262 to 0.900,10.262
line from 0.900,10.262 to 0.900,7.888
line from 3.800,8.775 to 4.025,8.775
line from 1.250,8.975 to 1.250,8.975
line from 1.250,8.975 to 1.613,8.963
line from 1.613,8.963 to 2.312,8.988
line from 2.312,8.988 to 3.025,9.037
line from 3.025,9.037 to 3.725,9.062
line from 3.725,9.062 to 4.438,9.100
dashwid = 0.037i
line dotted from 3.800,8.625 to 4.025,8.625
line dotted from 1.250,9.312 to 1.250,9.312
line dotted from 1.250,9.312 to 1.613,9.287
line dotted from 1.613,9.287 to 2.312,9.675
line dotted from 2.312,9.675 to 3.025,9.262
line dotted from 3.025,9.262 to 3.725,9.738
line dotted from 3.725,9.738 to 4.438,9.512
line dashed from 3.800,8.475 to 4.025,8.475
line dashed from 1.250,9.400 to 1.250,9.400
line dashed from 1.250,9.400 to 1.613,9.287
line dashed from 1.613,9.287 to 2.312,9.575
line dashed from 2.312,9.575 to 3.025,9.300
line dashed from 3.025,9.300 to 3.725,9.613
line dashed from 3.725,9.613 to 4.438,9.512
dashwid = 0.075i
line dotted from 3.800,8.325 to 4.025,8.325
line dotted from 1.250,9.400 to 1.250,9.400
line dotted from 1.250,9.400 to 1.613,9.412
line dotted from 1.613,9.412 to 2.312,9.700
line dotted from 2.312,9.700 to 3.025,9.537
line dotted from 3.025,9.537 to 3.725,9.938
line dotted from 3.725,9.938 to 4.438,9.812
"0" at 0.825,7.810 rjust
"5" at 0.825,8.110 rjust
"10" at 0.825,8.410 rjust
"15" at 0.825,8.697 rjust
"20" at 0.825,8.997 rjust
"25" at 0.825,9.297 rjust
"30" at 0.825,9.597 rjust
"35" at 0.825,9.885 rjust
"40" at 0.825,10.185 rjust
"0" at 0.900,7.660
"2" at 1.613,7.660
"4" at 2.312,7.660
"6" at 3.025,7.660
"8" at 3.725,7.660
"10" at 4.438,7.660
"Time (sec)" at 0.150,8.997
"Number of Clients" at 2.837,7.510
"Figure #3: MAB Phase 3 (stat/find)" at 2.837,10.335
"NFS" at 3.725,8.697 rjust
"Leases" at 3.725,8.547 rjust
"Leases, Rdirlookup" at 3.725,8.397 rjust
"Leases, Attrib leases, Rdirlookup" at 3.725,8.247 rjust
dashwid = 0.050i
line dashed from 0.900,7.888 to 4.787,7.888
line dashed from 0.900,7.888 to 0.900,10.262
line from 0.900,7.888 to 0.963,7.888
line from 4.787,7.888 to 4.725,7.888
line from 0.900,8.188 to 0.963,8.188
line from 4.787,8.188 to 4.725,8.188
line from 0.900,8.488 to 0.963,8.488
line from 4.787,8.488 to 4.725,8.488
line from 0.900,8.775 to 0.963,8.775
line from 4.787,8.775 to 4.725,8.775
line from 0.900,9.075 to 0.963,9.075
line from 4.787,9.075 to 4.725,9.075
line from 0.900,9.375 to 0.963,9.375
line from 4.787,9.375 to 4.725,9.375
line from 0.900,9.675 to 0.963,9.675
line from 4.787,9.675 to 4.725,9.675
line from 0.900,9.963 to 0.963,9.963
line from 4.787,9.963 to 4.725,9.963
line from 0.900,10.262 to 0.963,10.262
line from 4.787,10.262 to 4.725,10.262
line from 0.900,7.888 to 0.900,7.950
line from 0.900,10.262 to 0.900,10.200
line from 1.613,7.888 to 1.613,7.950
line from 1.613,10.262 to 1.613,10.200
line from 2.312,7.888 to 2.312,7.950
line from 2.312,10.262 to 2.312,10.200
line from 3.025,7.888 to 3.025,7.950
line from 3.025,10.262 to 3.025,10.200
line from 3.725,7.888 to 3.725,7.950
line from 3.725,10.262 to 3.725,10.200
line from 4.438,7.888 to 4.438,7.950
line from 4.438,10.262 to 4.438,10.200
line from 0.900,7.888 to 4.787,7.888
line from 4.787,7.888 to 4.787,10.262
line from 4.787,10.262 to 0.900,10.262
line from 0.900,10.262 to 0.900,7.888
line from 3.800,8.775 to 4.025,8.775
line from 1.250,9.412 to 1.250,9.412
line from 1.250,9.412 to 1.613,9.425
line from 1.613,9.425 to 2.312,9.463
line from 2.312,9.463 to 3.025,9.600
line from 3.025,9.600 to 3.725,9.875
line from 3.725,9.875 to 4.438,10.075
dashwid = 0.037i
line dotted from 3.800,8.625 to 4.025,8.625
line dotted from 1.250,9.450 to 1.250,9.450
line dotted from 1.250,9.450 to 1.613,9.438
line dotted from 1.613,9.438 to 2.312,9.438
line dotted from 2.312,9.438 to 3.025,9.525
line dotted from 3.025,9.525 to 3.725,9.550
line dotted from 3.725,9.550 to 4.438,9.662
line dashed from 3.800,8.475 to 4.025,8.475
line dashed from 1.250,9.438 to 1.250,9.438
line dashed from 1.250,9.438 to 1.613,9.412
line dashed from 1.613,9.412 to 2.312,9.450
line dashed from 2.312,9.450 to 3.025,9.500
line dashed from 3.025,9.500 to 3.725,9.613
line dashed from 3.725,9.613 to 4.438,9.675
dashwid = 0.075i
line dotted from 3.800,8.325 to 4.025,8.325
line dotted from 1.250,9.387 to 1.250,9.387
line dotted from 1.250,9.387 to 1.613,9.600
line dotted from 1.613,9.600 to 2.312,9.625
line dotted from 2.312,9.625 to 3.025,9.738
line dotted from 3.025,9.738 to 3.725,9.850
line dotted from 3.725,9.850 to 4.438,9.800
"0" at 0.825,7.810 rjust
"5" at 0.825,8.110 rjust
"10" at 0.825,8.410 rjust
"15" at 0.825,8.697 rjust
"20" at 0.825,8.997 rjust
"25" at 0.825,9.297 rjust
"30" at 0.825,9.597 rjust
"35" at 0.825,9.885 rjust
"40" at 0.825,10.185 rjust
"0" at 0.900,7.660
"2" at 1.613,7.660
"4" at 2.312,7.660
"6" at 3.025,7.660
"8" at 3.725,7.660
"10" at 4.438,7.660
"Time (sec)" at 0.150,8.997
"Number of Clients" at 2.837,7.510
"Figure #4: MAB Phase 4 (grep/wc/find)" at 2.837,10.335
"NFS" at 3.725,8.697 rjust
"Leases" at 3.725,8.547 rjust
"Leases, Rdirlookup" at 3.725,8.397 rjust
"Leases, Attrib leases, Rdirlookup" at 3.725,8.247 rjust
dashwid = 0.050i
line dashed from 0.900,7.888 to 4.787,7.888
line dashed from 0.900,7.888 to 0.900,10.262
line from 0.900,7.888 to 0.963,7.888
line from 4.787,7.888 to 4.725,7.888
line from 0.900,8.150 to 0.963,8.150
line from 4.787,8.150 to 4.725,8.150
line from 0.900,8.412 to 0.963,8.412
line from 4.787,8.412 to 4.725,8.412
line from 0.900,8.675 to 0.963,8.675
line from 4.787,8.675 to 4.725,8.675
line from 0.900,8.938 to 0.963,8.938
line from 4.787,8.938 to 4.725,8.938
line from 0.900,9.213 to 0.963,9.213
line from 4.787,9.213 to 4.725,9.213
line from 0.900,9.475 to 0.963,9.475
line from 4.787,9.475 to 4.725,9.475
line from 0.900,9.738 to 0.963,9.738
line from 4.787,9.738 to 4.725,9.738
line from 0.900,10.000 to 0.963,10.000
line from 4.787,10.000 to 4.725,10.000
line from 0.900,10.262 to 0.963,10.262
line from 4.787,10.262 to 4.725,10.262
line from 0.900,7.888 to 0.900,7.950
line from 0.900,10.262 to 0.900,10.200
line from 1.613,7.888 to 1.613,7.950
line from 1.613,10.262 to 1.613,10.200
line from 2.312,7.888 to 2.312,7.950
line from 2.312,10.262 to 2.312,10.200
line from 3.025,7.888 to 3.025,7.950
line from 3.025,10.262 to 3.025,10.200
line from 3.725,7.888 to 3.725,7.950
line from 3.725,10.262 to 3.725,10.200
line from 4.438,7.888 to 4.438,7.950
line from 4.438,10.262 to 4.438,10.200
line from 0.900,7.888 to 4.787,7.888
line from 4.787,7.888 to 4.787,10.262
line from 4.787,10.262 to 0.900,10.262
line from 0.900,10.262 to 0.900,7.888
line from 3.800,8.675 to 4.025,8.675
line from 1.250,8.800 to 1.250,8.800
line from 1.250,8.800 to 1.613,8.912
line from 1.613,8.912 to 2.312,9.113
line from 2.312,9.113 to 3.025,9.438
line from 3.025,9.438 to 3.725,9.750
line from 3.725,9.750 to 4.438,10.088
dashwid = 0.037i
line dotted from 3.800,8.525 to 4.025,8.525
line dotted from 1.250,8.637 to 1.250,8.637
line dotted from 1.250,8.637 to 1.613,8.700
line dotted from 1.613,8.700 to 2.312,8.713
line dotted from 2.312,8.713 to 3.025,8.775
line dotted from 3.025,8.775 to 3.725,8.887
line dotted from 3.725,8.887 to 4.438,9.037
line dashed from 3.800,8.375 to 4.025,8.375
line dashed from 1.250,8.675 to 1.250,8.675
line dashed from 1.250,8.675 to 1.613,8.688
line dashed from 1.613,8.688 to 2.312,8.713
line dashed from 2.312,8.713 to 3.025,8.825
line dashed from 3.025,8.825 to 3.725,8.887
line dashed from 3.725,8.887 to 4.438,9.062
dashwid = 0.075i
line dotted from 3.800,8.225 to 4.025,8.225
line dotted from 1.250,8.700 to 1.250,8.700
line dotted from 1.250,8.700 to 1.613,8.688
line dotted from 1.613,8.688 to 2.312,8.762
line dotted from 2.312,8.762 to 3.025,8.812
line dotted from 3.025,8.812 to 3.725,8.925
line dotted from 3.725,8.925 to 4.438,9.025
"0" at 0.825,7.810 rjust
"50" at 0.825,8.072 rjust
"100" at 0.825,8.335 rjust
"150" at 0.825,8.597 rjust
"200" at 0.825,8.860 rjust
"250" at 0.825,9.135 rjust
"300" at 0.825,9.397 rjust
"350" at 0.825,9.660 rjust
"400" at 0.825,9.922 rjust
"450" at 0.825,10.185 rjust
"0" at 0.900,7.660
"2" at 1.613,7.660
"4" at 2.312,7.660
"6" at 3.025,7.660
"8" at 3.725,7.660
"10" at 4.438,7.660
"Time (sec)" at 0.150,8.997
"Number of Clients" at 2.837,7.510
"Figure #5: MAB Phase 5 (compile)" at 2.837,10.335
"NFS" at 3.725,8.597 rjust
"Leases" at 3.725,8.447 rjust
"Leases, Rdirlookup" at 3.725,8.297 rjust
"Leases, Attrib leases, Rdirlookup" at 3.725,8.147 rjust
For the string searching phase depicted in figure 4,
the leasing variants that do not require valid leases for
files when attributes are returned appear to scale better
with server load than NFS. However, the effect appears to be
negligible until the server load is fairly heavy.
Most of the time in the MAB benchmark is spent in the
compilation phase and this is where the differences between
caching methods are most pronounced. In figure 5 it can be
seen that any protocol variant using Leases performs about a
factor of two better than NFS at a load of ten clients. This
indicates that the use of NQNFS may allow servers to handle
significantly more clients for this type of workload.
Table 2 summarizes the MAB run times for all phases for
the single client DECstation 5000/25. The Leases case refers
to using leases, whereas the Leases, Rdirl case uses the
Readdir_and_Lookup RPC as well and the BCache Only case uses
leases, but only the buffer cache and not the attribute or
name caches. The No Caching cases does not do any client
side caching, performing all system calls via synchronous
RPCs to the server.
7.3. Processor Speed Tests
An important goal of client-side file system caching is
to decouple the I/O system calls from the underlying distri-
buted file system, so that the client's system performance
might scale with processor speed. In order to test this, a
series of MAB runs were performed on three DECstations that
are similar except for processor speed. In addition to the
four protocol variants used for the above tests, runs were
done with the client caches turned off, for worst case per-
formance numbers for caching mechanisms with a 100% miss
rate. The CPU utilization was measured, as an indicator of
how much the processor was blocking for I/O system calls.
Note that since the systems were running in single user mode
and otherwise quiescent, almost all CPU activity was
directly related to the MAB run. The results are presented
in table 3. The CPU time is simply the product of the CPU
utilization and elapsed running time and, as such, is the
optimistic bound on performance achievable with an ideal
client caching scheme that never blocks for I/O. As can be
seen in the table, any caching mechanism achieves signifi-
cantly better performance than when caching is disabled,
roughly doubling the CPU utilization with a corresponding
reduction in run time. For NFS, the CPU utilization is drop-
ping with increase in CPU speed, which would suggest that it
is not scaling with CPU speed. For the NQNFS variants, the
CPU utilization remains at just below 90%, which suggests
that the caching mechanism is working well and scaling
________________________________________________________________________________
| Table #2: Single DECstation 5000/25 Client Elapsed Times (sec) |
| Phase 1 2 3 4 5 Total % Improvement|
|______________|________________________________________________________________|
| No Caching | 6 35 41 40 258 380 -93 |
| NFS | 5 24 15 20 133 197 0 |
| BCache Only | 5 20 24 23 116 188 5 |
| Leases, Rdirl| 5 20 21 20 105 171 13 |
| Leases | 5 19 21 21 99 165 16 |
|______________|________________________________________________________________|
________________________________________________________________________________________________
| Table #3: MAB Phase 5 (compile) |
| DS2100 (10.5 MIPS) DS3100 (14.0 MIPS) DS5000/25 (26.7 MIPS) |
| Elapsed CPU CPU Elapsed CPU CPU Elapsed CPU CPU |
| time Util(%) time time Util(%) time time Util(%) time|
|______________|________________________________________________________________________________|
| Leases | 143 89 127 113 87 98 99 89 88 |
| Leases, Rdirl| 150 89 134 110 91 100 105 88 92 |
| BCache Only | 169 85 144 129 78 101 116 75 87 |
| NFS | 172 77 132 135 74 100 133 71 94 |
| No Caching | 330 47 155 256 41 105 258 39 101 |
|______________|________________________________________________________________________________|
within this CPU range. Note that for this benchmark, the
ratio of CPU times for the DECstation 3100 and DECstation
5000/25 are quite different than the Dhrystone MIPS ratings
would suggest.
Overall, the results seem encouraging, although it
remains to be seen whether or not the caching provided by
NQNFS can continue to scale with CPU performance. There is a
good indication that NQNFS permits a server to scale to more
clients than does NFS, at least for workloads akin to the
MAB compile phase. A more difficult question is "What if the
server is much faster doing write RPCs?" as a result of some
technology such as Prestoserve or write gathering. Since a
significant part of the difference between NFS and NQNFS is
the synchronous writing, it is difficult to predict how much
a server capable of fast write RPCs will negate the perfor-
mance improvements of NQNFS. At the very least, table 1
indicates that the write RPC load on the server has
decreased by approximately 30%, and this reduced write load
should still result in some improvement.
Indications are that the Readdir_and_Lookup RPC has not
improved performance for these tests and may in fact be
degrading performance slightly. The results in figure 3
indicate some problems, possibly with handling of the attri-
bute cache. It seems logical that the Readdir_and_Lookup RPC
should be permit priming of the attribute cache improving
hit rate, but the results are counter to that.
7.4. Internetwork Delay Tests
This experimental setup was used to explore how the
different protocol variants might perform over internetworks
with larger RPC RTTs. The server was moved to a separate
Ethernet, using a MicroVAXII as an IP router to the other
Ethernet. The 4.3Reno BSD Unix system running on the Micro-
VAXII was modified to delay IP packets being forwarded by a
tunable N millisecond delay. The implementation was rather
crude and did not try to simulate a distribution of delay
times nor was it programmed to drop packets at a given rate,
but it served as a simple emulation of a long, fat net-
work[3] [Jacobson88]. The MAB was run using both UDP and TCP
RPC transports for a variety of RTT delays from five to two
hundred milliseconds, to observe the effects of RTT delay on
RPC transport. It was found that, due to a high variability
between runs, four runs was not suffice, so eight runs at
each value was done. The results in figure 6 and table 4 are
the average for the eight runs.
I found these results somewhat surprising, since I had
assumed that stability across an internetwork connection
would be a function of RPC transport protocol. Looking at
the standard deviations observed between the eight runs,
there is an indication that the NQNFS protocol plays a
larger role in maintaining stability than the underlying RPC
transport protocol. It appears that NFS over TCP transport
is the least stable variant tested. It should be noted that
the TCP implementation used was roughly at 4.3BSD Tahoe
release and that the 4.4BSD TCP implementation was far less
stable and would fail intermittently, due to a bug I was not
able to isolate. It would appear that some of the recent
enhancements to the 4.4BSD TCP implementation have a detri-
mental effect on the performance of RPC-type traffic loads,
which intermix small and large data transfers in both direc-
tions. It is obvious that more exploration of this area is
needed before any conclusions can be made beyond the fact
that over a local area network, TCP transport provides per-
formance comparable to UDP.
8. Lessons Learned
Evaluating the performance of a distributed file system
is fraught with difficulties, due to the many software and
hardware factors involved. The limited benchmarking
presented here took a considerable amount of time and the
results gained by the exercise only give indications of what
the performance might be for a few scenarios.
The IP router with delay introduction proved to be a
valuable tool for protocol debugging[4], and may be useful
for a more extensive study of performance over internetworks
____________________
[3]Long fat networks refer to network interconnections
with a Bandwidth X RTT product > 105 bits.
[4]It exposed two bugs in the 4.4BSD networking, one a
problem in the Lance chip driver for the DECstation and the
other a TCP window sizing problem that I was not able to
isolate.
dashwid = 0.050i
line dashed from 0.900,7.888 to 4.787,7.888
line dashed from 0.900,7.888 to 0.900,10.262
line from 0.900,7.888 to 0.963,7.888
line from 4.787,7.888 to 4.725,7.888
line from 0.900,8.350 to 0.963,8.350
line from 4.787,8.350 to 4.725,8.350
line from 0.900,8.800 to 0.963,8.800
line from 4.787,8.800 to 4.725,8.800
line from 0.900,9.262 to 0.963,9.262
line from 4.787,9.262 to 4.725,9.262
line from 0.900,9.713 to 0.963,9.713
line from 4.787,9.713 to 4.725,9.713
line from 0.900,10.175 to 0.963,10.175
line from 4.787,10.175 to 4.725,10.175
line from 0.900,7.888 to 0.900,7.950
line from 0.900,10.262 to 0.900,10.200
line from 1.825,7.888 to 1.825,7.950
line from 1.825,10.262 to 1.825,10.200
line from 2.750,7.888 to 2.750,7.950
line from 2.750,10.262 to 2.750,10.200
line from 3.675,7.888 to 3.675,7.950
line from 3.675,10.262 to 3.675,10.200
line from 4.600,7.888 to 4.600,7.950
line from 4.600,10.262 to 4.600,10.200
line from 0.900,7.888 to 4.787,7.888
line from 4.787,7.888 to 4.787,10.262
line from 4.787,10.262 to 0.900,10.262
line from 0.900,10.262 to 0.900,7.888
line from 4.125,8.613 to 4.350,8.613
line from 0.988,8.400 to 0.988,8.400
line from 0.988,8.400 to 1.637,8.575
line from 1.637,8.575 to 2.375,8.713
line from 2.375,8.713 to 3.125,8.900
line from 3.125,8.900 to 3.862,9.137
line from 3.862,9.137 to 4.600,9.425
dashwid = 0.037i
line dotted from 4.125,8.463 to 4.350,8.463
line dotted from 0.988,8.375 to 0.988,8.375
line dotted from 0.988,8.375 to 1.637,8.525
line dotted from 1.637,8.525 to 2.375,8.850
line dotted from 2.375,8.850 to 3.125,8.975
line dotted from 3.125,8.975 to 3.862,9.137
line dotted from 3.862,9.137 to 4.600,9.625
line dashed from 4.125,8.312 to 4.350,8.312
line dashed from 0.988,8.525 to 0.988,8.525
line dashed from 0.988,8.525 to 1.637,8.688
line dashed from 1.637,8.688 to 2.375,8.838
line dashed from 2.375,8.838 to 3.125,9.150
line dashed from 3.125,9.150 to 3.862,9.275
line dashed from 3.862,9.275 to 4.600,9.588
dashwid = 0.075i
line dotted from 4.125,8.162 to 4.350,8.162
line dotted from 0.988,8.525 to 0.988,8.525
line dotted from 0.988,8.525 to 1.637,8.838
line dotted from 1.637,8.838 to 2.375,8.863
line dotted from 2.375,8.863 to 3.125,9.137
line dotted from 3.125,9.137 to 3.862,9.387
line dotted from 3.862,9.387 to 4.600,10.200
"0" at 0.825,7.810 rjust
"100" at 0.825,8.272 rjust
"200" at 0.825,8.722 rjust
"300" at 0.825,9.185 rjust
"400" at 0.825,9.635 rjust
"500" at 0.825,10.097 rjust
"0" at 0.900,7.660
"50" at 1.825,7.660
"100" at 2.750,7.660
"150" at 3.675,7.660
"200" at 4.600,7.660
"Time (sec)" at 0.150,8.997
"Round Trip Delay (msec)" at 2.837,7.510
"Figure #6: MAB Phase 5 (compile)" at 2.837,10.335
"Leases,UDP" at 4.050,8.535 rjust
"Leases,TCP" at 4.050,8.385 rjust
"NFS,UDP" at 4.050,8.235 rjust
"NFS,TCP" at 4.050,8.085 rjust
____________________________________________________________________________________________________________
| Table #4: MAB Phase 5 (compile) for Internetwork Delays |
| NFS,UDP NFS,TCP Leases,UDP Leases,TCP |
| Delay Elapsed Standard Elapsed Standard Elapsed Standard Elapsed Standard |
| (msec) time (sec) Deviation time (sec) Deviation time (sec) Deviation time (sec) Deviation|
|_______|___________________________________________________________________________________________________|
| 5 | 139 2.9 139 2.4 112 7.0 108 6.0 |
| 40 | 175 5.1 208 44.5 150 23.8 139 4.3 |
| 80 | 207 3.9 213 4.7 180 7.7 210 52.9 |
| 120 | 276 29.3 273 17.1 221 7.7 238 5.8 |
| 160 | 304 7.2 328 77.1 275 21.5 274 10.1 |
| 200 | 372 35.0 506 235.1 338 25.2 379 69.2 |
|_______|___________________________________________________________________________________________________|
if enhanced to do a better job of simulating internetwork
delay and packet loss.
The Leases mechanism provided a simple model for the
provision of cache consistency and did seem to improve per-
formance for various scenarios. Unfortunately, it does not
provide the server state information that is required for
file system semantics, such as locking, that many software
systems demand. In production environments on my campus, the
need for file locking and the correct generation of the
ETXTBSY error code are far more important that full cache
consistency, and leasing does not satisfy these needs.
Another file system semantic that requires hard server state
is the delay of file removal until the last close system
call. Although Spritely NFS did not support this semantic
either, it is logical that the open file state maintained by
that system would facilitate the implementation of this
semantic more easily than would the Leases mechanism.
9. Further Work
The current implementation uses a fixed, moderate sized
buffer cache designed for the local UFS [McKusick84] file
system. The results in figure 1 suggest that this is ade-
quate so long as the cache is of an appropriate size. How-
ever, a mechanism permitting the cache to vary in size has
been shown to outperform fixed sized buffer caches [Nel-
son90], and could be beneficial. It could also be useful to
allow the buffer cache to grow very large by making use of
local backing store for cases where server performance is
limited. A very large buffer cache size would in turn permit
experimentation with much larger read/write data sizes,
facilitating bulk data transfers across long fat networks,
such as will characterize the Internet of the near future. A
careful redesign of the buffer cache mechanism to provide
support for these features would probably be the next imple-
mentation step.
The results in figure 3 indicate that the mechanics of
caching file attributes and maintaining the attribute
cache's consistency needs to be looked at further. There
also needs to be more work done on the interaction between a
Readdir_and_Lookup RPC and the name and attribute caches, in
an effort to reduce Getattr and Lookup RPC loads.
The NQNFS protocol has never been used in a production
environment and doing so would provide needed insight into
how well the protocol saisfies the needs of real workstation
environments. It is hoped that the distribution of the
implementation in 4.4BSD will facilitate use of the protocol
in production environments elsewhere.
The big question that needs to be resolved is whether
Leases are an adequate mechanism for cache consistency or
whether hard server state is required. Given the work
presented here and in the papers related to Sprite and
Spritely NFS, there are clear indications that a cache con-
sistency algorithm can improve both performance and file
system semantics. As yet, however, it is unclear what the
best approach to maintain consistency is. It would appear
that hard state information is required for file locking and
other mechanisms and, if so, it seems appropriate to use it
for cache consistency as well.
10. Acknowledgements
I would like to thank the members of the CSRG at the
University of California, Berkeley for their continued sup-
port over the years. Without their encouragement and assis-
tance this software would never have been implemented. Prof.
Jim Linders and Prof. Tom Wilson here at the University of
Guelph helped proofread this paper and Jeffrey Mogul pro-
vided a great deal of assistance, helping to turn my gibber-
ish into something at least moderately readable.
11. References
[Baker91] Mary Baker and John Ousterhout, Availability
in the Sprite Distributed File System, In
Operating System Review, (25)2, pg. 95-98,
April 1991.
[Baker91a] Mary Baker, private communication, May 1991.
[Burrows88] Michael Burrows, Efficient Data Sharing,
Technical Report #153, Computer Laboratory,
University of Cambridge, Dec. 1988.
[Gray89] Cary G. Gray and David R. Cheriton, Leases:
An Efficient Fault-Tolerant Mechanism for
Distributed File Cache Consistency, In Proc.
of the Twelfth ACM Symposium on Operating
Systems Principals, Litchfield Park, AZ, Dec.
1989.
[Howard88] John H. Howard, Michael L. Kazar, Sherri G.
Menees, David A. Nichols, M. Satyanarayanan,
Robert N. Sidebotham and Michael J. West,
Scale and Performance in a Distributed File
System, ACM Trans. on Computer Systems, (6)1,
pg 51-81, Feb. 1988.
[Jacobson88] Van Jacobson and R. Braden, TCP Extensions
for Long-Delay Paths, ARPANET Working Group
Requests for Comment, DDN Network Information
Center, SRI International, Menlo Park, CA,
October 1988, RFC-1072.
[Jacobson89] Van Jacobson, Sun NFS Performance Problems,
Private Communication, November, 1989.
[Juszczak89] Chet Juszczak, Improving the Performance and
Correctness of an NFS Server, In Proc. Winter
1989 USENIX Conference, pg. 53-63, San Diego,
CA, January 1989.
[Juszczak94] Chet Juszczak, Improving the Write Perfor-
mance of an NFS Server, to appear in Proc.
Winter 1994 USENIX Conference, San Francisco,
CA, January 1994.
[Kazar88] Michael L. Kazar, Synchronization and Caching
Issues in the Andrew File System, In Proc.
Winter 1988 USENIX Conference, pg. 27-36,
Dallas, TX, February 1988.
[Kent87] Christopher. A. Kent and Jeffrey C. Mogul,
Fragmentation Considered Harmful, Research
Report 87/3, Digital Equipment Corporation
Western Research Laboratory, Dec. 1987.
[Kent87a] Christopher. A. Kent, Cache Coherence in Dis-
tributed Systems, Research Report 87/4, Digi-
tal Equipment Corporation Western Research
Laboratory, April 1987.
[Macklem90] Rick Macklem, Lessons Learned Tuning the
4.3BSD Reno Implementation of the NFS Proto-
col, In Proc. Winter 1991 USENIX Conference,
pg. 53-64, Dallas, TX, January 1991.
[Macklem93] Rick Macklem, The 4.4BSD NFS Implementation,
In The System Manager's Manual, 4.4 Berkeley
Software Distribution, University of Califor-
nia, Berkeley, June 1993.
[McKusick84] Marshall K. McKusick, William N. Joy, Samuel
J. Leffler and Robert S. Fabry, A Fast File
System for UNIX, ACM Transactions on Computer
Systems, Vol. 2, Number 3, pg. 181-197,
August 1984.
[McKusick90] Marshall K. McKusick, Michael J. Karels and
Keith Bostic, A Pageable Memory Based
Filesystem, In Proc. Summer 1990 USENIX
Conference, pg. 137-143, Anaheim, CA, June
1990.
[Mogul93] Jeffrey C. Mogul, Recovery in Spritely NFS,
Research Report 93/2, Digital Equipment Cor-
poration Western Research Laboratory, June
1993.
[Moran90] Joseph Moran, Russel Sandberg, Don Coleman,
Jonathan Kepecs and Bob Lyon, Breaking
Through the NFS Performance Barrier, In Proc.
Spring 1990 EUUG Conference, pg. 199-206,
Munich, FRG, April 1990.
[Nelson88] Michael N. Nelson, Brent B. Welch, and John
K. Ousterhout, Caching in the Sprite Network
File System, ACM Transactions on Computer
Systems (6)1 pg. 134-154, February 1988.
[Nelson90] Michael N. Nelson, Virtual Memory vs. The
File System, Research Report 90/4, Digital
Equipment Corporation Western Research
Laboratory, March 1990.
[Nowicki89] Bill Nowicki, Transport Issues in the Network
File System, In Computer Communication
Review, pg. 16-20, March 1989.
[Ousterhout90] John K. Ousterhout, Why Aren't Operating Sys-
tems Getting Faster As Fast as Hardware? In
Proc. Summer 1990 USENIX Conference, pg.
247-256, Anaheim, CA, June 1990.
[Sandberg85] Russel Sandberg, David Goldberg, Steve Klei-
man, Dan Walsh, and Bob Lyon, Design and
Implementation of the Sun Network filesystem,
In Proc. Summer 1985 USENIX Conference, pages
119-130, Portland, OR, June 1985.
[Srinivasan89] V. Srinivasan and Jeffrey. C. Mogul, Spritely
NFS: Experiments with Cache-Consistency Pro-
tocols, In Proc. of the Twelfth ACM Symposium
on Operating Systems Principals, Litchfield
Park, AZ, Dec. 1989.
[Steiner88] J. G. Steiner, B. C. Neuman and J. I.
Schiller, Kerberos: An Authentication Service
for Open Network Systems, In Proc. Winter
1988 USENIX Conference, pg. 191-202, Dallas,
TX, February 1988.
[SUN89] Sun Microsystems Inc., NFS: Network File Sys-
tem Protocol Specification, ARPANET Working
Group Requests for Comment, DDN Network
Information Center, SRI International, Menlo
Park, CA, March 1989, RFC-1094.
[SUN93] Sun Microsystems Inc., NFS: Network File Sys-
tem Version 3 Protocol Specification, Sun
Microsystems Inc., Mountain View, CA, June
1993.
[Wittle93] Mark Wittle and Bruce E. Keith, LADDIS: The
Next Generation in NFS File Server Benchmark-
ing, In Proc. Summer 1993 USENIX Conference,
pg. 111-128, Cincinnati, OH, June 1993.
____________________
- NFS is believed to be a trademark of Sun Microsystems,
Inc.
- Prestoserve is a trademark of Legato Systems, Inc.
S MIPS is a trademark of Silicon Graphics, Inc.
- DECstation, MicroVAXII and Ultrix are trademarks of Di-
gital Equipment Corp.
= Unix is a trademark of Novell, Inc.
Generated on 2013-04-27 00:20:00 by $MirOS: src/scripts/roff2htm,v 1.77 2013/01/01 20:49:09 tg Exp $
These manual pages and other documentation are copyrighted by their respective writers;
their source is available at our CVSweb,
AnonCVS, and other mirrors. The rest is Copyright © 2002‒2013 The MirOS Project, Germany.
This product includes material
provided by Thorsten Glaser.
This manual page’s HTML representation is supposed to be valid XHTML/1.1; if not, please send a bug report – diffs preferred.