MirOS Manual: nqnfs(PAPERS)


          Reprinted with  permission  from  the  "Proceedings  of  the
          Winter 1994 Usenix Conference", January 1994, San Francisco,
          CA, Copyright The Usenix Association.

                 Not Quite NFS, Soft Cache Consistency for NFS

                                  Rick Macklem
                              University of Guelph

                                    Abstract

               There are some constraints inherent in the NFS-  proto-
          col  that result in performance limitations for high perfor-
          mance workstation  environments.  This  paper  discusses  an
          NFS-like  protocol  named Not Quite NFS (NQNFS), designed to
          address some of these limitations.  This  protocol  provides
          full  cache  consistency during normal operation, while per-
          mitting more effective client-side caching in an  effort  to
          improve  performance. There are also a variety of minor pro-
          tocol changes, in order to resolve various NFS  issues.  The
          emphasis  is on observed performance of a preliminary imple-
          mentation of the protocol, in order to show  how  well  this
          design  works  and  to  suggest  possible  areas for further
          improvement.

          1. Introduction

               It has been observed that overall  workstation  perfor-
          mance  has  not  been  scaling with processor speed and that
          file  system  I/O  is  a  limiting  factor   [Ousterhout90].
          Ousterhout  notes  that  a principal challenge for operating
          system developers is the decoupling  of  system  calls  from
          their underlying I/O operations, in order to improve average
          system call response times. For  distributed  file  systems,
          every  synchronous  Remote  Procedure  Call  (RPC)  takes  a
          minimum of a few milliseconds and, as such, is analogous  to
          an underlying I/O operation. This suggests that client cach-
          ing with a very good hit ratio  for  read  type  operations,
          along  with  asynchronous  writing,  is required in order to
          avoid delays waiting for RPC replies. However, the NFS  pro-
          tocol requires that the server be stateless[1] and does  not
          provide any explicit mechanism for client cache consistency,
          putting constraints on how the client may cache  data.  This
          paper  describes  an NFS-like protocol that includes a cache
          consistency component designed  to  enhance  client  caching
          performance.  It  does provide full consistency under normal
          operation, but without requiring that hard state information
          be  maintained  on  the  server.  Design tradeoffs were made
          ____________________
             [1]The server must not require any state that may be lost
          due to a crash, to function correctly.

          towards simplicity and  high  performance  over  cache  con-
          sistency under abnormal conditions. The protocol design uses
          a variation of Leases  [Gray89]  to  provide  state  on  the
          server that does not need to be recovered after a crash.

               The protocol also includes changes designed to  address
          other  limitations  of  NFS in a modern workstation environ-
          ment. The use of TCP transport is  optionally  available  to
          avoid  the  pitfalls of Sun RPC over UDP transport when run-
          ning   across   an   internetwork   [Nowicki89].    Kerberos
          [Steiner88] support is available to do proper user authenti-
          cation, in order to provide improved security and  arbitrary
          client  to server user ID mappings. There are also a variety
          of other changes to accommodate large file systems, such  as
          64bit  file sizes and offsets, as well as lifting the 8Kbyte
          I/O size limit. The remainder of this paper gives  an  over-
          view  of the protocol, highlighting performance related com-
          ponents, followed by an evaluation of resultant  performance
          for the 4.4BSD implementation.

          2. Distributed File Systems and Caching

               Clients using distributed file systems cache  recently-
          used  data  in  order  to  reduce  the number of synchronous
          server operations, and therefore  improve  average  response
          times  for  system  calls.  Unfortunately,  maintaining con-
          sistency between these caches is a  problem  whenever  write
          sharing  occurs;  that is, when a process on a client writes
          to a file and one or more processes on other client(s)  read
          the file. If the writer closes the file before any reader(s)
          open the file for reading, this is called  sequential  write
          sharing.  Both the Andrew ITC file system [Howard88] and NFS
          [Sandberg85] maintain consistency for sequential write shar-
          ing  by  requiring the writer to push all the writes through
          to the server on close and having readers check  to  see  if
          the  file  has been modified upon open. If the file has been
          modified, the client throws away all cached  data  for  that
          file,  as  it  is  now  stale. NFS implementations typically
          detect file modification by checking a cached  copy  of  the
          file's  modification  time; since this cached value is often
          several seconds out of date and only has a resolution of one
          second,  an NFS client often uses stale cached data for some
          time after the file has been updated on the server.

               A more difficult  case  is  concurrent  write  sharing,
          where  write operations are intermixed with read operations.
          Consistency for this case, often referred to as "full  cache
          consistency,"  requires  that  a  reader always receives the
          most recently written data. Neither NFS nor the  Andrew  ITC
          file system maintain consistency for this case. The simplest
          mechanism for maintaining full cache consistency is the  one
          used by Sprite [Nelson88], which disables all client caching
          of the file whenever concurrent write sharing  might  occur.

          There  are  other  mechanisms  described  in  the literature
          [Kent87a, Burrows88], but they appeared to be too  elaborate
          for  incorporation  into NQNFS (for example, Kent's requires
          specialized hardware). NQNFS differs from Sprite in the  way
          it detects write sharing. The Sprite server maintains a list
          of files currently open by the various clients  and  detects
          write  sharing  when  a  file  open  request  for writing is
          received and the file is already open for reading  (or  vice
          versa).  This  list  of open files is hard state information
          that must be recovered after a server crash, which is a sig-
          nificant problem in its own right [Mogul93, Welch90].

               The approach used by NQNFS is a variant of  the  Leases
          mechanism  [Gray89].  In  this model, the server issues to a
          client a promise, referred to as a "lease," that the  client
          may  cache  a  specific  object  without fear of conflict. A
          lease has a limited duration and  must  be  renewed  by  the
          client  if  it  wishes  to  continue to cache the object. In
          NQNFS, clients hold short-term (up to one minute) leases  on
          files  for  reading  or writing. The leases are analogous to
          entries in the open file list, except that they expire after
          the  lease  term  unless renewed by the client. As such, one
          minute after issuing the last lease  there  are  no  current
          leases  and therefore no lease records to be recovered after
          a crash, hence the term "soft server state."

               A related design consideration is the way client  writ-
          ing is done. Synchronous writing requires that all writes be
          pushed through to the server during the write  system  call.
          This  is  the  simplest variant, from a consistency point of
          view, since the server always has the most recently  written
          data. It also permits any write errors, such as "file system
          out of space" to be propagated back to the client's  process
          via   the  write  system  call  return.  Unfortunately  this
          approach limits the client write rate, based on server write
          performance and client/server RPC round trip time (RTT).

               An alternative to this is delayed  writing,  where  the
          write  system  call returns as soon as the data is cached on
          the client and the data is written to  the  server  sometime
          later.  This  permits client writing to occur at the rate of
          local storage access up to the  size  of  the  local  cache.
          Also,   for  cases  where  file  truncation/deletion  occurs
          shortly after writing,  the  write  to  the  server  may  be
          avoided  since  the  data has already been deleted, reducing
          server write load. There are some obvious drawbacks to  this
          approach.  For  any Sprite-like system to maintain full con-
          sistency, the server must "callback" to the client to  cause
          the  delayed  writes  to  be written back to the server when
          write sharing is about to occur.  There  are  also  problems
          with  the  propagation  of errors back to the client process
          that issued the write system call. The reason  for  this  is
          that  the system call has already returned without reporting

          an error and the process may also have  already  terminated.
          As  well,  there  is  a risk of the loss of recently written
          data if the client crashes before the data is  written  back
          to the server.

               A compromise between these two  alternatives  is  asyn-
          chronous writing, where the write to the server is initiated
          during the write system  call  but  the  write  system  call
          returns  before the write completes. This approach minimizes
          the risk of data loss due to a client crash, but negates the
          possibility of reducing server write load by throwing writes
          away when a file is truncated or deleted.

               NFS implementations usually do a  mix  of  asynchronous
          and  delayed  writing but push all writes to the server upon
          close, in order to maintain open/close consistency.  Pushing
          the  delayed writes on close negates much of the performance
          advantage of delayed writing, since  the  delays  that  were
          avoided  in the write system calls are observed in the close
          system call. Akin to Sprite, the NQNFS protocol does delayed
          writing  in an effort to achieve good client performance and
          uses a  callback  mechanism  to  maintain  full  cache  con-
          sistency.

          3. Related Work

               There has been a great deal of effort put into  improv-
          ing  the  performance  and  consistency of the NFS protocol.
          This work can be put in two categories. The  first  category
          are implementation enhancements for the NFS protocol and the
          second involve modifications to the protocol.

               The  work  done  on  implementation  enhancements  have
          attacked two problem areas, NFS server write performance and
          RPC transport problems. Server write performance is a  major
          problem  for NFS, in part due to the requirement to push all
          writes to the server upon close and in part due to the  fact
          that,  for  writes, all data and meta-data must be committed
          to non-volatile storage before the  server  replies  to  the
          write  RPC.  The  Prestoserve-  [Moran90]  system  uses non-
          volatile RAM as a buffer for recently written  data  on  the
          server, so that the write RPC replies can be returned to the
          client before the data is written to the disk surface. Write
          gathering  [Juszczak94]  is a software technique used on the
          server where a write RPC request is delayed for a short time
          in  the  hope  that  another  contiguous  write request will
          arrive, so that they can be merged into one write operation.
          Since  the  replies  to  all  of  the  merged writes are not
          returned to the client until the  write  operation  is  com-
          pleted, this delay does not violate the protocol. When write
          operations are merged, the number  of  disk  writes  can  be
          reduced, improving server write performance. Although either
          of the above reduces write RPC response time for the server,

          it  cannot be reduced to zero, and so, any client side cach-
          ing mechanism that reduces write RPC load or  client  depen-
          dence  on  server  RPC  response  time  should still improve
          overall performance. Good client side caching should be com-
          plementary  to these server techniques, although client per-
          formance improvements as a result of  caching  may  be  less
          dramatic when these techniques are used.

               In NFS, each Sun RPC  request  is  packaged  in  a  UDP
          datagram for transmission to the server. A timer is started,
          and if a timeout occurs before the corresponding  RPC  reply
          is received, the RPC request is retransmitted. There are two
          problems with this model. First, when a  retransmit  timeout
          occurs, the RPC may be redone, instead of simply retransmit-
          ting the RPC request message to the server. A recent-request
          cache  can  be  used  on the server to minimize the negative
          impact of redoing RPCs [Juszczak89]. The second  problem  is
          that  a  large UDP datagram, such as a read request or write
          reply, must be fragmented by IP and if any one  IP  fragment
          is  lost  in  transit,  the  entire  UDP  datagram  is  lost
          [Kent87]. Since entire requests and replies are packaged  in
          a  single  UDP  datagram,  this  puts  an upper bound on the
          read/write data size (8 kbytes).

               Adjusting the retransmit timeout (RTT) interval dynami-
          cally  and  applying  a  congestion  window  on  outstanding
          requests has been shown to be of some help [Nowicki89]  with
          the retransmission problem. An alternative to this is to use
          TCP transport to delivery the RPC messages  reliably  [Mack-
          lem90]  and  one  of  the  performance results in this paper
          shows the effects of this further.

               Srinivasan and Mogul [Srinivasan89]  enhanced  the  NFS
          protocol to use the Sprite cache consistency algorithm in an
          effort to improve performance and  to  provide  full  client
          cache  consistency.  This experimental implementation demon-
          strated significantly better performance than NFS, but  suf-
          fered  from a lack of crash recovery support. The NQNFS pro-
          tocol design borrowed heavily from this work,  but  differed
          from  the  Sprite  algorithm by using Leases instead of file
          open state to detect write  sharing.  The  decision  to  use
          Leases  was made primarily to avoid the crash recovery prob-
          lem. More recent work by  the  Sprite  group  [Baker91]  and
          Mogul  [Mogul93]  have addressed the crash recovery problem,
          making this design tradeoff more questionable now.

               Sun has recently updated the NFS protocol to Version  3
          [SUN93],  using  some  changes  similar  to NQNFS to address
          various issues. The Version 3 protocol uses 64bit file sizes
          and offsets, provides a Readdir_and_Lookup RPC and an access
          RPC. It also provides cache hints, to permit a client to  be
          able  to determine whether a file modification is the result
          of that client's write or  some  other  client's  write.  It

          would  be  possible to add either Spritely NFS or NQNFS sup-
          port for cache consistency to the NFS Version 3 protocol.

          4. NQNFS Consistency Protocol and Recovery

               The NQNFS cache consistency protocol  uses  a  somewhat
          Sprite-like  [Nelson88]  mechanism,  but  is based on Leases
          [Gray89] instead of hard server state information about open
          files.  The  basic  principle  is  that  the server disables
          client caching of files whenever  concurrent  write  sharing
          could  occur,  by  performing  a  server-to-client callback,
          forcing the client to flush its caches and to do all  subse-
          quent I/O on the file with synchronous RPCs. A Sprite server
          maintains a record of  the  open  state  of  files  for  all
          clients  and  uses  this  to determine when concurrent write
          sharing might occur. This open state information might  also
          be  referred to as an infinite-term lease for the file, with
          explicit lease cancellation. NQNFS, on the other hand,  uses
          a  short-term lease that expires due to timeout after a max-
          imum of one minute, unless explicitly renewed by the client.
          The fundamental difference is that an NQNFS client must keep
          renewing a lease to use cached data whereas a Sprite  client
          assumes  the  data  is valid until canceled by the server or
          the file is closed.  Using  leases  permits  the  server  to
          remain  "stateless," since the soft state information, which
          consists of the set of current leases,  is  moot  after  one
          minute, when all the leases expire.

               Whenever a client wishes to access  a  file's  data  it
          must  hold one of three types of lease: read-caching, write-
          caching or non-caching. The latter type  requires  that  all
          file  operations  be  done synchronously with the server via
          the appropriate RPCs.

               A read-caching lease allows for client data caching but
          no  modifications  may  be  done. It may, however, be shared
          between multiple clients. Diagram 1 shows  a  typical  read-
          caching  scenario. The vertical solid black lines depict the
          lease records. Note that the time lines are nowhere near  to
          scale,  since a client/server interaction will normally take
          less than one hundred milliseconds, whereas the normal lease
          duration  is  thirty  seconds. Every lease includes a modrev
          value, which changes upon every modification of the file. It
          may  be used to check to see if data cached on the client is
          still current.

               A write-caching lease permits  delayed  write  caching,
          but  requires that all data be pushed to the server when the
          lease expires or is terminated by an eviction callback. When
          a  write-caching  lease  has almost expired, the client will
          attempt to extend the lease if the file is still  open,  but
          is  required  to  push  the  delayed writes to the server if
          renewal fails (as depicted by diagram 2). The writes may not

          arrive at the server until after the write lease has expired
          on the client, but this does not  result  in  a  consistency
          problem,  so  long  as the write lease is still valid on the
          server. Note that, in diagram 2, the  lease  record  on  the
          server  remains  current  after  the expiry time, due to the
          conditions mentioned in section 5. If a write RPC is done on
          the  server after the write lease has expired on the server,
          this could be considered an error since consistency could be
          lost, but it is not handled as such by NQNFS.

               Diagram  3  depicts  how  read  and  write  leases  are
          replaced  by a non-caching lease when there is the potential
          for write sharing. A write-caching lease is not used in  the
          Stanford  V  Distributed  System [Gray89], since synchronous
          writing is always used. A side effect of this change is that
          the  five  to  ten second lease duration recommended by Gray
          was found to be insufficient to achieve good performance for
          the  write-caching lease. Experimentation showed that thirty
          seconds was about optimal for cases  where  the  client  and
          server  are  connected  to  the  same local area network, so
          thirty seconds is the default lease duration  for  NQNFS.  A
          maximum  of twice that value is permitted, since Gray showed
          that for some network topologies, a  larger  lease  duration
          functions  better.  Although  there is an explicit get_lease
          RPC defined for the protocol, most lease requests  are  pig-
          gybacked  onto  the  other  RPCs  to minimize the additional
          overhead introduced by leasing.

          4.1. Rationale

               Leasing was chosen over hard server  state  information
          for the following reasons:

          1.   The server must maintain state  information  about  all
               current client leases. Since at most one lease is allo-
               cated for each RPC and the leases  expire  after  their
               lease  term,  the  upper bound on the number of current
               leases is the product of the lease term and the  server
               RPC  rate.  In practice, it has been observed that less
               than 10% of RPCs request  new  leases  and  since  most
               leases  have  a  term  of thirty seconds, the following
               rule of thumb should  estimate  the  number  of  server
               lease records:

                       Number of Server Lease Records = 0.1 * 30 * RPC rate

               Since each lease record occupies  64  bytes  of  server
               memory, storing the lease records should not be a seri-
               ous problem. If a server has exhausted  lease  storage,
               it  can simply wait a few seconds for a lease to expire
               and free up a record. On the other hand, a  Sprite-like
               server  must store records for all files currently open
               by all clients, which can require  significant  storage

                                        line from 0.738,5.388 to 1.238,5.388
                                        dashwid = 0.050i
                                        line dashed from 1.488,10.075 to 1.488,5.450
                                        line dashed from 2.987,10.075 to 2.987,5.450
                                        line dashed from 4.487,10.075 to 4.487,5.450
                                        line from 4.487,7.013 to 4.487,5.950
                                        line from 2.987,7.700 to 2.987,5.950 to 2.987,6.075
                                        line from 1.488,7.513 to 1.488,5.950
                                        line from 2.987,9.700 to 2.987,8.325
                                        line from 1.488,9.450 to 1.488,8.325
                                        line from 2.987,6.450 to 4.487,6.200
                                        line from 4.385,6.192 to 4.487,6.200 to 4.393,6.241
                                        line from 4.487,6.888 to 2.987,6.575
                                        line from 3.080,6.620 to 2.987,6.575 to 3.090,6.571
                                        line from 2.987,7.263 to 4.487,7.013
                                        line from 4.385,7.004 to 4.487,7.013 to 4.393,7.054
                                        line from 4.487,7.638 to 2.987,7.388
                                        line from 3.082,7.429 to 2.987,7.388 to 3.090,7.379
                                        line from 2.987,6.888 to 1.488,6.575
                                        line from 1.580,6.620 to 1.488,6.575 to 1.590,6.571
                                        line from 1.488,7.200 to 2.987,6.950
                                        line from 2.885,6.942 to 2.987,6.950 to 2.893,6.991
                                        line from 2.987,7.700 to 1.488,7.513
                                        line from 1.584,7.550 to 1.488,7.513 to 1.590,7.500
                                        line from 1.488,8.012 to 2.987,7.763
                                        line from 2.885,7.754 to 2.987,7.763 to 2.893,7.804
                                        line from 2.987,9.012 to 1.488,8.825
                                        line from 1.584,8.862 to 1.488,8.825 to 1.590,8.813
                                        line from 1.488,9.325 to 2.987,9.137
                                        line from 2.885,9.125 to 2.987,9.137 to 2.891,9.175
                                        line from 2.987,9.637 to 1.488,9.450
                                        line from 1.584,9.487 to 1.488,9.450 to 1.590,9.438
                                        line from 1.488,9.887 to 2.987,9.700
                                        line from 2.885,9.688 to 2.987,9.700 to 2.891,9.737
                                        "Lease valid on machine" at 1.363,5.296 ljust
                                        "with same modrev" at 1.675,7.421 ljust
                                        "miss)" at 2.612,9.233 ljust
                                        "(cache" at 2.300,9.358 ljust
                                        "Diagram #1: Read Caching Leases" at 0.738,5.114 ljust
                                        "Client B" at 4.112,10.176 ljust
                                        "Server" at 2.612,10.176 ljust
                                        "Client A" at 0.925,10.176 ljust
                                        "from cache" at 4.675,6.546 ljust
                                        "Read syscalls" at 4.675,6.796 ljust
                                        "Reply" at 3.737,6.108 ljust
                                        "(cache miss)" at 3.675,6.421 ljust
                                        "Read  req" at 3.737,6.608 ljust
                                        "to lease" at 3.112,6.796 ljust
                                        "Client B added" at 3.112,6.983 ljust
                                        "Reply" at 3.237,7.296 ljust
                                        "Read + lease req" at 3.175,7.671 ljust

                                        "Read syscall" at 4.675,7.608 ljust
                                        "Reply" at 1.675,6.796 ljust
                                        "miss)" at 2.487,7.108 ljust
                                        "Read req (cache" at 1.675,7.233 ljust
                                        "from cache" at 0.425,6.296 ljust
                                        "Read  syscalls" at 0.425,6.546 ljust
                                        "cache" at 0.425,6.858 ljust
                                        "so can still" at 0.425,7.108 ljust
                                        "Modrev  same" at 0.425,7.358 ljust
                                        "Reply" at 1.675,7.671 ljust
                                        "Get lease req" at 1.675,8.108 ljust
                                        "Read syscall" at 0.425,7.983 ljust
                                        "Lease times out" at 0.425,8.296 ljust
                                        "from cache" at 0.425,9.046 ljust
                                        "Read syscalls" at 0.425,9.296 ljust
                                        "for Client A" at 3.112,9.296 ljust
                                        "Read caching lease" at 3.112,9.483 ljust
                                        "Reply" at 1.675,8.983 ljust
                                        "Read req" at 1.675,9.358 ljust
                                        "Reply" at 1.675,9.608 ljust
                                        "Read + lease req" at 1.675,9.921 ljust
                                        "Read syscall" at 0.425,9.921 ljust

                                        line from 1.175,5.700 to 1.300,5.700
                                        line from 0.738,5.700 to 1.175,5.700
                                        line from 2.987,6.638 to 2.987,6.075
                                        dashwid = 0.050i
                                        line dashed from 2.987,6.575 to 2.987,5.950
                                        line dashed from 1.488,6.575 to 1.488,5.888
                                        line from 2.987,9.762 to 2.987,6.638
                                        line from 1.488,9.450 to 1.488,7.700
                                        line from 2.987,6.763 to 1.488,6.575
                                        line from 1.584,6.612 to 1.488,6.575 to 1.590,6.563
                                        line from 1.488,7.013 to 2.987,6.825
                                        line from 2.885,6.813 to 2.987,6.825 to 2.891,6.862
                                        line from 2.987,7.325 to 1.488,7.075
                                        line from 1.582,7.116 to 1.488,7.075 to 1.590,7.067
                                        line from 1.488,7.700 to 2.987,7.388
                                        line from 2.885,7.383 to 2.987,7.388 to 2.895,7.432
                                        line from 2.987,8.575 to 1.488,8.325
                                        line from 1.582,8.366 to 1.488,8.325 to 1.590,8.317
                                        line from 1.488,8.887 to 2.987,8.637
                                        line from 2.885,8.629 to 2.987,8.637 to 2.893,8.679
                                        line from 2.987,9.637 to 1.488,9.450
                                        line from 1.584,9.487 to 1.488,9.450 to 1.590,9.438
                                        line from 1.488,9.887 to 2.987,9.762
                                        line from 2.886,9.746 to 2.987,9.762 to 2.890,9.796
                                        line dashed from 2.987,10.012 to 2.987,6.513
                                        line dashed from 1.488,10.012 to 1.488,6.513
                                        "write" at 4.237,5.921 ljust
                                        "Lease valid on machine" at 1.425,5.733 ljust
                                        "Diagram #2: Write Caching Lease" at 0.738,5.551 ljust
                                        "Server" at 2.675,10.114 ljust
                                        "Client A" at 1.113,10.114 ljust
                                        "seconds after last" at 3.112,5.921 ljust
                                        "Expires write_slack" at 3.112,6.108 ljust
                                        "due to write activity" at 3.112,6.608 ljust
                                        "Expiry delayed" at 3.112,6.796 ljust
                                        "Lease times out" at 3.112,7.233 ljust
                                        "Lease renewed" at 3.175,8.546 ljust
                                        "Lease for client A" at 3.175,9.358 ljust
                                        "Write caching" at 3.175,9.608 ljust
                                        "Reply" at 1.675,6.733 ljust
                                        "Write req" at 1.988,7.046 ljust
                                        "Reply" at 1.675,7.233 ljust
                                        "Write req" at 1.675,7.796 ljust
                                        "Lease expires" at 0.487,7.733 ljust
                                        "Close syscall" at 0.487,8.108 ljust
                                        "lease granted" at 1.675,8.546 ljust
                                        "Get write lease" at 1.675,8.921 ljust
                                        "before expiry" at 0.487,8.608 ljust
                                        "Lease renewal" at 0.487,8.796 ljust
                                        "syscalls" at 0.487,9.046 ljust
                                        "Delayed write" at 0.487,9.233 ljust
                                        "lease granted" at 1.675,9.608 ljust
                                        "Get write lease req" at 1.675,9.921 ljust

                                        "Write syscall" at 0.487,9.858 ljust

                                        line from 0.613,2.638 to 1.238,2.638
                                        line from 1.488,4.075 to 1.488,3.638
                                        line from 2.987,4.013 to 2.987,3.575
                                        line from 4.487,4.013 to 4.487,3.575
                                        line from 2.987,3.888 to 4.487,3.700
                                        line from 4.385,3.688 to 4.487,3.700 to 4.391,3.737
                                        line from 4.487,4.138 to 2.987,3.950
                                        line from 3.084,3.987 to 2.987,3.950 to 3.090,3.938
                                        line from 2.987,4.763 to 4.487,4.450
                                        line from 4.385,4.446 to 4.487,4.450 to 4.395,4.495
                                        line from 4.487,4.438 to 4.487,4.013
                                        line from 4.487,5.138 to 2.987,4.888
                                        line from 3.082,4.929 to 2.987,4.888 to 3.090,4.879
                                        line from 4.487,6.513 to 4.487,5.513
                                        line from 4.487,6.513 to 4.487,6.513 to 4.487,5.513
                                        line from 2.987,5.450 to 2.987,5.200
                                        line from 1.488,5.075 to 1.488,4.075
                                        line from 2.987,5.263 to 2.987,4.013
                                        line from 2.987,7.700 to 2.987,5.325
                                        line from 4.487,7.575 to 4.487,6.513
                                        line from 1.488,8.512 to 1.488,8.075
                                        line from 2.987,8.637 to 2.987,8.075
                                        line from 2.987,9.637 to 2.987,8.825
                                        line from 1.488,9.450 to 1.488,8.950
                                        line from 2.987,4.450 to 1.488,4.263
                                        line from 1.584,4.300 to 1.488,4.263 to 1.590,4.250
                                        line from 1.488,4.888 to 2.987,4.575
                                        line from 2.885,4.571 to 2.987,4.575 to 2.895,4.620
                                        line from 2.987,5.263 to 1.488,5.075
                                        line from 1.584,5.112 to 1.488,5.075 to 1.590,5.063
                                        line from 4.487,5.513 to 2.987,5.325
                                        line from 3.084,5.362 to 2.987,5.325 to 3.090,5.313
                                        line from 2.987,5.700 to 4.487,5.575
                                        line from 4.386,5.558 to 4.487,5.575 to 4.390,5.608
                                        line from 4.487,6.013 to 2.987,5.825
                                        line from 3.084,5.862 to 2.987,5.825 to 3.090,5.813
                                        line from 2.987,6.200 to 4.487,6.075
                                        line from 4.386,6.058 to 4.487,6.075 to 4.390,6.108
                                        line from 4.487,6.450 to 2.987,6.263
                                        line from 3.084,6.300 to 2.987,6.263 to 3.090,6.250
                                        line from 2.987,6.700 to 4.487,6.513
                                        line from 4.385,6.500 to 4.487,6.513 to 4.391,6.550
                                        line from 1.488,6.950 to 2.987,6.763
                                        line from 2.885,6.750 to 2.987,6.763 to 2.891,6.800
                                        line from 2.987,7.700 to 4.487,7.575
                                        line from 4.386,7.558 to 4.487,7.575 to 4.390,7.608
                                        line from 4.487,7.950 to 2.987,7.763
                                        line from 3.084,7.800 to 2.987,7.763 to 3.090,7.750
                                        line from 2.987,8.637 to 1.488,8.512
                                        line from 1.585,8.546 to 1.488,8.512 to 1.589,8.496
                                        line from 1.488,8.887 to 2.987,8.700
                                        line from 2.885,8.688 to 2.987,8.700 to 2.891,8.737
                                        line from 2.987,9.637 to 1.488,9.450

                                        line from 1.584,9.487 to 1.488,9.450 to 1.590,9.438
                                        line from 1.488,9.950 to 2.987,9.762
                                        line from 2.885,9.750 to 2.987,9.762 to 2.891,9.800
                                        dashwid = 0.050i
                                        line dashed from 4.487,10.137 to 4.487,2.825
                                        line dashed from 2.987,10.137 to 2.987,2.825
                                        line dashed from 1.488,10.137 to 1.488,2.825
                                        "(not cached)" at 4.612,3.858 ljust
                                        "Diagram #3: Write sharing case" at 0.613,2.239 ljust
                                        "Write syscall" at 4.675,7.546 ljust
                                        "Read syscall" at 0.550,9.921 ljust
                                        "Lease valid on machine" at 1.363,2.551 ljust
                                        "(can still cache)" at 1.675,8.171 ljust
                                        "Reply" at 3.800,3.858 ljust
                                        "Write" at 3.175,4.046 ljust
                                        "writes" at 4.612,4.046 ljust
                                        "synchronous" at 4.612,4.233 ljust
                                        "write syscall" at 4.675,5.108 ljust
                                        "non-caching lease" at 3.175,4.296 ljust
                                        "Reply " at 3.175,4.483 ljust
                                        "req" at 3.175,4.983 ljust
                                        "Get write lease" at 3.175,5.108 ljust
                                        "Vacated msg" at 3.175,5.483 ljust
                                        "to the server" at 4.675,5.858 ljust
                                        "being flushed to" at 4.675,6.046 ljust
                                        "Delayed writes" at 4.675,6.233 ljust
                                        "Server" at 2.675,10.182 ljust
                                        "Client B" at 3.925,10.182 ljust
                                        "Client A" at 0.863,10.182 ljust
                                        "(not cached)" at 0.550,4.733 ljust
                                        "Read data" at 0.550,4.921 ljust
                                        "Reply  data" at 1.675,4.421 ljust
                                        "Read request" at 1.675,4.921 ljust
                                        "lease" at 1.675,5.233 ljust
                                        "Reply non-caching" at 1.675,5.421 ljust
                                        "Reply" at 3.737,5.733 ljust
                                        "Write" at 3.175,5.983 ljust
                                        "Reply" at 3.737,6.171 ljust
                                        "Write" at 3.175,6.421 ljust
                                        "Eviction Notice" at 3.175,6.796 ljust
                                        "Get read lease" at 1.675,7.046 ljust
                                        "Read syscall" at 0.550,6.983 ljust
                                        "being cached" at 4.675,7.171 ljust
                                        "Delayed writes" at 4.675,7.358 ljust
                                        "lease" at 3.175,7.233 ljust
                                        "Reply write caching" at 3.175,7.421 ljust
                                        "Get  write lease" at 3.175,7.983 ljust
                                        "Write syscall" at 4.675,7.983 ljust
                                        "with same modrev" at 1.675,8.358 ljust
                                        "Lease" at 0.550,8.171 ljust
                                        "Renewed" at 0.550,8.358 ljust
                                        "Reply" at 1.675,8.608 ljust
                                        "Get Lease Request" at 1.675,8.983 ljust

                                        "Read syscall" at 0.550,8.733 ljust
                                        "from cache" at 0.550,9.108 ljust
                                        "Read syscall" at 0.550,9.296 ljust
                                        "Reply " at 1.675,9.671 ljust
                                        "plus lease" at 2.050,9.983 ljust
                                        "Read Request" at 1.675,10.108 ljust

               for a large, heavily loaded server. In [Mogul93], it is
               proposed  that  a  mechanism  vaguely similar to paging
               could be used to deal with this for Spritely  NFS,  but
               this  appears  to introduce a fair amount of complexity
               and may limit the usefulness of open records for  stor-
               ing other state information, such as file locks.

          2.   After a server crashes it must  recover  lease  records
               for  the  current  outstanding  leases,  which actually
               implies that if it waits until all leases have expired,
               there  is no state to recover. The server must wait for
               the maximum lease duration of one minute, and  it  must
               serve  all  outstanding  write  requests resulting from
               terminated  write-caching  leases  before  issuing  new
               leases.  The  one  minute  delay can be overlapped with
               file system consistency checking (eg. fsck). Because no
               state  must be recovered, a lease-based server, like an
               NFS server, avoids the problem of state recovery  after
               a crash.

               There can, however, be problems during  crash  recovery
               because  of  a  potentially large number of write backs
               due to terminated write-caching leases.  One  of  these
               problems  is  a "recovery storm" [Baker91], which could
               occur when the server is overloaded by  the  number  of
               write  RPC requests. The NQNFS protocol deals with this
               by  replying  with  a   return   status   code   called
               try_again_later  to  all  RPC  requests  (except write)
               until the write requests subside. At this  time,  there
               has   not  been  sufficient  testing  of  server  crash
               recovery while under heavy server load to determine  if
               the  try_again_later  reply is a sufficient solution to
               the problem. The other problem is that consistency will
               be  lost  if other RPCs are performed before all of the
               write backs for terminated  write-caching  leases  have
               completed.  This  is  handled  by only performing write
               RPCs until no write RPC requests arrive for write_slack
               seconds,  where write_slack is set to several times the
               client timeout retransmit interval, at which time it is
               assumed  all  clients  have  had an opportunity to send
               their writes to the server.

          3.   Another advantage of leasing is that, since leases  are
               required  at  times  when  other  I/O operations occur,
               lease requests can  almost  always  be  piggybacked  on

               other  RPCs,  avoiding  some of the overhead associated
               with the explicit open and close  RPCs  required  by  a
               Sprite-like  system.  Compared  with  Sprite cache con-
               sistency, this can result in a significantly lower  RPC
               load (see table #1).

          5. Limitations of the NQNFS Protocol

               There is a  serious  risk  when  leasing  is  used  for
          delayed  write  caching. If the server is simply too busy to
          service a lease renewal before a  write-caching  lease  ter-
          minates,  the client will not be able to push the write data
          to the server before the lease has terminated, resulting  in
          inconsistency.  Note that the danger of inconsistency occurs
          when the server assumes that a write-caching lease has  ter-
          minated  before  the client has had the opportunity to write
          the data back to the server. In  an  effort  to  avoid  this
          problem,  the  NQNFS  server  does  not assume that a write-
          caching lease has terminated until three conditions are met:

              1 - clock time > (expiry time + clock skew)
              2 - there is at least one server daemon (nfsd) waiting for an RPC request
              3 - no write RPCs received for leased file within write_slack after the corrected expiry time

          The first condition ensures that the lease  has  expired  on
          the  client.  The clock_skew, by default three seconds, must
          be set to a value larger than the maximum time-of-day  clock
          error that is likely to occur during the maximum lease dura-
          tion. The second  condition  attempts  to  ensure  that  the
          client  is  not  waiting  for replies to any writes that are
          still queued for service by an  nfsd.  The  third  condition
          tries to guarantee that the client has transmitted all write
          requests to the server, since write_slack is set to  several
          times the client's timeout retransmit interval.

               There are also certain file system semantics  that  are
          problematic for both NFS and NQNFS, due to the lack of state
          information maintained by the server. If a file is  unlinked
          on  one client while open on another it will be removed from
          the file server, resulting in failed file  accesses  on  the
          client  that  has  the  file open. If the file system on the
          server is out of space or the client user's disk  quota  has
          been exceeded, a delayed write can fail long after the write
          system call was successfully completed. With NFS this  error
          will be detected by the close system call, since the delayed
          writes are  pushed  upon  close.  With  NQNFS  however,  the
          delayed write RPC may not occur until after the close system
          call, possibly even after the process has exited. Therefore,
          if a process must check for write errors, a system call such
          as fsync must be used.

               Another problem occurs when a process on one client  is
          running  an  executable file and a process on another client
          starts to write to the file. The read  lease  on  the  first
          client  is  terminated  by the server, but the client has no
          recourse but to terminate the process, since the process  is
          already in progress on the old executable.

               The NQNFS protocol does not support file locking, since
          a  file  lock  would have to involve hard, recovered after a
          crash, state information.

          6. Other NQNFS Protocol Features

               NQNFS also includes a variety of minor modifications to
          the  NFS  protocol, in an attempt to address various limita-
          tions. The protocol uses 64bit file  sizes  and  offsets  in
          order to handle large files. TCP transport may be used as an
          alternative to UDP for cases  where  UDP  does  not  perform
          well.  Transport  mechanisms such as TCP also permit the use
          of much larger read/write data sizes,  which  might  improve
          performance in certain environments.

               The NQNFS protocol replaces  the  Readdir  RPC  with  a
          Readdir_and_Lookup  RPC  that  returns  the  file handle and
          attributes for each file in the directory as  well  as  name
          and  file id number. This additional information may then be
          loaded into the lookup  and  file-attribute  caches  on  the
          client.  Thus,  for  cases  such as "ls -l", the stat system
          calls can be performed locally without doing any  lookup  or
          getattr  RPCs. Another additional RPC is the Access RPC that
          checks for file accessibility against the  server.  This  is
          necessary  since  in some cases the client user ID is mapped
          to a different user on the server and doing the access check
          locally  on  the  client  using  file  attributes and client
          credentials is not correct.  One  case  where  this  becomes
          necessary  is  when  the NQNFS mount point is using Kerberos
          authentication, where the Kerberos authentication ticket  is
          translated  to  credentials on the server that are mapped to
          the client side user id. For further details on  the  proto-
          col, see [Macklem93].

          7. Performance

               In order to evaluate the  effectiveness  of  the  NQNFS
          protocol,  a  benchmark was used that was designed to typify
          real work on the client  workstation.  Benchmarks,  such  as
          Laddis [Wittle93], that perform server load characterization
          are not appropriate for this work,  since  it  is  primarily
          client  caching efficiency that needs to be evaluated. Since
          these tests are measuring overall client system  performance
          and  not  just  the  performance  of  the  file system, each
          sequence of runs was performed  on  identical  hardware  and
          operating   system   in  order  to  factor  out  the  system

          components affecting performance other than the file  system
          protocol.

               The equipment used  for  the  all  the  benchmarks  are
          members  of the DECstation- family of workstations using the
          MIPSS RISC architecture. The  operating  system  running  on
          these systems was a pre-release version of 4.4BSD Unix=. For
          all benchmarks, the file server was a  DECstation  2100  (10
          MIPS)  with  8Mbytes  of  memory  and a local RZ23 SCSI disk
          (27msec average access time). The  clients  range  in  speed
          from  DECstation  2100s  to a DECstation 5000/25, and always
          run with six block I/O daemons and a  4Mbyte  buffer  cache,
          except for the test runs where the buffer cache size was the
          independent variable. In all cases /tmp is  mounted  on  the
          local  SCSI  disk[2], all machines were attached to the same
          uncongested Ethernet, and ran in single user mode during the
          benchmarks. Unless noted otherwise, test runs used  UDP  RPC
          transport  and  the  results given are the average values of
          four runs.

               The benchmark used is  the  Modified  Andrew  Benchmark
          (MAB)  [Ousterhout90],  which is a slightly modified version
          of the benchmark used to  characterize  performance  of  the
          Andrew  ITC  file system [Howard88]. The MAB was set up with
          the executable binaries in the remote  mounted  file  system
          and  the final load step was commented out, due to a linkage
          problem  during  testing  under  4.4BSD.  Therefore,   these
          results  are  not  directly comparable to other reported MAB
          results. The MAB is made up of five distinct phases:

          1.        Makes five directories (no significant cost)

          2.        Copy a file system subtree to a working directory

          3.        Get file attributes  (stat)  of  all  the  working
                    files

          4.        Search for strings (grep) in the files

          5.        Compile a library of C sources and archive them

          Of the five phases, the fifth is by far the largest  and  is
          the  one  affected  most  by  client caching mechanisms. The
          results for phase #1 are  invariant  over  all  the  caching
          mechanisms.

          ____________________
             [2]Testing using the 4.4BSD MFS [McKusick90] resulted  in
          slightly  degraded  performance, probably since the machines
          only had 16Mbytes of memory, and so paging increased.

          7.1. Buffer Cache Size Tests

               The first experiment was done to see what effect chang-
          ing  the  size of the buffer cache would have on client per-
          formance. A single DECstation  5000/25  was  used  to  do  a
          series  of runs of MAB with different buffer cache sizes for
          four variations of the file system protocol. The four varia-
          tions are as follows:

          Case 1:   NFS - The NFS protocol as implemented in 4.4BSD

          Case 2:   Leases - The NQNFS protocol using leases for cache
                    consistency

          Case 3:   Leases, Rdirlookup  -  The  NQNFS  protocol  using
                    leases  for cache consistency and with the readdir
                    RPC replaced by Readdir_and_Lookup

          Case 4:   Leases, Attrib leases, Rdirlookup - The NQNFS pro-
                    tocol using leases for cache consistency, with the
                    readdir RPC replaced  by  the  Readdir_and_Lookup,
                    and requiring a valid lease not only for file-data
                    access, but also for file-attribute access.

          As can be seen in figure 1, the buffer cache achieves  about
          optimal performance for the range of two to ten megabytes in
          size. At eleven megabytes in size, the system pages  heavily
          and  the runs did not complete in a reasonable time. Even at
          64Kbytes, the buffer  cache  improves  performance  over  no
          buffer  cache  by  a  significant  margin of 136-148 seconds
          versus 239 seconds. This may be due, in part,  to  the  fact
          that  the Compile Phase of the MAB uses a rather small work-
          ing set of file data. All variants of  NQNFS  achieve  about
          the  same  performance,  running around 30% faster than NFS,
          with a slightly larger difference  for  large  buffer  cache
          sizes.  Based on these results, all remaining tests were run
          with the buffer cache size set to 4Mbytes. Although I do not
          know  what  causes  the local peak in the curves between 0.5
          and 2 megabytes, there is some  indication  that  contention
          for  buffer  cache blocks, between the update process (which
          pushes delayed writes to the server  every  thirty  seconds)
          and the I/O system calls, may be involved.

          7.2. Multiple Client Load Tests

               During preliminary runs of the  MAB,  it  was  observed
          that  the  server  RPC  counts were reduced significantly by
          NQNFS as compared  to  NFS  (table  1).  (Spritely  NFS  and
          Ultrix4.3/NFS  numbers were taken from [Mogul93] and are not
          directly comparable, due  to  numerous  differences  in  the
          experimental  setup including deletion of the load step from
          phase 5.) This suggests that the NQNFS protocol might  scale
          better  with  respect to the number of clients accessing the

                                        dashwid = 0.050i
                                        line dashed from 0.900,7.888 to 4.787,7.888
                                        line dashed from 0.900,7.888 to 0.900,10.262
                                        line from 0.900,7.888 to 0.963,7.888
                                        line from 4.787,7.888 to 4.725,7.888
                                        line from 0.900,8.188 to 0.963,8.188
                                        line from 4.787,8.188 to 4.725,8.188
                                        line from 0.900,8.488 to 0.963,8.488
                                        line from 4.787,8.488 to 4.725,8.488
                                        line from 0.900,8.775 to 0.963,8.775
                                        line from 4.787,8.775 to 4.725,8.775
                                        line from 0.900,9.075 to 0.963,9.075
                                        line from 4.787,9.075 to 4.725,9.075
                                        line from 0.900,9.375 to 0.963,9.375
                                        line from 4.787,9.375 to 4.725,9.375
                                        line from 0.900,9.675 to 0.963,9.675
                                        line from 4.787,9.675 to 4.725,9.675
                                        line from 0.900,9.963 to 0.963,9.963
                                        line from 4.787,9.963 to 4.725,9.963
                                        line from 0.900,10.262 to 0.963,10.262
                                        line from 4.787,10.262 to 4.725,10.262
                                        line from 0.900,7.888 to 0.900,7.950
                                        line from 0.900,10.262 to 0.900,10.200
                                        line from 1.613,7.888 to 1.613,7.950
                                        line from 1.613,10.262 to 1.613,10.200
                                        line from 2.312,7.888 to 2.312,7.950
                                        line from 2.312,10.262 to 2.312,10.200
                                        line from 3.025,7.888 to 3.025,7.950
                                        line from 3.025,10.262 to 3.025,10.200
                                        line from 3.725,7.888 to 3.725,7.950
                                        line from 3.725,10.262 to 3.725,10.200
                                        line from 4.438,7.888 to 4.438,7.950
                                        line from 4.438,10.262 to 4.438,10.200
                                        line from 0.900,7.888 to 4.787,7.888
                                        line from 4.787,7.888 to 4.787,10.262
                                        line from 4.787,10.262 to 0.900,10.262
                                        line from 0.900,10.262 to 0.900,7.888
                                        line from 3.800,8.775 to 4.025,8.775
                                        line from 0.925,10.088 to 0.925,10.088
                                        line from 0.925,10.088 to 0.938,9.812
                                        line from 0.938,9.812 to 0.988,9.825
                                        line from 0.988,9.825 to 1.075,9.838
                                        line from 1.075,9.838 to 1.163,9.938
                                        line from 1.163,9.938 to 1.250,9.838
                                        line from 1.250,9.838 to 1.613,9.825
                                        line from 1.613,9.825 to 2.312,9.750
                                        line from 2.312,9.750 to 3.025,9.713
                                        line from 3.025,9.713 to 3.725,9.850
                                        line from 3.725,9.850 to 4.438,9.875
                                        dashwid = 0.037i
                                        line dotted from 3.800,8.625 to 4.025,8.625
                                        line dotted from 0.925,9.912 to 0.925,9.912

                                        line dotted from 0.925,9.912 to 0.938,9.887
                                        line dotted from 0.938,9.887 to 0.988,9.713
                                        line dotted from 0.988,9.713 to 1.075,9.562
                                        line dotted from 1.075,9.562 to 1.163,9.562
                                        line dotted from 1.163,9.562 to 1.250,9.562
                                        line dotted from 1.250,9.562 to 1.613,9.675
                                        line dotted from 1.613,9.675 to 2.312,9.363
                                        line dotted from 2.312,9.363 to 3.025,9.375
                                        line dotted from 3.025,9.375 to 3.725,9.387
                                        line dotted from 3.725,9.387 to 4.438,9.450
                                        line dashed from 3.800,8.475 to 4.025,8.475
                                        line dashed from 0.925,10.000 to 0.925,10.000
                                        line dashed from 0.925,10.000 to 0.938,9.787
                                        line dashed from 0.938,9.787 to 0.988,9.650
                                        line dashed from 0.988,9.650 to 1.075,9.537
                                        line dashed from 1.075,9.537 to 1.163,9.613
                                        line dashed from 1.163,9.613 to 1.250,9.800
                                        line dashed from 1.250,9.800 to 1.613,9.488
                                        line dashed from 1.613,9.488 to 2.312,9.375
                                        line dashed from 2.312,9.375 to 3.025,9.363
                                        line dashed from 3.025,9.363 to 3.725,9.325
                                        line dashed from 3.725,9.325 to 4.438,9.438
                                        dashwid = 0.075i
                                        line dotted from 3.800,8.325 to 4.025,8.325
                                        line dotted from 0.925,9.963 to 0.925,9.963
                                        line dotted from 0.925,9.963 to 0.938,9.750
                                        line dotted from 0.938,9.750 to 0.988,9.662
                                        line dotted from 0.988,9.662 to 1.075,9.613
                                        line dotted from 1.075,9.613 to 1.163,9.613
                                        line dotted from 1.163,9.613 to 1.250,9.700
                                        line dotted from 1.250,9.700 to 1.613,9.438
                                        line dotted from 1.613,9.438 to 2.312,9.463
                                        line dotted from 2.312,9.463 to 3.025,9.312
                                        line dotted from 3.025,9.312 to 3.725,9.387
                                        line dotted from 3.725,9.387 to 4.438,9.425
                                        "0" at 0.825,7.810 rjust
                                        "20" at 0.825,8.110 rjust
                                        "40" at 0.825,8.410 rjust
                                        "60" at 0.825,8.697 rjust
                                        "80" at 0.825,8.997 rjust
                                        "100" at 0.825,9.297 rjust
                                        "120" at 0.825,9.597 rjust
                                        "140" at 0.825,9.885 rjust
                                        "160" at 0.825,10.185 rjust
                                        "0" at 0.900,7.660
                                        "2" at 1.613,7.660
                                        "4" at 2.312,7.660
                                        "6" at 3.025,7.660
                                        "8" at 3.725,7.660
                                        "10" at 4.438,7.660
                                        "Time (sec)" at 0.150,8.997
                                        "Buffer Cache Size (MBytes)" at 2.837,7.510
                                        "Figure #1: MAB Phase 5 (compile)" at 2.837,10.335

                                        "NFS" at 3.725,8.697 rjust
                                        "Leases" at 3.725,8.547 rjust
                                        "Leases, Rdirlookup" at 3.725,8.397 rjust
                                        "Leases, Attrib leases, Rdirlookup" at 3.725,8.247 rjust

          server. The experiment described in this section ran the MAB
          on  from  one  to  ten  clients concurrently, to observe the
          effects of heavier server load. The clients were started  at
          roughly  the  same  time  by  pressing all the <return> keys
          together and, although not synchronized beyond  that  point,
          all  clients  would  finish  the  test  run within about two
          seconds of each other. This was not a realistic  load  of  N
          active clients, but it did result in a reproducible increas-
          ing client load on the server.  The  results  for  the  four
          variants are plotted in figures 2-5.

               For the MAB benchmark, the NQNFS protocol  reduces  the
          RPC  counts significantly, but with a minimum of extra over-
          head (the GetLease/Open-Close count).

               In figure 2, where a subtree of seventy small files  is
          copied,  the  difference  between  the  protocol variants is
          minimal, with the NQNFS variants performing slightly better.
          For  this  case, the Readdir_and_Lookup RPC is a slight hin-
          drance under heavy load,  possibly  because  it  results  in
          larger directory blocks in the buffer cache.

               In figure 3, for the phase that  gets  file  attributes
          for a large number of files, the leasing variants take about
          50% longer, indicating that there are  performance  problems
          in  this  area.  For the case where valid current leases are
          required for every file when attributes  are  returned,  the
          performance  is significantly worse than when the attributes
          are allowed to be stale by a few seconds on  the  client.  I
          have  not been able to explain the oscillation in the curves
          for the Lease cases.

          _______________________________________________________________________________________
         |                               Table #1: MAB RPC Counts                               |
         |      RPC        Getattr   Read   Write   Lookup   Other   GetLease/Open-Close   Total|
         |______________|_______________________________________________________________________|
         | BSD/NQNFS    |    277      139    306     575      294            127           1718 |
         | BSD/NFS      |   1210      506    451     489      238              0           2894 |
         | Spritely NFS |    259      836    192     535      306           1467           3595 |
         | Ultrix4.3/NFS|   1225     1186    476     810      305              0           4002 |
         |______________|_______________________________________________________________________|

                                        dashwid = 0.050i
                                        line dashed from 0.900,7.888 to 4.787,7.888
                                        line dashed from 0.900,7.888 to 0.900,10.262
                                        line from 0.900,7.888 to 0.963,7.888
                                        line from 4.787,7.888 to 4.725,7.888
                                        line from 0.900,8.225 to 0.963,8.225
                                        line from 4.787,8.225 to 4.725,8.225
                                        line from 0.900,8.562 to 0.963,8.562
                                        line from 4.787,8.562 to 4.725,8.562
                                        line from 0.900,8.900 to 0.963,8.900
                                        line from 4.787,8.900 to 4.725,8.900
                                        line from 0.900,9.250 to 0.963,9.250
                                        line from 4.787,9.250 to 4.725,9.250
                                        line from 0.900,9.588 to 0.963,9.588
                                        line from 4.787,9.588 to 4.725,9.588
                                        line from 0.900,9.925 to 0.963,9.925
                                        line from 4.787,9.925 to 4.725,9.925
                                        line from 0.900,10.262 to 0.963,10.262
                                        line from 4.787,10.262 to 4.725,10.262
                                        line from 0.900,7.888 to 0.900,7.950
                                        line from 0.900,10.262 to 0.900,10.200
                                        line from 1.613,7.888 to 1.613,7.950
                                        line from 1.613,10.262 to 1.613,10.200
                                        line from 2.312,7.888 to 2.312,7.950
                                        line from 2.312,10.262 to 2.312,10.200
                                        line from 3.025,7.888 to 3.025,7.950
                                        line from 3.025,10.262 to 3.025,10.200
                                        line from 3.725,7.888 to 3.725,7.950
                                        line from 3.725,10.262 to 3.725,10.200
                                        line from 4.438,7.888 to 4.438,7.950
                                        line from 4.438,10.262 to 4.438,10.200
                                        line from 0.900,7.888 to 4.787,7.888
                                        line from 4.787,7.888 to 4.787,10.262
                                        line from 4.787,10.262 to 0.900,10.262
                                        line from 0.900,10.262 to 0.900,7.888
                                        line from 3.800,8.900 to 4.025,8.900
                                        line from 1.250,8.325 to 1.250,8.325
                                        line from 1.250,8.325 to 1.613,8.500
                                        line from 1.613,8.500 to 2.312,8.825
                                        line from 2.312,8.825 to 3.025,9.175
                                        line from 3.025,9.175 to 3.725,9.613
                                        line from 3.725,9.613 to 4.438,10.012
                                        dashwid = 0.037i
                                        line dotted from 3.800,8.750 to 4.025,8.750
                                        line dotted from 1.250,8.275 to 1.250,8.275
                                        line dotted from 1.250,8.275 to 1.613,8.412
                                        line dotted from 1.613,8.412 to 2.312,8.562
                                        line dotted from 2.312,8.562 to 3.025,9.088
                                        line dotted from 3.025,9.088 to 3.725,9.375
                                        line dotted from 3.725,9.375 to 4.438,10.000
                                        line dashed from 3.800,8.600 to 4.025,8.600
                                        line dashed from 1.250,8.250 to 1.250,8.250

                                        line dashed from 1.250,8.250 to 1.613,8.438
                                        line dashed from 1.613,8.438 to 2.312,8.637
                                        line dashed from 2.312,8.637 to 3.025,9.088
                                        line dashed from 3.025,9.088 to 3.725,9.525
                                        line dashed from 3.725,9.525 to 4.438,10.075
                                        dashwid = 0.075i
                                        line dotted from 3.800,8.450 to 4.025,8.450
                                        line dotted from 1.250,8.262 to 1.250,8.262
                                        line dotted from 1.250,8.262 to 1.613,8.425
                                        line dotted from 1.613,8.425 to 2.312,8.613
                                        line dotted from 2.312,8.613 to 3.025,9.137
                                        line dotted from 3.025,9.137 to 3.725,9.512
                                        line dotted from 3.725,9.512 to 4.438,9.988
                                        "0" at 0.825,7.810 rjust
                                        "20" at 0.825,8.147 rjust
                                        "40" at 0.825,8.485 rjust
                                        "60" at 0.825,8.822 rjust
                                        "80" at 0.825,9.172 rjust
                                        "100" at 0.825,9.510 rjust
                                        "120" at 0.825,9.847 rjust
                                        "140" at 0.825,10.185 rjust
                                        "0" at 0.900,7.660
                                        "2" at 1.613,7.660
                                        "4" at 2.312,7.660
                                        "6" at 3.025,7.660
                                        "8" at 3.725,7.660
                                        "10" at 4.438,7.660
                                        "Time (sec)" at 0.150,8.997
                                        "Number of Clients" at 2.837,7.510
                                        "Figure #2: MAB Phase 2 (copying)" at 2.837,10.335
                                        "NFS" at 3.725,8.822 rjust
                                        "Leases" at 3.725,8.672 rjust
                                        "Leases, Rdirlookup" at 3.725,8.522 rjust
                                        "Leases, Attrib leases, Rdirlookup" at 3.725,8.372 rjust

                                        dashwid = 0.050i
                                        line dashed from 0.900,7.888 to 4.787,7.888
                                        line dashed from 0.900,7.888 to 0.900,10.262
                                        line from 0.900,7.888 to 0.963,7.888
                                        line from 4.787,7.888 to 4.725,7.888
                                        line from 0.900,8.188 to 0.963,8.188
                                        line from 4.787,8.188 to 4.725,8.188
                                        line from 0.900,8.488 to 0.963,8.488
                                        line from 4.787,8.488 to 4.725,8.488
                                        line from 0.900,8.775 to 0.963,8.775
                                        line from 4.787,8.775 to 4.725,8.775
                                        line from 0.900,9.075 to 0.963,9.075
                                        line from 4.787,9.075 to 4.725,9.075
                                        line from 0.900,9.375 to 0.963,9.375
                                        line from 4.787,9.375 to 4.725,9.375
                                        line from 0.900,9.675 to 0.963,9.675
                                        line from 4.787,9.675 to 4.725,9.675
                                        line from 0.900,9.963 to 0.963,9.963
                                        line from 4.787,9.963 to 4.725,9.963
                                        line from 0.900,10.262 to 0.963,10.262
                                        line from 4.787,10.262 to 4.725,10.262
                                        line from 0.900,7.888 to 0.900,7.950
                                        line from 0.900,10.262 to 0.900,10.200
                                        line from 1.613,7.888 to 1.613,7.950
                                        line from 1.613,10.262 to 1.613,10.200
                                        line from 2.312,7.888 to 2.312,7.950
                                        line from 2.312,10.262 to 2.312,10.200
                                        line from 3.025,7.888 to 3.025,7.950
                                        line from 3.025,10.262 to 3.025,10.200
                                        line from 3.725,7.888 to 3.725,7.950
                                        line from 3.725,10.262 to 3.725,10.200
                                        line from 4.438,7.888 to 4.438,7.950
                                        line from 4.438,10.262 to 4.438,10.200
                                        line from 0.900,7.888 to 4.787,7.888
                                        line from 4.787,7.888 to 4.787,10.262
                                        line from 4.787,10.262 to 0.900,10.262
                                        line from 0.900,10.262 to 0.900,7.888
                                        line from 3.800,8.775 to 4.025,8.775
                                        line from 1.250,8.975 to 1.250,8.975
                                        line from 1.250,8.975 to 1.613,8.963
                                        line from 1.613,8.963 to 2.312,8.988
                                        line from 2.312,8.988 to 3.025,9.037
                                        line from 3.025,9.037 to 3.725,9.062
                                        line from 3.725,9.062 to 4.438,9.100
                                        dashwid = 0.037i
                                        line dotted from 3.800,8.625 to 4.025,8.625
                                        line dotted from 1.250,9.312 to 1.250,9.312
                                        line dotted from 1.250,9.312 to 1.613,9.287
                                        line dotted from 1.613,9.287 to 2.312,9.675
                                        line dotted from 2.312,9.675 to 3.025,9.262
                                        line dotted from 3.025,9.262 to 3.725,9.738
                                        line dotted from 3.725,9.738 to 4.438,9.512
                                        line dashed from 3.800,8.475 to 4.025,8.475

                                        line dashed from 1.250,9.400 to 1.250,9.400
                                        line dashed from 1.250,9.400 to 1.613,9.287
                                        line dashed from 1.613,9.287 to 2.312,9.575
                                        line dashed from 2.312,9.575 to 3.025,9.300
                                        line dashed from 3.025,9.300 to 3.725,9.613
                                        line dashed from 3.725,9.613 to 4.438,9.512
                                        dashwid = 0.075i
                                        line dotted from 3.800,8.325 to 4.025,8.325
                                        line dotted from 1.250,9.400 to 1.250,9.400
                                        line dotted from 1.250,9.400 to 1.613,9.412
                                        line dotted from 1.613,9.412 to 2.312,9.700
                                        line dotted from 2.312,9.700 to 3.025,9.537
                                        line dotted from 3.025,9.537 to 3.725,9.938
                                        line dotted from 3.725,9.938 to 4.438,9.812
                                        "0" at 0.825,7.810 rjust
                                        "5" at 0.825,8.110 rjust
                                        "10" at 0.825,8.410 rjust
                                        "15" at 0.825,8.697 rjust
                                        "20" at 0.825,8.997 rjust
                                        "25" at 0.825,9.297 rjust
                                        "30" at 0.825,9.597 rjust
                                        "35" at 0.825,9.885 rjust
                                        "40" at 0.825,10.185 rjust
                                        "0" at 0.900,7.660
                                        "2" at 1.613,7.660
                                        "4" at 2.312,7.660
                                        "6" at 3.025,7.660
                                        "8" at 3.725,7.660
                                        "10" at 4.438,7.660
                                        "Time (sec)" at 0.150,8.997
                                        "Number of Clients" at 2.837,7.510
                                        "Figure #3: MAB Phase 3 (stat/find)" at 2.837,10.335
                                        "NFS" at 3.725,8.697 rjust
                                        "Leases" at 3.725,8.547 rjust
                                        "Leases, Rdirlookup" at 3.725,8.397 rjust
                                        "Leases, Attrib leases, Rdirlookup" at 3.725,8.247 rjust

                                        dashwid = 0.050i
                                        line dashed from 0.900,7.888 to 4.787,7.888
                                        line dashed from 0.900,7.888 to 0.900,10.262
                                        line from 0.900,7.888 to 0.963,7.888
                                        line from 4.787,7.888 to 4.725,7.888
                                        line from 0.900,8.188 to 0.963,8.188
                                        line from 4.787,8.188 to 4.725,8.188
                                        line from 0.900,8.488 to 0.963,8.488
                                        line from 4.787,8.488 to 4.725,8.488
                                        line from 0.900,8.775 to 0.963,8.775
                                        line from 4.787,8.775 to 4.725,8.775
                                        line from 0.900,9.075 to 0.963,9.075
                                        line from 4.787,9.075 to 4.725,9.075
                                        line from 0.900,9.375 to 0.963,9.375
                                        line from 4.787,9.375 to 4.725,9.375
                                        line from 0.900,9.675 to 0.963,9.675
                                        line from 4.787,9.675 to 4.725,9.675
                                        line from 0.900,9.963 to 0.963,9.963
                                        line from 4.787,9.963 to 4.725,9.963
                                        line from 0.900,10.262 to 0.963,10.262
                                        line from 4.787,10.262 to 4.725,10.262
                                        line from 0.900,7.888 to 0.900,7.950
                                        line from 0.900,10.262 to 0.900,10.200
                                        line from 1.613,7.888 to 1.613,7.950
                                        line from 1.613,10.262 to 1.613,10.200
                                        line from 2.312,7.888 to 2.312,7.950
                                        line from 2.312,10.262 to 2.312,10.200
                                        line from 3.025,7.888 to 3.025,7.950
                                        line from 3.025,10.262 to 3.025,10.200
                                        line from 3.725,7.888 to 3.725,7.950
                                        line from 3.725,10.262 to 3.725,10.200
                                        line from 4.438,7.888 to 4.438,7.950
                                        line from 4.438,10.262 to 4.438,10.200
                                        line from 0.900,7.888 to 4.787,7.888
                                        line from 4.787,7.888 to 4.787,10.262
                                        line from 4.787,10.262 to 0.900,10.262
                                        line from 0.900,10.262 to 0.900,7.888
                                        line from 3.800,8.775 to 4.025,8.775
                                        line from 1.250,9.412 to 1.250,9.412
                                        line from 1.250,9.412 to 1.613,9.425
                                        line from 1.613,9.425 to 2.312,9.463
                                        line from 2.312,9.463 to 3.025,9.600
                                        line from 3.025,9.600 to 3.725,9.875
                                        line from 3.725,9.875 to 4.438,10.075
                                        dashwid = 0.037i
                                        line dotted from 3.800,8.625 to 4.025,8.625
                                        line dotted from 1.250,9.450 to 1.250,9.450
                                        line dotted from 1.250,9.450 to 1.613,9.438
                                        line dotted from 1.613,9.438 to 2.312,9.438
                                        line dotted from 2.312,9.438 to 3.025,9.525
                                        line dotted from 3.025,9.525 to 3.725,9.550
                                        line dotted from 3.725,9.550 to 4.438,9.662
                                        line dashed from 3.800,8.475 to 4.025,8.475

                                        line dashed from 1.250,9.438 to 1.250,9.438
                                        line dashed from 1.250,9.438 to 1.613,9.412
                                        line dashed from 1.613,9.412 to 2.312,9.450
                                        line dashed from 2.312,9.450 to 3.025,9.500
                                        line dashed from 3.025,9.500 to 3.725,9.613
                                        line dashed from 3.725,9.613 to 4.438,9.675
                                        dashwid = 0.075i
                                        line dotted from 3.800,8.325 to 4.025,8.325
                                        line dotted from 1.250,9.387 to 1.250,9.387
                                        line dotted from 1.250,9.387 to 1.613,9.600
                                        line dotted from 1.613,9.600 to 2.312,9.625
                                        line dotted from 2.312,9.625 to 3.025,9.738
                                        line dotted from 3.025,9.738 to 3.725,9.850
                                        line dotted from 3.725,9.850 to 4.438,9.800
                                        "0" at 0.825,7.810 rjust
                                        "5" at 0.825,8.110 rjust
                                        "10" at 0.825,8.410 rjust
                                        "15" at 0.825,8.697 rjust
                                        "20" at 0.825,8.997 rjust
                                        "25" at 0.825,9.297 rjust
                                        "30" at 0.825,9.597 rjust
                                        "35" at 0.825,9.885 rjust
                                        "40" at 0.825,10.185 rjust
                                        "0" at 0.900,7.660
                                        "2" at 1.613,7.660
                                        "4" at 2.312,7.660
                                        "6" at 3.025,7.660
                                        "8" at 3.725,7.660
                                        "10" at 4.438,7.660
                                        "Time (sec)" at 0.150,8.997
                                        "Number of Clients" at 2.837,7.510
                                        "Figure #4: MAB Phase 4 (grep/wc/find)" at 2.837,10.335
                                        "NFS" at 3.725,8.697 rjust
                                        "Leases" at 3.725,8.547 rjust
                                        "Leases, Rdirlookup" at 3.725,8.397 rjust
                                        "Leases, Attrib leases, Rdirlookup" at 3.725,8.247 rjust

                                        dashwid = 0.050i
                                        line dashed from 0.900,7.888 to 4.787,7.888
                                        line dashed from 0.900,7.888 to 0.900,10.262
                                        line from 0.900,7.888 to 0.963,7.888
                                        line from 4.787,7.888 to 4.725,7.888
                                        line from 0.900,8.150 to 0.963,8.150
                                        line from 4.787,8.150 to 4.725,8.150
                                        line from 0.900,8.412 to 0.963,8.412
                                        line from 4.787,8.412 to 4.725,8.412
                                        line from 0.900,8.675 to 0.963,8.675
                                        line from 4.787,8.675 to 4.725,8.675
                                        line from 0.900,8.938 to 0.963,8.938
                                        line from 4.787,8.938 to 4.725,8.938
                                        line from 0.900,9.213 to 0.963,9.213
                                        line from 4.787,9.213 to 4.725,9.213
                                        line from 0.900,9.475 to 0.963,9.475
                                        line from 4.787,9.475 to 4.725,9.475
                                        line from 0.900,9.738 to 0.963,9.738
                                        line from 4.787,9.738 to 4.725,9.738
                                        line from 0.900,10.000 to 0.963,10.000
                                        line from 4.787,10.000 to 4.725,10.000
                                        line from 0.900,10.262 to 0.963,10.262
                                        line from 4.787,10.262 to 4.725,10.262
                                        line from 0.900,7.888 to 0.900,7.950
                                        line from 0.900,10.262 to 0.900,10.200
                                        line from 1.613,7.888 to 1.613,7.950
                                        line from 1.613,10.262 to 1.613,10.200
                                        line from 2.312,7.888 to 2.312,7.950
                                        line from 2.312,10.262 to 2.312,10.200
                                        line from 3.025,7.888 to 3.025,7.950
                                        line from 3.025,10.262 to 3.025,10.200
                                        line from 3.725,7.888 to 3.725,7.950
                                        line from 3.725,10.262 to 3.725,10.200
                                        line from 4.438,7.888 to 4.438,7.950
                                        line from 4.438,10.262 to 4.438,10.200
                                        line from 0.900,7.888 to 4.787,7.888
                                        line from 4.787,7.888 to 4.787,10.262
                                        line from 4.787,10.262 to 0.900,10.262
                                        line from 0.900,10.262 to 0.900,7.888
                                        line from 3.800,8.675 to 4.025,8.675
                                        line from 1.250,8.800 to 1.250,8.800
                                        line from 1.250,8.800 to 1.613,8.912
                                        line from 1.613,8.912 to 2.312,9.113
                                        line from 2.312,9.113 to 3.025,9.438
                                        line from 3.025,9.438 to 3.725,9.750
                                        line from 3.725,9.750 to 4.438,10.088
                                        dashwid = 0.037i
                                        line dotted from 3.800,8.525 to 4.025,8.525
                                        line dotted from 1.250,8.637 to 1.250,8.637
                                        line dotted from 1.250,8.637 to 1.613,8.700
                                        line dotted from 1.613,8.700 to 2.312,8.713
                                        line dotted from 2.312,8.713 to 3.025,8.775
                                        line dotted from 3.025,8.775 to 3.725,8.887

                                        line dotted from 3.725,8.887 to 4.438,9.037
                                        line dashed from 3.800,8.375 to 4.025,8.375
                                        line dashed from 1.250,8.675 to 1.250,8.675
                                        line dashed from 1.250,8.675 to 1.613,8.688
                                        line dashed from 1.613,8.688 to 2.312,8.713
                                        line dashed from 2.312,8.713 to 3.025,8.825
                                        line dashed from 3.025,8.825 to 3.725,8.887
                                        line dashed from 3.725,8.887 to 4.438,9.062
                                        dashwid = 0.075i
                                        line dotted from 3.800,8.225 to 4.025,8.225
                                        line dotted from 1.250,8.700 to 1.250,8.700
                                        line dotted from 1.250,8.700 to 1.613,8.688
                                        line dotted from 1.613,8.688 to 2.312,8.762
                                        line dotted from 2.312,8.762 to 3.025,8.812
                                        line dotted from 3.025,8.812 to 3.725,8.925
                                        line dotted from 3.725,8.925 to 4.438,9.025
                                        "0" at 0.825,7.810 rjust
                                        "50" at 0.825,8.072 rjust
                                        "100" at 0.825,8.335 rjust
                                        "150" at 0.825,8.597 rjust
                                        "200" at 0.825,8.860 rjust
                                        "250" at 0.825,9.135 rjust
                                        "300" at 0.825,9.397 rjust
                                        "350" at 0.825,9.660 rjust
                                        "400" at 0.825,9.922 rjust
                                        "450" at 0.825,10.185 rjust
                                        "0" at 0.900,7.660
                                        "2" at 1.613,7.660
                                        "4" at 2.312,7.660
                                        "6" at 3.025,7.660
                                        "8" at 3.725,7.660
                                        "10" at 4.438,7.660
                                        "Time (sec)" at 0.150,8.997
                                        "Number of Clients" at 2.837,7.510
                                        "Figure #5: MAB Phase 5 (compile)" at 2.837,10.335
                                        "NFS" at 3.725,8.597 rjust
                                        "Leases" at 3.725,8.447 rjust
                                        "Leases, Rdirlookup" at 3.725,8.297 rjust
                                        "Leases, Attrib leases, Rdirlookup" at 3.725,8.147 rjust

               For the string searching phase depicted  in  figure  4,
          the  leasing  variants  that do not require valid leases for
          files when attributes are returned appear  to  scale  better
          with server load than NFS. However, the effect appears to be
          negligible until the server load is fairly heavy.

               Most of the time in the MAB benchmark is spent  in  the
          compilation  phase and this is where the differences between
          caching methods are most pronounced. In figure 5 it  can  be
          seen that any protocol variant using Leases performs about a
          factor of two better than NFS at a load of ten clients. This
          indicates  that the use of NQNFS may allow servers to handle
          significantly more clients for this type of workload.

               Table 2 summarizes the MAB run times for all phases for
          the single client DECstation 5000/25. The Leases case refers
          to using leases, whereas the Leases,  Rdirl  case  uses  the
          Readdir_and_Lookup RPC as well and the BCache Only case uses
          leases, but only the buffer cache and not the  attribute  or
          name  caches.  The  No  Caching cases does not do any client
          side caching, performing all system  calls  via  synchronous
          RPCs to the server.

          7.3. Processor Speed Tests

               An important goal of client-side file system caching is
          to decouple the I/O system calls from the underlying distri-
          buted file system, so that the client's  system  performance
          might  scale  with processor speed. In order to test this, a
          series of MAB runs were performed on three DECstations  that
          are  similar  except for processor speed. In addition to the
          four protocol variants used for the above tests,  runs  were
          done  with the client caches turned off, for worst case per-
          formance numbers for caching mechanisms  with  a  100%  miss
          rate.  The  CPU utilization was measured, as an indicator of
          how much the processor was blocking for  I/O  system  calls.
          Note that since the systems were running in single user mode
          and  otherwise  quiescent,  almost  all  CPU  activity   was
          directly  related  to the MAB run. The results are presented
          in table 3. The CPU time is simply the product  of  the  CPU
          utilization  and  elapsed  running time and, as such, is the
          optimistic bound on performance  achievable  with  an  ideal
          client caching scheme that never blocks for I/O. As  can  be
          seen  in  the table, any caching mechanism achieves signifi-
          cantly better performance than  when  caching  is  disabled,
          roughly  doubling  the  CPU utilization with a corresponding
          reduction in run time. For NFS, the CPU utilization is drop-
          ping with increase in CPU speed, which would suggest that it
          is not scaling with CPU speed. For the NQNFS  variants,  the
          CPU  utilization  remains  at just below 90%, which suggests
          that the caching  mechanism  is  working  well  and  scaling

          ________________________________________________________________________________
         | Table #2: Single DECstation 5000/25 Client Elapsed Times (sec)                |
         |     Phase        1      2       3       4       5      Total     % Improvement|
         |______________|________________________________________________________________|
         | No Caching   |   6      35      41     40      258       380          -93     |
         | NFS          |   5      24      15     20      133       197            0     |
         | BCache Only  |   5      20      24     23      116       188            5     |
         | Leases, Rdirl|   5      20      21     20      105       171           13     |
         | Leases       |   5      19      21     21       99       165           16     |
         |______________|________________________________________________________________|

          ________________________________________________________________________________________________
         |                                Table #3: MAB Phase 5 (compile)                                |
         |                    DS2100 (10.5 MIPS)         DS3100 (14.0 MIPS)       DS5000/25 (26.7 MIPS)  |
         |                 Elapsed     CPU     CPU    Elapsed     CPU     CPU    Elapsed     CPU     CPU |
         |                  time     Util(%)   time    time     Util(%)   time    time     Util(%)   time|
         |______________|________________________________________________________________________________|
         | Leases       |    143       89      127      113       87       98       99       89       88 |
         | Leases, Rdirl|    150       89      134      110       91      100      105       88       92 |
         | BCache Only  |    169       85      144      129       78      101      116       75       87 |
         | NFS          |    172       77      132      135       74      100      133       71       94 |
         | No Caching   |    330       47      155      256       41      105      258       39      101 |
         |______________|________________________________________________________________________________|

          within this CPU range. Note that  for  this  benchmark,  the
          ratio  of  CPU  times for the DECstation 3100 and DECstation
          5000/25 are quite different than the Dhrystone MIPS  ratings
          would suggest.

               Overall, the  results  seem  encouraging,  although  it
          remains  to  be  seen whether or not the caching provided by
          NQNFS can continue to scale with CPU performance. There is a
          good indication that NQNFS permits a server to scale to more
          clients than does NFS, at least for workloads  akin  to  the
          MAB compile phase. A more difficult question is "What if the
          server is much faster doing write RPCs?" as a result of some
          technology  such  as Prestoserve or write gathering. Since a
          significant part of the difference between NFS and NQNFS  is
          the synchronous writing, it is difficult to predict how much
          a server capable of fast write RPCs will negate the  perfor-
          mance  improvements  of  NQNFS.  At  the very least, table 1
          indicates  that  the  write  RPC  load  on  the  server  has
          decreased  by approximately 30%, and this reduced write load
          should still result in some improvement.

               Indications are that the Readdir_and_Lookup RPC has not
          improved  performance  for  these  tests  and may in fact be
          degrading performance slightly.  The  results  in  figure  3
          indicate some problems, possibly with handling of the attri-
          bute cache. It seems logical that the Readdir_and_Lookup RPC
          should  be  permit  priming of the attribute cache improving
          hit rate, but the results are counter to that.

          7.4. Internetwork Delay Tests

               This experimental setup was used  to  explore  how  the
          different protocol variants might perform over internetworks
          with larger RPC RTTs. The server was  moved  to  a  separate
          Ethernet,  using  a  MicroVAXII as an IP router to the other
          Ethernet. The 4.3Reno BSD Unix system running on the  Micro-
          VAXII  was modified to delay IP packets being forwarded by a

          tunable N millisecond delay. The implementation  was  rather
          crude  and  did  not try to simulate a distribution of delay
          times nor was it programmed to drop packets at a given rate,
          but  it  served  as  a  simple emulation of a long, fat net-
          work[3] [Jacobson88]. The MAB was run using both UDP and TCP
          RPC  transports for a variety of RTT delays from five to two
          hundred milliseconds, to observe the effects of RTT delay on
          RPC  transport. It was found that, due to a high variability
          between runs, four runs was not suffice, so  eight  runs  at
          each value was done. The results in figure 6 and table 4 are
          the average for the eight runs.

               I found these results somewhat surprising, since I  had
          assumed  that  stability  across  an internetwork connection
          would be a function of RPC transport  protocol.  Looking  at
          the  standard  deviations  observed  between the eight runs,
          there is an indication  that  the  NQNFS  protocol  plays  a
          larger role in maintaining stability than the underlying RPC
          transport protocol. It appears that NFS over  TCP  transport
          is  the least stable variant tested. It should be noted that
          the TCP implementation used  was  roughly  at  4.3BSD  Tahoe
          release  and that the 4.4BSD TCP implementation was far less
          stable and would fail intermittently, due to a bug I was not
          able  to  isolate.  It  would appear that some of the recent
          enhancements to the 4.4BSD TCP implementation have a  detri-
          mental  effect on the performance of RPC-type traffic loads,
          which intermix small and large data transfers in both direc-
          tions.  It  is obvious that more exploration of this area is
          needed before any conclusions can be made  beyond  the  fact
          that  over a local area network, TCP transport provides per-
          formance comparable to UDP.

          8. Lessons Learned

               Evaluating the performance of a distributed file system
          is  fraught  with difficulties, due to the many software and
          hardware  factors   involved.   The   limited   benchmarking
          presented  here  took  a considerable amount of time and the
          results gained by the exercise only give indications of what
          the performance might be for a few scenarios.

               The IP router with delay introduction proved  to  be  a
          valuable tool for protocol debugging[4], and may  be  useful
          for a more extensive study of performance over internetworks
          ____________________
             [3]Long fat networks refer  to  network  interconnections
          with a Bandwidth X RTT product > 105 bits.
             [4]It  exposed  two  bugs in the 4.4BSD networking, one a
          problem in the Lance chip driver for the DECstation and  the
          other  a  TCP  window  sizing problem that I was not able to
          isolate.

                                        dashwid = 0.050i
                                        line dashed from 0.900,7.888 to 4.787,7.888
                                        line dashed from 0.900,7.888 to 0.900,10.262
                                        line from 0.900,7.888 to 0.963,7.888
                                        line from 4.787,7.888 to 4.725,7.888
                                        line from 0.900,8.350 to 0.963,8.350
                                        line from 4.787,8.350 to 4.725,8.350
                                        line from 0.900,8.800 to 0.963,8.800
                                        line from 4.787,8.800 to 4.725,8.800
                                        line from 0.900,9.262 to 0.963,9.262
                                        line from 4.787,9.262 to 4.725,9.262
                                        line from 0.900,9.713 to 0.963,9.713
                                        line from 4.787,9.713 to 4.725,9.713
                                        line from 0.900,10.175 to 0.963,10.175
                                        line from 4.787,10.175 to 4.725,10.175
                                        line from 0.900,7.888 to 0.900,7.950
                                        line from 0.900,10.262 to 0.900,10.200
                                        line from 1.825,7.888 to 1.825,7.950
                                        line from 1.825,10.262 to 1.825,10.200
                                        line from 2.750,7.888 to 2.750,7.950
                                        line from 2.750,10.262 to 2.750,10.200
                                        line from 3.675,7.888 to 3.675,7.950
                                        line from 3.675,10.262 to 3.675,10.200
                                        line from 4.600,7.888 to 4.600,7.950
                                        line from 4.600,10.262 to 4.600,10.200
                                        line from 0.900,7.888 to 4.787,7.888
                                        line from 4.787,7.888 to 4.787,10.262
                                        line from 4.787,10.262 to 0.900,10.262
                                        line from 0.900,10.262 to 0.900,7.888
                                        line from 4.125,8.613 to 4.350,8.613
                                        line from 0.988,8.400 to 0.988,8.400
                                        line from 0.988,8.400 to 1.637,8.575
                                        line from 1.637,8.575 to 2.375,8.713
                                        line from 2.375,8.713 to 3.125,8.900
                                        line from 3.125,8.900 to 3.862,9.137
                                        line from 3.862,9.137 to 4.600,9.425
                                        dashwid = 0.037i
                                        line dotted from 4.125,8.463 to 4.350,8.463
                                        line dotted from 0.988,8.375 to 0.988,8.375
                                        line dotted from 0.988,8.375 to 1.637,8.525
                                        line dotted from 1.637,8.525 to 2.375,8.850
                                        line dotted from 2.375,8.850 to 3.125,8.975
                                        line dotted from 3.125,8.975 to 3.862,9.137
                                        line dotted from 3.862,9.137 to 4.600,9.625
                                        line dashed from 4.125,8.312 to 4.350,8.312
                                        line dashed from 0.988,8.525 to 0.988,8.525
                                        line dashed from 0.988,8.525 to 1.637,8.688
                                        line dashed from 1.637,8.688 to 2.375,8.838
                                        line dashed from 2.375,8.838 to 3.125,9.150
                                        line dashed from 3.125,9.150 to 3.862,9.275
                                        line dashed from 3.862,9.275 to 4.600,9.588
                                        dashwid = 0.075i

                                        line dotted from 4.125,8.162 to 4.350,8.162
                                        line dotted from 0.988,8.525 to 0.988,8.525
                                        line dotted from 0.988,8.525 to 1.637,8.838
                                        line dotted from 1.637,8.838 to 2.375,8.863
                                        line dotted from 2.375,8.863 to 3.125,9.137
                                        line dotted from 3.125,9.137 to 3.862,9.387
                                        line dotted from 3.862,9.387 to 4.600,10.200
                                        "0" at 0.825,7.810 rjust
                                        "100" at 0.825,8.272 rjust
                                        "200" at 0.825,8.722 rjust
                                        "300" at 0.825,9.185 rjust
                                        "400" at 0.825,9.635 rjust
                                        "500" at 0.825,10.097 rjust
                                        "0" at 0.900,7.660
                                        "50" at 1.825,7.660
                                        "100" at 2.750,7.660
                                        "150" at 3.675,7.660
                                        "200" at 4.600,7.660
                                        "Time (sec)" at 0.150,8.997
                                        "Round Trip Delay (msec)" at 2.837,7.510
                                        "Figure #6: MAB Phase 5 (compile)" at 2.837,10.335
                                        "Leases,UDP" at 4.050,8.535 rjust
                                        "Leases,TCP" at 4.050,8.385 rjust
                                        "NFS,UDP" at 4.050,8.235 rjust
                                        "NFS,TCP" at 4.050,8.085 rjust

          ____________________________________________________________________________________________________________
         |                          Table #4: MAB Phase 5 (compile) for Internetwork Delays                          |
         |                 NFS,UDP                  NFS,TCP                 Leases,UDP               Leases,TCP      |
         | Delay     Elapsed     Standard     Elapsed     Standard     Elapsed     Standard     Elapsed     Standard |
         | (msec)   time (sec)   Deviation   time (sec)   Deviation   time (sec)   Deviation   time (sec)   Deviation|
         |_______|___________________________________________________________________________________________________|
         | 5     |     139          2.9         139           2.4        112          7.0         108          6.0   |
         | 40    |     175          5.1         208          44.5        150         23.8         139          4.3   |
         | 80    |     207          3.9         213           4.7        180          7.7         210         52.9   |
         | 120   |     276         29.3         273          17.1        221          7.7         238          5.8   |
         | 160   |     304          7.2         328          77.1        275         21.5         274         10.1   |
         | 200   |     372         35.0         506         235.1        338         25.2         379         69.2   |
         |_______|___________________________________________________________________________________________________|

          if enhanced to do a better job  of  simulating  internetwork
          delay and packet loss.

               The Leases mechanism provided a simple  model  for  the
          provision  of cache consistency and did seem to improve per-
          formance for various scenarios. Unfortunately, it  does  not
          provide  the  server  state information that is required for
          file system semantics, such as locking, that  many  software
          systems demand. In production environments on my campus, the
          need for file locking and  the  correct  generation  of  the
          ETXTBSY  error  code  are far more important that full cache

          consistency, and  leasing  does  not  satisfy  these  needs.
          Another file system semantic that requires hard server state
          is the delay of file removal until  the  last  close  system
          call.  Although  Spritely  NFS did not support this semantic
          either, it is logical that the open file state maintained by
          that  system  would  facilitate  the  implementation of this
          semantic more easily than would the Leases mechanism.

          9. Further Work

               The current implementation uses a fixed, moderate sized
          buffer  cache  designed  for the local UFS [McKusick84] file
          system. The results in figure 1 suggest that  this  is  ade-
          quate  so  long as the cache is of an appropriate size. How-
          ever, a mechanism permitting the cache to vary in  size  has
          been  shown  to  outperform  fixed sized buffer caches [Nel-
          son90], and could be beneficial. It could also be useful  to
          allow  the  buffer cache to grow very large by making use of
          local backing store for cases where  server  performance  is
          limited. A very large buffer cache size would in turn permit
          experimentation with  much  larger  read/write  data  sizes,
          facilitating  bulk  data transfers across long fat networks,
          such as will characterize the Internet of the near future. A
          careful  redesign  of  the buffer cache mechanism to provide
          support for these features would probably be the next imple-
          mentation step.

               The results in figure 3 indicate that the mechanics  of
          caching   file  attributes  and  maintaining  the  attribute
          cache's consistency needs to be  looked  at  further.  There
          also needs to be more work done on the interaction between a
          Readdir_and_Lookup RPC and the name and attribute caches, in
          an effort to reduce Getattr and Lookup RPC loads.

               The NQNFS protocol has never been used in a  production
          environment  and  doing so would provide needed insight into
          how well the protocol saisfies the needs of real workstation
          environments.  It  is  hoped  that  the  distribution of the
          implementation in 4.4BSD will facilitate use of the protocol
          in production environments elsewhere.

               The big question that needs to be resolved  is  whether
          Leases  are  an  adequate mechanism for cache consistency or
          whether hard  server  state  is  required.  Given  the  work
          presented  here  and  in  the  papers  related to Sprite and
          Spritely NFS, there are clear indications that a cache  con-
          sistency  algorithm  can  improve  both performance and file
          system semantics. As yet, however, it is  unclear  what  the
          best  approach  to  maintain consistency is. It would appear
          that hard state information is required for file locking and
          other  mechanisms and, if so, it seems appropriate to use it
          for cache consistency as well.

          10. Acknowledgements

               I would like to thank the members of the  CSRG  at  the
          University  of California, Berkeley for their continued sup-
          port over the years. Without their encouragement and  assis-
          tance this software would never have been implemented. Prof.
          Jim Linders and Prof. Tom Wilson here at the  University  of
          Guelph  helped  proofread  this paper and Jeffrey Mogul pro-
          vided a great deal of assistance, helping to turn my gibber-
          ish into something at least moderately readable.

          11. References

          [Baker91]      Mary Baker and John Ousterhout,  Availability
                         in  the  Sprite  Distributed  File System, In
                         Operating System Review,  (25)2,  pg.  95-98,
                         April 1991.

          [Baker91a]     Mary Baker, private communication, May 1991.

          [Burrows88]    Michael  Burrows,  Efficient  Data   Sharing,
                         Technical  Report  #153, Computer Laboratory,
                         University of Cambridge, Dec. 1988.

          [Gray89]       Cary G. Gray and David R.  Cheriton,  Leases:
                         An  Efficient  Fault-Tolerant  Mechanism  for
                         Distributed File Cache Consistency, In  Proc.
                         of  the  Twelfth  ACM  Symposium on Operating
                         Systems Principals, Litchfield Park, AZ, Dec.
                         1989.

          [Howard88]     John H. Howard, Michael L. Kazar,  Sherri  G.
                         Menees,  David A. Nichols, M. Satyanarayanan,
                         Robert N. Sidebotham  and  Michael  J.  West,
                         Scale  and  Performance in a Distributed File
                         System, ACM Trans. on Computer Systems, (6)1,
                         pg 51-81, Feb. 1988.

          [Jacobson88]   Van Jacobson and R.  Braden,  TCP  Extensions
                         for  Long-Delay  Paths, ARPANET Working Group
                         Requests for Comment, DDN Network Information
                         Center,  SRI  International,  Menlo Park, CA,
                         October 1988, RFC-1072.

          [Jacobson89]   Van Jacobson, Sun NFS  Performance  Problems,
                         Private Communication, November, 1989.

          [Juszczak89]   Chet Juszczak, Improving the Performance  and
                         Correctness of an NFS Server, In Proc. Winter
                         1989 USENIX Conference, pg. 53-63, San Diego,
                         CA, January 1989.

          [Juszczak94]   Chet Juszczak, Improving  the  Write  Perfor-
                         mance  of  an  NFS Server, to appear in Proc.
                         Winter 1994 USENIX Conference, San Francisco,
                         CA, January 1994.

          [Kazar88]      Michael L. Kazar, Synchronization and Caching
                         Issues  in  the  Andrew File System, In Proc.
                         Winter 1988  USENIX  Conference,  pg.  27-36,
                         Dallas, TX, February 1988.

          [Kent87]       Christopher. A. Kent and  Jeffrey  C.  Mogul,
                         Fragmentation  Considered  Harmful,  Research
                         Report 87/3,  Digital  Equipment  Corporation
                         Western Research Laboratory, Dec. 1987.

          [Kent87a]      Christopher. A. Kent, Cache Coherence in Dis-
                         tributed Systems, Research Report 87/4, Digi-
                         tal Equipment  Corporation  Western  Research
                         Laboratory, April 1987.

          [Macklem90]    Rick  Macklem,  Lessons  Learned  Tuning  the
                         4.3BSD  Reno Implementation of the NFS Proto-
                         col, In Proc. Winter 1991 USENIX  Conference,
                         pg. 53-64, Dallas, TX, January 1991.

          [Macklem93]    Rick Macklem, The 4.4BSD NFS  Implementation,
                         In  The System Manager's Manual, 4.4 Berkeley
                         Software Distribution, University of Califor-
                         nia, Berkeley, June 1993.

          [McKusick84]   Marshall K. McKusick, William N. Joy,  Samuel
                         J.  Leffler  and Robert S. Fabry, A Fast File
                         System for UNIX, ACM Transactions on Computer
                         Systems,  Vol.  2,  Number  3,  pg.  181-197,
                         August 1984.

          [McKusick90]   Marshall K. McKusick, Michael J.  Karels  and
                         Keith   Bostic,   A   Pageable  Memory  Based
                         Filesystem,  In  Proc.  Summer  1990   USENIX
                         Conference,  pg.  137-143,  Anaheim, CA, June
                         1990.

          [Mogul93]      Jeffrey C. Mogul, Recovery in  Spritely  NFS,
                         Research  Report 93/2, Digital Equipment Cor-
                         poration Western  Research  Laboratory,  June
                         1993.

          [Moran90]      Joseph Moran, Russel Sandberg,  Don  Coleman,
                         Jonathan   Kepecs   and  Bob  Lyon,  Breaking
                         Through the NFS Performance Barrier, In Proc.
                         Spring  1990  EUUG  Conference,  pg. 199-206,
                         Munich, FRG, April 1990.

          [Nelson88]     Michael N. Nelson, Brent B. Welch,  and  John
                         K.  Ousterhout, Caching in the Sprite Network
                         File System,  ACM  Transactions  on  Computer
                         Systems (6)1 pg. 134-154, February 1988.

          [Nelson90]     Michael N. Nelson,  Virtual  Memory  vs.  The
                         File  System,  Research  Report 90/4, Digital
                         Equipment   Corporation   Western    Research
                         Laboratory, March 1990.

          [Nowicki89]    Bill Nowicki, Transport Issues in the Network
                         File   System,   In   Computer  Communication
                         Review, pg. 16-20, March 1989.

          [Ousterhout90] John K. Ousterhout, Why Aren't Operating Sys-
                         tems  Getting  Faster As Fast as Hardware? In
                         Proc.  Summer  1990  USENIX  Conference,  pg.
                         247-256, Anaheim, CA, June 1990.

          [Sandberg85]   Russel Sandberg, David Goldberg, Steve  Klei-
                         man,  Dan  Walsh,  and  Bob  Lyon, Design and
                         Implementation of the Sun Network filesystem,
                         In Proc. Summer 1985 USENIX Conference, pages
                         119-130, Portland, OR, June 1985.

          [Srinivasan89] V. Srinivasan and Jeffrey. C. Mogul, Spritely
                         NFS:  Experiments with Cache-Consistency Pro-
                         tocols, In Proc. of the Twelfth ACM Symposium
                         on  Operating  Systems Principals, Litchfield
                         Park, AZ, Dec. 1989.

          [Steiner88]    J.  G.  Steiner,  B.  C.  Neuman  and  J.  I.
                         Schiller, Kerberos: An Authentication Service
                         for Open Network  Systems,  In  Proc.  Winter
                         1988  USENIX Conference, pg. 191-202, Dallas,
                         TX, February 1988.

          [SUN89]        Sun Microsystems Inc., NFS: Network File Sys-
                         tem  Protocol  Specification, ARPANET Working
                         Group  Requests  for  Comment,  DDN   Network
                         Information  Center, SRI International, Menlo
                         Park, CA, March 1989, RFC-1094.

          [SUN93]        Sun Microsystems Inc., NFS: Network File Sys-
                         tem  Version  3  Protocol  Specification, Sun
                         Microsystems Inc., Mountain  View,  CA,  June
                         1993.

          [Wittle93]     Mark Wittle and Bruce E. Keith,  LADDIS:  The
                         Next Generation in NFS File Server Benchmark-
                         ing, In Proc. Summer 1993 USENIX  Conference,

                         pg. 111-128, Cincinnati, OH, June 1993.

          ____________________
             - NFS is believed to be a trademark of Sun  Microsystems,
          Inc.
             - Prestoserve is a trademark of Legato Systems, Inc.
             S MIPS is a trademark of Silicon Graphics, Inc.
             - DECstation, MicroVAXII and Ultrix are trademarks of Di-
          gital Equipment Corp.
             = Unix is a trademark of Novell, Inc.

Generated on 2014-07-04 21:17:45 by $MirOS: src/scripts/roff2htm,v 1.79 2014/02/10 00:36:11 tg Exp $

These manual pages and other documentation are copyrighted by their respective writers; their source is available at our CVSweb, AnonCVS, and other mirrors. The rest is Copyright © 2002‒2014 The MirOS Project, Germany.
This product includes material provided by Thorsten Glaser.

This manual page’s HTML representation is supposed to be valid XHTML/1.1; if not, please send a bug report – diffs preferred.