MirOS Manual: 06.nfs(SMM)

                         The 4.4BSD NFS Implementation

                                  Rick Macklem
                              University of Guelph


               The 4.4BSD implementation of the  Network  File  System
          (NFS)[1] is intended to interoperate with other NFS  Version
          2  Protocol (RFC1094) implementations but also allows use of
          an alternate protocol that is hoped to provide  better  per-
          formance in certain environments. This paper will informally
          discuss these various protocol features and their use. There
          is  a  brief  overview  of  the  implementation  followed by
          several sections on various problem areas related to NFS and
          some hints on how to deal with them.

               Not Quite NFS (NQNFS) is an NFS like protocol  designed
          to  maintain  full  cache  consistency  between clients in a
          crash tolerant manner. It is an adaptation of the NFS proto-
          col such that the server supports both NFS and NQNFS clients
          while maintaining full consistency between  the  server  and
          NQNFS   clients.  It  borrows  heavily  from  work  done  on
          Spritely-NFS [Srinivasan89], but  uses  Leases  [Gray89]  to
          avoid  the  need to recover server state information after a

          1. NFS Implementation

               The 4.4BSD implementation of NFS and the alternate pro-
          tocol  nicknamed  Not Quite NFS (NQNFS) are kernel resident,
          but make use of a few system daemons. The kernel implementa-
          tion  does  not use an RPC library, handling the RPC request
          and reply messages directly in mbuf data areas.  NFS  inter-
          faces to the network using sockets via. the kernel interface
          available in  sys/kern/uipc_syscalls.c  as  sosend(),  sore-
          ceive(),...  There  are  connection  management routines for
          support of sockets for  connection  oriented  protocols  and
          timeout/retransmit  support  for  datagram  sockets  on  the
          client side. For connection  oriented  transport  protocols,
          such  as  TCP/IP, there is one connection for each client to
          server mount point that is maintained until  an  umount.  If
          the  connection  breaks, the client will attempt a reconnect
          with a new socket. The client side can operate  without  any
          daemons running, but performance will be improved by running
             [1]Network File System (NFS) is  believed  to  be  a  re-
          gistered trademark of Sun Microsystems Inc.

          SMM:06-2                       The 4.4BSD NFS Implementation

          nfsiod daemons that perform read-aheads  and  write-behinds.
          For the server side to function, the daemons portmap, mountd
          and nfsd must be running. The  mountd  daemon  performs  two
          important functions.

          1)   Upon startup and after a hangup  signal,  mountd  reads
               the  exports file and pushes the export information for
               each local file system down into the  kernel  via.  the
               mount system call.

          2)   Mountd handles remote mount protocol (RFC1094, Appendix
               A) requests.

          The nfsd master daemon forks off  children  that  enter  the
          kernel  via.  the  nfssvc system call. The children normally
          remain kernel resident, providing a process context for  the
          NFS  RPC  servers. The only exception to this is when a Ker-
          beros [Steiner88] ticket is received and at  that  time  the
          nfsd  exits the kernel temporarily to verify the ticket via.
          the Kerberos libraries and then returns to the  kernel  with
          the results. (This only happens for Kerberos mount points as
          described further under  Security.)  Meanwhile,  the  master
          nfsd waits to accept new connections from clients using con-
          nection oriented transport  protocols  and  passes  the  new
          sockets  down  into  the  kernel.  The client side mount_nfs
          along with portmap and mountd are the only parts of the  NFS
          subsystem that make any use of the Sun RPC library.

          2. Mount Problems

               There are several problems that can be  encountered  at
          the  time  of  an NFS mount, ranging from a unresponsive NFS
          server (crashed, network partitioned from client,  etc.)  to
          various  interoperability  problems  between  different  NFS

               On the server side, if the 4.4BSD NFS  server  will  be
          handling  any  PC clients, mountd will require the -n option
          to enable non-root mount request  servicing.  Running  of  a
          pcnfsd[2] daemon will also be  necessary.  The  server  side
          requires  that  the  daemons  mountd and nfsd be running and
          that they be registered with portmap properly.  If  problems
          are  encountered,  the safest fix is to kill all the daemons
          and then restart them in the order portmap, mountd and nfsd.
          Other  server  side problems are normally caused by problems
          with the format of the exports file, which is covered  under
          Security and in the exports man page.

             [2] Pcnfsd is available in source form from Sun Microsys-
          tems and many anonymous ftp sites.

          The 4.4BSD NFS Implementation                       SMM:06-3

               On the client side, there  are  several  mount  options
          useful  for  dealing  with server problems. In cases where a
          file system is not critical for  system  operation,  the  -b
          mount option may be specified so that mount_nfs will go into
          the background  for  a  mount  attempt  on  an  unresponsive
          server.  This is useful for mounts specified in fstab(5), so
          that the system will not get hung while booting doing  mount
          -a  because  a  file  server is not responsive. On the other
          hand, if the file system is critical  to  system  operation,
          this  option should not be used so that the client will wait
          for the server to come up before  completing  bootstrapping.
          There are also three mount options to help deal with intero-
          perability issues with various non-BSD NFS servers.  The  -P
          option  specifies that the NFS client use a reserved IP port
          number to satisfy some  servers'  security  requirements.[3]
          The  -c  option stops the NFS client from doing a connect on
          the UDP socket, so that the mount works  with  servers  that
          send  NFS  replies from port numbers other than the standard
          2049.[4] Finally, the -g=num option sets the maximum size of
          the group list in the credentials passed to an NFS server in
          every RPC request. Although RFC1057 specifies a maximum size
          of  16  for  the  group list, some servers can't handle that
          many. If a user, particularly root doing a mount, keeps get-
          ting  access  denied  from  a  file  server, try temporarily
          reducing the number of groups that user is in to less than 5
          by  editing /etc/group. If the user can then access the file
          system, slowly increase the number of groups for  that  user
          until  the  limit is found and then peg the limit there with
          the -g=num option. This implies that the  server  will  only
          see  the  first  num  groups  that the user is in, which can
          cause some accessibility problems.

               For sites that have many NFS servers, amd [Pendry93] is
          a  useful administration tool. It also reduces the number of
          actual NFS mount points, alleviating problems with  commands
          such  as  df(1)  that  hang  when  any of the NFS servers is

          3. Dealing with Hung Servers

               There are several mount options  available  to  help  a
          client  deal  with  being  hung  waiting for response from a
          crashed or unreachable[5] server. By default, a  hard  mount
          will  continue  to  try to contact the server ``forever'' to
          complete the system call. This type of mount is  appropriate
             [3]Any security benefit of this  is  highly  questionable
          and  as such the BSD server does not require a client to use
          a reserved port number.
             [4]The Encore Multimax is known to require this.
             [5]Due to a network partitioning or similar.

          SMM:06-4                       The 4.4BSD NFS Implementation

          when processes on the client that access files in  the  file
          system do not tolerate file I/O systems calls that return -1
          with errno == EINTR and/or access  to  the  file  system  is
          critical for normal system operation.

          There are two other alternatives:

          1)   A soft mount (-s option) retries an  RPC  n  times  and
               then  the  corresponding  system  call  returns -1 with
               errno set to EINTR. For TCP transport, the  actual  RPC
               request is not retransmitted, but the timeout intervals
               waiting for a reply from the server  are  done  in  the
               same  manner  as UDP for this purpose. The problem with
               this type of mount is that  most  applications  do  not
               expect an EINTR error return from file I/O system calls
               (since it never occurs for a local file system) and get
               confused  by the error return from the I/O system call.
               The option -x=num is used to set the  RPC  retry  limit
               and if set too low, the error returns will start occur-
               ring whenever the NFS server is slow due to heavy load.
               Alternately,  a  large retry limit can result in a pro-
               cess hung for a long time, due to a crashed  server  or
               network partitioning.

          2)   An interruptible mount (-i option) checks to see  if  a
               termination  signal  is  pending  for  the process when
               waiting for server response and if it is, the I/O  sys-
               tem  call  posts an EINTR. Normally this results in the
               process being terminated by the signal  when  returning
               from the system call. This feature allows you to ``^C''
               out of processes that  are  hung  due  to  unresponsive
               servers. The problem with this approach is that signals
               that are caught by a process are not recognized as ter-
               mination signals and the process will remain hung.[6]

          4. RPC Transport Issues

               The NFS Version 2 protocol runs over  UDP/IP  transport
          by   sending   each  Sun  Remote  Procedure  Call  (RFC1057)
          request/reply message in a single UDP  datagram.  Since  UDP
          does  not  guarantee datagram delivery, the Remote Procedure
          Call (RPC) layer times out and retransmits an RPC request if
          no  RPC  reply  has  been  received.  Since  this round trip
          timeout (RTO) value is for the entire RPC operation, includ-
          ing  RPC  message transmission to the server, queuing at the
          server for an nfsd, performing the RPC and sending  the  RPC
          reply  message back to the client, it can be highly variable
             [6]Unfortunately, there are also some resource allocation
          situations  in  the  BSD kernel where the termination signal
          will be ignored and the process will not terminate.

          The 4.4BSD NFS Implementation                       SMM:06-5

          for even a moderately loaded NFS server. As  a  result,  the
          RTO  interval  must  be  a conservation (large) estimate, in
          order to avoid extraneous RPC request retransmits.[7]  Also,
          with  an  8Kbyte  read/write  data  size  (the default), the
          read/write reply/request will be  an  8+Kbyte  UDP  datagram
          that  must  normally  be  fragmented  at  the  IP  layer for
          transmission.[8]  For  IP  fragments  to   be   successfully
          reassembled  into  the  IP  datagram at the receive end, all
          fragments must be received within a fairly short  ``time  to
          live''.  If  one  fragment  is  lost/damaged in transit, the
          entire RPC must be retransmitted and  redone.  This  problem
          can  be  exaggerated  by a network interface on the receiver
          that cannot handle the reception of  back  to  back  network
          packets. [Kent87a]

               There are several tuning mount options  on  the  client
          side  that can prove useful when trying to alleviate perfor-
          mance problems related to UDP  RPC  transport.  The  options
          -r=num  and  -w=num  specify  the maximum read or write data
          size respectively. The size num should be a power of 2  (4K,
          2K,  1K)  and  adjusted downward from the maximum of 8Kbytes
          whenever IP fragmentation  is  causing  problems.  The  best
          indicator  of  IP  fragmentation  problems  is a significant
          number of fragments dropped after timeout  reported  by  the
          ip:  section of a netstat -s command on either the client or
          server. Of course, if the fragments are being dropped at the
          server,  it  can  be  fun  figuring  out which client(s) are
          involved. The most likely candidates are  clients  that  are
          not  on  the  same  local area network as the server or have
          network interfaces that do not receive several back to  back
          network packets properly.

               By default, the 4.4BSD NFS client dynamically estimates
          the retransmit timeout interval for the RPC and this appears
          to work reasonably well for many environments. However,  the
          -d  flag can be specified to turn off the dynamic estimation
          of retransmit timeout, so that the client will use a  static
          initial timeout interval.[9] The -t=num option can  be  used
          with  -d  to  set the initial timeout interval to other than
          the default of 2 seconds. The best  indicator  that  dynamic
          estimation  should  be  turned  off  would  be a significant

             [7]At best, an  extraneous  RPC  request  retransmit  in-
          creases  the  load  on the server and at worst can result in
          damaged files on the server  when  non-idempotent  RPCs  are
          redone [Juszczak89].
             [8]6 IP fragments for an Ethernet, which has  an  maximum
          transmission unit of 1500bytes.
             [9]After the first retransmit timeout, the initial inter-
          val is backed off exponentially.

          SMM:06-6                       The 4.4BSD NFS Implementation

          number[10] in the X Replies field and a large number in  the
          Retries  field  in  the Rpc Info: section as reported by the
          nfsstat command. On the server, there would  be  significant
          numbers  of  Inprog  recent request cache hits in the Server
          Cache Stats: section as reported  by  the  nfsstat  command,
          when run on the server.

               The tradeoff is that a smaller timeout interval results
          in  a  better  average  RPC response time, but increases the
          risk of extraneous retries that in turn increase server load
          and  the  possibility  of damaged files on the server. It is
          probably best to err on the safe side and use  a  large  (>=
          2sec)  fixed timeout if the dynamic retransmit timeout esti-
          mation seems to be causing problems.

               An alternative to all this fiddling is to run NFS  over
          TCP transport instead of UDP. Since the 4.4BSD TCP implemen-
          tation provides reliable delivery with  congestion  control,
          it avoids all of the above problems. It also permits the use
          of read and write data sizes greater than the  8Kbyte  limit
          for UDP transport.[11] NFS over TCP usually delivers compar-
          able  to  significantly better performance than NFS over UDP
          unless the client or server processor runs at less  than  5-
          10MIPS.  For  a  slow  processor,  the extra CPU overhead of
          using TCP transport will become significant  and  TCP  tran-
          sport may only be useful when the client to server intercon-
          nect traverses congested gateways.  The  main  problem  with
          using TCP transport is that it is only supported between BSD
          clients and servers.[12]

          5. Other Tuning Tricks

               Another mount option that may improve performance  over
          certain  network  interconnects  is  -a=num  which  sets the
          number of blocks that the system will attempt to  read-ahead
          during  sequential reading of a file. The default value of 1
          seems to be appropriate for most situations,  but  a  larger
          value  might  achieve  better  performance for some environ-
          ments, such as a mount to a server across a ``high bandwidth
          * round trip delay'' interconnect.

               For the adventurous,  playing  with  the  size  of  the
          buffer   cache   can   also  improve  performance  for  some
             [10]Even 0.1% of the total RPCs is probably significant.
             [11]Read/write  data  sizes greater than 8Kbytes will not
          normally improve  performance  unless  the  kernel  constant
          MAXBSIZE  is increased and the file system on the server has
          a block size greater than 8Kbytes.
             [12]There are rumors of commercial NFS over TCP implemen-
          tations  on  the horizon and these may well be worth explor-

          The 4.4BSD NFS Implementation                       SMM:06-7

          environments that use NFS heavily. Under some  workloads,  a
          buffer  cache of 4-6Mbytes can result in significant perfor-
          mance improvements over 1-2Mbytes, both in client side  sys-
          tem  call  response  time  and  reduced server RPC load. The
          buffer cache size defaults to 10% of  physical  memory,  but
          this  can be overridden by specifying the BUFPAGES option in
          the machine's config file.[13] When increasing the  size  of
          BUFPAGES,  it  is  also  advisable to increase the number of
          buffers NBUF by a corresponding amount. Note that there is a
          tradeoff  of  memory  allocated  to  the buffer cache versus
          available for paging, which implies that making  the  buffer
          cache larger will increase paging rate, with possibly disas-
          trous results.

          6. Security Issues

               When a machine is running an NFS server it opens  up  a
          great  big  security  hole.  For  ordinary  NFS,  the server
          receives client credentials in the RPC request as a user  id
          and a list of group ids and trusts them to be authentic! The
          only tool available to restrict remote access to  file  sys-
          tems  with is the exports(5) file, so file systems should be
          exported with great care. The exports file is read by mountd
          upon  startup and after a hangup signal is posted for it and
          then as much of the access specifications  as  possible  are
          pushed  down  into  the  kernel  for use by the nfsd(s). The
          trick here is that the kernel information is stored on a per
          local  file system mount point and client host address basis
          and cannot refer to individual directories within the  local
          server  file system. It is best to think of the exports file
          as referring to the various local file systems and not  just
          directory  paths as mount points. A local file system may be
          exported to a specific host, all hosts that match  a  subnet
          mask  or  all  other  hosts  (the world). The latter is very
          dangerous and should only be used for public information. It
          is  also  strongly recommended that file systems exported to
          ``the world'' be exported read-only. For each host or  group
          of  hosts,  the  file  system  can  be exported read-only or
          read/write. You can also define one of three client user  id
          to  server  credential mappings to help control access. Root
          (user id == 0) can be mapped  to  some  default  credentials
          while  all  other  user  ids  are  accepted as given. If the
          default credentials for user id equal zero  are  root,  then
          there is essentially no remapping. Most NFS file systems are
          exported this way, most commonly mapping user id == 0 to the
          credentials  for  the  user nobody. Since the client user id
          and group id list is used unchanged on  the  server  (except
             BUFPAGES is the number of physical machine pages allocat-
          ed  to  the buffer cache. ie. BUFPAGES * NBPG = buffer cache
          size in bytes

          SMM:06-8                       The 4.4BSD NFS Implementation

          for root), this also implies that the user id and  group  id
          space  must  be  common  between the client and server. (ie.
          user id N on the client must refer to the same user  on  the
          server)  All  user  ids  can  be  mapped to a default set of
          credentials, typically that of the user nobody. This  essen-
          tially  gives world access to all users on the corresponding

               There is also a non-standard BSD  -kerb  export  option
          that  requires  the client provide a KerberosIV rcmd service
          ticket to authenticate the user on the server.  If  success-
          ful,  the  Kerberos  principal  is looked up in the server's
          password and group databases to get a set of credentials and
          a  map of client userid to these credentials is then cached.
          The use of TCP transport is strongly recommended, since  the
          scheme  depends  on  the  TCP  connection  to  avert  replay
          attempts. Unfortunately, this option is only usable  between
          BSD  clients  and  servers  since  it is not compatible with
          other known ``kerberized'' NFS systems.  To  enable  use  of
          this  Kerberos option, both mount_nfs on the client and nfsd
          on the server must be rebuilt with the -DKERBEROS option and
          linked  to  KerberosIV  libraries.  The  file system is then
          exported to the client(s)  with  the  -kerb  option  in  the
          exports  file  on  the server and the client mount specifies
          the -K and -T options. The -m=realm mount option may be used
          to  specify  a Kerberos Realm for the ticket (it must be the
          Kerberos Realm  of  the  server)  that  is  other  than  the
          client's  local  Realm.  To  access  files  in a -kerb mount
          point, the user must have  a  valid  TGT  for  the  server's
          Realm, as provided by kinit or similar.

               As  well  as  the  standard  NFS  Version  2   protocol
          (RFC1094)  implementation,  BSD systems can use a variant of
          the protocol called Not Quite NFS (NQNFS)  that  supports  a
          variety  of  protocol  extensions.  This protocol uses 64bit
          file offsets and sizes, an access rpc, an append  option  on
          the write rpc and extended file attributes to support 4.4BSD
          file system functionality more fully. It also makes use of a
          variant  of  short  term  leases [Gray89] with delayed write
          client caching, in an effort  to  provide  full  cache  con-
          sistency  and better performance. This protocol is available
          between 4.4BSD systems only and is used when  the  -q  mount
          option  is  specified.  It  can  be  used  with  any  of the
          aforementioned options for NFS, such as TCP  transport  (-T)
          and  KerberosIV  authentication (-K). Although this protocol
          is experimental, it  is  recommended  over  NFS  for  mounts
          between 4.4BSD systems.[14]
             [14]I would appreciate email from anyone who can  provide
          NFS  vs.  NQNFS  performance measurements, particularly fast
          clients, many clients or  over  an  internetwork  connection
          with a large ``bandwidth * RTT'' product.

          The 4.4BSD NFS Implementation                       SMM:06-9

          7. Monitoring NFS Activity

               The  basic  command  for  monitoring  NFS  activity  on
          clients  and  servers  is  nfsstat.  It  reports  cumulative
          statistics of various NFS activities, such as counts of  the
          various different RPCs and cache hit rates on the client and
          server. Of particular interest on the server are the  fields
          in  the Server Cache Stats: section, which gives numbers for
          RPC retries received in the first  three  fields  and  total
          RPCs  in  the fourth. The first three fields should remain a
          very small percentage of the total. If not, it  would  indi-
          cate  one or more clients doing retries too aggressively and
          the fix would be  to  isolate  these  clients,  disable  the
          dynamic  RTO  estimation  on  them  and  make  their initial
          timeout interval a conservative (ie. large) value.

               On the client side, the fields in the Rpc Info: section
          are  of particular interest, as they give an overall picture
          of NFS activity. The TimedOut field is  the  number  of  I/O
          system calls that returned -1 for ``soft'' mounts and can be
          reduced by increasing the retry limit or changing the  mount
          type  to  ``intr'' or ``hard''. The Invalid field is a count
          of trashed RPC replies that are received and  should  remain
          zero.[15] The X Replies field counts the number of  repeated
          RPC  replies received from the server and is a clear indica-
          tion of a too aggressive RTO estimate. Unfortunately, a good
          NFS  server  implementation  will  use  a  ``recent  request
          cache''  [Juszczak89]  that  will  suppress  the  extraneous
          replies.  A large value for Retries indicates a problem, but
          it could be any of:

          +    a too aggressive RTO estimate

          +    an overloaded NFS server

          +    IP fragments being dropped (gateway, client or server)

          and requires further investigation. The  Requests  field  is
          the total count of RPCs done on all servers.

               The netstat -s comes in useful during investigation  of
          RPC  transport  problems.  The field fragments dropped after
          timeout in the ip: section indicates IP fragments are  being
          lost  and  a significant number of these occurring indicates
          that the use of TCP transport or a smaller  read/write  data
          size  is  in  order.  A  significant number of bad checksums
          reported in the udp: section would suggest network  problems
          of  a  more  generic  sort. (cabling, transceiver or network
             [15]Some NFS implementations run with UDP checksums  dis-
          abled, so garbage RPC messages can be received.

          SMM:06-10                      The 4.4BSD NFS Implementation

          hardware interface problems or similar)

               There is a RPC activity logging facility for  both  the
          client  and  server  side  in  the  kernel.  When logging is
          enabled by setting the kernel variable nfsrtton to one,  the
          logs  in  the kernel structures nfsrtt (for the client side)
          and nfsdrt (for the server side) are updated upon  the  com-
          pletion of each RPC in a circular manner. The pos element of
          the structure is the index of the next element  of  the  log
          array  to  be  updated.  In other words, elements of the log
          array from log[pos] to log[pos -  1]  are  in  chronological
          order.  The  include file <sys/nfsrtt.h> should be consulted
          for details on the fields in the two log structures.[16]

          8. Diskless Client Support

               The  NFS  client  does  include  kernel   support   for
          diskless/dataless  operation  where the root file system and
          optionally  the  swap  area  is  remote   NFS   mounted.   A
          diskless/dataless  client  is  configured using a version of
          the ``swapvmunix.c''  file  as  provided  in  the  directory
          contrib/diskless.nfs. If the swap device == NODEV, it speci-
          fies an NFS mounted swap area and should be  configured  the
          same  size  as  set  up  by  diskless_setup  when run on the
          server.    This    file    must    be     put     in     the
          sys/compile/<machine_name>  kernel build directory after the
          config command has been run,  since  config  does  not  know
          about  specifying  NFS root and swap areas. The kernel vari-
          able mountroot must  be  set  to  nfs_mountroot  instead  of
          ffs_mountroot  and the kernel structure nfs_diskless must be
          filled in properly. There are some primitive system adminis-
          tration  tools  in  the  contrib/diskless.nfs  directory  to
          assist in filling in the nfs_diskless structure and in  set-
          ting  up  an  NFS  server for diskless/dataless clients. The
          tools were designed to provide a bare bones  capability,  to
          allow maximum flexibility when setting up different servers.

          The tools are as follows:

          +    diskless_offset.c  -  This  little  program   reads   a
               ``vmunix''  object file and writes the file byte offset
               of the nfs_diskless structure in it to standard out. It
               was  kept  separate  because  it  sometimes  has  to be
               compiled/linked in funny ways depending on  the  client
               architecture. (See the comment at the beginning of it.)

          +    diskless_setup.c - This program is run  on  the  server
               and  sets  up  files for a given client. It mostly just
             [16]Unfortunately, a monitoring tool that uses these logs
          is still in the planning (dreaming) stage.

          The 4.4BSD NFS Implementation                      SMM:06-11

               fills in an nfs_diskless structure and writes it out to
               either  the  "vmunix"  file  or  a separate file called

          +    diskless_boot.c - There are two functions in here  that
               may be used by a bootstrap server such as tftpd to per-
               mit sharing of the ``vmunix'' object file  for  similar
               clients.  This saves disk space on the bootstrap server
               and simplify organization, but  are  not  critical  for
               correct  operation.  They read the ``vmunix'' file, but
               optionally fill in the nfs_diskless  structure  from  a
               separate "setup.<official-hostname>" file so that there
               is only one copy of "vmunix" for all similar (same arch
               etc.)  clients.  These functions use a text file called
               /var/diskless/boot.<official-hostname> to  control  the

          The basic setup steps are:

          +    make a "vmunix" for the client(s) with  mountroot()  ==
               nfs_mountroot()  and swdevt[0].sw_dev == NODEV if it is
               to do nfs swapping as well (See the  same  swapvmunix.c

          +    run diskless_offset on the vmunix file to find out  the
               byte offset of the nfs_diskless structure

          +    Run diskless_setup on the server to set up  the  server
               and fill in the nfs_diskless structure for that client.
               The nfs_diskless structure can either be  written  into
               the   vmunix   file   (the   -x  option)  or  saved  in

          +    Set up the bootstrap server. If the nfs_diskless struc-
               ture  was written into the ``vmunix'' file, any vanilla
               bootstrap protocol such as bootp/tftp can be  used.  If
               the bootstrap server has been modified to use the func-
               tions  in   diskless_boot.c,   then   a   file   called
               /var/diskless/boot.<official-hostname> must be created.
               It is simply a two line text file, where the first line
               is  the pathname of the correct ``vmunix'' file and the
               second line has the pathname of the nfs_diskless struc-
               ture file and its byte offset in it. For example:

          +    Create a /var subtree for each client in an appropriate
               place       on       the      server,      such      as
               /var/diskless/var/<client-hostname>/...  By  using  the
               <client-hostname>  to differentiate /var for each host,
               /etc/rc can be modified to mount the correct /var  from
               the server.

          SMM:06-12                      The 4.4BSD NFS Implementation

          9. Not Quite NFS, Crash Tolerant Cache Consistency for NFS

               Not Quite NFS (NQNFS) is an NFS like protocol  designed
          to  maintain  full  cache  consistency  between clients in a
          crash tolerant manner. It is an adaptation of the NFS proto-
          col such that the server supports both NFS and NQNFS clients
          while maintaining full consistency between  the  server  and
          NQNFS  clients.  This section borrows heavily from work done
          on Spritely-NFS [Srinivasan89], but uses Leases [Gray89]  to
          avoid  the  need to recover server state information after a
          crash. The reader  is  strongly  encouraged  to  read  these
          references  before  trying  to  grasp the material presented

          9.1. Overview

               The protocol maintains cache  consistency  by  using  a
          somewhat  Sprite  [Nelson88]  like protocol, but is based on
          short term leases[17]  instead  of  hard  state  information
          about open files. The basic principal is that  the  protocol
          will  disable client caching of a file whenever that file is
          write shared[18]. Whenever a client wishes to cache data for
          a  file it must hold a valid lease. There are three types of
          leases: read caching, write  caching  and  non-caching.  The
          latter  type  requires that all file operations be done syn-
          chronously with the server via. RPCs. A read  caching  lease
          allows  for  client  data caching, but no file modifications
          may be done. A write caching lease allows for client caching
          of  writes,  but  requires  that all writes be pushed to the
          server when  the  lease  expires.  If  a  client  has  dirty
          buffers[19] when a write cache lease has almost expired,  it
          will attempt to extend the lease but is required to push the
          dirty buffers if extension fails. A client  gets  leases  by
          either  doing  a  GetLease RPC or by piggybacking a GetLease
          Request onto another RPC. Piggybacking is supported for  the
          frequent  RPCs  Getattr,  Setattr,  Lookup,  Readlink, Read,
          Write and Readdir in an effort to  minimize  the  number  of
          GetLease RPCs required. All leases are at the granularity of
          a file, since all NFS RPCs operate on individual  files  and
          NFS  has  no  intrinsic  notion  of a file hierarchy. Direc-
          tories, symbolic links  and  file  attributes  may  be  read
          cached  but  are not write cached. The exception here is the
          attribute file_size, which is updated during cached  writing
          on the client to reflect a growing file.
             [17] A lease is a ticket permitting an activity  that  is
          valid until some expiry time.
             [18] Write sharing occurs when at  least  one  client  is
          modifying a file while other client(s) are reading the file.
             [19] Cached write data is not yet pushed (written) to the

          The 4.4BSD NFS Implementation                      SMM:06-13

               It is the server's responsibility to ensure  that  con-
          sistency  is maintained among the NQNFS clients by disabling
          client caching whenever a server file operation would  cause
          inconsistencies.  The  possibility of inconsistencies occurs
          whenever a client has a write caching lease  and  any  other
          client,  or  local operations on the server, tries to access
          the file or when a modify operation is attempted on  a  file
          being  read  cached  by  client(s). At this time, the server
          sends an eviction notice to all clients  holding  the  lease
          and  then  waits  for  lease  termination. Lease termination
          occurs when a vacated the premises message has been received
          from  all the clients that have signed the lease or when the
          lease expires via. timeout. The message pair eviction notice
          and  vacated  the  premises  roughly  correspond to a Sprite
          server->client callback,  but  are  not  implemented  as  an
          actual  RPC,  to avoid the server waiting indefinitely for a
          reply from a dead client.

               Server consistency checking can be  viewed  as  issuing
          intrinsic  leases  for  a file operation for the duration of
          the operation only. For example, the Create RPC will get  an
          intrinsic  write lease on the directory in which the file is
          being created, disabling client read caches for that  direc-

               By relegating this responsibility to the  server,  con-
          sistency  between the server and NQNFS clients is maintained
          when NFS clients are modifying the file system as well.[20]

               The leases are issued as time intervals  to  avoid  the
          requirement  of time of day clock synchronization. There are
          three important time constants  known  to  the  server.  The
          maximum_lease_term  sets  an  upper bound on lease duration.
          The clock_skew is added to all lease terms on the server  to
          correct  for  differing  clock speeds between the client and
          server and write_slack is the number of seconds  the  server
          is  willing to wait for a client with an expired write cach-
          ing lease to push dirty writes.

               The server maintains a modify_revision number for  each
          file.  It  is defined as a unsigned quadword integer that is
          never zero and that must increase whenever the corresponding
          file  is modified on the server. It is used by the client to
          determine whether or not cached data for the file is  stale.
          Generating  this value is easier said than done. The current
          implementation  uses  the  following  technique,  which   is
          believed  to  be adequate. The high order longword is stored
          in the ufs inode and is initialized to one when an inode  is
             [20] The NFS clients will continue  to  be  approximately
          consistent with the server.

          SMM:06-14                      The 4.4BSD NFS Implementation

          first allocated. The low order longword is  stored  in  main
          memory only and is initialized to zero when an inode is read
          in from disk. When the file is modified for the  first  time
          within  a  given  second  of wall clock time, the high order
          longword is incremented by one and the  low  order  longword
          reset  to zero. For subsequent modifications within the same
          second of wall clock time, the low order longword is  incre-
          mented.  If the low order longword wraps around to zero, the
          high order longword is incremented  again.  Since  the  high
          order longword only increments once per second and the inode
          is pushed to disk frequently during file modification,  this
          implies 0 ≤ Current-Disk ≤ 5. When the inode is read in from
          disk, 10 is added to the high order longword, which  ensures
          that  the  quadword  is greater than any value it could have
          had before a crash. This introduces  apparent  modifications
          every  time  the inode falls out of the LRU inode cache, but
          this should only reduce the client caching performance by  a
          (hopefully) small margin.

          9.2. Crash Recovery and other Failure Scenarios

               The server must maintain the state of all  the  current
          leases  held  by  clients.  The  nice thing about short term
          leases is that maximum_lease_term seconds after  the  server
          stops  issuing  leases, there are no current leases left. As
          such, server crash  recovery  does  not  require  any  state
          recovery. After rebooting, the server refuses to service any
          RPCs except for writes until write_slack seconds  after  the
          last lease would have expired[21]. By then, the server would
          not  have any outstanding leases to recover the state of and
          the clients have had at least write_slack  seconds  to  push
          dirty  writes  to the server and get the server sync'd up to
          date. After this, the server simply services requests  in  a
          manner  similar  to NFS. In an effort to minimize the effect
          of  "recovery  storms"   [Baker91],   the   server   replies
          try_again_later to the RPCs it is not yet ready to service.

               After a client crashes, the server may have to wait for
          a lease to timeout before servicing a request if write shar-
          ing of a file with a cachable lease on the client  is  about
          to occur. As for the client, it simply starts up getting any
          leases it now needs. Any outstanding leases for that  client
          on  the  server prior to the crash will either be renewed or
          expire via timeout.

               Certain network partitioning failures  are  more  prob-
          lematic. If a client to server network connection is severed
             [21] The last lease expiry time may be  safely  estimated
          as   "boottime+maximum_lease_term+clock_skew"  for  machines
          that cannot store it in nonvolatile RAM.

          The 4.4BSD NFS Implementation                      SMM:06-15

          just before a write caching lease expires, the client cannot
          push the dirty writes to the server. After the lease expires
          on the server, the server permits other  clients  to  access
          the  file  with  the potential of getting stale data. Unfor-
          tunately I believe this failure scenario is intrinsic in any
          delay  write caching scheme unless the server is required to
          wait forever for a client to regain contact[22].  Since  the
          write  caching lease has expired on the client, it will sync
          up with the server as soon as  the  network  connection  has
          been re-established.

               There is another failure condition that can occur  when
          the  server is congested. The worst case scenario would have
          the client pushing dirty writes to the server  but  a  large
          request  queue  on  the  server delays these writes for more
          than write_slack seconds. It is hoped that a congestion con-
          trol  scheme using the try_again_later RPC reply after boot-
          ing combined with the following lease termination  rule  for
          write   caching   leases  can  minimize  the  risk  of  this
          occurrence. A write caching lease is only terminated on  the
          server  when  there  are have been no writes to the file and
          the server has  not  been  overloaded  during  the  previous
          write_slack  seconds.  The server has not been overloaded is
          approximated by a test for sleeping nfsd(s) at  the  end  of
          the write_slack period.

          9.3. Server Disk Full

               There is a serious unresolved problem for delayed write
          caching  with  respect to server disk space allocation. When
          the disk on the file server is full, delayed write RPCs  can
          fail due to "out of space". For NFS, this occurrence results
          in an error return from the close system call on  the  file,
          since  the dirty blocks are pushed on close. Processes writ-
          ing important files can  check  for  this  error  return  to
          ensure  that  the  file was written successfully. For NQNFS,
          the dirty blocks are not pushed on close  and  as  such  the
          client may not attempt the write RPC until after the process
          has done the close which implies no error  return  from  the
          close.  For  the  current prototype, the only solution is to
          modify programs writing important file(s) to call fsync  and
          check for an error return from it instead of close.

          9.4. Protocol Details

               The protocol specification is identical to that of  NFS
          [Sun89] except for the following changes.

             [22] Gray and Cheriton avoid  this  problem  by  using  a
          write through policy.

          SMM:06-16                      The 4.4BSD NFS Implementation

          +    RPC Information

                           Program Number 300105
                           Version Number 1

          +    Readdir_and_Lookup RPC

                           struct readdirlookargs {
                                   fhandle file;
                                   nfscookie cookie;
                                   unsigned count;
                                   unsigned duration;

                           struct entry {
                                   unsigned cachable;
                                   unsigned duration;
                                   modifyrev rev;
                                   fhandle entry_fh;
                                   nqnfs_fattr entry_attrib;
                                   unsigned fileid;
                                   filename name;
                                   nfscookie cookie;
                                   entry *nextentry;

                           union readdirlookres switch (stat status) {
                           case NFS_OK:
                                   struct {
                                           entry *entries;
                                           bool eof;
                                   } readdirlookok;

                           NQNFSPROC_READDIRLOOK(readdirlookargs) = 18;

               Reads entries in a directory in a manner  analogous  to
               the  NFSPROC_READDIR  RPC  in NFS, but returns the file
               handle and attributes  of  each  entry  as  well.  This
               allows the attribute and lookup caches to be primed.

          +    Get Lease RPC

                           struct getleaseargs {
                                   fhandle file;
                                   cachetype readwrite;
                                   unsigned duration;

          The 4.4BSD NFS Implementation                      SMM:06-17

                           union getleaseres switch (stat status) {
                           case NFS_OK:
                                   bool cachable;
                                   unsigned duration;
                                   modifyrev rev;
                                   nqnfs_fattr attributes;

                           NQNFSPROC_GETLEASE(getleaseargs) = 19;

               Gets a lease for "file" valid  for  "duration"  seconds
               from when the lease was issued on the  server[23].  The
               lease permits client caching if "cachable" is true. The
               modify revision level and attributes for the  file  are
               also returned.

          +    Eviction Message

                           NQNFSPROC_EVICTED (fhandle) = 21;

               This message is sent from the  server  to  the  client.
               When  the  client receives the message, it should flush
               data associated with the file represented by  "fhandle"
               from  its caches and then send the Vacated Message back
               to the server.  Flushing  includes  pushing  any  dirty
               writes via. write RPCs.

          +    Vacated Message

                           NQNFSPROC_VACATED (fhandle) = 20;

               This message is sent from the client to the  server  in
               response to the Eviction Message. See above.

          +    Access RPC

                           struct accessargs {
                                   fhandle file;
                                   bool read_access;
                                   bool write_access;
                                   bool exec_access;
             [23] To be safe, the client  may  only  assume  that  the
          lease  is  valid  for ``duration'' seconds from when the RPC
          request was sent to the server.

          SMM:06-18                      The 4.4BSD NFS Implementation

                           NQNFSPROC_ACCESS(accessargs) = 22;

               The access RPC does permission checking on  the  server
               for the given type of access required by the client for
               the file. Use of this RPC avoids accessibility problems
               caused by client->server uid mapping.

          +    Piggybacked Get Lease Request

               The  piggybacked  get  lease  request  is  functionally
          equivalent  to  the Get Lease RPC except that is attached to
          one  of  the  other  NQNFS  RPC  requests  as   follows.   A
          getleaserequest is prepended to all of the request arguments
          for NQNFS and a getleaserequestres is inserted  in  all  NFS
          result  structures just after the "stat" field only if "stat
          == NFS_OK".

                      union getleaserequest switch (cachetype type) {
                      case NQLREAD:
                      case NQLWRITE:
                              unsigned duration;

                      union getleaserequestres switch (cachetype type) {
                      case NQLREAD:
                      case NQLWRITE:
                              bool cachable;
                              unsigned duration;
                              modifyrev rev;

          The get lease request applies to the file that the  attached
          RPC  operates  on and the file attributes remain in the same
          location as for the NFS RPC reply structure.

          +    Three additional "stat" values

               Three  additional  values  have  been  added   to   the
          enumerated type "stat".


          The "expired" value indicates that a lease has expired.  The
          "try  later"  value is returned by the server when it wishes
          the client to retry the RPC request after a short delay.  It
          is  used  during  crash recovery (Section 2) and may also be

          The 4.4BSD NFS Implementation                      SMM:06-19

          useful for server  congestion  control.  The  "authetication
          error"  value  is  returned  for  kerberized mount points to
          indicate that there is no cached authentication mapping  and
          a Kerberos ticket for the principal is required.

          9.5. Data Types

          +    cachetype

                           enum cachetype {
                                   NQLNONE = 0,
                                   NQLREAD = 1,
                                   NQLWRITE = 2

               Type of lease requested. NQLNONE is used to indicate no
               piggybacked lease request.

          +    modifyrev

                           typedef unsigned hyper modifyrev;

               The "modifyrev" is a unsigned  quadword  integer  value
               that  is  never  zero  and  increases  every  time  the
               corresponding file is modified on the server.

          +    nqnfs_time

                           struct nqnfs_time {
                                   unsigned seconds;
                                   unsigned nano_seconds;

               For NQNFS times are handled at nano  second  resolution
               instead of micro second resolution for NFS.

          +    nqnfs_fattr

                           struct nqnfs_fattr {
                                   ftype type;
                                   unsigned mode;
                                   unsigned nlink;
                                   unsigned uid;
                                   unsigned gid;
                                   unsigned hyper size;
                                   unsigned blocksize;
                                   unsigned rdev;
                                   unsigned hyper bytes;
                                   unsigned fsid;
                                   unsigned fileid;
                                   nqnfs_time atime;
                                   nqnfs_time mtime;
                                   nqnfs_time ctime;

          SMM:06-20                      The 4.4BSD NFS Implementation

                                   unsigned flags;
                                   unsigned generation;
                                   modifyrev rev;

               The nqnfs_fattr structure  is  modified  from  the  NFS
               fattr  so that it stores the file size as a 64bit quan-
               tity and the storage occupied  as  a  64bit  number  of
               bytes. It also has fields added for the 4.4BSD va_flags
               and va_gen fields as well  as  the  file's  modify  rev

          +    nqnfs_sattr

                           struct nqnfs_sattr {
                                   unsigned mode;
                                   unsigned uid;
                                   unsigned gid;
                                   unsigned hyper size;
                                   nqnfs_time atime;
                                   nqnfs_time mtime;
                                   unsigned flags;
                                   unsigned rdev;

               The nqnfs_sattr structure  is  modified  from  the  NFS
               sattr structure in the same manner as fattr.

          The arguments to several of the NFS RPCs have been  modified
          as  well.  Mostly, these are minor changes to use 64bit file
          offsets or similar. The modified argument structures follow.

          +    Lookup RPC

                           struct lookup_diropargs {
                                   unsigned duration;
                                   fhandle dir;
                                   filename name;

                           union lookup_diropres switch (stat status) {
                           case NFS_OK:
                                   struct {
                                           union getleaserequestres lookup_lease;
                                           fhandle file;
                                           nqnfs_fattr attributes;
                                   } lookup_diropok;

               The additional "duration" argument tells the server  to

          The 4.4BSD NFS Implementation                      SMM:06-21

               get  a lease for the name being looked up if it is non-
               zero and the lease is specified in "lookup_lease".

          +    Read RPC

                           struct nqnfs_readargs {
                                   fhandle file;
                                   unsigned hyper offset;
                                   unsigned count;

          +    Write RPC

                           struct nqnfs_writeargs {
                                   fhandle file;
                                   unsigned hyper offset;
                                   bool append;
                                   nfsdata data;

               The "append" argument is true  for  apeend  only  write

          +    Get Filesystem Attributes RPC

                           union nqnfs_statfsres (stat status) {
                           case NFS_OK:
                                   struct {
                                           unsigned tsize;
                                           unsigned bsize;
                                           unsigned blocks;
                                           unsigned bfree;
                                           unsigned bavail;
                                           unsigned files;
                                           unsigned files_free;
                                   } info;

               The "files" field is the number of files  in  the  file
               system and the "files_free" is the number of additional
               files that can be created.

          10. Summary

               The configuration and  tuning  of  an  NFS  environment
          tends  to be a bit of a mystic art, but hopefully this paper
          along with the man pages and other reading will be  helpful.
          Good Luck.

          SMM:06-22                      The 4.4BSD NFS Implementation

          11. Bibliography

          [Baker91]       Mary Baker and John Ousterhout, Availability
                          in  the  Sprite  Distributed File System, In
                          Operating System Review, (25)2,  pg.  95-98,
                          April 1991.

          [Baker91a]      Mary Baker, Private Email Communication, May

          [Burrows88]     Michael  Burrows,  Efficient  Data  Sharing,
                          Technical  Report #153, Computer Laboratory,
                          University of Cambridge, Dec. 1988.

          [Gray89]        Cary G. Gray and David R. Cheriton,  Leases:
                          An  Efficient  Fault-Tolerant  Mechanism for
                          Distributed File Cache Consistency, In Proc.
                          of  the  Twelfth  ACM Symposium on Operating
                          Systems  Principals,  Litchfield  Park,  AZ,
                          Dec. 1989.

          [Howard88]      John H. Howard, Michael L. Kazar, Sherri  G.
                          Menees, David A. Nichols, M. Satyanarayanan,
                          Robert N. Sidebotham and  Michael  J.  West,
                          Scale  and Performance in a Distributed File
                          System,  ACM  Trans.  on  Computer  Systems,
                          (6)1, pg 51-81, Feb. 1988.

          [Juszczak89]    Chet Juszczak, Improving the Performance and
                          Correctness  of  an  NFS  Server,  In  Proc.
                          Winter 1989 USENIX  Conference,  pg.  53-63,
                          San Diego, CA, January 1989.

          [Keith90]       Bruce E. Keith,  Perspectives  on  NFS  File
                          Server   Performance   Characterization,  In
                          Proc. Summer  1990  USENIX  Conference,  pg.
                          267-277, Anaheim, CA, June 1990.

          [Kent87]        Christopher. A.  Kent,  Cache  Coherence  in
                          Distributed  Systems,  Research Report 87/4,
                          Digital   Equipment   Corporation    Western
                          Research Laboratory, April 1987.

          [Kent87a]       Christopher. A. Kent and Jeffrey  C.  Mogul,
                          Fragmentation  Considered  Harmful, Research
                          Report 87/3, Digital  Equipment  Corporation
                          Western Research Laboratory, Dec. 1987.

          [Macklem91]     Rick Macklem,  Lessons  Learned  Tuning  the
                          4.3BSD Reno Implementation of the NFS Proto-
                          col, In Proc. Winter USENIX Conference,  pg.
                          53-64, Dallas, TX, January 1991.

          The 4.4BSD NFS Implementation                      SMM:06-23

          [Nelson88]      Michael N. Nelson, Brent B. Welch, and  John
                          K. Ousterhout, Caching in the Sprite Network
                          File System, ACM  Transactions  on  Computer
                          Systems (6)1 pg. 134-154, February 1988.

          [Nowicki89]     Bill Nowicki, Transport Issues in  the  Net-
                          work  File System, In Computer Communication
                          Review, pg. 16-20, Vol. 19, Number 2,  April

          [Ousterhout90]  John K.  Ousterhout,  Why  Aren't  Operating
                          Systems  Getting Faster As Fast as Hardware?
                          In Proc. Summer 1990 USENIX Conference,  pg.
                          247-256, Anaheim, CA, June 1990.

          [Pendry93]      Jan-Simon Pendry, 4.4 BSD Automounter Refer-
                          ence  Manual, In src/usr.sbin/amd/doc direc-
                          tory of 4.4 BSD distribution tape.

          [Reid90]        Jim Reid, N(e)FS: the Protocol is the  Prob-
                          lem,  In Proc. Summer 1990 UKUUG Conference,
                          London, England, July 1990.

          [Sandberg85]    Russel Sandberg, David Goldberg, Steve Klei-
                          man,  Dan  Walsh,  and  Bob Lyon, Design and
                          Implementation of the Sun  Network  filesys-
                          tem, In Proc. Summer 1985 USENIX Conference,
                          pages 119-130, Portland, OR, June 1985.

          [Schroeder85]   Michael D. Schroeder, David K.  Gifford  and
                          Roger  M. Needham, A Caching File System For
                          A Programmer's Workstation, In Proc. of  the
                          Tenth  ACM  Symposium  on  Operating Systems
                          Principals, pg.  25-34,  Orcas  Island,  WA,
                          Dec. 1985.

          [Srinivasan89]  V.  Srinivasan  and   Jeffrey.   C.   Mogul,
                          Spritely NFS: Implementation and Performance
                          of  Cache-Consistency  Protocols,   Research
                          Report  89/5,  Digital Equipment Corporation
                          Western Research Laboratory, May 1989.

          [Steiner88]     Jennifer G.  Steiner,  Clifford  Neuman  and
                          Jeffrey  I. Schiller, Kerberos: An Authenti-
                          cation Service for Open Network Systems,  In
                          Proc. Winter 1988 USENIX Conference, Dallas,
                          TX, February 1988.

          [Stern]         Hal Stern, Managing NFS  and  NIS,  O'Reilly
                          and Associates, ISBN 0-937175-75-7.

          [Sun87]         Sun Microsystems Inc.,  XDR:  External  Data
                          Representation  Standard,  RFC1014,  Network

          SMM:06-24                      The 4.4BSD NFS Implementation

                          Information Center, SRI International,  June

          [Sun88]         Sun Microsystems Inc., RPC: Remote Procedure
                          Call   Protocol   Specification  Version  2,
                          RFC1057,  Network  Information  Center,  SRI
                          International, June 1988.

          [Sun89]         Sun Microsystems  Inc.,  NFS:  Network  File
                          System Protocol Specification, ARPANET Work-
                          ing Group Requests for Comment, DDN  Network
                          Information Center, SRI International, Menlo
                          Park, CA, March 1989, RFC-1094.

Generated on 2014-04-02 20:57:59 by $MirOS: src/scripts/roff2htm,v 1.79 2014/02/10 00:36:11 tg Exp $

These manual pages and other documentation are copyrighted by their respective writers; their source is available at our CVSweb, AnonCVS, and other mirrors. The rest is Copyright © 2002‒2014 The MirOS Project, Germany.
This product includes material provided by Thorsten Glaser.

This manual page’s HTML representation is supposed to be valid XHTML/1.1; if not, please send a bug report – diffs preferred.