Networking Implementation Notes
4.4BSD Edition
Samuel J. Leffler, William N. Joy, Robert S. Fabry, and Michael J. Karels
Computer Systems Research Group
Computer Science Division
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, CA 94720
ABSTRACT
This report describes the internal structure
of the networking facilities developed for the
4.4BSD version of the UNIX* operating system for
the VAX-. These facilities are based on several
central abstractions which structure the external
(user) view of network communication as well as
the internal (system) implementation.
The report documents the internal structure
of the networking system. The ``Berkeley Software
Architecture Manual, 4.4BSD Edition'' (PSD:5) pro-
vides a description of the user interface to the
networking facilities.
Revised June 10, 1993
_________________________
* UNIX is a trademark of Bell Laboratories.
- DEC, VAX, DECnet, and UNIBUS are trademarks of Digi-
tal Equipment Corporation.
SMM:18-2 Networking Implementation Notes
TABLE OF CONTENTS
1. Introduction
2. Overview
3. Goals
4. Internal address representation
5. Memory management
6. Internal layering
6.1. Socket layer
6.1.1. Socket state
6.1.2. Socket data queues
6.1.3. Socket connection queuing
6.2. Protocol layer(s)
6.3. Network-interface layer
6.3.1. UNIBUS interfaces
7. Socket/protocol interface
8. Protocol/protocol interface
8.1. pr_output
8.2. pr_input
8.3. pr_ctlinput
8.4. pr_ctloutput
9. Protocol/network-interface interface
9.1. Packet transmission
9.2. Packet reception
10. Gateways and routing issues
10.1. Routing tables
10.2. Routing table interface
10.3. User level routing policies
11. Raw sockets
11.1. Control blocks
11.2. Input processing
11.3. Output processing
12. Buffering and congestion control
12.1. Memory management
12.2. Protocol buffering policies
12.3. Queue limiting
12.4. Packet forwarding
13. Out of band data
14. Trailer protocols
Networking Implementation Notes SMM:18-3
Acknowledgements
References
SMM:18-4 Networking Implementation Notes
1. Introduction
This report describes the internal structure of facili-
ties added to the 4.2BSD version of the UNIX operating sys-
tem for the VAX, as modified in the 4.4BSD release. The sys-
tem facilities provide a uniform user interface to network-
ing within UNIX. In addition, the implementation introduces
a structure for network communications which may be used by
system implementors in adding new networking facilities.
The internal structure is not visible to the user, rather it
is intended to aid implementors of communication protocols
and network services by providing a framework which promotes
code sharing and minimizes implementation effort.
The reader is expected to be familiar with the C pro-
gramming language and system interface, as described in the
Berkeley Software Architecture Manual, 4.4BSD Edition
[Joy86]. Basic understanding of network communication con-
cepts is assumed; where required any additional ideas are
introduced.
The remainder of this document provides a description
of the system internals, avoiding, when possible, those por-
tions which are utilized only by the interprocess communica-
tion facilities.
2. Overview
If we consider the International Standards
Organization's (ISO) Open System Interconnection (OSI) model
of network communication [ISO81] [Zimmermann80], the net-
working facilities described here correspond to a portion of
the session layer (layer 3) and all of the transport and
network layers (layers 2 and 1, respectively).
The network layer provides possibly imperfect data
transport services with minimal addressing structure.
Addressing at this level is normally host to host, with
implicit or explicit routing optionally supported by the
communicating agents.
At the transport layer the notions of reliable
transfer, data sequencing, flow control, and service
addressing are normally included. Reliability is usually
managed by explicit acknowledgement of data delivered.
Failure to acknowledge a transfer results in retransmission
of the data. Sequencing may be handled by tagging each mes-
sage handed to the network layer by a sequence number and
maintaining state at the endpoints of communication to util-
ize received sequence numbers in reordering data which
arrives out of order.
The session layer facilities may provide forms of
addressing which are mapped into formats required by the
Networking Implementation Notes SMM:18-5
transport layer, service authentication and client authenti-
cation, etc. Various systems also provide services such as
data encryption and address and protocol translation.
The following sections begin by describing some of the
common data structures and utility routines, then examine
the internal layering. The contents of each layer and its
interface are considered. Certain of the interfaces are
protocol implementation specific. For these cases examples
have been drawn from the Internet [Cerf78] protocol family.
Later sections cover routing issues, the design of the raw
socket interface and other miscellaneous topics.
3. Goals
The networking system was designed with the goal of
supporting multiple protocol families and addressing styles.
This required information to be ``hidden'' in common data
structures which could be manipulated by all the pieces of
the system, but which required interpretation only by the
protocols which ``controlled'' it. The system described
here attempts to minimize the use of shared data structures
to those kept by a suite of protocols (a protocol family),
and those used for rendezvous between ``synchronous'' and
``asynchronous'' portions of the system (e.g. queues of data
packets are filled at interrupt time and emptied based on
user requests).
A major goal of the system was to provide a framework
within which new protocols and hardware could be easily be
supported. To this end, a great deal of effort has been
extended to create utility routines which hide many of the
more complex and/or hardware dependent chores of networking.
Later sections describe the utility routines and the under-
lying data structures they manipulate.
4. Internal address representation
Common to all portions of the system are two data
structures. These structures are used to represent addresses
and various data objects. Addresses, internally are
described by the sockaddr structure,
struct sockaddr {
short sa_family; /* data format identifier */
char sa_data[14]; /* address */
};
All addresses belong to one or more address families which
define their format and interpretation. The sa_family field
indicates the address family to which the address belongs,
and the sa_data field contains the actual data value. The
size of the data field, 14 bytes, was selected based on a
study of current address formats.* Specific address formats
SMM:18-6 Networking Implementation Notes
use private structure definitions that define the format of
the data field. The system interface supports larger address
structures, although address-family-independent support
facilities, for example routing and raw socket interfaces,
provide only 14 bytes for address storage. Protocols that do
not use those facilities (e.g, the current Unix domain) may
use larger data areas.
5. Memory management
A single mechanism is used for data storage: memory
buffers, or mbuf's. An mbuf is a structure of the form:
struct mbuf {
struct mbuf *m_next; /* next buffer in chain */
u_long m_off; /* offset of data */
short m_len; /* amount of data in this mbuf */
short m_type; /* mbuf type (accounting) */
u_char m_dat[MLEN]; /* data storage */
struct mbuf *m_act; /* link in higher-level mbuf list */
};
The m_next field is used to chain mbufs together on linked
lists, while the m_act field allows lists of mbuf chains to
be accumulated. By convention, the mbufs common to a single
object (for example, a packet) are chained together with the
m_next field, while groups of objects are linked via the
m_act field (possibly when in a queue).
Each mbuf has a small data area for storing informa-
tion, m_dat. The m_len field indicates the amount of data,
while the m_off field is an offset to the beginning of the
data from the base of the mbuf. Thus, for example, the
macro mtod, which converts a pointer to an mbuf to a pointer
to the data stored in the mbuf, has the form
#define mtod(x,t) ((t)((int)(x) + (x)->m_off))
(note the t parameter, a C type cast, which is used to cast
the resultant pointer for proper assignment).
In addition to storing data directly in the mbuf's data
area, data of page size may be also be stored in a separate
area of memory. The mbuf utility routines maintain a pool of
pages for this purpose and manipulate a private page map for
such pages. An mbuf with an external data area may be recog-
nized by the larger offset to the data area; this is formal-
ized by the macro M_HASCL(m), which is true if the mbuf
whose address is m has an external page cluster. An array of
reference counts on pages is also maintained so that copies
_________________________
* Later versions of the system may support variable
length addresses.
Networking Implementation Notes SMM:18-7
of pages may be made without core to core copying (copies
are created simply by duplicating the reference to the data
and incrementing the associated reference counts for the
pages). Separate data pages are currently used only when
copying data from a user process into the kernel, and when
bringing data in at the hardware level. Routines which
manipulate mbufs are not normally aware whether data is
stored directly in the mbuf data array, or if it is kept in
separate pages.
The following may be used to allocate and free mbufs:
m = m_get(wait, type);
MGET(m, wait, type);
The subroutine m_get and the macro MGET each allocate
an mbuf, placing its address in m. The argument wait is
either M_WAIT or M_DONTWAIT according to whether allo-
cation should block or fail if no mbuf is available.
The type is one of the predefined mbuf types for use in
accounting of mbuf allocation.
MCLGET(m);
This macro attempts to allocate an mbuf page cluster to
associate with the mbuf m. If successful, the length of
the mbuf is set to CLSIZE, the size of the page clus-
ter.
n = m_free(m);
MFREE(m,n);
The routine m_free and the macro MFREE each free a sin-
gle mbuf, m, and any associated external storage area,
placing a pointer to its successor in the chain it
heads, if any, in n.
m_freem(m);
This routine frees an mbuf chain headed by m.
The following utility routines are available for mani-
pulating mbuf chains:
m = m_copy(m0, off, len);
The m_copy routine create a copy of all, or part, of a
list of the mbufs in m0. Len bytes of data, starting
off bytes from the front of the chain, are copied.
Where possible, reference counts on pages are used
instead of core to core copies. The original mbuf
chain must have at least off + len bytes of data. If
len is specified as M_COPYALL, all the data present,
offset as before, is copied.
m_cat(m, n);
The mbuf chain, n, is appended to the end of m. Where
SMM:18-8 Networking Implementation Notes
possible, compaction is performed.
m_adj(m, diff);
The mbuf chain, m is adjusted in size by diff bytes.
If diff is non-negative, diff bytes are shaved off the
front of the mbuf chain. If diff is negative, the
alteration is performed from back to front. No space is
reclaimed in this operation; alterations are accom-
plished by changing the m_len and m_off fields of
mbufs.
m = m_pullup(m0, size);
After a successful call to m_pullup, the mbuf at the
head of the returned list, m, is guaranteed to have at
least size bytes of data in contiguous memory within
the data area of the mbuf (allowing access via a
pointer, obtained using the mtod macro, and allowing
the mbuf to be located from a pointer to the data area
using dtom, defined below). If the original data was
less than size bytes long, len was greater than the
size of an mbuf data area (112 bytes), or required
resources were unavailable, m is 0 and the original
mbuf chain is deallocated.
This routine is particularly useful when verifying
packet header lengths on reception. For example, if a
packet is received and only 8 of the necessary 16 bytes
required for a valid packet header are present at the
head of the list of mbufs representing the packet, the
remaining 8 bytes may be ``pulled up'' with a single
m_pullup call. If the call fails the invalid packet
will have been discarded.
By insuring that mbufs always reside on 128 byte boun-
daries, it is always possible to locate the mbuf associated
with a data area by masking off the low bits of the virtual
address. This allows modules to store data structures in
mbufs and pass them around without concern for locating the
original mbuf when it comes time to free the structure. Note
that this works only with objects stored in the internal
data buffer of the mbuf. The dtom macro is used to convert a
pointer into an mbuf's data area to a pointer to the mbuf,
#define dtom(x) ((struct mbuf *)((int)x & ~(MSIZE-1)))
Mbufs are used for dynamically allocated data struc-
tures such as sockets as well as memory allocated for pack-
ets and headers. Statistics are maintained on mbuf usage
and can be viewed by users using the netstat(1) program.
Networking Implementation Notes SMM:18-9
6. Internal layering
The internal structure of the network system is divided
into three layers. These layers correspond to the services
provided by the socket abstraction, those provided by the
communication protocols, and those provided by the hardware
interfaces. The communication protocols are normally lay-
ered into two or more individual cooperating layers, though
they are collectively viewed in the system as one layer pro-
viding services supportive of the appropriate socket
abstraction.
The following sections describe the properties of each
layer in the system and the interfaces to which each must
conform.
6.1. Socket layer
The socket layer deals with the interprocess communica-
tion facilities provided by the system. A socket is a
bidirectional endpoint of communication which is ``typed''
by the semantics of communication it supports. The system
calls described in the Berkeley Software Architecture Manual
[Joy86] are used to manipulate sockets.
A socket consists of the following data structure:
struct socket {
short so_type; /* generic type */
short so_options; /* from socket call */
short so_linger; /* time to linger while closing */
short so_state; /* internal state flags */
caddr_t so_pcb; /* protocol control block */
struct protosw *so_proto; /* protocol handle */
struct socket *so_head; /* back pointer to accept socket */
struct socket *so_q0; /* queue of partial connections */
short so_q0len; /* partials on so_q0 */
struct socket *so_q; /* queue of incoming connections */
short so_qlen; /* number of connections on so_q */
short so_qlimit; /* max number queued connections */
struct sockbuf so_rcv; /* receive queue */
struct sockbuf so_snd; /* send queue */
short so_timeo; /* connection timeout */
u_short so_error; /* error affecting connection */
u_short so_oobmark; /* chars to oob mark */
short so_pgrp; /* pgrp for signals */
};
Each socket contains two data queues, so_rcv and
so_snd, and a pointer to routines which provide supporting
services. The type of the socket, so_type is defined at
socket creation time and used in selecting those services
which are appropriate to support it. The supporting
SMM:18-10 Networking Implementation Notes
protocol is selected at socket creation time and recorded in
the socket data structure for later use. Protocols are
defined by a table of procedures, the protosw structure,
which will be described in detail later. A pointer to a
protocol-specific data structure, the ``protocol control
block,'' is also present in the socket structure. Protocols
control this data structure, which normally includes a back
pointer to the parent socket structure to allow easy lookup
when returning information to a user (for example, placing
an error number in the so_error field). The other entries
in the socket structure are used in queuing connection
requests, validating user requests, storing socket charac-
teristics (e.g. options supplied at the time a socket is
created), and maintaining a socket's state.
Processes ``rendezvous at a socket'' in many instances.
For instance, when a process wishes to extract data from a
socket's receive queue and it is empty, or lacks sufficient
data to satisfy the request, the process blocks, supplying
the address of the receive queue as a ``wait channel' to be
used in notification. When data arrives for the process and
is placed in the socket's queue, the blocked process is
identified by the fact it is waiting ``on the queue.''
6.1.1. Socket state
A socket's state is defined from the following:
#define SS_NOFDREF 0x001 /* no file table ref any more */
#define SS_ISCONNECTED 0x002 /* socket connected to a peer */
#define SS_ISCONNECTING 0x004 /* in process of connecting to peer */
#define SS_ISDISCONNECTING 0x008 /* in process of disconnecting */
#define SS_CANTSENDMORE 0x010 /* can't send more data to peer */
#define SS_CANTRCVMORE 0x020 /* can't receive more data from peer */
#define SS_RCVATMARK 0x040 /* at mark on input */
#define SS_PRIV 0x080 /* privileged */
#define SS_NBIO 0x100 /* non-blocking ops */
#define SS_ASYNC 0x200 /* async i/o notify */
The state of a socket is manipulated both by the proto-
cols and the user (through system calls). When a socket is
created, the state is defined based on the type of socket.
It may change as control actions are performed, for example
connection establishment. It may also change according to
the type of input/output the user wishes to perform, as
indicated by options set with fcntl. ``Non-blocking'' I/O
implies that a process should never be blocked to await
resources. Instead, any call which would block returns
prematurely with the error EWOULDBLOCK, or the service
request may be partially fulfilled, e.g. a request for more
data than is present.
Networking Implementation Notes SMM:18-11
If a process requested ``asynchronous'' notification of
events related to the socket, the SIGIO signal is posted to
the process when such events occur. An event is a change in
the socket's state; examples of such occurrences are: space
becoming available in the send queue, new data available in
the receive queue, connection establishment or disestablish-
ment, etc.
A socket may be marked ``privileged'' if it was created
by the super-user. Only privileged sockets may bind
addresses in privileged portions of an address space or use
``raw'' sockets to access lower levels of the network.
6.1.2. Socket data queues
A socket's data queue contains a pointer to the data
stored in the queue and other entries related to the manage-
ment of the data. The following structure defines a data
queue:
struct sockbuf {
u_short sb_cc; /* actual chars in buffer */
u_short sb_hiwat; /* max actual char count */
u_short sb_mbcnt; /* chars of mbufs used */
u_short sb_mbmax; /* max chars of mbufs to use */
u_short sb_lowat; /* low water mark */
short sb_timeo; /* timeout */
struct mbuf *sb_mb; /* the mbuf chain */
struct proc *sb_sel; /* process selecting read/write */
short sb_flags; /* flags, see below */
};
Data is stored in a queue as a chain of mbufs. The
actual count of data characters as well as high and low
water marks are used by the protocols in controlling the
flow of data. The amount of buffer space (characters of
mbufs and associated data pages) is also recorded along with
the limit on buffer allocation. The socket routines
cooperate in implementing the flow control policy by block-
ing a process when it requests to send data and the high
water mark has been reached, or when it requests to receive
data and less than the low water mark is present (assuming
non-blocking I/O has not been specified).*
When a socket is created, the supporting protocol
``reserves'' space for the send and receive queues of the
socket. The limit on buffer allocation is set somewhat
higher than the limit on data characters to account for the
granularity of buffer allocation. The actual storage
_________________________
* The low-water mark is always presumed to be 0 in the
current implementation.
SMM:18-12 Networking Implementation Notes
associated with a socket queue may fluctuate during a
socket's lifetime, but it is assumed that this reservation
will always allow a protocol to acquire enough memory to
satisfy the high water marks.
The timeout and select values are manipulated by the
socket routines in implementing various portions of the
interprocess communications facilities and will not be
described here.
Data queued at a socket is stored in one of two styles.
Stream-oriented sockets queue data with no addresses,
headers or record boundaries. The data are in mbufs linked
through the m_next field. Buffers containing access rights
may be present within the chain if the underlying protocol
supports passage of access rights. Record-oriented sockets,
including datagram sockets, queue data as a list of packets;
the sections of packets are distinguished by the types of
the mbufs containing them. The mbufs which comprise a record
are linked through the m_next field; records are linked from
the m_act field of the first mbuf of one packet to the first
mbuf of the next. Each packet begins with an mbuf containing
the ``from'' address if the protocol provides it, then any
buffers containing access rights, and finally any buffers
containing data. If a record contains no data, no data
buffers are required unless neither address nor access
rights are present.
A socket queue has a number of flags used in synchron-
izing access to the data and in acquiring resources:
#define SB_LOCK 0x01 /* lock on data queue (so_rcv only) */
#define SB_WANT 0x02 /* someone is waiting to lock */
#define SB_WAIT 0x04 /* someone is waiting for data/space */
#define SB_SEL 0x08 /* buffer is selected */
#define SB_COLL 0x10 /* collision selecting */
The last two flags are manipulated by the system in imple-
menting the select mechanism.
6.1.3. Socket connection queuing
In dealing with connection oriented sockets (e.g.
SOCK_STREAM) the two ends are considered distinct. One end
is termed active, and generates connection requests. The
other end is called passive and accepts connection requests.
From the passive side, a socket is marked with
SO_ACCEPTCONN when a listen call is made, creating two
queues of sockets: so_q0 for connections in progress and
so_q for connections already made and awaiting user accep-
tance. As a protocol is preparing incoming connections, it
creates a socket structure queued on so_q0 by calling the
routine sonewconn(). When the connection is established,
Networking Implementation Notes SMM:18-13
the socket structure is then transferred to so_q, making it
available for an accept.
If an SO_ACCEPTCONN socket is closed with sockets on
either so_q0 or so_q, these sockets are dropped, with notif-
ication to the peers as appropriate.
6.2. Protocol layer(s)
Each socket is created in a communications domain,
which usually implies both an addressing structure (address
family) and a set of protocols which implement various
socket types within the domain (protocol family). Each
domain is defined by the following structure:
struct domain {
int dom_family; /* PF_xxx */
char *dom_name;
int (*dom_init)(); /* initialize domain data structures */
int (*dom_externalize)(); /* externalize access rights */
int (*dom_dispose)(); /* dispose of internalized rights */
struct protosw *dom_protosw, *dom_protoswNPROTOSW;
struct domain *dom_next;
};
At boot time, each domain configured into the kernel is
added to a linked list of domain. The initialization pro-
cedure of each domain is then called. After that time, the
domain structure is used to locate protocols within the pro-
tocol family. It may also contain procedure references for
externalization of access rights at the receiving socket and
the disposal of access rights that are not received.
Protocols are described by a set of entry points and
certain socket-visible characteristics, some of which are
used in deciding which socket type(s) they may support.
An entry in the ``protocol switch'' table exists for
each protocol module configured into the system. It has the
following form:
SMM:18-14 Networking Implementation Notes
struct protosw {
short pr_type; /* socket type used for */
struct domain *pr_domain; /* domain protocol a member of */
short pr_protocol; /* protocol number */
short pr_flags; /* socket visible attributes */
/* protocol-protocol hooks */
int (*pr_input)(); /* input to protocol (from below) */
int (*pr_output)(); /* output to protocol (from above) */
int (*pr_ctlinput)(); /* control input (from below) */
int (*pr_ctloutput)(); /* control output (from above) */
/* user-protocol hook */
int (*pr_usrreq)(); /* user request */
/* utility hooks */
int (*pr_init)(); /* initialization routine */
int (*pr_fasttimo)(); /* fast timeout (200ms) */
int (*pr_slowtimo)(); /* slow timeout (500ms) */
int (*pr_drain)(); /* flush any excess space possible */
};
A protocol is called through the pr_init entry before
any other. Thereafter it is called every 200 milliseconds
through the pr_fasttimo entry and every 500 milliseconds
through the pr_slowtimo for timer based actions. The system
will call the pr_drain entry if it is low on space and this
should throw away any non-critical data.
Protocols pass data between themselves as chains of
mbufs using the pr_input and pr_output routines. Pr_input
passes data up (towards the user) and pr_output passes it
down (towards the network); control information passes up
and down on pr_ctlinput and pr_ctloutput. The protocol is
responsible for the space occupied by any of the arguments
to these entries and must either pass it onward or dispose
of it. (On output, the lowest level reached must free
buffers storing the arguments; on input, the highest level
is responsible for freeing buffers.)
The pr_usrreq routine interfaces protocols to the
socket code and is described below.
The pr_flags field is constructed from the following
values:
#define PR_ATOMIC 0x01 /* exchange atomic messages only */
#define PR_ADDR 0x02 /* addresses given with messages */
#define PR_CONNREQUIRED 0x04 /* connection required by protocol */
#define PR_WANTRCVD 0x08 /* want PRU_RCVD calls */
#define PR_RIGHTS 0x10 /* passes capabilities */
Protocols which are connection-based specify the
PR_CONNREQUIRED flag so that the socket routines will never
attempt to send data before a connection has been
Networking Implementation Notes SMM:18-15
established. If the PR_WANTRCVD flag is set, the socket
routines will notify the protocol when the user has removed
data from the socket's receive queue. This allows the pro-
tocol to implement acknowledgement on user receipt, and also
update windowing information based on the amount of space
available in the receive queue. The PR_ADDR field indicates
that any data placed in the socket's receive queue will be
preceded by the address of the sender. The PR_ATOMIC flag
specifies that each user request to send data must be per-
formed in a single protocol send request; it is the
protocol's responsibility to maintain record boundaries on
data to be sent. The PR_RIGHTS flag indicates that the pro-
tocol supports the passing of capabilities; this is
currently used only by the protocols in the UNIX protocol
family.
When a socket is created, the socket routines scan the
protocol table for the domain looking for an appropriate
protocol to support the type of socket being created. The
pr_type field contains one of the possible socket types
(e.g. SOCK_STREAM), while the pr_domain is a back pointer to
the domain structure. The pr_protocol field contains the
protocol number of the protocol, normally a well-known
value.
6.3. Network-interface layer
Each network-interface configured into a system defines
a path through which packets may be sent and received. Nor-
mally a hardware device is associated with this interface,
though there is no requirement for this (for example, all
systems have a software ``loopback'' interface used for
debugging and performance analysis). In addition to manipu-
lating the hardware device, an interface module is responsi-
ble for encapsulation and decapsulation of any link-layer
header information required to deliver a message to its des-
tination. The selection of which interface to use in
delivering packets is a routing decision carried out at a
higher level than the network-interface layer. An interface
may have addresses in one or more address families. The
address is set at boot time using an ioctl on a socket in
the appropriate domain; this operation is implemented by the
protocol family, after verifying the operation through the
device ioctl entry.
An interface is defined by the following structure,
SMM:18-16 Networking Implementation Notes
struct ifnet {
char *if_name; /* name, e.g. ``en'' or ``lo'' */
short if_unit; /* sub-unit for lower level driver */
short if_mtu; /* maximum transmission unit */
short if_flags; /* up/down, broadcast, etc. */
short if_timer; /* time 'til if_watchdog called */
struct ifaddr *if_addrlist; /* list of addresses of interface */
struct ifqueue if_snd; /* output queue */
int (*if_init)(); /* init routine */
int (*if_output)(); /* output routine */
int (*if_ioctl)(); /* ioctl routine */
int (*if_reset)(); /* bus reset routine */
int (*if_watchdog)(); /* timer routine */
int if_ipackets; /* packets received on interface */
int if_ierrors; /* input errors on interface */
int if_opackets; /* packets sent on interface */
int if_oerrors; /* output errors on interface */
int if_collisions; /* collisions on csma interfaces */
struct ifnet *if_next;
};
Each interface address has the following form:
struct ifaddr {
struct sockaddr ifa_addr; /* address of interface */
union {
struct sockaddr ifu_broadaddr;
struct sockaddr ifu_dstaddr;
} ifa_ifu;
struct ifnet *ifa_ifp; /* back-pointer to interface */
struct ifaddr *ifa_next; /* next address for interface */
};
#define ifa_broadaddr ifa_ifu.ifu_broadaddr /* broadcast address */
#define ifa_dstaddr ifa_ifu.ifu_dstaddr /* other end of p-to-p link */
The protocol generally maintains this structure as part of a
larger structure containing additional information concern-
ing the address.
Each interface has a send queue and routines used for
initialization, if_init, and output, if_output. If the
interface resides on a system bus, the routine if_reset will
be called after a bus reset has been performed. An interface
may also specify a timer routine, if_watchdog; if if_timer
is non-zero, it is decremented once per second until it
reaches zero, at which time the watchdog routine is called.
The state of an interface and certain characteristics
are stored in the if_flags field. The following values are
possible:
Networking Implementation Notes SMM:18-17
#define IFF_UP 0x1 /* interface is up */
#define IFF_BROADCAST 0x2 /* broadcast is possible */
#define IFF_DEBUG 0x4 /* turn on debugging */
#define IFF_LOOPBACK 0x8 /* is a loopback net */
#define IFF_POINTOPOINT 0x10 /* interface is point-to-point link */
#define IFF_NOTRAILERS 0x20 /* avoid use of trailers */
#define IFF_RUNNING 0x40 /* resources allocated */
#define IFF_NOARP 0x80 /* no address resolution protocol */
If the interface is connected to a network which supports
transmission of broadcast packets, the IFF_BROADCAST flag
will be set and the ifa_broadaddr field will contain the
address to be used in sending or accepting a broadcast
packet. If the interface is associated with a point-to-
point hardware link (for example, a DEC DMR-11), the
IFF_POINTOPOINT flag will be set and ifa_dstaddr will con-
tain the address of the host on the other side of the con-
nection. These addresses and the local address of the
interface, if_addr, are used in filtering incoming packets.
The interface sets IFF_RUNNING after it has allocated system
resources and posted an initial read on the device it
manages. This state bit is used to avoid multiple alloca-
tion requests when an interface's address is changed. The
IFF_NOTRAILERS flag indicates the interface should refrain
from using a trailer encapsulation on outgoing packets, or
(where per-host negotiation of trailers is possible) that
trailer encapsulations should not be requested; trailer pro-
tocols are described in section 14. The IFF_NOARP flag
indicates the interface should not use an ``address resolu-
tion protocol'' in mapping internetwork addresses to local
network addresses.
Various statistics are also stored in the interface
structure. These may be viewed by users using the
netstat(1) program.
The interface address and flags may be set with the
SIOCSIFADDR and SIOCSIFFLAGS ioctls. SIOCSIFADDR is used
initially to define each interface's address; SIOGSIFFLAGS
can be used to mark an interface down and perform site-
specific configuration. The destination address of a point-
to-point link is set with SIOCSIFDSTADDR. Corresponding
operations exist to read each value. Protocol families may
also support operations to set and read the broadcast
address. In addition, the SIOCGIFCONF ioctl retrieves a list
of interface names and addresses for all interfaces and pro-
tocols on the host.
6.3.1. UNIBUS interfaces
All hardware related interfaces currently reside on the
UNIBUS. Consequently a common set of utility routines for
dealing with the UNIBUS has been developed. Each UNIBUS
SMM:18-18 Networking Implementation Notes
interface utilizes a structure of the following form:
struct ifubinfo {
short iff_uban; /* uba number */
short iff_hlen; /* local net header length */
struct uba_regs *iff_uba; /* uba regs, in vm */
short iff_flags; /* used during uballoc's */
};
Additional structures are associated with each receive and
transmit buffer, normally one each per interface; for read,
struct ifrw {
caddr_t ifrw_addr; /* virt addr of header */
short ifrw_bdp; /* unibus bdp */
short ifrw_flags; /* type, etc. */
#define IFRW_W 0x01 /* is a transmit buffer */
int ifrw_info; /* value from ubaalloc */
int ifrw_proto; /* map register prototype */
struct pte *ifrw_mr; /* base of map registers */
};
and for write,
struct ifxmt {
struct ifrw ifrw;
caddr_t ifw_base; /* virt addr of buffer */
struct pte ifw_wmap[IF_MAXNUBAMR]; /* base pages for output */
struct mbuf *ifw_xtofree; /* pages being dma'd out */
short ifw_xswapd; /* mask of clusters swapped */
short ifw_nmr; /* number of entries in wmap */
};
#define ifw_addr ifrw.ifrw_addr
#define ifw_bdp ifrw.ifrw_bdp
#define ifw_flags ifrw.ifrw_flags
#define ifw_info ifrw.ifrw_info
#define ifw_proto ifrw.ifrw_proto
#define ifw_mr ifrw.ifrw_mr
One of each of these structures is conveniently packaged for
interfaces with single buffers for each direction, as fol-
lows:
struct ifuba {
struct ifubinfo ifu_info;
struct ifrw ifu_r;
struct ifxmt ifu_xmt;
};
#define ifu_uban ifu_info.iff_uban
#define ifu_hlen ifu_info.iff_hlen
#define ifu_uba ifu_info.iff_uba
#define ifu_flags ifu_info.iff_flags
#define ifu_w ifu_xmt.ifrw
#define ifu_xtofree ifu_xmt.ifw_xtofree
Networking Implementation Notes SMM:18-19
The if_ubinfo structure contains the general informa-
tion needed to characterize the I/O-mapped buffers for the
device. In addition, there is a structure describing each
buffer, including UNIBUS resources held by the interface.
Sufficient memory pages and bus map registers are allocated
to each buffer upon initialization according to the maximum
packet size and header length. The kernel virtual address of
the buffer is held in ifrw_addr, and the map registers begin
at ifrw_mr. UNIBUS map register ifrw_mr[-1] maps the local
network header ending on a page boundary. UNIBUS data paths
are reserved for read and for write, given by ifrw_bdp. The
prototype of the map registers for read and for write is
saved in ifrw_proto.
When write transfers are not at least half-full pages
on page boundaries, the data are just copied into the pages
mapped on the UNIBUS and the transfer is started. If a write
transfer is at least half a page long and on a page boun-
dary, UNIBUS page table entries are swapped to reference the
pages, and then the initial pages are remapped from ifw_wmap
when the transfer completes. The mbufs containing the mapped
pages are placed on the ifw_xtofree queue to be freed after
transmission.
When read transfers give at least half a page of data
to be input, page frames are allocated from a network page
list and traded with the pages already containing the data,
mapping the allocated pages to replace the input pages for
the next UNIBUS data input.
The following utility routines are available for use in
writing network interface drivers; all use the structures
described above.
if_ubaminit(ifubinfo, uban, hlen, nmr, ifr, nr, ifx, nx);
if_ubainit(ifuba, uban, hlen, nmr);
if_ubaminit allocates resources on UNIBUS adapter uban,
storing the information in the ifubinfo, ifrw and ifxmt
structures referenced. The ifr and ifx parameters are
pointers to arrays of ifrw and ifxmt structures whose
dimensions are nr and nx, respectively. if_ubainit is a
simpler, backwards-compatible interface used for
hardware with single buffers of each type. They are
called only at boot time or after a UNIBUS reset. One
data path (buffered or unbuffered, depending on the
ifu_flags field) is allocated for each buffer. The nmr
parameter indicates the number of UNIBUS mapping regis-
ters required to map a maximal sized packet onto the
UNIBUS, while hlen specifies the size of a local net-
work header, if any, which should be mapped separately
from the data (see the description of trailer protocols
in chapter 14). Sufficient UNIBUS mapping registers and
pages of memory are allocated to initialize the input
SMM:18-20 Networking Implementation Notes
data path for an initial read. For the output data
path, mapping registers and pages of memory are also
allocated and mapped onto the UNIBUS. The pages asso-
ciated with the output data path are held in reserve in
the event a write requires copying non-page-aligned
data (see if_wubaput below). If if_ubainit is called
with memory pages already allocated, they will be used
instead of allocating new ones (this normally occurs
after a UNIBUS reset). A 1 is returned when allocation
and initialization are successful, 0 otherwise.
m = if_ubaget(ifubinfo, ifr, totlen, off0, ifp);
m = if_rubaget(ifuba, totlen, off0, ifp);
if_ubaget and if_rubaget pull input data out of an
interface receive buffer and into an mbuf chain. The
first interface passes pointers to the ifubinfo struc-
ture for the interface and the ifrw structure for the
receive buffer; the second call may be used for
single-buffered devices. totlen specifies the length of
data to be obtained, not counting the local network
header. If off0 is non-zero, it indicates a byte
offset to a trailing local network header which should
be copied into a separate mbuf and prepended to the
front of the resultant mbuf chain. When the data
amount to at least a half a page, the previously mapped
data pages are remapped into the mbufs and swapped with
fresh pages, thus avoiding any copy. The receiving
interface is recorded as ifp, a pointer to an ifnet
structure, for the use of the receiving network proto-
col. A 0 return value indicates a failure to allocate
resources.
if_wubaput(ifubinfo, ifx, m);
if_wubaput(ifuba, m);
if_ubaput and if_wubaput map a chain of mbufs onto a
network interface in preparation for output. The first
interface is used by devices with multiple transmit
buffers. The chain includes any local network header,
which is copied so that it resides in the mapped and
aligned I/O space. Page-aligned data that are page-
aligned in the output buffer are mapped to the UNIBUS
in place of the normal buffer page, and the correspond-
ing mbuf is placed on a queue to be freed after
transmission. Any other mbufs which contained non-
page-sized data portions are copied to the I/O space
and then freed. Pages mapped from a previous output
operation (no longer needed) are unmapped.
Networking Implementation Notes SMM:18-21
7. Socket/protocol interface
The interface between the socket routines and the com-
munication protocols is through the pr_usrreq routine
defined in the protocol switch table. The following
requests to a protocol module are possible:
#define PRU_ATTACH 0 /* attach protocol */
#define PRU_DETACH 1 /* detach protocol */
#define PRU_BIND 2 /* bind socket to address */
#define PRU_LISTEN 3 /* listen for connection */
#define PRU_CONNECT 4 /* establish connection to peer */
#define PRU_ACCEPT 5 /* accept connection from peer */
#define PRU_DISCONNECT 6 /* disconnect from peer */
#define PRU_SHUTDOWN 7 /* won't send any more data */
#define PRU_RCVD 8 /* have taken data; more room now */
#define PRU_SEND 9 /* send this data */
#define PRU_ABORT 10 /* abort (fast DISCONNECT, DETATCH) */
#define PRU_CONTROL 11 /* control operations on protocol */
#define PRU_SENSE 12 /* return status into m */
#define PRU_RCVOOB 13 /* retrieve out of band data */
#define PRU_SENDOOB 14 /* send out of band data */
#define PRU_SOCKADDR 15 /* fetch socket's address */
#define PRU_PEERADDR 16 /* fetch peer's address */
#define PRU_CONNECT2 17 /* connect two sockets */
/* begin for protocols internal use */
#define PRU_FASTTIMO 18 /* 200ms timeout */
#define PRU_SLOWTIMO 19 /* 500ms timeout */
#define PRU_PROTORCV 20 /* receive from below */
#define PRU_PROTOSEND 21 /* send to below */
A call on the user request routine is of the form,
error = (*protosw[].pr_usrreq)(so, req, m, addr, rights);
int error; struct socket *so; int req; struct mbuf *m, *addr, *rights;
The mbuf data chain m is supplied for output operations and
for certain other operations where it is to receive a
result. The address addr is supplied for address-oriented
requests such as PRU_BIND and PRU_CONNECT. The rights param-
eter is an optional pointer to an mbuf chain containing
user-specified capabilities (see the sendmsg and recvmsg
system calls). The protocol is responsible for disposal of
the data mbuf chains on output operations. A non-zero return
value gives a UNIX error number which should be passed to
higher level software. The following paragraphs describe
each of the requests possible.
PRU_ATTACH
When a protocol is bound to a socket (with the socket
system call) the protocol module is called with this
request. It is the responsibility of the protocol
module to allocate any resources necessary. The
``attach'' request will always precede any of the other
SMM:18-22 Networking Implementation Notes
requests, and should not occur more than once.
PRU_DETACH
This is the antithesis of the attach request, and is
used at the time a socket is deleted. The protocol
module may deallocate any resources assigned to the
socket.
PRU_BIND
When a socket is initially created it has no address
bound to it. This request indicates that an address
should be bound to an existing socket. The protocol
module must verify that the requested address is valid
and available for use.
PRU_LISTEN
The ``listen'' request indicates the user wishes to
listen for incoming connection requests on the associ-
ated socket. The protocol module should perform any
state changes needed to carry out this request (if pos-
sible). A ``listen'' request always precedes a request
to accept a connection.
PRU_CONNECT
The ``connect'' request indicates the user wants to a
establish an association. The addr parameter supplied
describes the peer to be connected to. The effect of a
connect request may vary depending on the protocol.
Virtual circuit protocols, such as TCP [Postel81b], use
this request to initiate establishment of a TCP connec-
tion. Datagram protocols, such as UDP [Postel80], sim-
ply record the peer's address in a private data struc-
ture and use it to tag all outgoing packets. There are
no restrictions on how many times a connect request may
be used after an attach. If a protocol supports the
notion of multi-casting, it is possible to use multiple
connects to establish a multi-cast group. Alterna-
tively, an association may be broken by a
PRU_DISCONNECT request, and a new association created
with a subsequent connect request; all without destroy-
ing and creating a new socket.
PRU_ACCEPT
Following a successful PRU_LISTEN request and the
arrival of one or more connections, this request is
made to indicate the user has accepted the first con-
nection on the queue of pending connections. The pro-
tocol module should fill in the supplied address buffer
with the address of the connected party.
PRU_DISCONNECT
Eliminate an association created with a PRU_CONNECT
request.
Networking Implementation Notes SMM:18-23
PRU_SHUTDOWN
This call is used to indicate no more data will be sent
and/or received (the addr parameter indicates the
direction of the shutdown, as encoded in the soshutdown
system call). The protocol may, at its discretion,
deallocate any data structures related to the shutdown
and/or notify a connected peer of the shutdown.
PRU_RCVD
This request is made only if the protocol entry in the
protocol switch table includes the PR_WANTRCVD flag.
When a user removes data from the receive queue this
request will be sent to the protocol module. It may be
used to trigger acknowledgements, refresh windowing
information, initiate data transfer, etc.
PRU_SEND
Each user request to send data is translated into one
or more PRU_SEND requests (a protocol may indicate that
a single user send request must be translated into a
single PRU_SEND request by specifying the PR_ATOMIC
flag in its protocol description). The data to be sent
is presented to the protocol as a list of mbufs and an
address is, optionally, supplied in the addr parameter.
The protocol is responsible for preserving the data in
the socket's send queue if it is not able to send it
immediately, or if it may need it at some later time
(e.g. for retransmission).
PRU_ABORT
This request indicates an abnormal termination of ser-
vice. The protocol should delete any existing
association(s).
PRU_CONTROL
The ``control'' request is generated when a user per-
forms a UNIX ioctl system call on a socket (and the
ioctl is not intercepted by the socket routines). It
allows protocol-specific operations to be provided out-
side the scope of the common socket interface. The
addr parameter contains a pointer to a static kernel
data area where relevant information may be obtained or
returned. The m parameter contains the actual ioctl
request code (note the non-standard calling conven-
tion). The rights parameter contains a pointer to an
ifnet structure if the ioctl operation pertains to a
particular network interface.
PRU_SENSE
The ``sense'' request is generated when the user makes
an fstat system call on a socket; it requests status of
the associated socket. This currently returns a stan-
dard stat structure. It typically contains only the
optimal transfer size for the connection (based on
SMM:18-24 Networking Implementation Notes
buffer size, windowing information and maximum packet
size). The m parameter contains a pointer to a static
kernel data area where the status buffer should be
placed.
PRU_RCVOOB
Any ``out-of-band'' data presently available is to be
returned. An mbuf is passed to the protocol module,
and the protocol should either place data in the mbuf
or attach new mbufs to the one supplied if there is
insufficient space in the single mbuf. An error may be
returned if out-of-band data is not (yet) available or
has already been consumed. The addr parameter contains
any options such as MSG_PEEK to examine data without
consuming it.
PRU_SENDOOB
Like PRU_SEND, but for out-of-band data.
PRU_SOCKADDR
The local address of the socket is returned, if any is
currently bound to it. The address (with protocol
specific format) is returned in the addr parameter.
PRU_PEERADDR
The address of the peer to which the socket is con-
nected is returned. The socket must be in a
SS_ISCONNECTED state for this request to be made to the
protocol. The address format (protocol specific) is
returned in the addr parameter.
PRU_CONNECT2
The protocol module is supplied two sockets and
requested to establish a connection between the two
without binding any addresses, if possible. This call
is used in implementing the socketpair(2) system call.
The following requests are used internally by the pro-
tocol modules and are never generated by the socket rou-
tines. In certain instances, they are handed to the
pr_usrreq routine solely for convenience in tracing a
protocol's operation (e.g. PRU_SLOWTIMO).
PRU_FASTTIMO
A ``fast timeout'' has occurred. This request is made
when a timeout occurs in the protocol's pr_fastimo rou-
tine. The addr parameter indicates which timer
expired.
PRU_SLOWTIMO
A ``slow timeout'' has occurred. This request is made
when a timeout occurs in the protocol's pr_slowtimo
routine. The addr parameter indicates which timer
expired.
Networking Implementation Notes SMM:18-25
PRU_PROTORCV
This request is used in the protocol-protocol inter-
face, not by the routines. It requests reception of
data destined for the protocol and not the user. No
protocols currently use this facility.
PRU_PROTOSEND
This request allows a protocol to send data destined
for another protocol module, not a user. The details
of how data is marked ``addressed to protocol'' instead
of ``addressed to user'' are left to the protocol
modules. No protocols currently use this facility.
8. Protocol/protocol interface
The interface between protocol modules is through the
pr_usrreq, pr_input, pr_output, pr_ctlinput, and
pr_ctloutput routines. The calling conventions for all but
the pr_usrreq routine are expected to be specific to the
protocol modules and are not guaranteed to be consistent
across protocol families. We will examine the conventions
used for some of the Internet protocols in this section as
an example.
8.1. pr_output
The Internet protocol UDP uses the convention,
error = udp_output(inp, m);
int error; struct inpcb *inp; struct mbuf *m;
where the inp, ``internet protocol control block'', passed
between modules conveys per connection state information,
and the mbuf chain contains the data to be sent. UDP per-
forms consistency checks, appends its header, calculates a
checksum, etc. before passing the packet on. UDP is based on
the Internet Protocol, IP [Postel81a], as its transport. UDP
passes a packet to the IP module for output as follows:
error = ip_output(m, opt, ro, flags);
int error; struct mbuf *m, *opt; struct route *ro; int flags;
The call to IP's output routine is more complicated
than that for UDP, as befits the additional work the IP
module must do. The m parameter is the data to be sent, and
the opt parameter is an optional list of IP options which
should be placed in the IP packet header. The ro parameter
is is used in making routing decisions (and passing them
back to the caller for use in subsequent calls). The final
parameter, flags contains flags indicating whether the user
is allowed to transmit a broadcast packet and if routing is
to be performed. The broadcast flag may be inconsequential
if the underlying hardware does not support the notion of
SMM:18-26 Networking Implementation Notes
broadcasting.
All output routines return 0 on success and a UNIX
error number if a failure occurred which could be detected
immediately (no buffer space available, no route to destina-
tion, etc.).
8.2. pr_input
Both UDP and TCP use the following calling convention,
(void) (*protosw[].pr_input)(m, ifp);
struct mbuf *m; struct ifnet *ifp;
Each mbuf list passed is a single packet to be processed by
the protocol module. The interface from which the packet was
received is passed as the second parameter.
The IP input routine is a VAX software interrupt level
routine, and so is not called with any parameters. It
instead communicates with network interfaces through a
queue, ipintrq, which is identical in structure to the
queues used by the network interfaces for storing packets
awaiting transmission. The software interrupt is enabled by
the network interfaces when they place input data on the
input queue.
8.3. pr_ctlinput
This routine is used to convey ``control'' information
to a protocol module (i.e. information which might be passed
to the user, but is not data).
The common calling convention for this routine is,
(void) (*protosw[].pr_ctlinput)(req, addr);
int req; struct sockaddr *addr;
The req parameter is one of the following,
Networking Implementation Notes SMM:18-27
#define PRC_IFDOWN 0 /* interface transition */
#define PRC_ROUTEDEAD 1 /* select new route if possible */
#define PRC_QUENCH 4 /* some said to slow down */
#define PRC_MSGSIZE 5 /* message size forced drop */
#define PRC_HOSTDEAD 6 /* normally from IMP */
#define PRC_HOSTUNREACH 7 /* ditto */
#define PRC_UNREACH_NET 8 /* no route to network */
#define PRC_UNREACH_HOST 9 /* no route to host */
#define PRC_UNREACH_PROTOCOL 10 /* dst says bad protocol */
#define PRC_UNREACH_PORT 11 /* bad port # */
#define PRC_UNREACH_NEEDFRAG 12 /* IP_DF caused drop */
#define PRC_UNREACH_SRCFAIL 13 /* source route failed */
#define PRC_REDIRECT_NET 14 /* net routing redirect */
#define PRC_REDIRECT_HOST 15 /* host routing redirect */
#define PRC_REDIRECT_TOSNET 14 /* redirect for type of service & net */
#define PRC_REDIRECT_TOSHOST 15 /* redirect for tos & host */
#define PRC_TIMXCEED_INTRANS 18 /* packet lifetime expired in transit */
#define PRC_TIMXCEED_REASS 19 /* lifetime expired on reass q */
#define PRC_PARAMPROB 20 /* header incorrect */
while the addr parameter is the address to which the condi-
tion applies. Many of the requests have obviously been
derived from ICMP (the Internet Control Message Protocol
[Postel81c]), and from error messages defined in the 1822
host/IMP convention [BBN78]. Mapping tables exist to con-
vert control requests to UNIX error codes which are
delivered to a user.
8.4. pr_ctloutput
This is the routine that implements per-socket options
at the protocol level for getsockopt and setsockopt. The
calling convention is,
error = (*protosw[].pr_ctloutput)(op, so, level, optname, mp);
int op; struct socket *so; int level, optname; struct mbuf **mp;
where op is one of PRCO_SETOPT or PRCO_GETOPT, so is the
socket from whence the call originated, and level and
optname are the protocol level and option name supplied by
the user. The results of a PRCO_GETOPT call are returned in
an mbuf whose address is placed in mp before return. On a
PRCO_SETOPT call, mp contains the address of an mbuf con-
taining the option data; the mbuf should be freed before
return.
SMM:18-28 Networking Implementation Notes
9. Protocol/network-interface interface
The lowest layer in the set of protocols which comprise
a protocol family must interface itself to one or more net-
work interfaces in order to transmit and receive packets.
It is assumed that any routing decisions have been made
before handing a packet to a network interface, in fact this
is absolutely necessary in order to locate any interface at
all (unless, of course, one uses a single ``hardwired''
interface). There are two cases with which to be concerned,
transmission of a packet and receipt of a packet; each will
be considered separately.
9.1. Packet transmission
Assuming a protocol has a handle on an interface, ifp,
a (struct ifnet *), it transmits a fully formatted packet
with the following call,
error = (*ifp->if_output)(ifp, m, dst)
int error; struct ifnet *ifp; struct mbuf *m; struct sockaddr *dst;
The output routine for the network interface transmits the
packet m to the dst address, or returns an error indication
(a UNIX error number). In reality transmission may not be
immediate or successful; normally the output routine simply
queues the packet on its send queue and primes an interrupt
driven routine to actually transmit the packet. For unreli-
able media, such as the Ethernet, ``successful'' transmis-
sion simply means that the packet has been placed on the
cable without a collision. On the other hand, an 1822
interface guarantees proper delivery or an error indication
for each message transmitted. The model employed in the net-
working system attaches no promises of delivery to the pack-
ets handed to a network interface, and thus corresponds more
closely to the Ethernet. Errors returned by the output rou-
tine are only those that can be detected immediately, and
are normally trivial in nature (no buffer space, address
format not handled, etc.). No indication is received if
errors are detected after the call has returned.
9.2. Packet reception
Each protocol family must have one or more ``lowest
level'' protocols. These protocols deal with internetwork
addressing and are responsible for the delivery of incoming
packets to the proper protocol processing modules. In the
PUP model [Boggs78] these protocols are termed Level 1 pro-
tocols, in the ISO model, network layer protocols. In this
system each such protocol module has an input packet queue
assigned to it. Incoming packets received by a network
interface are queued for the protocol module, and a VAX
software interrupt is posted to initiate processing.
Networking Implementation Notes SMM:18-29
Three macros are available for queuing and dequeuing
packets:
IF_ENQUEUE(ifq, m)
This places the packet m at the tail of the queue ifq.
IF_DEQUEUE(ifq, m)
This places a pointer to the packet at the head of
queue ifq in m and removes the packet from the queue. A
zero value will be returned in m if the queue is empty.
IF_DEQUEUEIF(ifq, m, ifp)
Like IF_DEQUEUE, this removes the next packet from the
head of a queue and returns it in m. A pointer to the
interface on which the packet was received is placed in
ifp, a (struct ifnet *).
IF_PREPEND(ifq, m)
This places the packet m at the head of the queue ifq.
Each queue has a maximum length associated with it as a
simple form of congestion control. The macro IF_QFULL(ifq)
returns 1 if the queue is filled, in which case the macro
IF_DROP(ifq) should be used to increment the count of the
number of packets dropped, and the offending packet is
dropped. For example, the following code fragment is com-
monly found in a network interface's input routine,
if (IF_QFULL(inq)) {
IF_DROP(inq);
m_freem(m);
} else
IF_ENQUEUE(inq, m);
10. Gateways and routing issues
The system has been designed with the expectation that
it will be used in an internetwork environment. The
``canonical'' environment was envisioned to be a collection
of local area networks connected at one or more points
through hosts with multiple network interfaces (one on each
local area network), and possibly a connection to a long
haul network (for example, the ARPANET). In such an
environment, issues of gatewaying and packet routing become
very important. Certain of these issues, such as congestion
control, have been handled in a simplistic manner or specif-
ically not addressed. Instead, where possible, the network
system attempts to provide simple mechanisms upon which more
involved policies may be implemented. As some of these
problems become better understood, the solutions developed
will be incorporated into the system.
This section will describe the facilities provided for
SMM:18-30 Networking Implementation Notes
packet routing. The simplistic mechanisms provided for
congestion control are described in chapter 12.
10.1. Routing tables
The network system maintains a set of routing tables
for selecting a network interface to use in delivering a
packet to its destination. These tables are of the form:
struct rtentry {
u_long rt_hash; /* hash key for lookups */
struct sockaddr rt_dst; /* destination net or host */
struct sockaddr rt_gateway; /* forwarding agent */
short rt_flags; /* see below */
short rt_refcnt; /* no. of references to structure */
u_long rt_use; /* packets sent using route */
struct ifnet *rt_ifp; /* interface to give packet to */
};
The routing information is organized in two separate
tables, one for routes to a host and one for routes to a
network. The distinction between hosts and networks is
necessary so that a single mechanism may be used for both
broadcast and multi-drop type networks, and also for net-
works built from point-to-point links (e.g DECnet [DEC80]).
Each table is organized as a hashed set of linked
lists. Two 32-bit hash values are calculated by routines
defined for each address family; one based on the destina-
tion being a host, and one assuming the target is the net-
work portion of the address. Each hash value is used to
locate a hash chain to search (by taking the value modulo
the hash table size) and the entire 32-bit value is then
used as a key in scanning the list of routes. Lookups are
applied first to the routing table for hosts, then to the
routing table for networks. If both lookups fail, a final
lookup is made for a ``wildcard'' route (by convention, net-
work 0). The first appropriate route discovered is used. By
doing this, routes to a specific host on a network may be
present as well as routes to the network. This also allows
a ``fall back'' network route to be defined to a ``smart''
gateway which may then perform more intelligent routing.
Each routing table entry contains a destination (the
desired final destination), a gateway to which to send the
packet, and various flags which indicate the route's status
and type (host or network). A count of the number of pack-
ets sent using the route is kept, along with a count of
``held references'' to the dynamically allocated structure
to insure that memory reclamation occurs only when the route
is not in use. Finally, a pointer to the a network inter-
face is kept; packets sent using the route should be handed
to this interface.
Networking Implementation Notes SMM:18-31
Routes are typed in two ways: either as host or net-
work, and as ``direct'' or ``indirect''. The host/network
distinction determines how to compare the rt_dst field dur-
ing lookup. If the route is to a network, only a packet's
destination network is compared to the rt_dst entry stored
in the table. If the route is to a host, the addresses must
match bit for bit.
The distinction between ``direct'' and ``indirect''
routes indicates whether the destination is directly con-
nected to the source. This is needed when performing local
network encapsulation. If a packet is destined for a peer
at a host or network which is not directly connected to the
source, the internetwork packet header will contain the
address of the eventual destination, while the local network
header will address the intervening gateway. Should the
destination be directly connected, these addresses are
likely to be identical, or a mapping between the two exists.
The RTF_GATEWAY flag indicates that the route is to an
``indirect'' gateway agent, and that the local network
header should be filled in from the rt_gateway field instead
of from the final internetwork destination address.
It is assumed that multiple routes to the same destina-
tion will not be present; only one of multiple routes, that
most recently installed, will be used.
Routing redirect control messages are used to dynami-
cally modify existing routing table entries as well as
dynamically create new routing table entries. On hosts
where exhaustive routing information is too expensive to
maintain (e.g. work stations), the combination of wildcard
routing entries and routing redirect messages can be used to
provide a simple routing management scheme without the use
of a higher level policy process. Current connections may be
rerouted after notification of the protocols by means of
their pr_ctlinput entries. Statistics are kept by the rout-
ing table routines on the use of routing redirect messages
and their affect on the routing tables. These statistics
may be viewed using netstat(1).
Status information other than routing redirect control
messages may be used in the future, but at present they are
ignored. Likewise, more intelligent ``metrics'' may be used
to describe routes in the future, possibly based on
bandwidth and monetary costs.
10.2. Routing table interface
A protocol accesses the routing tables through three
routines, one to allocate a route, one to free a route, and
one to process a routing redirect control message. The rou-
tine rtalloc performs route allocation; it is called with a
pointer to the following structure containing the desired
SMM:18-32 Networking Implementation Notes
destination:
struct route {
struct rtentry *ro_rt;
struct sockaddr ro_dst;
};
The route returned is assumed ``held'' by the caller until
released with an rtfree call. Protocols which implement
virtual circuits, such as TCP, hold onto routes for the
duration of the circuit's lifetime, while connection-less
protocols, such as UDP, allocate and free routes whenever
their destination address changes.
The routine rtredirect is called to process a routing
redirect control message. It is called with a destination
address, the new gateway to that destination, and the source
of the redirect. Redirects are accepted only from the
current router for the destination. If a non-wildcard route
exists to the destination, the gateway entry in the route is
modified to point at the new gateway supplied. Otherwise, a
new routing table entry is inserted reflecting the informa-
tion supplied. Routes to interfaces and routes to gateways
which are not directly accessible from the host are ignored.
10.3. User level routing policies
Routing policies implemented in user processes manipu-
late the kernel routing tables through two ioctl calls. The
commands SIOCADDRT and SIOCDELRT add and delete routing
entries, respectively; the tables are read through the
/dev/kmem device. The decision to place policy decisions in
a user process implies that routing table updates may lag a
bit behind the identification of new routes, or the failure
of existing routes, but this period of instability is nor-
mally very small with proper implementation of the routing
process. Advisory information, such as ICMP error messages
and IMP diagnostic messages, may be read from raw sockets
(described in the next section).
Several routing policy processes have already been
implemented. The system standard ``routing daemon'' uses a
variant of the Xerox NS Routing Information Protocol
[Xerox82] to maintain up-to-date routing tables in our local
environment. Interaction with other existing routing proto-
cols, such as the Internet EGP (Exterior Gateway Protocol),
has been accomplished using a similar process.
Networking Implementation Notes SMM:18-33
11. Raw sockets
A raw socket is an object which allows users direct
access to a lower-level protocol. Raw sockets are intended
for knowledgeable processes which wish to take advantage of
some protocol feature not directly accessible through the
normal interface, or for the development of new protocols
built atop existing lower level protocols. For example, a
new version of TCP might be developed at the user level by
utilizing a raw IP socket for delivery of packets. The raw
IP socket interface attempts to provide an identical inter-
face to the one a protocol would have if it were resident in
the kernel.
The raw socket support is built around a generic raw
socket interface, (possibly) augmented by protocol-specific
processing routines. This section will describe the core of
the raw socket interface.
11.1. Control blocks
Every raw socket has a protocol control block of the
following form:
struct rawcb {
struct rawcb *rcb_next; /* doubly linked list */
struct rawcb *rcb_prev;
struct socket *rcb_socket; /* back pointer to socket */
struct sockaddr rcb_faddr; /* destination address */
struct sockaddr rcb_laddr; /* socket's address */
struct sockproto rcb_proto; /* protocol family, protocol */
caddr_t rcb_pcb; /* protocol specific stuff */
struct mbuf *rcb_options; /* protocol specific options */
struct route rcb_route; /* routing information */
short rcb_flags;
};
All the control blocks are kept on a doubly linked list for
performing lookups during packet dispatch. Associations may
be recorded in the control block and used by the output rou-
tine in preparing packets for transmission. The rcb_proto
structure contains the protocol family and protocol number
with which the raw socket is associated. The protocol, fam-
ily and addresses are used to filter packets on input; this
will be described in more detail shortly. If any protocol-
specific information is required, it may be attached to the
control block using the rcb_pcb field. Protocol-specific
options for transmission in outgoing packets may be stored
in rcb_options.
A raw socket interface is datagram oriented. That is,
each send or receive on the socket requires a destination
address. This address may be supplied by the user or stored
in the control block and automatically installed in the
SMM:18-34 Networking Implementation Notes
outgoing packet by the output routine. Since it is not pos-
sible to determine whether an address is present or not in
the control block, two flags, RAW_LADDR and RAW_FADDR, indi-
cate if a local and foreign address are present. Routing is
expected to be performed by the underlying protocol if
necessary.
11.2. Input processing
Input packets are ``assigned'' to raw sockets based on
a simple pattern matching scheme. Each network interface or
protocol gives unassigned packets to the raw input routine
with the call:
raw_input(m, proto, src, dst)
struct mbuf *m; struct sockproto *proto, struct sockaddr *src, *dst;
The data packet then has a generic header prepended to it of
the form
struct raw_header {
struct sockproto raw_proto;
struct sockaddr raw_dst;
struct sockaddr raw_src;
};
and it is placed in a packet queue for the ``raw input pro-
tocol'' module. Packets taken from this queue are copied
into any raw sockets that match the header according to the
following rules,
1) The protocol family of the socket and header agree.
2) If the protocol number in the socket is non-zero, then
it agrees with that found in the packet header.
3) If a local address is defined for the socket, the
address format of the local address is the same as the
destination address's and the two addresses agree bit
for bit.
4) The rules of 3) are applied to the socket's foreign
address and the packet's source address.
A basic assumption is that addresses present in the control
block and packet header (as constructed by the network
interface and any raw input protocol module) are in a canon-
ical form which may be ``block compared''.
11.3. Output processing
On output the raw pr_usrreq routine passes the packet
and a pointer to the raw control block to the raw protocol
output routine for any processing required before it is
Networking Implementation Notes SMM:18-35
delivered to the appropriate network interface. The output
routine is normally the only code required to implement a
raw socket interface.
12. Buffering and congestion control
One of the major factors in the performance of a proto-
col is the buffering policy used. Lack of a proper buffer-
ing policy can force packets to be dropped, cause falsified
windowing information to be emitted by protocols, fragment
host memory, degrade the overall host performance, etc. Due
to problems such as these, most systems allocate a fixed
pool of memory to the networking system and impose a policy
optimized for ``normal'' network operation.
The networking system developed for UNIX is little dif-
ferent in this respect. At boot time a fixed amount of
memory is allocated by the networking system. At later
times more system memory may be requested as the need
arises, but at no time is memory ever returned to the sys-
tem. It is possible to garbage collect memory from the net-
work, but difficult. In order to perform this garbage col-
lection properly, some portion of the network will have to
be ``turned off'' as data structures are updated. The
interval over which this occurs must kept small compared to
the average inter-packet arrival time, or too much traffic
may be lost, impacting other hosts on the network, as well
as increasing load on the interconnecting mediums. In our
environment we have not experienced a need for such compac-
tion, and thus have left the problem unresolved.
The mbuf structure was introduced in chapter 5. In
this section a brief description will be given of the allo-
cation mechanisms, and policies used by the protocols in
performing connection level buffering.
12.1. Memory management
The basic memory allocation routines manage a private
page map, the size of which determines the maximum amount of
memory that may be allocated by the network. A small amount
of memory is allocated at boot time to initialize the mbuf
and mbuf page cluster free lists. When the free lists are
exhausted, more memory is requested from the system memory
allocator if space remains in the map. If memory cannot be
allocated, callers may block awaiting free memory, or the
failure may be reflected to the caller immediately. The
allocator will not block awaiting free map entries, however,
as exhaustion of the page map usually indicates that buffers
have been lost due to a ``leak.'' The private page table is
used by the network buffer management routines in remapping
pages to be logically contiguous as the need arises. In
addition, an array of reference counts parallels the page
table and is used when multiple references to a page are
SMM:18-36 Networking Implementation Notes
present.
Mbufs are 128 byte structures, 8 fitting in a 1Kbyte
page of memory. When data is placed in mbufs, it is copied
or remapped into logically contiguous pages of memory from
the network page pool if possible. Data smaller than half of
the size of a page is copied into one or more 112 byte mbuf
data areas.
12.2. Protocol buffering policies
Protocols reserve fixed amounts of buffering for send
and receive queues at socket creation time. These amounts
define the high and low water marks used by the socket rou-
tines in deciding when to block and unblock a process. The
reservation of space does not currently result in any action
by the memory management routines.
Protocols which provide connection level flow control
do this based on the amount of space in the associated
socket queues. That is, send windows are calculated based
on the amount of free space in the socket's receive queue,
while receive windows are adjusted based on the amount of
data awaiting transmission in the send queue. Care has been
taken to avoid the ``silly window syndrome'' described in
[Clark82] at both the sending and receiving ends.
12.3. Queue limiting
Incoming packets from the network are always received
unless memory allocation fails. However, each Level 1 pro-
tocol input queue has an upper bound on the queue's length,
and any packets exceeding that bound are discarded. It is
possible for a host to be overwhelmed by excessive network
traffic (for instance a host acting as a gateway from a high
bandwidth network to a low bandwidth network). As a
``defensive'' mechanism the queue limits may be adjusted to
throttle network traffic load on a host. Consider a host
willing to devote some percentage of its machine to handling
network traffic. If the cost of handling an incoming packet
can be calculated so that an acceptable ``packet handling
rate'' can be determined, then input queue lengths may be
dynamically adjusted based on a host's network load and the
number of packets awaiting processing. Obviously, discard-
ing packets is not a satisfactory solution to a problem such
as this (simply dropping packets is likely to increase the
load on a network); the queue lengths were incorporated
mainly as a safeguard mechanism.
12.4. Packet forwarding
When packets can not be forwarded because of memory
limitations, the system attempts to generate a ``source
quench'' message. In addition, any other problems
Networking Implementation Notes SMM:18-37
encountered during packet forwarding are also reflected back
to the sender in the form of ICMP packets. This helps hosts
avoid unneeded retransmissions.
Broadcast packets are never forwarded due to possible
dire consequences. In an early stage of network develop-
ment, broadcast packets were forwarded and a ``routing
loop'' resulted in network saturation and every host on the
network crashing.
13. Out of band data
Out of band data is a facility peculiar to the stream
socket abstraction defined. Little agreement appears to
exist as to what its semantics should be. TCP defines the
notion of ``urgent data'' as in-line, while the NBS proto-
cols [Burruss81] and numerous others provide a fully
independent logical transmission channel along which out of
band data is to be sent. In addition, the amount of the data
which may be sent as an out of band message varies from pro-
tocol to protocol; everything from 1 bit to 16 bytes or
more.
A stream socket's notion of out of band data has been
defined as the lowest reasonable common denominator (at
least reasonable in our minds); clearly this is subject to
debate. Out of band data is expected to be transmitted out
of the normal sequencing and flow control constraints of the
data stream. A minimum of 1 byte of out of band data and
one outstanding out of band message are expected to be sup-
ported by the protocol supporting a stream socket. It is a
protocol's prerogative to support larger-sized messages, or
more than one outstanding out of band message at a time.
Out of band data is maintained by the protocol and is
usually not stored in the socket's receive queue. A socket-
level option, SO_OOBINLINE, is provided to force out-of-band
data to be placed in the normal receive queue when urgent
data is received; this sometimes amelioriates problems due
to loss of data when multiple out-of-band segments are
received before the first has been passed to the user. The
PRU_SENDOOB and PRU_RCVOOB requests to the pr_usrreq routine
are used in sending and receiving data.
SMM:18-38 Networking Implementation Notes
14. Trailer protocols
Core to core copies can be expensive. Consequently, a
great deal of effort was spent in minimizing such opera-
tions. The VAX architecture provides virtual memory
hardware organized in page units. To cut down on copy
operations, data is kept in page-sized units on page-aligned
boundaries whenever possible. This allows data to be moved
in memory simply by remapping the page instead of copying.
The mbuf and network interface routines perform page table
manipulations where needed, hiding the complexities of the
VAX virtual memory hardware from higher level code.
Data enters the system in two ways: from the user, or
from the network (hardware interface). When data is copied
from the user's address space into the system it is depo-
sited in pages (if sufficient data is present). This
encourages the user to transmit information in messages
which are a multiple of the system page size.
Unfortunately, performing a similar operation when tak-
ing data from the network is very difficult. Consider the
format of an incoming packet. A packet usually contains a
local network header followed by one or more headers used by
the high level protocols. Finally, the data, if any, follows
these headers. Since the header information may be variable
length, DMA'ing the eventual data for the user into a page
aligned area of memory is impossible without a priori
knowledge of the format (e.g., by supporting only a single
protocol header format).
To allow variable length header information to be
present and still ensure page alignment of data, a special
local network encapsulation may be used. This encapsulation,
termed a trailer protocol [Leffler84], places the variable
length header information after the data. A fixed size
local network header is then prepended to the resultant
packet. The local network header contains the size of the
data portion (in units of 512 bytes), and a new trailer pro-
tocol header, inserted before the variable length informa-
tion, contains the size of the variable length header infor-
mation. The following trailer protocol header is used to
store information regarding the variable length protocol
header:
struct {
short protocol; /* original protocol no. */
short length; /* length of trailer */
};
The processing of the trailer protocol is very simple.
On output, the local network header indicates that a trailer
encapsulation is being used. The header also includes an
Networking Implementation Notes SMM:18-39
indication of the number of data pages present before the
trailer protocol header. The trailer protocol header is
initialized to contain the actual protocol identifier and
the variable length header size, and is appended to the data
along with the variable length header information.
On input, the interface routines identify the trailer
encapsulation by the protocol type stored in the local net-
work header, then calculate the number of pages of data to
find the beginning of the trailer. The trailing information
is copied into a separate mbuf and linked to the front of
the resultant packet.
Clearly, trailer protocols require cooperation between
source and destination. In addition, they are normally cost
effective only when sizable packets are used. The current
scheme works because the local network encapsulation header
is a fixed size, allowing DMA operations to be performed at
a known offset from the first data page being received.
Should the local network header be variable length this
scheme fails.
Statistics collected indicate that as much as 200Kb/s
can be gained by using a trailer protocol with 1Kbyte pack-
ets. The average size of the variable length header was 40
bytes (the size of a minimal TCP/IP packet header). If
hardware supports larger sized packets, even greater gains
may be realized.
Acknowledgements
The internal structure of the system is patterned after
the Xerox PUP architecture [Boggs79], while in certain
places the Internet protocol family has had a great deal of
influence in the design. The use of software interrupts for
process invocation is based on similar facilities found in
the VMS operating system. Many of the ideas related to pro-
tocol modularity, memory management, and network interfaces
are based on Rob Gurwitz's TCP/IP implementation for the
4.1BSD version of UNIX on the VAX [Gurwitz81]. Greg Chesson
explained his use of trailer encapsulations in Datakit,
instigating their use in our system.
SMM:18-40 Networking Implementation Notes
References
[Boggs79] Boggs, D. R., J. F. Shoch, E. A. Taft,
and R. M. Metcalfe; PUP: An Internetwork
Architecture. Report CSL-79-10. XEROX
Palo Alto Research Center, July 1979.
[BBN78] Bolt Beranek and Newman; Specification
for the Interconnection of Host and IMP.
BBN Technical Report 1822. May 1978.
[Cerf78] Cerf, V. G.; The Catenet Model for
Internetworking. Internet Working Group,
IEN 48. July 1978.
[Clark82] Clark, D. D.; Window and Acknowledge-
ment Strategy in TCP, RFC-813. Network
Information Center, SRI International.
July 1982.
[DEC80] Digital Equipment Corporation; DECnet
DIGITAL Network Architecture - General
Description. Order No. AA-K179A-TK.
October 1980.
[Gurwitz81] Gurwitz, R. F.; VAX-UNIX Networking
Support Project - Implementation
Description. Internetwork Working
Group, IEN 168. January 1981.
[ISO81] International Organization for Standard-
ization. ISO Open Systems Interconnec-
tion - Basic Reference Model. ISO/TC
97/SC 16 N 719. August 1981.
[Joy86] Joy, W.; Fabry, R.; Leffler, S.;
McKusick, M.; and Karels, M.; Berkeley
Software Architecture Manual, 4.4BSD
Edition. UNIX Programmer's Supplementary
Documents, Vol. 1 (PSD:5). Computer Sys-
tems Research Group, University of Cali-
fornia, Berkeley. May, 1986.
[Leffler84] Leffler, S.J. and Karels, M.J.; Trailer
Encapsulations, RFC-893. Network Infor-
mation Center, SRI International. April
1984.
[Postel80] Postel, J. User Datagram Protocol,
RFC-768. Network Information Center, SRI
International. May 1980.
[Postel81a] Postel, J., ed. Internet Protocol,
Networking Implementation Notes SMM:18-41
RFC-791. Network Information Center, SRI
International. September 1981.
[Postel81b] Postel, J., ed. Transmission Control
Protocol, RFC-793. Network Information
Center, SRI International. September
1981.
[Postel81c] Postel, J. Internet Control Message
Protocol, RFC-792. Network Information
Center, SRI International. September
1981.
[Xerox81] Xerox Corporation. Internet Transport
Protocols. Xerox System Integration
Standard 028112. December 1981.
[Zimmermann80] Zimmermann, H. OSI Reference Model -
The ISO Model of Architecture for Open
Systems Interconnection. IEEE Transac-
tions on Communications. Com-28(4);
425-432. April 1980.
Generated on 2013-04-27 00:20:00 by $MirOS: src/scripts/roff2htm,v 1.77 2013/01/01 20:49:09 tg Exp $
These manual pages and other documentation are copyrighted by their respective writers;
their source is available at our CVSweb,
AnonCVS, and other mirrors. The rest is Copyright © 2002‒2013 The MirOS Project, Germany.
This product includes material
provided by Thorsten Glaser.
This manual page’s HTML representation is supposed to be valid XHTML/1.1; if not, please send a bug report – diffs preferred.