Some Applications of Inverted Indexes on the UNIX System USD:30-1
Some Applications of Inverted Indexes on the UNIX
System
M. E. Lesk
AT&T Bell Laboratories
Murray Hill, New Jersey 07974
1. Introduction.
The UNIX- system has many utilities (e.g. grep, awk, lex,
egrep, fgrep, ...) to search through files of text, but most of
them are based on a linear scan through the entire file, using
some deterministic automaton. This memorandum discusses a program
which uses inverted indexes[1] and can thus be used on much
larger data bases.
As with any indexing system, of course, there are some
disadvantages; once an index is made, the files that have been
indexed can not be changed without remaking the index. Thus
applications are restricted to those making many searches of
relatively stable data. Furthermore, these programs depend on
hashing, and can only search for exact matches of whole keywords.
It is not possible to look for arithmetic or logical expressions
(e.g. ``date greater than 1970'') or for regular expression
searching such as that in lex.[2]
Currently there are two uses of this software, the refer
preprocessor to format references, and the lookall command to
search through all text files on the UNIX system.=
The remaining sections of this memorandum discuss the
searching programs and their uses. Section 2 explains the opera-
tion of the searching algorithm and describes the data collected
_________________________
- UNIX is a registered trademark of AT&T Bell Labora-
tories in the USA and other countries.
[1] D. Knuth, The Art of Computer Programming: Vol.
3, Sorting and Searching, Addison-Wesley, Reading,
Mass., 1977. See section 6.5.
[2] M. E. Lesk, "Lex - A Lexical Analyzer Generator,"
Comp. Sci. Tech. Rep. No. 39, Bell Laboratories, Murray
Hill, New Jersey, October 1975.
= lookall is not part of the Berkeley UNIX distribu-
tion.
USD:30-2 Some Applications of Inverted Indexes on the UNIX System
for use with the lookall command. The more important application,
refer has a user's description in section 3. Section 4 goes into
more detail on reference files for the benefit of those who wish
to add references to data bases or write new troff macros for use
with refer. The options to make refer collect identical cita-
tions, or otherwise relocate and adjust references, are described
in section 5.
2. Searching.
The indexing and searching process is divided into two
phases, each made of two parts. These are shown below.
A. Construct the index.
(1) Find keys -- turn the input files into a sequence of
tags and keys, where each tag identifies a distinct
item in the input and the keys for each such item are
the strings under which it is to be indexed.
(2) Hash and sort -- prepare a set of inverted indexes from
which, given a set of keys, the appropriate item tags
can be found quickly.
B. Retrieve an item in response to a query.
(3) Search -- Given some keys, look through the files
prepared by the hashing and sorting facility and derive
the appropriate tags.
(4) Deliver -- Given the tags, find the original items.
This completes the searching process.
The first phase, making the index, is presumably done relatively
infrequently. It should, of course, be done whenever the data
being indexed change. In contrast, the second phase, retrieving
items, is presumably done often, and must be rapid.
An effort is made to separate code which depends on the data
being handled from code which depends on the searching procedure.
The search algorithm is involved only in programs (2) and (3),
while knowledge of the actual data files is needed only by pro-
grams (1) and (4). Thus it is easy to adapt to different data
files or different search algorithms.
To start with, it is necessary to have some way of selecting
or generating keys from input files. For dealing with files that
are basically English, we have a key-making program which
automatically selects words and passes them to the hashing and
sorting program (step 2). The format used has one line for each
input item, arranged as follows:
name:start,length (tab) key1 key2 key3 ...
Some Applications of Inverted Indexes on the UNIX System USD:30-3
where name is the file name, start is the starting byte number,
and length is the number of bytes in the entry.
These lines are the only input used to make the index. The
first field (the file name, byte position, and byte count) is the
tag of the item and can be used to retrieve it quickly. Normally,
an item is either a whole file or a section of a file delimited
by blank lines. After the tab, the second field contains the
keys. The keys, if selected by the automatic program, are any
alphanumeric strings which are not among the 100 most frequent
words in English and which are not entirely numeric (except for
four-digit numbers beginning 19, which are accepted as dates).
Keys are truncated to six characters and converted to lower case.
Some selection is needed if the original items are very large. We
normally just take the first n keys, with n less than 100 or so;
this replaces any attempt at intelligent selection. One file in
our system is a complete English dictionary; it would presumably
be retrieved for all queries.
To generate an inverted index to the list of record tags and
keys, the keys are hashed and sorted to produce an index. What is
wanted, ideally, is a series of lists showing the tags associated
with each key. To condense this, what is actually produced is a
list showing the tags associated with each hash code, and thus
with some set of keys. To speed up access and further save space,
a set of three or possibly four files is produced. These files
are:
File Contents
entry Pointers to posting file
for each hash code
posting Lists of tag pointers for
each hash code
tag Tags for each item
key Keys for each item
(optional)
The posting file comprises the real data: it contains a sequence
of lists of items posted under each hash code. To speed up
searching, the entry file is an array of pointers into the post-
ing file, one per potential hash code. Furthermore, the items in
the lists in the posting file are not referred to by their com-
plete tag, but just by an address in the tag file, which gives
the complete tags. The key file is optional and contains a copy
of the keys used in the indexing.
The searching process starts with a query, containing
several keys. The goal is to obtain all items which were indexed
under these keys. The query keys are hashed, and the pointers in
the entry file used to access the lists in the posting file.
These lists are addresses in the tag file of documents posted
under the hash codes derived from the query. The common items
from all lists are determined; this must include the items
indexed by every key, but may also contain some items which are
USD:30-4 Some Applications of Inverted Indexes on the UNIX System
false drops, since items referenced by the correct hash codes
need not actually have contained the correct keys. Normally, if
there are several keys in the query, there are not likely to be
many false drops in the final combined list even though each hash
code is somewhat ambiguous. The actual tags are then obtained
from the tag file, and to guard against the possibility that an
item has false-dropped on some hash code in the query, the origi-
nal items are normally obtained from the delivery program (4) and
the query keys checked against them by string comparison.
Usually, therefore, the check for bad drops is made against
the original file. However, if the key derivation procedure is
complex, it may be preferable to check against the keys fed to
program (2). In this case the optional key file which contains
the keys associated with each item is generated, and the item tag
is supplemented by a string
;start,length
which indicates the starting byte number in the key file and the
length of the string of keys for each item. This file is not usu-
ally necessary with the present key-selection program, since the
keys always appear in the original document.
There is also an option (-Cn) for coordination level search-
ing. This retrieves items which match all but n of the query
keys. The items are retrieved in the order of the number of keys
that they match. Of course, n must be less than the number of
query keys (nothing is retrieved unless it matches at least one
key).
As an example, consider one set of 4377 references, compris-
ing 660,000 bytes. This included 51,000 keys, of which 5,900 were
distinct keys. The hash table is kept full to save space (at the
expense of time); 995 of 997 possible hash codes were used. The
total set of index files (no key file) included 171,000 bytes,
about 26% of the original file size. It took 8 minutes of proces-
sor time to hash, sort, and write the index. To search for a sin-
gle query with the resulting index took 1.9 seconds of processor
time, while to find the same paper with a sequential linear
search using grep (reading all of the tags and keys) took 12.3
seconds of processor time.
We have also used this software to index all of the English
stored on our UNIX system. This is the index searched by the
lookall command. On a typical day there were 29,000 files in our
user file system, containing about 152,000,000 bytes. Of these
5,300 files, containing 32,000,000 bytes (about 21%) were English
text. The total number of `words' (determined mechanically) was
5,100,000. Of these 227,000 were selected as keys; 19,000 were
distinct, hashing to 4,900 (of 5,000 possible) different hash
codes. The resulting inverted file indexes used 845,000 bytes, or
about 2.6% of the size of the original files. The particularly
small indexes are caused by the fact that keys are taken from
Some Applications of Inverted Indexes on the UNIX System USD:30-5
only the first 50 non-common words of some very long input files.
Even this large lookall index can be searched quickly. For
example, to find this document by looking for the keys ``lesk
inverted indexes'' required 1.7 seconds of processor time and
system time. By comparison, just to search the 800,000 byte dic-
tionary (smaller than even the inverted indexes, let alone the
27,000,000 bytes of text files) with grep takes 29 seconds of
processor time. The lookall program is thus useful when looking
for a document which you believe is stored on-line, but do not
know where. For example, many memos from our center are in the
file system, but it is often difficult to guess where a particu-
lar memo might be (it might have several authors, each with many
directories, and have been worked on by a secretary with yet more
directories). Instructions for the use of the lookall command are
given in the manual section, shown in the appendix to this
memorandum.
The only indexes maintained routinely are those of publica-
tion lists and all English files. To make other indexes, the pro-
grams for making keys, sorting them, searching the indexes, and
delivering answers must be used. Since they are usually invoked
as parts of higher-level commands, they are not in the default
command directory, but are available to any user in the directory
/usr/lib/refer. Three programs are of interest: mkey, which iso-
lates keys from input files; inv, which makes an index from a set
of keys; and hunt, which searches the index and delivers the
items. Note that the two parts of the retrieval phase are com-
bined into one program, to avoid the excessive system work and
delay which would result from running these as separate
processes.
These three commands have a large number of options to adapt
to different kinds of input. The user not interested in the
detailed description that now follows may skip to section 3,
which describes the refer program, a packaged-up version of these
tools specifically oriented towards formatting references.
Make Keys. The program mkey is the key-making program
corresponding to step (1) in phase A. Normally, it reads its
input from the file names given as arguments, and if there are no
arguments it reads from the standard input. It assumes that blank
lines in the input delimit separate items, for each of which a
different line of keys should be generated. The lines of keys are
written on the standard output. Keys are any alphanumeric string
in the input not among the most frequent words in English and not
entirely numeric (except that all-numeric strings are acceptable
if they are between 1900 and 1999). In the output, keys are
translated to lower case, and truncated to six characters in
length; any associated punctuation is removed. The following flag
arguments are recognized by mkey:
USD:30-6 Some Applications of Inverted Indexes on the UNIX System
-c name Name of file of common words; default is
/usr/lib/eign.
-f name Read a list of files from name and take
each as an input argument.
-i chars Ignore all lines which begin with `%'
followed by any character in chars.
-kn Use at most n keys per input item.
-ln Ignore items shorter than n letters
long.
-nm Ignore as a key any word in the first m
words of the list of common English
words. The default is 100.
-s Remove the labels (file:start,length)
from the output; just give the keys.
Used when searching rather than index-
ing.
-w Each whole file is a separate item;
blank lines in files are irrelevant.
The normal arguments for indexing references are the
defaults, which are -c /usr/lib/eign, -n100, and -l3. For search-
ing, the -s option is also needed. When the big lookall index of
all English files is run, the options are -w, -k50, and -f
(filelist). When running on textual input, the mkey program
processes about 1000 English words per processor second. Unless
the -k option is used (and the input files are long enough for it
to take effect) the output of mkey is comparable in size to its
input.
Hash and invert. The inv program computes the hash codes and
writes the inverted files. It reads the output of mkey and writes
the set of files described earlier in this section. It expects
one argument, which is used as the base name for the three (or
four) files to be written. Assuming an argument of Index (the
default) the entry file is named Index.ia, the posting file
Index.ib, the tag file Index.ic, and the key file (if present)
Index.id. The inv program recognizes the following options:
-a Append the new keys to a previous set of
inverted files, making new files if
there is no old set using the same base
name.
-d Write the optional key file. This is
needed when you can not check for false
drops by looking for the keys in the
original inputs, i.e. when the key
derivation procedure is complicated and
the output keys are not words from the
input files.
-hn The hash table size is n (default 997);
n should be prime. Making n bigger saves
search time and spends disk space.
Some Applications of Inverted Indexes on the UNIX System USD:30-7
-i[u] name Take input from file name, instead of
the standard input; if u is present name
is unlinked when the sort is started.
Using this option permits the sort
scratch space to overlap the disk space
used for input keys.
-n Make a completely new set of inverted
files, ignoring previous files.
-p Pipe into the sort program, rather than
writing a temporary input file. This
saves disk space and spends processor
time.
-v Verbose mode; print a summary of the
number of keys which finished indexing.
About half the time used in inv is in the contained sort.
Assuming the sort is roughly linear, however, a guess at the
total timing for inv is 250 keys per second. The space used is
usually of more importance: the entry file uses four bytes per
possible hash (note the -h option), and the tag file around 15-20
bytes per item indexed. Roughly, the posting file contains one
item for each key instance and one item for each possible hash
code; the items are two bytes long if the tag file is less than
65336 bytes long, and the items are four bytes wide if the tag
file is greater than 65536 bytes long. Note that to minimize
storage, the hash tables should be over-full; for most of the
files indexed in this way, there is no other real choice, since
the entry file must fit in memory.
Searching and Retrieving. The hunt program retrieves items
from an index. It combines, as mentioned above, the two parts of
phase (B): search and delivery. The reason why it is efficient to
combine delivery and search is partly to avoid starting unneces-
sary processes, and partly because the delivery operation must be
a part of the search operation in any case. Because of the hash-
ing, the search part takes place in two stages: first items are
retrieved which have the right hash codes associated with them,
and then the actual items are inspected to determine false drops,
i.e. to determine if anything with the right hash codes doesn't
really have the right keys. Since the original item is retrieved
to check on false drops, it is efficient to present it immedi-
ately, rather than only giving the tag as output and later
retrieving the item again. If there were a separate key file,
this argument would not apply, but separate key files are not
common.
Input to hunt is taken from the standard input, one query
per line. Each query should be in mkey -s output format; all
lower case, no punctuation. The hunt program takes one argument
which specifies the base name of the index files to be searched.
Only one set of index files can be searched at a time, although
many text files may be indexed as a group, of course. If one of
the text files has been changed since the index, that file is
USD:30-8 Some Applications of Inverted Indexes on the UNIX System
searched with fgrep; this may occasionally slow down the search-
ing, and care should be taken to avoid having many out of date
files. The following option arguments are recognized by hunt:
-a Give all output; ignore checking for
false drops.
-Cn Coordination level n; retrieve items
with not more than n terms of the input
missing; default C0, implying that each
search term must be in the output items.
-F[ynd] ``-Fy'' gives the text of all the items
found; ``-Fn'' suppresses them. ``-Fd''
where d is an integer gives the text of
the first d items. The default is -Fy.
-g Do not use fgrep to search files changed
since the index was made; print an error
comment instead.
-i string Take string as input, instead of reading
the standard input.
-l n The maximum length of internal lists of
candidate items is n; default 1000.
-o string Put text output (``-Fy'') in string; of
use only when invoked from another pro-
gram.
-p Print hash code frequencies; mostly for
use in optimizing hash table sizes.
-T[ynd] ``-Ty'' gives the tags of the items
found; ``-Tn'' suppresses them. ``-Td''
where d is an integer gives the first d
tags. The default is -Tn.
-t string Put tag output (``-Ty'') in string; of
use only when invoked from another pro-
gram.
The timing of hunt is complex. Normally the hash table is
overfull, so that there will be many false drops on any single
term; but a multi-term query will have few false drops on all
terms. Thus if a query is underspecified (one search term) many
potential items will be examined and discarded as false drops,
wasting time. If the query is overspecified (a dozen search
terms) many keys will be examined only to verify that the single
item under consideration has that key posted. The variation of
search time with number of keys is shown in the table below.
Queries of varying length were constructed to retrieve a particu-
lar document from the file of references. In the sequence to the
left, search terms were chosen so as to select the desired paper
as quickly as possible. In the sequence on the right, terms were
chosen inefficiently, so that the query did not uniquely select
the desired document until four keys had been used. The same
document was the target in each case, and the final set of eight
keys are also identical; the differences at five, six and seven
keys are produced by measurement error, not by the slightly dif-
ferent key lists.
Some Applications of Inverted Indexes on the UNIX System USD:30-9
Efficient Keys | Inefficient Keys
No. keys Total drops Retrieved Search time | No. keys Total drops Retrieved Search time
(incl. false) Documents (seconds) | (incl. false) Documents (seconds)
1 15 3 1.27 | 1 68 55 5.96
2 1 1 0.11 | 2 29 29 2.72
3 1 1 0.14 | 3 8 8 0.95
4 1 1 0.17 | 4 1 1 0.18
5 1 1 0.19 | 5 1 1 0.21
6 1 1 0.23 | 6 1 1 0.22
7 1 1 0.27 | 7 1 1 0.26
8 1 1 0.29 | 8 1 1 0.29
As would be expected, the optimal search is achieved when the
query just specifies the answer; however, overspecification is
quite cheap. Roughly, the time required by hunt can be approxi-
mated as 30 milliseconds per search key plus 75 milliseconds per
dropped document (whether it is a false drop or a real answer).
In general, overspecification can be recommended; it protects the
user against additions to the data base which turn previously
uniquely-answered queries into ambiguous queries.
The careful reader will have noted an enormous discrepancy
between these times and the earlier quoted time of around 1.9
seconds for a search. The times here are purely for the search
and retrieval: they are measured by running many searches through
a single invocation of the hunt program alone. The normal
retrieval operation involves using the shell to set up a pipeline
through mkey to hunt and starting both processes; this adds a
fixed overhead of about 1.7 seconds of processor time to any sin-
gle search. Furthermore, remember that all these times are pro-
cessor times: on a typical morning on our PDP 11/70 system, with
about one dozen people logged on, to obtain 1 second of processor
time for the search program took between 2 and 12 seconds of real
time, with a median of 3.9 seconds and a mean of 4.8 seconds.
Thus, although the work involved in a single search may be only
200 milliseconds, after you add the 1.7 seconds of startup pro-
cessor time and then assume a 4:1 elapsed/processor time ratio,
it will be 8 seconds before any response is printed.
3. Selecting and Formatting References for TROFF
The major application of the retrieval software is refer,
which is a troff preprocessor like eqn.[3] It scans its input
looking for items of the form
.[
imprecise citation
.]
_________________________
[3] B. W. Kernighan and L. L. Cherry, "A System for
Typesetting Mathematics," Comm. Assoc. Comp. Mach.,
vol. 18, pp. 151-157, Bell Laboratories, Murray Hill,
New Jersey, March 1975.
USD:30-10Some Applications of Inverted Indexes on the UNIX System
where an imprecise citation is merely a string of words found in
the relevant bibliographic citation. This is translated into a
properly formatted reference. If the imprecise citation does not
correctly identify a single paper (either selecting no papers or
too many) a message is given. The data base of citations searched
may be tailored to each system, and individual users may specify
their own citation files. On our system, the default data base is
accumulated from the publication lists of the members of our
organization, plus about half a dozen personal bibliographies
that were collected. The present total is about 4300 citations,
but this increases steadily. Even now, the data base covers a
large fraction of local citations.
For example, the reference for the eqn paper above was
specified as
...
preprocessor like
.I eqn.
.[
kernighan cherry acm 1975
.]
It scans its input looking for items
...
This paper was itself printed using refer. The above input text
was processed by refer as well as tbl and troff by the command
refer memo-file | tbl | troff -ms
and the reference was automatically translated into a correct
citation to the ACM paper on mathematical typesetting.
The procedure to use to place a reference in a paper using
refer is as follows. First, use the lookbib command to check that
the paper is in the data base and to find out what keys are
necessary to retrieve it. This is done by typing lookbib and then
typing some potential queries until a suitable query is found.
For example, had one started to find the eqn paper shown above by
presenting the query
$ lookbib
kernighan cherry
(EOT)
lookbib would have found several items; experimentation would
quickly have shown that the query given above is adequate. Over-
specifying the query is of course harmless. A particularly care-
ful reader may have noticed that ``acm'' does not appear in the
printed citation; we have supplemented some of the data base
items with common extra keywords, such as common abbreviations
for journals or other sources, to aid in searching.
If the reference is in the data base, the query that
Some Applications of Inverted Indexes on the UNIX SystemUSD:30-11
retrieved it can be inserted in the text, between .[ and .]
brackets. If it is not in the data base, it can be typed into a
private file of references, using the format discussed in the
next section, and then the -p option used to search this private
file. Such a command might read (if the private references are
called myfile)
refer -p myfile document | tbl | eqn | troff -ms . . .
where tbl and/or eqn could be omitted if not needed. The use of
the -ms macros[4] or some other macro package, however, is essen-
tial. Refer only generates the data for the references; exact
formatting is done by some macro package, and if none is supplied
the references will not be printed.
By default, the references are numbered sequentially, and
the -ms macros format references as footnotes at the bottom of
the page. This memorandum is an example of that style. Other pos-
sibilities are discussed in section 5 below.
4. Reference Files.
A reference file is a set of bibliographic references usable
with refer. It can be indexed using the software described in
section 2 for fast searching. What refer does is to read the
input document stream, looking for imprecise citation references.
It then searches through reference files to find the full cita-
tions, and inserts them into the document. The format of the full
citation is arranged to make it convenient for a macro package,
such as the -ms macros, to format the reference for printing.
Since the format of the final reference is determined by the
desired style of output, which is determined by the macros used,
refer avoids forcing any kind of reference appearance. All it
does is define a set of string registers which contain the basic
information about the reference; and provide a macro call which
is expanded by the macro package to format the reference. It is
the responsibility of the final macro package to see that the
reference is actually printed; if no macros are used, and the
output of refer fed untranslated to troff, nothing at all will be
printed.
The strings defined by refer are taken directly from the
files of references, which are in the following format. The
references should be separated by blank lines. Each reference is
a sequence of lines beginning with % and followed by a key-
letter. The remainder of that line, and successive lines until
the next line beginning with %, contain the information specified
by the key-letter. In general, refer does not interpret the
information, but merely presents it to the macro package for
final formatting. A user with a separate macro package, for
_________________________
[4] M. E. Lesk, Typing Documents on UNIX and GCOS:
The -ms Macros for Troff, 1977.
USD:30-12Some Applications of Inverted Indexes on the UNIX System
example, can add new key-letters or use the existing ones for
other purposes without bothering refer.
The meaning of the key-letters given below, in particular,
is that assigned by the -ms macros. Not all information, obvi-
ously, is used with each citation. For example, if a document is
both an internal memorandum and a journal article, the macros
ignore the memorandum version and cite only the journal article.
Some kinds of information are not used at all in printing the
reference; if a user does not like finding references by specify-
ing title or author keywords, and prefers to add specific key-
words to the citation, a field is available which is searched but
not printed (K).
The key letters currently recognized by refer and -ms, with
the kind of information implied, are:
Key Information specified Key Information specified
A Author's name N Issue number
B Title of book containing item O Other information
C City of publication P Page(s) of article
D Date R Technical report reference
E Editor of book containing item T Title
G Government (NTIS) ordering number V Volume number
I Issuer (publisher)
J Journal name
K Keys (for searching) X or
L Label Y or
M Memorandum label Z Information not used by refer
For example, a sample reference could be typed as:
%T Bounds on the Complexity of the Maximal
Common Subsequence Problem
%Z ctr127
%A A. V. Aho
%A D. S. Hirschberg
%A J. D. Ullman
%J J. ACM
%V 23
%N 1
%P 1-12
%M abcd-78
%D Jan. 1976
Order is irrelevant, except that authors are shown in the order
given. The output of refer is a stream of string definitions,
one for each of the fields of each reference, as shown below.
Some Applications of Inverted Indexes on the UNIX SystemUSD:30-13
.]-
.ds [A authors' names ...
.ds [T title ...
.ds [J journal ...
...
.][ type-number
The special macro .]- precedes the string definitions and the
special macro .][ follows. These are changed from the input .[
and .] so that running the same file through refer again is harm-
less. The .]- macro can be used by the macro package to initial-
ize. The .][ macro, which should be used to print the reference,
is given an argument type-number to indicate the kind of refer-
ence, as follows:
Value Kind of reference
1 Journal article
2 Book
3 Article within book
4 Technical report
5 Bell Labs technical memorandum
0 Other
The reference is flagged in the text with the sequence
\*([.number\*(.]
where number is the footnote number. The strings [. and .] should
be used by the macro package to format the reference flag in the
text. These strings can be replaced for a particular footnote, as
described in section 5. The footnote number (or other signal) is
available to the reference macro .][ as the string register [F.
In some cases users wish to suspend the searching, and
merely use the reference macro formatting. That is, the user
doesn't want to provide a search key between .[ and .] brackets,
but merely the reference lines for the appropriate document.
Alternatively, the user can wish to add a few fields to those in
the reference as in the standard file, or override some fields.
Altering or replacing fields, or supplying whole references, is
easily done by inserting lines beginning with %; any such line is
taken as direct input to the reference processor rather than keys
to be searched. Thus
.[
key1 key2 key3 ...
%Q New format item
%R Override report name
.]
makes the indicated changes to the result of searching for the
keys. All of the search keys must be given before the first %
line.
USD:30-14Some Applications of Inverted Indexes on the UNIX System
If no search keys are provided, an entire citation can be
provided in-line in the text. For example, if the eqn paper cita-
tion were to be inserted in this way, rather than by searching
for it in the data base, the input would read
...
preprocessor like
.I eqn.
.[
%A B. W. Kernighan
%A L. L. Cherry
%T A System for Typesetting Mathematics
%J Comm. ACM
%V 18
%N 3
%P 151-157
%D March 1975
.]
It scans its input looking for items
...
This would produce a citation of the same appearance as that
resulting from the file search.
As shown, fields are normally turned into troff strings.
Sometimes users would rather have them defined as macros, so that
other troff commands can be placed into the data. When this is
necessary, simply double the control character % in the data.
Thus the input
.[
%V 23
%%M
Bell Laboratories,
Murray Hill, N.J. 07974
.]
is processed by refer into
.ds [V 23
.de [M
Bell Laboratories,
Murray Hill, N.J. 07974
..
The information after %%M is defined as a macro to be invoked by
.[M while the information after %V is turned into a string to be
invoked by \*([V. At present -ms expects all information as
strings.
5. Collecting References and other Refer Options
Normally, the combination of refer and -ms formats output as
troff footnotes which are consecutively numbered and placed at
Some Applications of Inverted Indexes on the UNIX SystemUSD:30-15
the bottom of the page. However, options exist to place the
references at the end; to arrange references alphabetically by
senior author; and to indicate references by strings in the text
of the form [Name1975a] rather than by number. Whenever refer-
ences are not placed at the bottom of a page identical references
are coalesced.
For example, the -e option to refer specifies that refer-
ences are to be collected; in this case they are output whenever
the sequence
.[
$LIST$
.]
is encountered. Thus, to place references at the end of a paper,
the user would run refer with the -e option and place the above
$LIST$ commands after the last line of the text. Refer will then
move all the references to that point. To aid in formatting the
collected references, refer writes the references preceded by the
line
.]<
and followed by the line
.]>
to invoke special macros before and after the references.
Another possible option to refer is the -s option to specify
sorting of references. The default, of course, is to list refer-
ences in the order presented. The -s option implies the -e
option, and thus requires a
.[
$LIST$
.]
entry to call out the reference list. The -s option may be fol-
lowed by a string of letters, numbers, and `+' signs indicating
how the references are to be sorted. The sort is done using the
fields whose key-letters are in the string as sorting keys; the
numbers indicate how many of the fields are to be considered,
with `+' taken as a large number. Thus the default is -sAD mean-
ing ``Sort on senior author, then date.'' To sort on all authors
and then title, specify -sA+T. And to sort on two authors and
then the journal, write -sA2J.
Other options to refer change the signal or label inserted
in the text for each reference. Normally these are just sequen-
tial numbers, and their exact placement (within brackets, as
superscripts, etc.) is determined by the macro package. The -l
option replaces reference numbers by strings composed of the
USD:30-16Some Applications of Inverted Indexes on the UNIX System
senior author's last name, the date, and a disambiguating letter.
If a number follows the l as in -l3 only that many letters of the
last name are used in the label string. To abbreviate the date as
well the form -lm,n shortens the last name to the first m letters
and the date to the last n digits. For example, the option -l3,2
would refer to the eqn paper (reference 3) by the signal Ker75a,
since it is the first cited reference by Kernighan in 1975.
A user wishing to specify particular labels for a private
bibliography may use the -k option. Specifying -kx causes the
field x to be used as a label. The default is L. If this field
ends in -, that character is replaced by a sequence letter; oth-
erwise the field is used exactly as given.
If none of the refer-produced signals are desired, the -b
option entirely suppresses automatic text signals.
If the user wishes to override the -ms treatment of the
reference signal (which is normally to enclose the number in
brackets in nroff and make it a superscript in troff) this can be
done easily. If the lines .[ or .] contain anything following
these characters, the remainders of these lines are used to sur-
round the reference signal, instead of the default. Thus, for
example, to say ``See reference (2).'' and avoid ``See refer-
ence.2'' the input might appear
See reference
.[ (
imprecise citation ...
.]).
Note that blanks are significant in this construction. If a per-
manent change is desired in the style of reference signals, how-
ever, it is probably easier to redefine the strings [. and .]
(which are used to bracket each signal) than to change each cita-
tion.
Although normally refer limits itself to retrieving the data
for the reference, and leaves to a macro package the job of
arranging that data as required by the local format, there are
two special options for rearrangements that can not be done by
macro packages. The -c option puts fields into all upper case
(CAPS-SMALL CAPS in troff output). The key-letters indicated what
information is to be translated to upper case follow the c, so
that -cAJ means that authors' names and journals are to be in
caps. The -a option writes the names of authors last name first,
that is A. D. Hall, Jr. is written as Hall, A. D. Jr. The cita-
tion form of the Journal of the ACM, for example, would require
both -cA and -a options. This produces authors' names in the
style KERNIGHAN, B. W. AND CHERRY, L. L. for the previous exam-
ple. The -a option may be followed by a number to indicate how
many author names should be reversed; -a1 (without any -c option)
would produce Kernighan, B. W. and L. L. Cherry, for example.
Some Applications of Inverted Indexes on the UNIX SystemUSD:30-17
Finally, there is also the previously-mentioned -p option to
let the user specify a private file of references to be searched
before the public files. Note that refer does not insist on a
previously made index for these files. If a file is named which
contains reference data but is not indexed, it will be searched
(more slowly) by refer using fgrep. In this way it is easy for
users to keep small files of new references, which can later be
added to the public data bases.
Generated on 2013-04-27 00:20:00 by $MirOS: src/scripts/roff2htm,v 1.77 2013/01/01 20:49:09 tg Exp $
These manual pages and other documentation are copyrighted by their respective writers;
their source is available at our CVSweb,
AnonCVS, and other mirrors. The rest is Copyright © 2002‒2013 The MirOS Project, Germany.
This product includes material
provided by Thorsten Glaser.
This manual page’s HTML representation is supposed to be valid XHTML/1.1; if not, please send a bug report – diffs preferred.