Lex - A Lexical Analyzer Generator PS1:16-1
Lex - A Lexical Analyzer Generator
M. E. Lesk and E. Schmidt
AT&T Bell Laboratories
Murray Hill, New Jersey 07974
ABSTRACT
Lex helps write programs whose control flow is
directed by instances of regular expressions in the
input stream. It is well suited for editor-script type
transformations and for segmenting input in preparation
for a parsing routine.
Lex source is a table of regular expressions and
corresponding program fragments. The table is
translated to a program which reads an input stream,
copying it to an output stream and partitioning the
input into strings which match the given expressions.
As each such string is recognized the corresponding
program fragment is executed. The recognition of the
expressions is performed by a deterministic finite
automaton generated by Lex. The program fragments writ-
ten by the user are executed in the order in which the
corresponding regular expressions occur in the input
stream.
The lexical analysis programs written with Lex
accept ambiguous specifications and choose the longest
match possible at each input point. If necessary, sub-
stantial lookahead is performed on the input, but the
input stream will be backed up to the end of the
current partition, so that the user has general freedom
to manipulate it.
Lex can generate analyzers in either C or Ratfor,
a language which can be translated automatically to
portable Fortran. It is available on the PDP-11 UNIX,
Honeywell GCOS, and IBM OS systems. This manual, how-
ever, will only discuss generating analyzers in C on
the UNIX system, which is the only supported form of
Lex under UNIX Version 7. Lex is designed to simplify
interfacing with Yacc, for those with access to this
compiler-compiler system.
PS1:16-2 Lex - A Lexical Analyzer Generator
1. Introduction. Lex is not a complete
language, but rather a genera-
Lex is a program genera- tor representing a new
tor designed for lexical pro- language feature which can be
cessing of character input added to different programming
streams. It accepts a high- languages, called ``host
level, problem oriented languages.'' Just as general
specification for character purpose languages can produce
string matching, and produces code to run on different com-
a program in a general purpose puter hardware, Lex can write
language which recognizes reg- code in different host
ular expressions. The regular languages. The host language
expressions are specified by is used for the output code
the user in the source specif- generated by Lex and also for
ications given to Lex. The Lex the program fragments added by
written code recognizes these the user. Compatible run-time
expressions in an input stream libraries for the different
and partitions the input host languages are also pro-
stream into strings matching vided. This makes Lex adapt-
the expressions. At the able to different environments
boundaries between strings and different users. Each
program sections provided by application may be directed to
the user are executed. The Lex the combination of hardware
source file associates the and host language appropriate
regular expressions and the to the task, the user's back-
program fragments. As each ground, and the properties of
expression appears in the local implementations. At
input to the program written present, the only supported
by Lex, the corresponding host language is C, although
fragment is executed. Fortran (in the form of Ratfor
[2] has been available in the
The user supplies the past. Lex itself exists on
additional code beyond expres- UNIX, GCOS, and OS/370; but
sion matching needed to com- the code generated by Lex may
plete his tasks, possibly be taken anywhere the
including code written by appropriate compilers exist.
other generators. The program
that recognizes the expres- Lex turns the user's
sions is generated in the gen- expressions and actions
eral purpose programming (called source in this memo)
language employed for the into the host general-purpose
user's program fragments. language; the generated pro-
Thus, a high level expression gram is named yylex. The yylex
language is provided to write program will recognize expres-
the string expressions to be sions in a stream (called
matched while the user's free- input in this memo) and per-
dom to write actions is unim- form the specified actions for
paired. This avoids forcing each expression as it is
the user who wishes to use a detected. See Figure 1.
string manipulation language _______
for input analysis to write Source ->| Lex | -> yylex
processing programs in the _______
same and often inappropriate
string handling language.
Lex - A Lexical Analyzer Generator PS1:16-3
_______ so the program
Input -> | yylex| -> Output generated by
_______ Lex (yylex)
will ignore
An overview of Lex these charac-
Figure 1 ters. Every-
thing else will
For a be copied. To
trivial exam- change any
ple, consider a remaining
program to string of
delete from the blanks or tabs
input all to a single
blanks or tabs blank, add
at the ends of another rule:
lines. %%
%% [ \t]+$ ;
[ \t]+$ ; [ \t]+ printf(" ");
is all that is The finite
required. The automaton gen-
program con- erated for this
tains a %% del- source will
imiter to mark scan for both
the beginning rules at once,
of the rules, observing at
and one rule. the termination
This rule con- of the string
tains a regular of blanks or
expression tabs whether or
which matches not there is a
one or more newline charac-
instances of ter, and exe-
the characters cuting the
blank or tab desired rule
(written \t for action. The
visibility, in first rule
accordance with matches all
the C language strings of
convention) blanks or tabs
just prior to at the end of
the end of a lines, and the
line. The second rule all
brackets indi- remaining
cate the char- strings of
acter class blanks or tabs.
made of blank
and tab; the + Lex can be
indicates ``one used alone for
or more ...''; simple
and the $ indi- transforma-
cates ``end of tions, or for
line,'' as in analysis and
QED. No action statistics
is specified, gathering on a
PS1:16-4 Lex - A Lexical Analyzer Generator
lexical level. programs writ-
Lex can also be ten by Lex.
used with a lexical grammar
parser genera- rules rules
tor to perform v v
the lexical _________ _________
analysis phase; | Lex | | Yacc |
it is particu- _________ _________
larly easy to v v
interface Lex _________ _________
and Yacc [3]. Input ->| yylex | ->| yyparse| -> Parsed input
Lex programs _________ _________
recognize only
regular expres- Lex with Yacc
sions; Yacc Figure 2
writes parsers Yacc users will
that accept a realize that
large class of the name yylex
context free is what Yacc
grammars, but expects its
require a lower lexical
level analyzer analyzer to be
to recognize named, so that
input tokens. the use of this
Thus, a combi- name by Lex
nation of Lex simplifies
and Yacc is interfacing.
often appropri-
ate. When used Lex gen-
as a preproces- erates a deter-
sor for a later ministic finite
parser genera- automaton from
tor, Lex is the regular
used to parti- expressions in
tion the input the source [4].
stream, and the The automaton
parser genera- is interpreted,
tor assigns rather than
structure to compiled, in
the resulting order to save
pieces. The space. The
flow of control result is still
in such a case a fast
(which might be analyzer. In
the first half particular, the
of a compiler, time taken by a
for example) is Lex program to
shown in Figure recognize and
2. Additional partition an
programs, writ- input stream is
ten by other proportional to
generators or the length of
by hand, can be the input. The
added easily to number of Lex
Lex - A Lexical Analyzer Generator PS1:16-5
rules or the action routine.
complexity of
the rules is Lex is not
not important limited to
in determining source which
speed, unless can be inter-
rules which preted on the
include forward basis of one
context require character look-
a significant ahead. For
amount of re- example, if
scanning. What there are two
does increase rules, one
with the number looking for ab
and complexity and another for
of rules is the abcdefg, and
size of the the input
finite automa- stream is
ton, and there- abcdefh, Lex
fore the size will recognize
of the program ab and leave
generated by the input
Lex. pointer just
before cd. . .
In the Such backup is
program written more costly
by Lex, the than the pro-
user's frag- cessing of
ments simpler
(representing languages.
the actions to
be performed as 2. Lex Source.
each regular
expression is The gen-
found) are eral format of
gathered as Lex source is:
cases of a {definitions}
switch. The %%
automaton {rules}
interpreter %%
directs the {user subroutines}
control flow. where the
Opportunity is definitions and
provided for the user sub-
the user to routines are
insert either often omitted.
declarations or The second %%
additional is optional,
statements in but the first
the routine is required to
containing the mark the begin-
actions, or to ning of the
add subroutines rules. The
outside this absolute
PS1:16-6 Lex - A Lexical Analyzer Generator
minimum Lex used to print
program is thus the string. The
%% end of the
(no defini- expression is
tions, no indicated by
rules) which the first blank
translates into or tab charac-
a program which ter. If the
copies the action is
input to the merely a single
output C expression,
unchanged. it can just be
given on the
In the right side of
outline of Lex the line; if it
programs shown is compound, or
above, the takes more than
rules represent a line, it
the user's con- should be
trol decisions; enclosed in
they are a braces. As a
table, in which slightly more
the left column useful example,
contains regu- suppose it is
lar expressions desired to
(see section 3) change a number
and the right of words from
column contains British to
actions, pro- American spel-
gram fragments ling. Lex rules
to be executed such as
when the colour printf("color");
expressions are mechanise printf("mechanize");
recognized. petrol printf("gas");
Thus an indivi- would be a
dual rule might start. These
appear rules are not
integer printf("found keyword INT"); quite enough,
to look for the since the word
string integer petroleum would
in the input become gaseum;
stream and a way of deal-
print the mes- ing with this
sage ``found will be
keyword INT'' described
whenever it later.
appears. In
this example 3. Lex Regular
the host pro- Expressions.
cedural
language is C The defin-
and the C itions of regu-
library func- lar expressions
tion printf is are very simi-
Lex - A Lexical Analyzer Generator PS1:16-7
lar to those in text charac-
QED [5]. A reg- ters. Thus
ular expression xyz"++"
specifies a set matches the
of strings to string xyz++
be matched. It when it
contains text appears. Note
characters that a part of
(which match a string may be
the correspond- quoted. It is
ing characters harmless but
in the strings unnecessary to
being compared) quote an ordi-
and operator nary text char-
characters acter; the
(which specify expression
repetitions, "xyz++"
choices, and is the same as
other the one above.
features). The Thus by quoting
letters of the every non-
alphabet and alphanumeric
the digits are character being
always text used as a text
characters; character, the
thus the regu- user can avoid
lar expression remembering the
integer list above of
matches the current opera-
string integer tor characters,
wherever it and is safe
appears and the should further
expression extensions to
a57D Lex lengthen
looks for the the list.
string a57D.
An opera-
Operators. tor character
The operator may also be
characters are turned into a
" \ [ ] ^ - ? . * + | ( ) $ / { } % < > text character
and if they are by preceding it
to be used as with \ as in
text charac- xyz\+\+
ters, an escape which is
should be used. another, less
The quotation readable,
mark operator equivalent of
(") indicates the above
that whatever expressions.
is contained Another use of
between a pair the quoting
of quotes is to mechanism is to
be taken as get a blank
PS1:16-8 Lex - A Lexical Analyzer Generator
into an expres- cial: these are
sion; normally, \ - and ^. The
as explained - character
above, blanks indicates
or tabs end a ranges. For
rule. Any blank example,
character not [a-z0-9<>_]
contained indicates the
within [] (see character class
below) must be containing all
quoted. Several the lower case
normal C letters, the
escapes with \ digits, the
are recognized: angle brackets,
\n is newline, and underline.
\t is tab, and Ranges may be
\b is back- given in either
space. To enter order. Using -
\ itself, use between any
\\. Since new- pair of charac-
line is illegal ters which are
in an expres- not both upper
sion, \n must case letters,
be used; it is both lower case
not required to letters, or
escape tab and both digits is
backspace. implementation
Every character dependent and
but blank, tab, will get a
newline and the warning mes-
list above is sage. (E.g.,
always a text [0-z] in ASCII
character. is many more
characters than
Character it is in
classes. EBCDIC). If it
Classes of is desired to
characters can include the
be specified character - in
using the a character
operator pair class, it
[]. The con- should be first
struction [abc] or last; thus
matches a sin- [-+0-9]
gle character, matches all the
which may be a, digits and the
b, or c. Within two signs.
square brack-
ets, most In charac-
operator mean- ter classes,
ings are the ^ operator
ignored. Only must appear as
three charac- the first char-
ters are spe- acter after the
Lex - A Lexical Analyzer Generator PS1:16-9
left bracket; expressions.
it indicates The operator ?
that the indicates an
resulting optional ele-
string is to be ment of an
complemented expression.
with respect to Thus
the computer ab?c
character set. matches either
Thus ac or abc.
[^abc]
matches all Repeated
characters expressions.
except a, b, or Repetitions of
c, including classes are
all special or indicated by
control charac- the operators *
ters; or and +.
[^a-zA-Z] a*
is any charac- is any number
ter which is of consecutive
not a letter. a characters,
The \ character including zero;
provides the while
usual escapes a+
within charac- is one or more
ter class instances of a.
brackets. For example,
[a-z]+
Arbitrary is all strings
character. To of lower case
match almost letters. And
any character, [A-Za-z][A-Za-z0-9]*
the operator indicates all
character alphanumeric
. strings with a
is the class of leading alpha-
all characters betic charac-
except newline. ter. This is a
Escaping into typical expres-
octal is possi- sion for recog-
ble although nizing identif-
non-portable: iers in com-
[\40-\176] puter
matches all languages.
printable char-
acters in the Alterna-
ASCII character tion and Group-
set, from octal ing. The opera-
40 (blank) to tor | indicates
octal 176 alternation:
(tilde). (ab|cd)
matches either
Optional ab or cd. Note
PS1:16-10 Lex - A Lexical Analyzer Generator
that classes, since
parentheses are that only
used for group- applies within
ing, although the [] opera-
they are not tors. If the
necessary on very last char-
the outside acter is $, the
level; expression will
ab|cd only be matched
would have suf- at the end of a
ficed. line (when
Parentheses can immediately
be used for followed by
more complex newline). The
expressions: latter operator
(ab|cd+)?(ef)* is a special
matches such case of the /
strings as operator char-
abefef, efefef, acter, which
cdef, or cddd; indicates
but not abc, trailing con-
abcd, or text. The
abcdef. expression
ab/cd
Context matches the
sensitivity. string ab, but
Lex will recog- only if fol-
nize a small lowed by cd.
amount of sur- Thus
rounding con- ab$
text. The two is the same as
simplest opera- ab/\n
tors for this Left context is
are ^ and $. If handled in Lex
the first char- by start condi-
acter of an tions as
expression is explained in
^, the expres- section 10. If
sion will only a rule is only
be matched at to be executed
the beginning when the Lex
of a line automaton
(after a new- interpreter is
line character, in start condi-
or at the tion x, the
beginning of rule should be
the input prefixed by
stream). This <x>
can never con- using the angle
flict with the bracket opera-
other meaning tor characters.
of ^, comple- If we con-
mentation of sidered ``being
character at the begin-
Lex - A Lexical Analyzer Generator PS1:16-11
ning of a 4. Lex Actions.
line'' to be
start condition When an
ONE, then the ^ expression
operator would written as
be equivalent above is
to matched, Lex
<ONE> executes the
Start condi- corresponding
tions are action. This
explained more section
fully later. describes some
features of Lex
Repeti- which aid in
tions and writing
Definitions. actions. Note
The operators that there is a
{} specify default action,
either repeti- which consists
tions (if they of copying the
enclose input to the
numbers) or output. This
definition is performed on
expansion (if all strings not
they enclose a otherwise
name). For matched. Thus
example the Lex user
{digit} who wishes to
looks for a absorb the
predefined entire input,
string named without produc-
digit and ing any output,
inserts it at must provide
that point in rules to match
the expression. everything.
The definitions When Lex is
are given in being used with
the first part Yacc, this is
of the Lex the normal
input, before situation. One
the rules. In may consider
contrast, that actions
a{1,5} are what is
looks for 1 to done instead of
5 occurrences copying the
of a. input to the
output; thus,
Finally, in general, a
initial % is rule which
special, being merely copies
the separator can be omitted.
for Lex source Also, a charac-
segments. ter combination
which is omit-
PS1:16-12 Lex - A Lexical Analyzer Generator
ted from the quotes around
rules and which \n and \t are
appears as not required.
input is likely
to be printed In more
on the output, complex
thus calling actions, the
attention to user will often
the gap in the want to know
rules. the actual text
that matched
One of the some expression
simplest things like [a-z]+.
that can be Lex leaves this
done is to text in an
ignore the external char-
input. Speci- acter array
fying a C null named yytext.
statement, ; as Thus, to print
an action the name found,
causes this a rule like
result. A fre- [a-z]+ printf("%s", yytext);
quent rule is will print the
[ \t\n] ; string in
which causes yytext. The C
the three spac- function printf
ing characters accepts a for-
(blank, tab, mat argument
and newline) to and data to be
be ignored. printed; in
this case, the
Another format is
easy way to ``print
avoid writing string'' (%
actions is the indicating data
action charac- conversion, and
ter |, which s indicating
indicates that string type),
the action for and the data
this rule is are the charac-
the action for ters in yytext.
the next rule. So this just
The previous places the
example could matched string
also have been on the output.
written This action is
" " so common that
"\t" it may be writ-
"\n" ten as ECHO:
with the same [a-z]+ ECHO;
result, is the same as
although in the above.
different Since the
style. The default action
Lex - A Lexical Analyzer Generator PS1:16-13
is just to characters in
print the char- the words
acters found, recognized. The
one might ask last character
why give a in the string
rule, like this matched can be
one, which accessed by
merely speci- yytext[yyleng-1]
fies the
default action? Occasion-
Such rules are ally, a Lex
often required action may
to avoid match- decide that a
ing some other rule has not
rule which is recognized the
not desired. correct span of
For example, if characters. Two
there is a rule routines are
which matches provided to aid
read it will with this
normally match situation.
the instances First, yymore()
of read con- can be called
tained in bread to indicate
or readjust; to that the next
avoid this, a input expres-
rule of the sion recognized
form [a-z]+ is is to be tacked
needed. This is on to the end
explained of this input.
further below. Normally, the
next input
Sometimes string would
it is more con- overwrite the
venient to know current entry
the end of what in yytext.
has been found; Second, yyless
hence Lex also (n) may be
provides a called to indi-
count yyleng of cate that not
the number of all the charac-
characters ters matched by
matched. To the currently
count both the successful
number of words expression are
and the number wanted right
of characters now. The argu-
in words in the ment n indi-
input, the user cates the
might write number of char-
[a-zA-Z]+ {words++; chars += yyleng;} acters in
which accumu- yytext to be
lates in chars retained.
the number of Further charac-
PS1:16-14 Lex - A Lexical Analyzer Generator
ters previously that the final
matched are quote terminat-
returned to the ing the string
input. This should be
provides the picked up in
same sort of the code
lookahead labeled ``nor-
offered by the mal process-
/ operator, but ing''.
in a different
form. The func-
tion yyless()
Example: might be used
Consider a to reprocess
language which text in various
defines a circumstances.
string as a set Consider the C
of characters problem of dis-
between quota- tinguishing the
tion (") marks, ambiguity of
and provides ``=-a''. Sup-
that to include pose it is
a " in a string desired to
it must be pre- treat this as
ceded by a \. ``=- a'' but
The regular print a mes-
expression sage. A rule
which matches might be
that is some- =-[a-zA-Z] {
what confusing, printf("Op (=-) ambiguous\n");
so that it yyless(yyleng-1);
might be ... action for =- ...
preferable to }
write which prints a
\"[^"]* { message,
if (yytext[yyleng-1] == '\\') returns the
yymore(); letter after
else the operator to
... normal user processing the input
} stream, and
which will, treats the
when faced with operator as
a string such ``=-''. Alter-
as "abc\"def" natively it
first match the might be
five characters desired to
"abc\; then the treat this as
call to ``= -a''. To
yymore() will do this, just
cause the next return the
part of the minus sign as
string, "def, well as the
to be tacked on letter to the
the end. Note input:
Lex - A Lexical Analyzer Generator PS1:16-15
=-[a-zA-Z] { character
printf("Op (=-) ambiguous\n"); c on the
yyless(yyleng-2); output;
... action for = ... and
}
will perform 3) unput(c)
the other pushes the
interpretation. character
Note that the c back
expressions for onto the
the two cases input
might more stream to
easily be writ- be read
ten later by
=-/[A-Za-z] input().
in the first
case and By default
=/-[A-Za-z] these routines
in the second; are provided as
no backup would macro defini-
be required in tions, but the
the rule user can over-
action. It is ride them and
not necessary supply private
to recognize versions. These
the whole iden- routines define
tifier to the relation-
observe the ship between
ambiguity. The external files
possibility of and internal
``=-3'', how- characters, and
ever, makes must all be
=-/[^ \t\n] retained or
a still better modified con-
rule. sistently. They
may be rede-
In addi- fined, to cause
tion to these input or output
routines, Lex to be transmit-
also permits ted to or from
access to the strange places,
I/O routines it including other
uses. They are: programs or
internal
1) input() memory; but the
which character set
returns used must be
the next consistent in
input all routines; a
character; value of zero
returned by
2) output(c) input must mean
which end of file;
writes the and the rela-
PS1:16-16 Lex - A Lexical Analyzer Generator
tionship more input to
between unput arrive from a
and input must new source. In
be retained or this case, the
the Lex look- user should
ahead will not provide a
work. Lex does yywrap which
not look ahead arranges for
at all if it new input and
does not have returns 0.
to, but every This instructs
rule ending in Lex to continue
+ * ? or $ or processing. The
containing / default yywrap
implies look- always returns
ahead. Look- 1.
ahead is also
necessary to This rou-
match an tine is also a
expression that convenient
is a prefix of place to print
another expres- tables, sum-
sion. See below maries, etc. at
for a discus- the end of a
sion of the program. Note
character set that it is not
used by Lex. possible to
The standard write a normal
Lex library rule which
imposes a 100 recognizes
character limit end-of-file;
on backup. the only access
to this condi-
Another tion is through
Lex library yywrap. In
routine that fact, unless a
the user will private version
sometimes want of input() is
to redefine is supplied a file
yywrap() which containing
is called when- nulls cannot be
ever Lex handled, since
reaches an a value of 0
end-of-file. If returned by
yywrap returns input is taken
a 1, Lex con- to be end-of-
tinues with the file.
normal wrapup
on end of 5. Ambiguous
input. Some- Source Rules.
times, however,
it is con- Lex can
venient to handle ambigu-
arrange for ous specifica-
Lex - A Lexical Analyzer Generator PS1:16-17
tions. When integer and so
more than one the identifier
expression can interpretation
match the is used.
current input,
Lex chooses as The prin-
follows: ciple of
preferring the
1) The long- longest match
est match makes rules
is pre- containing
ferred. expressions
like .*
2) Among dangerous. For
rules example,
which '.*'
matched might seem a
the same good way of
number of recognizing a
charac- string in sin-
ters, the gle quotes. But
rule given it is an invi-
first is tation for the
preferred. program to read
far ahead,
Thus, suppose looking for a
the rules distant single
integer keyword action ...; quote.
[a-z]+ identifier action ...; Presented with
to be given in the input
that order. If 'first' quoted string here, 'second' here
the input is the above
integers, it is expression will
taken as an match
identifier, 'first' quoted string here, 'second'
because [a-z]+ which is prob-
matches 8 char- ably not what
acters while was wanted. A
integer matches better rule is
only 7. If the of the form
input is '[^'\n]*'
integer, both which, on the
rules match 7 above input,
characters, and will stop after
the keyword 'first'. The
rule is consequences of
selected errors like
because it was this are miti-
given first. gated by the
Anything fact that the .
shorter (e.g. operator will
int) will not not match new-
match the line. Thus
expression expressions
PS1:16-18 Lex - A Lexical Analyzer Generator
like .* stop on includes he,
the current Lex will nor-
line. Don't try mally not
to defeat this recognize the
with expres- instances of he
sions like included in
(.|\n)+ or she, since once
equivalents; it has passed a
the Lex gen- she those char-
erated program acters are
will try to gone.
read the entire
input file, Sometimes
causing inter- the user would
nal buffer like to over-
overflows. ride this
choice. The
Note that action REJECT
Lex is normally means ``go do
partitioning the next alter-
the input native.'' It
stream, not causes whatever
searching for rule was second
all possible choice after
matches of each the current
expression. rule to be exe-
This means that cuted. The
each character position of the
is accounted input pointer
for once and is adjusted
only once. For accordingly.
example, sup- Suppose the
pose it is user really
desired to wants to count
count the included
occurrences of instances of
both she and he he:
in an input she {s++; REJECT;}
text. Some Lex he {h++; REJECT;}
rules to do \n |
this might be . ;
she s++; these rules are
he h++; one way of
\n | changing the
. ; previous exam-
where the last ple to do just
two rules that. After
ignore every- counting each
thing besides expression, it
he and she. is rejected;
Remember that . whenever
does not appropriate,
include new- the other
line. Since she expression will
Lex - A Lexical Analyzer Generator PS1:16-19
then be input stream
counted. In but to detect
this example, all examples of
of course, the some items in
user could note the input, and
that she the instances
includes he but of these items
not vice versa, may overlap or
and omit the include each
REJECT action other. Suppose
on he; in other a digram table
cases, however, of the input is
it would not be desired; nor-
possible a mally the
priori to tell digrams over-
which input lap, that is
characters were the word the is
in both considered to
classes. contain both th
and he. Assum-
Consider ing a two-
the two rules dimensional
a[bc]+ { ... ; REJECT;} array named
a[cd]+ { ... ; REJECT;} digram to be
If the input is incremented,
ab, only the the appropriate
first rule source is
matches, and on %%
ad only the [a-z][a-z] {
second matches. digram[yytext[0]][yytext[1]]++;
The input REJECT;
string accb }
matches the . ;
first rule for \n ;
four characters where the
and then the REJECT is
second rule for necessary to
three charac- pick up a
ters. In con- letter pair
trast, the beginning at
input accd every charac-
agrees with the ter, rather
second rule for than at every
four characters other charac-
and then the ter.
first rule for
three. 6. Lex Source
Definitions.
In gen-
eral, REJECT is Remember
useful whenever the format of
the purpose of the Lex source:
Lex is not to {definitions}
partition the %%
PS1:16-20 Lex - A Lexical Analyzer Generator
{rules} prior to
%% the first
{user routines} %% delim-
So far only the iter will
rules have been be exter-
described. The nal to any
user needs function
additional in the
options, code; if
though, to it appears
define vari- immedi-
ables for use ately
in his program after the
and for use by first %%,
Lex. These can it appears
go either in in an
the definitions appropri-
section or in ate place
the rules sec- for
tion. declara-
tions in
Remember the func-
that Lex is tion writ-
turning the ten by Lex
rules into a which con-
program. Any tains the
source not actions.
intercepted by This
Lex is copied material
into the gen- must look
erated program. like pro-
There are three gram frag-
classes of such ments, and
things. should
precede
1) Any line the first
which is Lex rule.
not part
of a Lex As a side
rule or effect of
action the above,
which lines
begins which
with a begin with
blank or a blank or
tab is tab, and
copied which con-
into the tain a
Lex gen- comment,
erated are passed
program. through to
Such the gen-
source erated
input program.
Lex - A Lexical Analyzer Generator PS1:16-21
This can etc., is
be used to copied out
include after the
comments Lex out-
in either put.
the Lex
source or Defini-
the gen- tions intended
erated for Lex are
code. The given before
comments the first %%
should delimiter. Any
follow the line in this
host section not
language contained
conven- between %{ and
tion. %}, and begin-
ing in column
2) Anything 1, is assumed
included to define Lex
between substitution
lines con- strings. The
taining format of such
only %{ lines is
and %} is name translation
copied out and it causes
as above. the string
The delim- given as a
iters are translation to
discarded. be associated
This for- with the name.
mat per- The name and
mits translation
entering must be
text like separated by at
preproces- least one blank
sor state- or tab, and the
ments that name must begin
must begin with a letter.
in column The translation
1, or can then be
copying called out by
lines that the {name} syn-
do not tax in a rule.
look like Using {D} for
programs. the digits and
{E} for an
3) Anything exponent field,
after the for example,
third %% might abbrevi-
delimiter, ate rules to
regardless recognize
of for- numbers:
mats, D [0-9]
PS1:16-22 Lex - A Lexical Analyzer Generator
E [DEde][-+]?{D}+ adjustments to
%% the default
{D}+ printf("integer"); size of arrays
{D}+"."{D}*({E})? | within Lex
{D}*"."{D}+({E})? | itself for
{D}+{E} larger source
Note the first programs. These
two rules for possibilities
real numbers; are discussed
both require a below under
decimal point ``Summary of
and contain an Source For-
optional mat,'' section
exponent field, 12.
but the first
requires at 7. Usage.
least one digit
before the There are
decimal point two steps in
and the second compiling a Lex
requires at source program.
least one digit First, the Lex
after the source must be
decimal point. turned into a
To correctly generated pro-
handle the gram in the
problem posed host general
by a Fortran purpose
expression such language. Then
as 35.EQ.I, this program
which does not must be com-
contain a real piled and
number, a loaded, usually
context- with a library
sensitive rule of Lex subrou-
such as tines. The gen-
[0-9]+/"."EQ printf("integer"); erated program
could be used is on a file
in addition to named lex.yy.c.
the normal rule The I/O library
for integers. is defined in
terms of the C
The defin- standard
itions section library [6].
may also con-
tain other com- The C pro-
mands, includ- grams generated
ing the selec- by Lex are
tion of a host slightly dif-
language, a ferent on
character set OS/370, because
table, a list the OS compiler
of start condi- is less power-
tions, or ful than the
Lex - A Lexical Analyzer Generator PS1:16-23
UNIX or GCOS analyzer. Nor-
compilers, and mally, the
does less at default main
compile time. C program on the
programs gen- Lex library
erated on GCOS calls this rou-
and UNIX are tine, but if
the same. Yacc is loaded,
and its main
UNIX. The program is
library is used, Yacc will
accessed by the call yylex().
loader flag In this case
-ll. So an each Lex rule
appropriate set should end with
of commands is return(token);
lex source cc lex.yy.c where the
-ll appropriate
The resulting token value is
program is returned. An
placed on the easy way to get
usual file access to
a.out for later Yacc's names
execution. To for tokens is
use Lex with to compile the
Yacc see below. Lex output file
Although the as part of the
default Lex I/O Yacc output
routines use file by placing
the C standard the line
library, the # include "lex.yy.c"
Lex automata in the last
themselves do section of Yacc
not do so; if input. Suppos-
private ver- ing the grammar
sions of input, to be named
output and ``good'' and
unput are the lexical
given, the rules to be
library can be named
avoided. ``better'' the
UNIX command
8. Lex and sequence can
Yacc. just be:
yacc good
If you lex better
want to use Lex cc y.tab.c -ly -ll
with Yacc, note The Yacc
that what Lex library (-ly)
writes is a should be
program named loaded before
yylex(), the the Lex
name required library, to
by Yacc for its obtain a main
PS1:16-24 Lex - A Lexical Analyzer Generator
program which such input
invokes the items as 49.63
Yacc parser. or X7. Further-
The generations more, it incre-
of Lex and Yacc ments the abso-
programs can be lute value of
done in either all negative
order. numbers divisi-
ble by 7. To
9. Examples. avoid this,
just add a few
As a more rules
trivial prob- after the
lem, consider active one, as
copying an here:
input file %%
while adding 3 int k;
to every posi- -?[0-9]+ {
tive number k = atoi(yytext);
divisible by 7. printf("%d",
Here is a suit- k%7 == 0 ? k+3 : k);
able Lex source }
program -?[0-9.]+ ECHO;
%% [A-Za-z][A-Za-z0-9]+ ECHO;
int k; Numerical
[0-9]+ { strings con-
k = atoi(yytext); taining a ``.''
if (k%7 == 0) or preceded by
printf("%d", k+3); a letter will
else be picked up by
printf("%d",k); one of the last
} two rules, and
to do just not changed.
that. The rule The if-else has
[0-9]+ recog- been replaced
nizes strings by a C condi-
of digits; atoi tional expres-
converts the sion to save
digits to space; the form
binary and a?b:c means
stores the ``if a then b
result in k. else c''.
The operator %
(remainder) is For an
used to check example of
whether k is statistics
divisible by 7; gathering, here
if it is, it is is a program
incremented by which histo-
3 as it is grams the
written out. It lengths of
may be objected words, where a
that this pro- word is defined
gram will alter as a string of
Lex - A Lexical Analyzer Generator PS1:16-25
letters. gle precision
int lengs[100]; Fortran.
%% Because Fortran
[a-z]+ lengs[yyleng]++; does not dis-
. | tinguish upper
\n ; and lower case
%% letters, this
yywrap() routine begins
{ by defining a
int i; set of classes
printf("Length No. words\n"); including both
for(i=0; i<100; i++) cases of each
if (lengs[i] > 0) letter:
printf("%5d%10d\n",i,lengs[i]); a [aA]
return(1); b [bB]
} c [cC]
This program ...
accumulates the z [zZ]
histogram, An additional
while producing class recog-
no output. At nizes white
the end of the space:
input it prints W [ \t]*
the table. The The first rule
final statement changes ``dou-
return(1); ble precision''
indicates that to ``real'', or
Lex is to per- ``DOUBLE PRECI-
form wrapup. SION'' to
If yywrap ``REAL''.
returns zero {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
(false) it printf(yytext[0]=='d'? "real" : "REAL");
implies that }
further input Care is taken
is available throughout this
and the program program to
is to continue preserve the
reading and case (upper or
processing. To lower) of the
provide a original pro-
yywrap that gram. The con-
never returns ditional opera-
true causes an tor is used to
infinite loop. select the
proper form of
As a the keyword.
larger example, The next rule
here are some copies con-
parts of a pro- tinuation card
gram written by indications to
N. L. Schryer avoid confusing
to convert dou- them with con-
ble precision stants:
Fortran to sin- ^" "[^ 0] ECHO;
PS1:16-26 Lex - A Lexical Analyzer Generator
In the regular respelled to
expression, the remove their
quotes surround initial d. By
the blanks. It using the array
is interpreted yytext the same
as ``beginning action suffices
of line, then for all the
five blanks, names (only a
then anything sample of a
but blank or rather long
zero.'' Note list is given
the two dif- here).
ferent meanings {d}{s}{i}{n} |
of ^. There {d}{c}{o}{s} |
follow some {d}{s}{q}{r}{t} |
rules to change {d}{a}{t}{a}{n} |
double preci- ...
sion constants {d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
to ordinary Another list of
floating con- names must have
stants. initial d
[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ | changed to ini-
[0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+ | tial a:
"."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+{d}{l}{o}{g} |
/* convert constants */ {d}{l}{o}{g}10 |
for(p=yytext; *p != 0; p++) {d}{m}{i}{n}1 |
{ {d}{m}{a}{x}1 {
if (*p == 'd' || *p == 'D') yytext[0] =+ 'a' - 'd';
*p=+ 'e'- 'd'; ECHO;
ECHO; }
} And one routine
After the must have ini-
floating point tial d changed
constant is to initial r:
recognized, it {d}1{m}{a}{c}{h} {yytext[0] =+ 'r' - 'd';
is scanned by
the for loop to
find the letter To avoid such
d or D. The names as dsinx
program than being detected
adds 'e'-'d', as instances of
which converts dsin, some
it to the next final rules
letter of the pick up longer
alphabet. The words as iden-
modified con- tifiers and
stant, now copy some sur-
single- viving charac-
precision, is ters:
written out [A-Za-z][A-Za-z0-9]* |
again. There [0-9]+ |
follow a series \n |
of names which . ECHO;
must be Note that this
Lex - A Lexical Analyzer Generator PS1:16-27
program is not facility simi-
complete; it lar to that for
does not deal adjacent right
with the spac- context, but it
ing problems in is unlikely to
Fortran or with be as useful,
the use of key- since often the
words as iden- relevant left
tifiers. context
10. Left Con- appeared some
text Sensi- time earlier,
tivity. such as at the
beginning of a
Sometimes line.
it is desirable
to have several This sec-
sets of lexical tion describes
rules to be three means of
applied at dif- dealing with
ferent times in different
the input. For environments: a
example, a com- simple use of
piler prepro- flags, when
cessor might only a few
distinguish rules change
preprocessor from one
statements and environment to
analyze them another, the
differently use of start
from ordinary conditions on
statements. rules, and the
This requires possibility of
sensitivity to making multiple
prior context, lexical
and there are analyzers all
several ways of run together.
handling such In each case,
problems. The ^ there are rules
operator, for which recognize
example, is a the need to
prior context change the
operator, environment in
recognizing which the fol-
immediately lowing input
preceding left text is
context just as analyzed, and
$ recognizes set some param-
immediately eter to reflect
following right the change.
context. Adja- This may be a
cent left con- flag explicitly
text could be tested by the
extended, to user's action
produce a code; such a
PS1:16-28 Lex - A Lexical Analyzer Generator
flag is the letter a,
simplest way of changing magic
dealing with to second on
the problem, every line
since Lex is which began
not involved at with the letter
all. It may be b, and changing
more con- magic to third
venient, how- on every line
ever, to have which began
Lex remember with the letter
the flags as c. All other
initial condi- words and all
tions on the other lines are
rules. Any rule left unchanged.
may be associ-
ated with a These
start condi- rules are so
tion. It will simple that the
only be recog- easiest way to
nized when Lex do this job is
is in that with a flag:
start condi- int flag;
tion. The %%
current start ^a {flag = 'a'; ECHO;}
condition may ^b {flag = 'b'; ECHO;}
be changed at ^c {flag = 'c'; ECHO;}
any time. \n {flag = 0 ; ECHO;}
Finally, if the magic {
sets of rules switch (flag)
for the dif- {
ferent environ- case 'a': printf("first"); break;
ments are very case 'b': printf("second"); break;
dissimilar, case 'c': printf("third"); break;
clarity may be default: ECHO; break;
best achieved }
by writing }
several dis- should be ade-
tinct lexical quate.
analyzers, and
switching from To handle
one to another the same prob-
as desired. lem with start
conditions,
Consider each start con-
the following dition must be
problem: copy introduced to
the input to Lex in the
the output, definitions
changing the section with a
word magic to line reading
first on every %Start name1 name2 ...
line which where the con-
began with the ditions may be
Lex - A Lexical Analyzer Generator PS1:16-29
named in any ^b {ECHO; BEGIN BB;}
order. The word ^c {ECHO; BEGIN CC;}
Start may be \n {ECHO; BEGIN 0;}
abbreviated to <AA>magic printf("first");
s or S. The <BB>magic printf("second");
conditions may <CC>magic printf("third");
be referenced where the logic
at the head of is exactly the
a rule with the same as in the
<> brackets: previous method
<name1>expression of handling the
is a rule which problem, but
is only recog- Lex does the
nized when Lex work rather
is in the start than the user's
condition code.
name1. To enter
a start condi- 11. Character
tion, execute Set.
the action
statement The pro-
BEGIN name1; grams generated
which changes by Lex handle
the start con- character I/O
dition to only through
name1. To the routines
resume the nor- input, output,
mal state, and unput. Thus
BEGIN 0; the character
resets the ini- representation
tial condition provided in
of the Lex these routines
automaton is accepted by
interpreter. A Lex and
rule may be employed to
active in return values
several start in yytext. For
conditions: internal use a
<name1,name2,name3> character is
is a legal pre- represented as
fix. Any rule a small integer
not beginning which, if the
with the <> standard
prefix operator library is
is always used, has a
active. value equal to
the integer
The same value of the
example as bit pattern
before can be representing
written: the character
%START AA BB CC on the host
%% computer. Nor-
^a {ECHO; BEGIN AA;} mally, the
PS1:16-30 Lex - A Lexical Analyzer Generator
letter a is through 26,
represented as newline into
the same form 27, + and -
as the charac- into 28 and 29,
ter constant and the digits
'a'. If this into 30 through
interpretation 39. Note the
is changed, by escape for new-
providing I/O line. If a
routines which table is sup-
translate the plied, every
characters, Lex character that
must be told is to appear
about it, by either in the
giving a trans- rules or in any
lation table. valid input
This table must must be
be in the included in the
definitions table. No char-
section, and acter may be
must be brack- assigned the
eted by lines number 0, and
containing no character
only ``%T''. may be assigned
The table con- a bigger number
tains lines of than the size
the form of the hardware
{integer} {character string} character set.
which indicate
the value asso- 12. Summary of
ciated with Source Format.
each character.
Thus the next The gen-
example eral form of a
%T Lex source file
1 Aa is:
2 Bb {definitions}
... %%
26 Zz {rules}
27 \n %%
28 + {user subroutines}
29 - The definitions
30 0 section con-
31 1 tains a combi-
... nation of
39 9
%T 1) Defini-
tions, in
Sample character table. the form
maps the lower ``name
and upper case space
letters transla-
together into tion''.
the integers 1
Lex - A Lexical Analyzer Generator PS1:16-31
2) Included ``expression
code, in action'' where
the form the action may
``space be continued on
code''. succeeding
lines by using
3) Included braces to del-
code, in imit it.
the form
%{ Regular
code expressions in
%} Lex use the
4) Start con- following
ditions, operators:
given in x the character "x"
the form "x" an "x", even if x is an operator.
%S name1 name2 ... \x an "x", even if x is an operator.
5) Character [xy] the character x or y.
set [x-z] the characters x, y or z.
tables, in [^x] any character but x.
the form . any character but newline.
%T ^x an x at the beginning of a line.
number space character-st<y>x an x when Lex is in start condition y.
... x$ an x at the end of a line.
%T x? an optional x.
6) Changes to x* 0,1,2, ... instances of x.
internal x+ 1,2,3, ... instances of x.
array x|y an x or a y.
sizes, in (x) an x.
the form x/y an x but only if followed by y.
%x nnn {xx} the translation of xx from the
where nnn definitions section.
is a x{m,n} m through n occurrences of x
decimal
integer 13. Caveats and
represent- Bugs.
ing an
array size There are
and x pathological
selects expressions
the param- which produce
eter as exponential
follows: growth of the
Letter Parameter tables when
p positions converted to
n states deterministic
e tree nodes machines; for-
a transitions tunately, they
k packed character classes are rare.
o output array size
REJECT
Lines in the does not rescan
rules section the input;
have the form instead it
PS1:16-32 Lex - A Lexical Analyzer Generator
remembers the Eric Schmidt.
results of the
previous scan. 15. References.
This means that
if a rule with 1. B. W. Ker-
trailing con- nighan and
text is found, D. M.
and REJECT exe- Ritchie,
cuted, the user The C Pro-
must not have gramming
used unput to Language,
change the Prentice-
characters Hall, N.
forthcoming J. (1978).
from the input
stream. This is 2. B. W. Ker-
the only res- nighan,
triction on the Ratfor: A
user's ability Preproces-
to manipulate sor for a
the not-yet- Rational
processed Fortran,
input. Software -
Practice
14. Acknowledg- and
ments. Experi-
ence, 5,
As should pp. 395-
be obvious from 496
the above, the (1975).
outside of Lex
is patterned on 3. S. C.
Yacc and the Johnson,
inside on Aho's Yacc: Yet
string matching Another
routines. Compiler
Therefore, both Compiler,
S. C. Johnson Computing
and A. V. Aho Science
are really ori- Technical
ginators of Report No.
much of Lex, as 32, 1975,
well as Bell
debuggers of Labora-
it. Many thanks tories,
are due to Murray
both. Hill, NJ
07974.
The code
of the current 4. A. V. Aho
version of Lex and M. J.
was designed, Corasick,
written, and Efficient
debugged by String
Lex - A Lexical Analyzer Generator PS1:16-33
Matching:
An Aid to
Biblio-
graphic
Search,
Comm. ACM
18, 333-
340
(1975).
5. B. W. Ker-
nighan, D.
M. Ritchie
and K. L.
Thompson,
QED Text
Editor,
Computing
Science
Technical
Report No.
5, 1972,
Bell
Labora-
tories,
Murray
Hill, NJ
07974.
6. D. M.
Ritchie,
private
communica-
tion. See
also M. E.
Lesk, The
Portable C
Library,
Computing
Science
Technical
Report No.
31, Bell
Labora-
tories,
Murray
Hill, NJ
07974.
Generated on 2013-04-27 00:20:00 by $MirOS: src/scripts/roff2htm,v 1.77 2013/01/01 20:49:09 tg Exp $
These manual pages and other documentation are copyrighted by their respective writers;
their source is available at our CVSweb,
AnonCVS, and other mirrors. The rest is Copyright © 2002‒2013 The MirOS Project, Germany.
This product includes material
provided by Thorsten Glaser.
This manual page’s HTML representation is supposed to be valid XHTML/1.1; if not, please send a bug report – diffs preferred.