MirOS Manual: 16.lex(PSD)

Lex - A Lexical Analyzer Generator                       PS1:16-1

               Lex - A Lexical Analyzer Generator

                    M. E. Lesk and E. Schmidt

                     AT&T Bell Laboratories
                  Murray Hill, New Jersey 07974


          Lex helps write programs  whose  control  flow  is
     directed  by  instances  of  regular expressions in the
     input stream. It is well suited for editor-script  type
     transformations and for segmenting input in preparation
     for a parsing routine.

          Lex source is a table of regular  expressions  and
     corresponding   program   fragments.   The   table   is
     translated to a program which reads  an  input  stream,
     copying  it  to  an  output stream and partitioning the
     input into strings which match the  given  expressions.
     As  each  such  string  is recognized the corresponding
     program fragment is executed. The  recognition  of  the
     expressions  is  performed  by  a  deterministic finite
     automaton generated by Lex. The program fragments writ-
     ten  by the user are executed in the order in which the
     corresponding regular expressions occur  in  the  input

          The lexical analysis  programs  written  with  Lex
     accept  ambiguous specifications and choose the longest
     match possible at each input point. If necessary,  sub-
     stantial  lookahead  is performed on the input, but the
     input stream will be  backed  up  to  the  end  of  the
     current partition, so that the user has general freedom
     to manipulate it.

          Lex can generate analyzers in either C or  Ratfor,
     a  language  which  can  be translated automatically to
     portable Fortran. It is available on the  PDP-11  UNIX,
     Honeywell  GCOS,  and IBM OS systems. This manual, how-
     ever, will only discuss generating analyzers  in  C  on
     the  UNIX  system,  which is the only supported form of
     Lex under UNIX Version 7. Lex is designed  to  simplify
     interfacing  with  Yacc,  for those with access to this
     compiler-compiler system.

PS1:16-2                       Lex - A Lexical Analyzer Generator

1. Introduction.                        Lex  is  not  a  complete
                                   language, but rather a genera-
     Lex is a program  genera-     tor   representing    a    new
tor  designed for lexical pro-     language  feature which can be
cessing  of  character   input     added to different programming
streams.  It  accepts  a high-     languages,    called    ``host
level,    problem     oriented     languages.'' Just  as  general
specification   for  character     purpose  languages can produce
string matching, and  produces     code to run on different  com-
a program in a general purpose     puter  hardware, Lex can write
language which recognizes reg-     code   in    different    host
ular  expressions. The regular     languages.  The  host language
expressions are  specified  by     is used for  the  output  code
the user in the source specif-     generated  by Lex and also for
ications given to Lex. The Lex     the program fragments added by
written  code recognizes these     the  user. Compatible run-time
expressions in an input stream     libraries  for  the  different
and   partitions   the   input     host  languages  are also pro-
stream into  strings  matching     vided. This makes  Lex  adapt-
the   expressions.    At   the     able to different environments
boundaries   between   strings     and  different   users.   Each
program  sections  provided by     application may be directed to
the user are executed. The Lex     the  combination  of  hardware
source   file  associates  the     and  host language appropriate
regular  expressions  and  the     to the task, the user's  back-
program   fragments.  As  each     ground,  and the properties of
expression  appears   in   the     local   implementations.    At
input  to  the program written     present,  the  only  supported
by  Lex,   the   corresponding     host language is  C,  although
fragment is executed.              Fortran (in the form of Ratfor
                                   [2] has been available in  the
     The  user  supplies   the     past.  Lex  itself  exists  on
additional code beyond expres-     UNIX, GCOS,  and  OS/370;  but
sion matching needed  to  com-     the  code generated by Lex may
plete   his   tasks,  possibly     be    taken    anywhere    the
including  code   written   by     appropriate compilers exist.
other  generators. The program
that  recognizes  the  expres-          Lex  turns   the   user's
sions is generated in the gen-     expressions     and    actions
eral    purpose    programming     (called source in  this  memo)
language   employed   for  the     into  the host general-purpose
user's   program    fragments.     language; the  generated  pro-
Thus,  a high level expression     gram is named yylex. The yylex
language is provided to  write     program will recognize expres-
the  string  expressions to be     sions   in  a  stream  (called
matched while the user's free-     input in this memo)  and  per-
dom  to write actions is unim-     form the specified actions for
paired.  This  avoids  forcing     each  expression  as   it   is
the  user  who wishes to use a     detected. See Figure 1.
string  manipulation  language                        _______
for  input  analysis  to write             Source ->|   Lex |   -> yylex
processing  programs  in   the                        _______
same  and  often inappropriate
string handling language.

Lex - A Lexical Analyzer Generator                       PS1:16-3

                   _______                        so the  program
        Input -> |  yylex|  -> Output             generated    by
                   _______                        Lex     (yylex)
                                                  will     ignore
             An overview of Lex                   these   charac-
                  Figure 1                        ters.    Every-
                                                  thing else will
                    For      a                    be  copied.  To
               trivial   exam-                    change      any
               ple, consider a                    remaining
               program      to                    string       of
               delete from the                    blanks  or tabs
               input       all                    to   a   single
               blanks or  tabs                    blank,      add
               at  the ends of                    another rule:
               lines.                         %%
                 %%                           [ \t]+$   ;
                 [ \t]+$   ;                  [ \t]+    printf(" ");
               is all that  is                    The      finite
               required.   The                    automaton  gen-
               program    con-                    erated for this
               tains a %% del-                    source     will
               imiter to  mark                    scan  for  both
               the   beginning                    rules  at once,
               of  the  rules,                    observing    at
               and  one  rule.                    the termination
               This rule  con-                    of  the  string
               tains a regular                    of   blanks  or
               expression                         tabs whether or
               which   matches                    not  there is a
               one   or   more                    newline charac-
               instances    of                    ter,  and  exe-
               the  characters                    cuting      the
               blank   or  tab                    desired    rule
               (written \t for                    action.     The
               visibility,  in                    first      rule
               accordance with                    matches     all
               the  C language                    strings      of
               convention)                        blanks or  tabs
               just  prior  to                    at  the  end of
               the  end  of  a                    lines, and  the
               line.       The                    second rule all
               brackets  indi-                    remaining
               cate  the char-                    strings      of
               acter     class                    blanks or tabs.
               made  of  blank
               and tab; the  +                         Lex can be
               indicates ``one                    used  alone for
               or more  ...'';                    simple
               and the $ indi-                    transforma-
               cates ``end  of                    tions,  or  for
               line,''  as  in                    analysis    and
               QED. No  action                    statistics
               is   specified,                    gathering  on a

PS1:16-4                       Lex - A Lexical Analyzer Generator

               lexical  level.                    programs  writ-
               Lex can also be                    ten by Lex.
               used   with   a                lexical        grammar
               parser  genera-                 rules          rules
               tor to  perform                   v              v
               the     lexical               _________      _________
               analysis phase;             |    Lex  |    |   Yacc  |
               it  is particu-               _________      _________
               larly  easy  to                   v              v
               interface   Lex               _________      _________
               and  Yacc  [3].     Input ->|   yylex |  ->|  yyparse|  -> Parsed input
               Lex    programs               _________      _________
               recognize  only
               regular expres-                    Lex with Yacc
               sions;     Yacc                       Figure 2
               writes  parsers                    Yacc users will
               that  accept  a                    realize    that
               large  class of                    the name  yylex
               context    free                    is   what  Yacc
               grammars,   but                    expects     its
               require a lower                    lexical
               level  analyzer                    analyzer to  be
               to    recognize                    named,  so that
               input   tokens.                    the use of this
               Thus, a  combi-                    name   by   Lex
               nation  of  Lex                    simplifies
               and   Yacc   is                    interfacing.
               often appropri-
               ate. When  used                         Lex   gen-
               as a preproces-                    erates a deter-
               sor for a later                    ministic finite
               parser  genera-                    automaton  from
               tor,   Lex   is                    the     regular
               used  to parti-                    expressions  in
               tion the  input                    the source [4].
               stream, and the                    The   automaton
               parser  genera-                    is interpreted,
               tor     assigns                    rather     than
               structure    to                    compiled,    in
               the   resulting                    order  to  save
               pieces.     The                    space.      The
               flow of control                    result is still
               in such a  case                    a          fast
               (which might be                    analyzer.    In
               the first  half                    particular, the
               of  a compiler,                    time taken by a
               for example) is                    Lex program  to
               shown in Figure                    recognize   and
               2.   Additional                    partition    an
               programs, writ-                    input stream is
               ten  by   other                    proportional to
               generators   or                    the  length  of
               by hand, can be                    the input.  The
               added easily to                    number  of  Lex

Lex - A Lexical Analyzer Generator                       PS1:16-5

               rules  or   the                    action routine.
               complexity   of
               the  rules   is                         Lex is not
               not   important                    limited      to
               in  determining                    source    which
               speed,   unless                    can  be  inter-
               rules     which                    preted  on  the
               include forward                    basis   of  one
               context require                    character look-
               a   significant                    ahead.      For
               amount  of  re-                    example,     if
               scanning.  What                    there  are  two
               does   increase                    rules,      one
               with the number                    looking  for ab
               and  complexity                    and another for
               of rules is the                    abcdefg,    and
               size   of   the                    the       input
               finite  automa-                    stream       is
               ton, and there-                    abcdefh,    Lex
               fore  the  size                    will  recognize
               of the  program                    ab  and   leave
               generated    by                    the       input
               Lex.                               pointer    just
                                                  before  cd. . .
                    In     the                    Such backup  is
               program written                    more     costly
               by   Lex,   the                    than  the  pro-
               user's    frag-                    cessing      of
               ments                              simpler
               (representing                      languages.
               the actions  to
               be performed as                    2. Lex Source.
               each    regular
               expression   is                         The   gen-
               found)      are                    eral  format of
               gathered     as                    Lex source is:
               cases   of    a                  {definitions}
               switch.     The                  %%
               automaton                        {rules}
               interpreter                      %%
               directs     the                  {user subroutines}
               control   flow.                    where       the
               Opportunity  is                    definitions and
               provided    for                    the  user  sub-
               the   user   to                    routines    are
               insert   either                    often  omitted.
               declarations or                    The  second  %%
               additional                         is    optional,
               statements   in                    but  the  first
               the     routine                    is required  to
               containing  the                    mark the begin-
               actions,  or to                    ning   of   the
               add subroutines                    rules.      The
               outside    this                    absolute

PS1:16-6                       Lex - A Lexical Analyzer Generator

               minimum     Lex                    used  to  print
               program is thus                    the string. The
                     %%                           end   of    the
               (no     defini-                    expression   is
               tions,       no                    indicated    by
               rules)    which                    the first blank
               translates into                    or tab  charac-
               a program which                    ter.   If   the
               copies      the                    action       is
               input  to   the                    merely a single
               output                             C   expression,
               unchanged.                         it  can just be
                                                  given  on   the
                    In     the                    right  side  of
               outline  of Lex                    the line; if it
               programs  shown                    is compound, or
               above,      the                    takes more than
               rules represent                    a    line,   it
               the user's con-                    should       be
               trol decisions;                    enclosed     in
               they    are   a                    braces.  As   a
               table, in which                    slightly   more
               the left column                    useful example,
               contains  regu-                    suppose  it  is
               lar expressions                    desired      to
               (see section 3)                    change a number
               and  the  right                    of  words  from
               column contains                    British      to
               actions,   pro-                    American  spel-
               gram  fragments                    ling. Lex rules
               to  be executed                    such as
               when        the           colour      printf("color");
               expressions are           mechanise   printf("mechanize");
               recognized.               petrol      printf("gas");
               Thus an indivi-                    would   be    a
               dual rule might                    start.    These
               appear                             rules  are  not
   integer   printf("found keyword INT");         quite   enough,
               to look for the                    since the  word
               string  integer                    petroleum would
               in  the   input                    become  gaseum;
               stream      and                    a  way of deal-
               print the  mes-                    ing  with  this
               sage    ``found                    will         be
               keyword   INT''                    described
               whenever     it                    later.
               appears.     In
               this    example                    3. Lex  Regular
               the  host  pro-                    Expressions.
               language  is  C                         The defin-
               and    the    C                    itions of regu-
               library   func-                    lar expressions
               tion  printf is                    are  very simi-

Lex - A Lexical Analyzer Generator                       PS1:16-7

               lar to those in                    text    charac-
               QED [5]. A reg-                    ters. Thus
               ular expression                        xyz"++"
               specifies a set                    matches     the
               of  strings  to                    string    xyz++
               be  matched. It                    when         it
               contains   text                    appears.   Note
               characters                         that a part  of
               (which    match                    a string may be
               the correspond-                    quoted.  It  is
               ing  characters                    harmless    but
               in  the strings                    unnecessary  to
               being compared)                    quote  an ordi-
               and    operator                    nary text char-
               characters                         acter;      the
               (which  specify                    expression
               repetitions,                           "xyz++"
               choices,    and                    is the same  as
               other                              the  one above.
               features).  The                    Thus by quoting
               letters of  the                    every      non-
               alphabet    and                    alphanumeric
               the digits  are                    character being
               always     text                    used as a  text
               characters;                        character,  the
               thus  the regu-                    user can  avoid
               lar expression                     remembering the
                 integer                          list  above  of
               matches     the                    current  opera-
               string  integer                    tor characters,
               wherever     it                    and   is   safe
               appears and the                    should  further
               expression                         extensions   to
                    a57D                          Lex    lengthen
               looks  for  the                    the list.
               string a57D.
                                                       An  opera-
                    Operators.                    tor   character
               The    operator                    may   also   be
               characters are                     turned  into  a
   " \ [ ] ^ - ? . * + | ( ) $ / { } % < >        text  character
               and if they are                    by preceding it
               to  be  used as                    with \ as in
               text    charac-                        xyz\+\+
               ters, an escape                    which        is
               should be used.                    another,   less
               The   quotation                    readable,
               mark   operator                    equivalent   of
               (")   indicates                    the       above
               that   whatever                    expressions.
               is    contained                    Another use  of
               between a  pair                    the     quoting
               of quotes is to                    mechanism is to
               be   taken   as                    get   a   blank

PS1:16-8                       Lex - A Lexical Analyzer Generator

               into an expres-                    cial: these are
               sion; normally,                    \ - and ^.  The
               as    explained                    -     character
               above,   blanks                    indicates
               or tabs  end  a                    ranges.     For
               rule. Any blank                    example,
               character   not                      [a-z0-9<>_]
               contained                          indicates   the
               within []  (see                    character class
               below)  must be                    containing  all
               quoted. Several                    the  lower case
               normal        C                    letters,    the
               escapes with  \                    digits,     the
               are recognized:                    angle brackets,
               \n is  newline,                    and  underline.
               \t  is tab, and                    Ranges  may  be
               \b   is   back-                    given in either
               space. To enter                    order. Using  -
               \  itself,  use                    between     any
               \\.  Since new-                    pair of charac-
               line is illegal                    ters  which are
               in  an  expres-                    not both  upper
               sion,  \n  must                    case   letters,
               be  used; it is                    both lower case
               not required to                    letters,     or
               escape  tab and                    both digits  is
               backspace.                         implementation
               Every character                    dependent   and
               but blank, tab,                    will    get   a
               newline and the                    warning    mes-
               list  above  is                    sage.    (E.g.,
               always  a  text                    [0-z] in  ASCII
               character.                         is   many  more
                                                  characters than
                    Character                     it     is    in
               classes.                           EBCDIC). If  it
               Classes      of                    is  desired  to
               characters  can                    include     the
               be    specified                    character  - in
               using       the                    a     character
               operator   pair                    class,       it
               [].   The  con-                    should be first
               struction [abc]                    or last; thus
               matches  a sin-                        [-+0-9]
               gle  character,                    matches all the
               which may be a,                    digits  and the
               b, or c. Within                    two signs.
               square   brack-
               ets,       most                         In charac-
               operator  mean-                    ter    classes,
               ings        are                    the ^  operator
               ignored.   Only                    must  appear as
               three   charac-                    the first char-
               ters  are  spe-                    acter after the

Lex - A Lexical Analyzer Generator                       PS1:16-9

               left   bracket;                    expressions.
               it    indicates                    The operator  ?
               that        the                    indicates    an
               resulting                          optional   ele-
               string is to be                    ment    of   an
               complemented                       expression.
               with respect to                    Thus
               the    computer                         ab?c
               character  set.                    matches  either
               Thus                               ac or abc.
               matches     all                         Repeated
               characters                         expressions.
               except a, b, or                    Repetitions  of
               c,    including                    classes     are
               all special  or                    indicated    by
               control charac-                    the operators *
               ters; or                           and +.
                  [^a-zA-Z]                             a*
               is any  charac-                    is  any  number
               ter   which  is                    of  consecutive
               not  a  letter.                    a   characters,
               The \ character                    including zero;
               provides    the                    while
               usual   escapes                          a+
               within  charac-                    is one or  more
               ter       class                    instances of a.
               brackets.                          For example,
                    Arbitrary                     is all  strings
               character.   To                    of  lower  case
               match    almost                    letters. And
               any  character,                 [A-Za-z][A-Za-z0-9]*
               the    operator                    indicates   all
               character                          alphanumeric
                      .                           strings with  a
               is the class of                    leading  alpha-
               all  characters                    betic   charac-
               except newline.                    ter.  This is a
               Escaping   into                    typical expres-
               octal is possi-                    sion for recog-
               ble    although                    nizing identif-
               non-portable:                      iers   in  com-
                 [\40-\176]                       puter
               matches     all                    languages.
               printable char-
               acters  in  the                         Alterna-
               ASCII character                    tion and Group-
               set, from octal                    ing. The opera-
               40  (blank)  to                    tor | indicates
               octal       176                    alternation:
               (tilde).                               (ab|cd)
                                                  matches  either
                    Optional                      ab  or cd. Note

PS1:16-10                      Lex - A Lexical Analyzer Generator

               that                               classes,  since
               parentheses are                    that       only
               used for group-                    applies  within
               ing,   although                    the  []  opera-
               they  are   not                    tors.  If   the
               necessary    on                    very last char-
               the     outside                    acter is $, the
               level;                             expression will
                    ab|cd                         only be matched
               would have suf-                    at the end of a
               ficed.                             line      (when
               Parentheses can                    immediately
               be   used   for                    followed     by
               more    complex                    newline).   The
               expressions:                       latter operator
               (ab|cd+)?(ef)*                     is   a  special
               matches    such                    case of  the  /
               strings      as                    operator  char-
               abefef, efefef,                    acter,    which
               cdef,  or cddd;                    indicates
               but  not   abc,                    trailing   con-
               abcd,        or                    text.       The
               abcdef.                            expression
                    Context                       matches     the
               sensitivity.                       string  ab, but
               Lex will recog-                    only  if   fol-
               nize   a  small                    lowed   by  cd.
               amount of  sur-                    Thus
               rounding   con-                          ab$
               text.  The  two                    is the same as
               simplest opera-                         ab/\n
               tors  for  this                    Left context is
               are ^ and $. If                    handled  in Lex
               the first char-                    by start condi-
               acter   of   an                    tions        as
               expression   is                    explained    in
               ^,  the expres-                    section 10.  If
               sion will  only                    a rule is  only
               be  matched  at                    to  be executed
               the   beginning                    when  the   Lex
               of    a    line                    automaton
               (after  a  new-                    interpreter  is
               line character,                    in start condi-
               or    at    the                    tion   x,   the
               beginning    of                    rule  should be
               the       input                    prefixed by
               stream).   This                          <x>
               can never  con-                    using the angle
               flict  with the                    bracket  opera-
               other   meaning                    tor characters.
               of  ^,  comple-                    If    we   con-
               mentation    of                    sidered ``being
               character                          at  the  begin-

Lex - A Lexical Analyzer Generator                      PS1:16-11

               ning    of    a                    4. Lex Actions.
               line''   to  be
               start condition                         When    an
               ONE, then the ^                    expression
               operator  would                    written      as
               be   equivalent                    above        is
               to                                 matched,    Lex
                    <ONE>                         executes    the
               Start    condi-                    corresponding
               tions       are                    action.    This
               explained  more                    section
               fully later.                       describes  some
                                                  features of Lex
                    Repeti-                       which   aid  in
               tions       and                    writing
               Definitions.                       actions.   Note
               The   operators                    that there is a
               {}      specify                    default action,
               either  repeti-                    which  consists
               tions (if  they                    of  copying the
               enclose                            input  to   the
               numbers)     or                    output.    This
               definition                         is performed on
               expansion   (if                    all strings not
               they  enclose a                    otherwise
               name).      For                    matched.   Thus
               example                            the  Lex   user
                   {digit}                        who  wishes  to
               looks   for   a                    absorb      the
               predefined                         entire   input,
               string    named                    without produc-
               digit       and                    ing any output,
               inserts  it  at                    must    provide
               that  point  in                    rules  to match
               the expression.                    everything.
               The definitions                    When   Lex   is
               are  given   in                    being used with
               the  first part                    Yacc,  this  is
               of   the    Lex                    the      normal
               input,   before                    situation.  One
               the  rules.  In                    may    consider
               contrast,                          that    actions
                   a{1,5}                         are   what   is
               looks for 1  to                    done instead of
               5   occurrences                    copying     the
               of a.                              input   to  the
                                                  output;   thus,
                    Finally,                      in  general,  a
               initial   %  is                    rule      which
               special,  being                    merely   copies
               the   separator                    can be omitted.
               for Lex  source                    Also, a charac-
               segments.                          ter combination
                                                  which  is omit-

PS1:16-12                      Lex - A Lexical Analyzer Generator

               ted  from   the                    quotes   around
               rules and which                    \n and  \t  are
               appears      as                    not required.
               input is likely
               to  be  printed                         In    more
               on  the output,                    complex
               thus    calling                    actions,    the
               attention    to                    user will often
               the gap in  the                    want  to   know
               rules.                             the actual text
                                                  that    matched
                    One of the                    some expression
               simplest things                    like    [a-z]+.
               that   can   be                    Lex leaves this
               done    is   to                    text   in    an
               ignore      the                    external  char-
               input.   Speci-                    acter     array
               fying a C  null                    named   yytext.
               statement, ; as                    Thus, to  print
               an       action                    the name found,
               causes     this                    a rule like
               result.  A fre-            [a-z]+   printf("%s", yytext);
               quent rule is                      will print  the
                 [ \t\n]   ;                      string       in
               which    causes                    yytext.  The  C
               the three spac-                    function printf
               ing  characters                    accepts a  for-
               (blank,    tab,                    mat    argument
               and newline) to                    and data to  be
               be ignored.                        printed;     in
                                                  this case,  the
                    Another                       format       is
               easy   way   to                    ``print
               avoid   writing                    string''     (%
               actions  is the                    indicating data
               action  charac-                    conversion, and
               ter   |,  which                    s    indicating
               indicates  that                    string   type),
               the  action for                    and  the   data
               this  rule   is                    are the charac-
               the  action for                    ters in yytext.
               the next  rule.                    So   this  just
               The    previous                    places      the
               example   could                    matched  string
               also  have been                    on the  output.
               written                            This  action is
                   " "                            so common  that
                   "\t"                           it may be writ-
                   "\n"                           ten as ECHO:
               with  the  same                    [a-z]+   ECHO;
               result,                            is the same  as
               although     in                    the      above.
               different                          Since       the
               style.      The                    default  action

Lex - A Lexical Analyzer Generator                      PS1:16-13

               is   just    to                    characters   in
               print the char-                    the       words
               acters   found,                    recognized. The
               one  might  ask                    last  character
               why   give    a                    in  the  string
               rule, like this                    matched can  be
               one,      which                    accessed by
               merely   speci-                   yytext[yyleng-1]
               fies        the
               default action?                         Occasion-
               Such rules  are                    ally,   a   Lex
               often  required                    action      may
               to avoid match-                    decide  that  a
               ing  some other                    rule  has   not
               rule  which  is                    recognized  the
               not    desired.                    correct span of
               For example, if                    characters. Two
               there is a rule                    routines    are
               which   matches                    provided to aid
               read   it  will                    with       this
               normally  match                    situation.
               the   instances                    First, yymore()
               of  read   con-                    can  be  called
               tained in bread                    to     indicate
               or readjust; to                    that  the  next
               avoid  this,  a                    input   expres-
               rule   of   the                    sion recognized
               form  [a-z]+ is                    is to be tacked
               needed. This is                    on  to  the end
               explained                          of this  input.
               further below.                     Normally,   the
                                                  next      input
                    Sometimes                     string    would
               it is more con-                    overwrite   the
               venient to know                    current   entry
               the end of what                    in      yytext.
               has been found;                    Second,  yyless
               hence  Lex also                    (n)   may    be
               provides      a                    called to indi-
               count yyleng of                    cate  that  not
               the  number  of                    all the charac-
               characters                         ters matched by
               matched.     To                    the   currently
               count  both the                    successful
               number of words                    expression  are
               and  the number                    wanted    right
               of   characters                    now.  The argu-
               in words in the                    ment  n   indi-
               input, the user                    cates       the
               might write                        number of char-
   [a-zA-Z]+   {words++; chars += yyleng;}        acters       in
               which   accumu-                    yytext  to   be
               lates  in chars                    retained.
               the  number  of                    Further charac-

PS1:16-14                      Lex - A Lexical Analyzer Generator

               ters previously                    that the  final
               matched     are                    quote terminat-
               returned to the                    ing the  string
               input.     This                    should       be
               provides    the                    picked  up   in
               same  sort   of                    the        code
               lookahead                          labeled  ``nor-
               offered by  the                    mal    process-
               / operator, but                    ing''.
               in a  different
               form.                                   The  func-
                                                  tion   yyless()
                    Example:                      might  be  used
               Consider      a                    to    reprocess
               language  which                    text in various
               defines       a                    circumstances.
               string as a set                    Consider the  C
               of   characters                    problem of dis-
               between  quota-                    tinguishing the
               tion (") marks,                    ambiguity    of
               and    provides                    ``=-a''.   Sup-
               that to include                    pose    it   is
               a " in a string                    desired      to
               it must be pre-                    treat  this  as
               ceded by  a  \.                    ``=-  a''   but
               The     regular                    print   a  mes-
               expression                         sage.   A  rule
               which   matches                    might be
               that  is  some-      =-[a-zA-Z]   {
               what confusing,                   printf("Op (=-) ambiguous\n");
               so   that    it                   yyless(yyleng-1);
               might        be                   ... action for =- ...
               preferable   to                   }
               write                              which prints  a
  \"[^"]*   {                                     message,
            if (yytext[yyleng-1] == '\\')         returns     the
                 yymore();                        letter    after
            else                                  the operator to
                 ... normal user processing       the       input
            }                                     stream,     and
               which     will,                    treats      the
               when faced with                    operator     as
               a  string  such                    ``=-''.  Alter-
               as   "abc\"def"                    natively     it
               first match the                    might        be
               five characters                    desired      to
               "abc\; then the                    treat  this  as
               call         to                    ``=   -a''.  To
               yymore()   will                    do  this,  just
               cause  the next                    return      the
               part   of   the                    minus  sign  as
               string,   "def,                    well   as   the
               to be tacked on                    letter  to  the
               the  end.  Note                    input:

Lex - A Lexical Analyzer Generator                      PS1:16-15

 =-[a-zA-Z]   {                                        character
              printf("Op (=-) ambiguous\n");           c  on  the
              yyless(yyleng-2);                        output;
              ... action for = ...                     and
               will    perform                    3)   unput(c)
               the       other                         pushes the
               interpretation.                         character
               Note  that  the                         c     back
               expressions for                         onto   the
               the  two  cases                         input
               might      more                         stream  to
               easily be writ-                         be    read
               ten                                     later   by
               =-/[A-Za-z]                             input().
               in  the   first
               case and                           By      default
                 =/-[A-Za-z]                      these  routines
               in the  second;                    are provided as
               no backup would                    macro   defini-
               be required  in                    tions, but  the
               the        rule                    user  can over-
               action.  It  is                    ride  them  and
               not   necessary                    supply  private
               to    recognize                    versions. These
               the whole iden-                    routines define
               tifier       to                    the   relation-
               observe     the                    ship    between
               ambiguity.  The                    external  files
               possibility  of                    and    internal
               ``=-3'',   how-                    characters, and
               ever, makes                        must   all   be
                 =-/[^ \t\n]                      retained     or
               a still  better                    modified   con-
               rule.                              sistently. They
                                                  may   be  rede-
                    In   addi-                    fined, to cause
               tion  to  these                    input or output
               routines,   Lex                    to be transmit-
               also    permits                    ted  to or from
               access  to  the                    strange places,
               I/O routines it                    including other
               uses. They are:                    programs     or
               1)   input()                       memory; but the
                    which                         character   set
                    returns                       used  must   be
                    the   next                    consistent   in
                    input                         all routines; a
                    character;                    value  of  zero
                                                  returned     by
               2)   output(c)                     input must mean
                    which                         end  of   file;
                    writes the                    and  the  rela-

PS1:16-16                      Lex - A Lexical Analyzer Generator

               tionship                           more  input  to
               between   unput                    arrive  from  a
               and input  must                    new  source. In
               be  retained or                    this case,  the
               the  Lex  look-                    user     should
               ahead  will not                    provide       a
               work. Lex  does                    yywrap    which
               not  look ahead                    arranges    for
               at  all  if  it                    new  input  and
               does  not  have                    returns      0.
               to,  but  every                    This  instructs
               rule  ending in                    Lex to continue
               + * ? or  $  or                    processing. The
               containing    /                    default  yywrap
               implies   look-                    always  returns
               ahead.    Look-                    1.
               ahead  is  also
               necessary    to                         This  rou-
               match        an                    tine  is also a
               expression that                    convenient
               is a prefix  of                    place  to print
               another expres-                    tables,    sum-
               sion. See below                    maries, etc. at
               for  a  discus-                    the  end  of  a
               sion   of   the                    program.   Note
               character   set                    that it is  not
               used  by   Lex.                    possible     to
               The    standard                    write a  normal
               Lex     library                    rule      which
               imposes  a  100                    recognizes
               character limit                    end-of-file;
               on backup.                         the only access
                                                  to  this condi-
                    Another                       tion is through
               Lex     library                    yywrap.      In
               routine    that                    fact, unless  a
               the  user  will                    private version
               sometimes  want                    of  input()  is
               to  redefine is                    supplied a file
               yywrap()  which                    containing
               is called when-                    nulls cannot be
               ever        Lex                    handled,  since
               reaches      an                    a  value  of  0
               end-of-file. If                    returned     by
               yywrap  returns                    input  is taken
               a 1,  Lex  con-                    to  be  end-of-
               tinues with the                    file.
               normal   wrapup
               on    end    of                    5.    Ambiguous
               input.    Some-                    Source Rules.
               times, however,
               it   is    con-                         Lex    can
               venient      to                    handle  ambigu-
               arrange     for                    ous  specifica-

Lex - A Lexical Analyzer Generator                      PS1:16-17

               tions.     When                    integer  and so
               more  than  one                    the  identifier
               expression  can                    interpretation
               match       the                    is used.
               current  input,
               Lex chooses  as                         The  prin-
               follows:                           ciple        of
                                                  preferring  the
               1)   The  long-                    longest   match
                    est  match                    makes     rules
                    is    pre-                    containing
                    ferred.                       expressions
                                                  like         .*
               2)   Among                         dangerous.  For
                    rules                         example,
                    which                              '.*'
                    matched                       might  seem   a
                    the   same                    good   way   of
                    number  of                    recognizing   a
                    charac-                       string  in sin-
                    ters,  the                    gle quotes. But
                    rule given                    it  is an invi-
                    first   is                    tation for  the
                    preferred.                    program to read
                                                  far      ahead,
               Thus,   suppose                    looking  for  a
               the rules                          distant  single
      integer   keyword action ...;               quote.
      [a-z]+    identifier action ...;            Presented  with
               to be given  in                    the input
               that order.  If     'first' quoted string here, 'second' here
               the  input   is                    the       above
               integers, it is                    expression will
               taken   as   an                    match
               identifier,            'first' quoted string here, 'second'
               because  [a-z]+                    which is  prob-
               matches 8 char-                    ably  not  what
               acters    while                    was  wanted.  A
               integer matches                    better  rule is
               only 7. If  the                    of the form
               input        is                       '[^'\n]*'
               integer,   both                    which,  on  the
               rules  match  7                    above    input,
               characters, and                    will stop after
               the     keyword                    'first'.    The
               rule         is                    consequences of
               selected                           errors     like
               because it  was                    this are  miti-
               given    first.                    gated   by  the
               Anything                           fact that the .
               shorter   (e.g.                    operator   will
               int)  will  not                    not match  new-
               match       the                    line.      Thus
               expression                         expressions

PS1:16-18                      Lex - A Lexical Analyzer Generator

               like .* stop on                    includes    he,
               the     current                    Lex  will  nor-
               line. Don't try                    mally       not
               to defeat  this                    recognize   the
               with    expres-                    instances of he
               sions      like                    included     in
               (.|\n)+      or                    she, since once
               equivalents;                       it has passed a
               the   Lex  gen-                    she those char-
               erated  program                    acters      are
               will   try   to                    gone.
               read the entire
               input     file,                         Sometimes
               causing  inter-                    the  user would
               nal      buffer                    like  to  over-
               overflows.                         ride       this
                                                  choice.     The
                    Note  that                    action   REJECT
               Lex is normally                    means  ``go  do
               partitioning                       the next alter-
               the       input                    native.''    It
               stream,     not                    causes whatever
               searching   for                    rule was second
               all    possible                    choice    after
               matches of each                    the     current
               expression.                        rule to be exe-
               This means that                    cuted.      The
               each  character                    position of the
               is    accounted                    input   pointer
               for  once   and                    is     adjusted
               only  once. For                    accordingly.
               example,   sup-                    Suppose     the
               pose    it   is                    user     really
               desired      to                    wants  to count
               count                              the    included
               occurrences  of                    instances    of
               both she and he                    he:
               in   an   input                 she   {s++; REJECT;}
               text.  Some Lex                 he    {h++; REJECT;}
               rules   to   do                 \n    |
               this might be                   .     ;
                 she   s++;                       these rules are
                 he    h++;                       one    way   of
                 \n    |                          changing    the
                 .     ;                          previous  exam-
               where the  last                    ple to do  just
               two       rules                    that.     After
               ignore   every-                    counting   each
               thing   besides                    expression,  it
               he   and   she.                    is    rejected;
               Remember that .                    whenever
               does        not                    appropriate,
               include    new-                    the       other
               line. Since she                    expression will

Lex - A Lexical Analyzer Generator                      PS1:16-19

               then         be                    input    stream
               counted.     In                    but  to  detect
               this   example,                    all examples of
               of course,  the                    some  items  in
               user could note                    the input,  and
               that        she                    the   instances
               includes he but                    of these  items
               not vice versa,                    may  overlap or
               and   omit  the                    include    each
               REJECT   action                    other.  Suppose
               on he; in other                    a digram  table
               cases, however,                    of the input is
               it would not be                    desired;   nor-
               possible      a                    mally       the
               priori  to tell                    digrams   over-
               which     input                    lap,   that  is
               characters were                    the word the is
               in         both                    considered   to
               classes.                           contain both th
                                                  and  he. Assum-
                    Consider                      ing   a    two-
               the two rules                      dimensional
          a[bc]+   { ... ; REJECT;}               array     named
          a[cd]+   { ... ; REJECT;}               digram   to  be
               If the input is                    incremented,
               ab,   only  the                    the appropriate
               first      rule                    source is
               matches, and on     %%
               ad   only   the     [a-z][a-z]   {
               second matches.                  digram[yytext[0]][yytext[1]]++;
               The       input                  REJECT;
               string     accb                  }
               matches     the     .            ;
               first  rule for     \n           ;
               four characters                    where       the
               and   then  the                    REJECT       is
               second rule for                    necessary    to
               three   charac-                    pick    up    a
               ters.  In  con-                    letter     pair
               trast,      the                    beginning    at
               input      accd                    every   charac-
               agrees with the                    ter,     rather
               second rule for                    than  at  every
               four characters                    other   charac-
               and  then   the                    ter.
               first  rule for
               three.                             6.  Lex  Source
                    In    gen-
               eral, REJECT is                         Remember
               useful whenever                    the  format  of
               the  purpose of                    the Lex source:
               Lex is  not  to                    {definitions}
               partition   the                    %%

PS1:16-20                      Lex - A Lexical Analyzer Generator

               {rules}                                 prior   to
               %%                                      the  first
               {user routines}                         %%  delim-
               So far only the                         iter  will
               rules have been                         be  exter-
               described.  The                         nal to any
               user      needs                         function
               additional                              in     the
               options,                                code;   if
               though,      to                         it appears
               define    vari-                         immedi-
               ables  for  use                         ately
               in  his program                         after  the
               and for use  by                         first  %%,
               Lex.  These can                         it appears
               go  either   in                         in      an
               the definitions                         appropri-
               section  or  in                         ate  place
               the  rules sec-                         for
               tion.                                   declara-
                                                       tions   in
                    Remember                           the  func-
               that   Lex   is                         tion writ-
               turning     the                         ten by Lex
               rules   into  a                         which con-
               program.    Any                         tains  the
               source      not                         actions.
               intercepted  by                         This
               Lex  is  copied                         material
               into  the  gen-                         must  look
               erated program.                         like  pro-
               There are three                         gram frag-
               classes of such                         ments, and
               things.                                 should
               1)   Any   line                         the  first
                    which   is                         Lex rule.
                    not   part
                    of  a  Lex                         As a  side
                    rule    or                         effect  of
                    action                             the above,
                    which                              lines
                    begins                             which
                    with     a                         begin with
                    blank   or                         a blank or
                    tab     is                         tab,   and
                    copied                             which con-
                    into   the                         tain     a
                    Lex   gen-                         comment,
                    erated                             are passed
                    program.                           through to
                    Such                               the   gen-
                    source                             erated
                    input                              program.

Lex - A Lexical Analyzer Generator                      PS1:16-21

                    This   can                         etc.,   is
                    be used to                         copied out
                    include                            after  the
                    comments                           Lex   out-
                    in  either                         put.
                    the    Lex
                    source  or                         Defini-
                    the   gen-                    tions  intended
                    erated                        for   Lex   are
                    code.  The                    given    before
                    comments                      the  first   %%
                    should                        delimiter.  Any
                    follow the                    line  in   this
                    host                          section     not
                    language                      contained
                    conven-                       between  %{ and
                    tion.                         %}, and  begin-
                                                  ing  in  column
               2)   Anything                      1,  is  assumed
                    included                      to  define  Lex
                    between                       substitution
                    lines con-                    strings.    The
                    taining                       format of  such
                    only    %{                    lines is
                    and %}  is                  name translation
                    copied out                    and  it  causes
                    as  above.                    the      string
                    The delim-                    given   as    a
                    iters  are                    translation  to
                    discarded.                    be   associated
                    This  for-                    with  the name.
                    mat   per-                    The  name   and
                    mits                          translation
                    entering                      must         be
                    text  like                    separated by at
                    preproces-                    least one blank
                    sor state-                    or tab, and the
                    ments that                    name must begin
                    must begin                    with  a letter.
                    in  column                    The translation
                    1,      or                    can   then   be
                    copying                       called  out  by
                    lines that                    the {name} syn-
                    do     not                    tax in a  rule.
                    look  like                    Using  {D}  for
                    programs.                     the digits  and
                                                  {E}    for   an
               3)   Anything                      exponent field,
                    after  the                    for    example,
                    third   %%                    might  abbrevi-
                    delimiter,                    ate   rules  to
                    regardless                    recognize
                    of    for-                    numbers:
                    mats,             D                   [0-9]

PS1:16-22                      Lex - A Lexical Analyzer Generator

   E                   [DEde][-+]?{D}+            adjustments  to
   %%                                             the     default
   {D}+                printf("integer");         size of  arrays
   {D}+"."{D}*({E})?   |                          within      Lex
   {D}*"."{D}+({E})?   |                          itself      for
   {D}+{E}                                        larger   source
               Note the  first                    programs. These
               two  rules  for                    possibilities
               real   numbers;                    are   discussed
               both  require a                    below     under
               decimal   point                    ``Summary    of
               and  contain an                    Source     For-
               optional                           mat,''  section
               exponent field,                    12.
               but  the  first
               requires     at                    7. Usage.
               least one digit
               before      the                         There  are
               decimal   point                    two   steps  in
               and  the second                    compiling a Lex
               requires     at                    source program.
               least one digit                    First, the  Lex
               after       the                    source  must be
               decimal  point.                    turned  into  a
               To    correctly                    generated  pro-
               handle      the                    gram   in   the
               problem   posed                    host    general
               by   a  Fortran                    purpose
               expression such                    language.  Then
               as     35.EQ.I,                    this    program
               which does  not                    must   be  com-
               contain  a real                    piled       and
               number,       a                    loaded, usually
               context-                           with a  library
               sensitive  rule                    of  Lex subrou-
               such as                            tines. The gen-
      [0-9]+/"."EQ   printf("integer");           erated  program
               could  be  used                    is  on  a  file
               in  addition to                    named lex.yy.c.
               the normal rule                    The I/O library
               for integers.                      is  defined  in
                                                  terms of the  C
                    The defin-                    standard
               itions  section                    library [6].
               may  also  con-
               tain other com-                         The C pro-
               mands,  includ-                    grams generated
               ing  the selec-                    by   Lex    are
               tion of a  host                    slightly   dif-
               language,     a                    ferent       on
               character   set                    OS/370, because
               table,  a  list                    the OS compiler
               of start condi-                    is  less power-
               tions,       or                    ful  than   the

Lex - A Lexical Analyzer Generator                      PS1:16-23

               UNIX   or  GCOS                    analyzer.  Nor-
               compilers,  and                    mally,      the
               does   less  at                    default    main
               compile time. C                    program on  the
               programs   gen-                    Lex     library
               erated on  GCOS                    calls this rou-
               and   UNIX  are                    tine,   but  if
               the same.                          Yacc is loaded,
                                                  and   its  main
                    UNIX.  The                    program      is
               library      is                    used, Yacc will
               accessed by the                    call   yylex().
               loader     flag                    In   this  case
               -ll.   So    an                    each  Lex  rule
               appropriate set                    should end with
               of commands is                     return(token);
     lex  source  cc  lex.yy.c                    where       the
     -ll                                          appropriate
               The   resulting                    token value  is
               program      is                    returned.    An
               placed  on  the                    easy way to get
               usual      file                    access       to
               a.out for later                    Yacc's    names
               execution.   To                    for  tokens  is
               use  Lex   with                    to compile  the
               Yacc see below.                    Lex output file
               Although    the                    as part of  the
               default Lex I/O                    Yacc     output
               routines    use                    file by placing
               the  C standard                    the line
               library,    the                 # include "lex.yy.c"
               Lex    automata                    in   the   last
               themselves   do                    section of Yacc
               not  do  so; if                    input.  Suppos-
               private    ver-                    ing the grammar
               sions of input,                    to   be   named
               output      and                    ``good''    and
               unput       are                    the     lexical
               given,      the                    rules   to   be
               library  can be                    named
               avoided.                           ``better''  the
                                                  UNIX    command
               8.   Lex    and                    sequence    can
               Yacc.                              just be:
                                                yacc good
                    If     you                  lex better
               want to use Lex                  cc y.tab.c -ly -ll
               with Yacc, note                    The        Yacc
               that  what  Lex                    library   (-ly)
               writes   is   a                    should       be
               program   named                    loaded   before
               yylex(),    the                    the         Lex
               name   required                    library,     to
               by Yacc for its                    obtain  a  main

PS1:16-24                      Lex - A Lexical Analyzer Generator

               program   which                    such      input
               invokes     the                    items  as 49.63
               Yacc    parser.                    or X7. Further-
               The generations                    more, it incre-
               of Lex and Yacc                    ments the abso-
               programs can be                    lute  value  of
               done  in either                    all    negative
               order.                             numbers divisi-
                                                  ble  by  7.  To
               9. Examples.                       avoid     this,
                                                  just add a  few
                    As       a                    more      rules
               trivial   prob-                    after       the
               lem,   consider                    active  one, as
               copying      an                    here:
               input      file     %%
               while  adding 3                            int k;
               to every  posi-     -?[0-9]+               {
               tive     number                            k = atoi(yytext);
               divisible by 7.                            printf("%d",
               Here is a suit-                              k%7 == 0 ? k+3 : k);
               able Lex source                            }
               program             -?[0-9.]+              ECHO;
      %%                           [A-Za-z][A-Za-z0-9]+   ECHO;
               int k;                             Numerical
      [0-9]+   {                                  strings    con-
               k = atoi(yytext);                  taining a ``.''
               if (k%7 == 0)                      or  preceded by
                    printf("%d", k+3);            a  letter  will
               else                               be picked up by
                    printf("%d",k);               one of the last
               }                                  two  rules, and
               to   do    just                    not    changed.
               that.  The rule                    The if-else has
               [0-9]+   recog-                    been   replaced
               nizes   strings                    by  a  C condi-
               of digits; atoi                    tional  expres-
               converts    the                    sion   to  save
               digits       to                    space; the form
               binary      and                    a?b:c     means
               stores      the                    ``if a  then  b
               result   in  k.                    else c''.
               The operator  %
               (remainder)  is                         For     an
               used  to  check                    example      of
               whether   k  is                    statistics
               divisible by 7;                    gathering, here
               if it is, it is                    is  a   program
               incremented  by                    which    histo-
               3   as   it  is                    grams       the
               written out. It                    lengths      of
               may be objected                    words, where  a
               that this  pro-                    word is defined
               gram will alter                    as a string  of

Lex - A Lexical Analyzer Generator                      PS1:16-25

               letters.                           gle   precision
           int lengs[100];                        Fortran.
  %%                                              Because Fortran
  [a-z]+   lengs[yyleng]++;                       does  not  dis-
  .        |                                      tinguish  upper
  \n       ;                                      and lower  case
  %%                                              letters,   this
  yywrap()                                        routine  begins
  {                                               by  defining  a
  int i;                                          set of  classes
  printf("Length  No. words\n");                  including  both
  for(i=0; i<100; i++)                            cases  of  each
       if (lengs[i] > 0)                          letter:
            printf("%5d%10d\n",i,lengs[i]);         a     [aA]
  return(1);                                        b     [bB]
  }                                                 c     [cC]
               This    program                      ...
               accumulates the                      z     [zZ]
               histogram,                         An   additional
               while producing                    class    recog-
               no output.   At                    nizes     white
               the  end of the                    space:
               input it prints                      W   [ \t]*
               the  table. The                    The first  rule
               final statement                    changes  ``dou-
               return(1);                         ble precision''
               indicates  that                    to ``real'', or
               Lex  is to per-                    ``DOUBLE PRECI-
               form    wrapup.                    SION''       to
               If       yywrap                    ``REAL''.
               returns    zero     {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
               (false)      it          printf(yytext[0]=='d'? "real" : "REAL");
               implies    that          }
               further   input                    Care  is  taken
               is    available                    throughout this
               and the program                    program      to
               is to  continue                    preserve    the
               reading     and                    case (upper  or
               processing.  To                    lower)  of  the
               provide       a                    original   pro-
               yywrap     that                    gram.  The con-
               never   returns                    ditional opera-
               true causes  an                    tor  is used to
               infinite loop.                     select      the
                                                  proper  form of
                    As       a                    the    keyword.
               larger example,                    The  next  rule
               here  are  some                    copies     con-
               parts of a pro-                    tinuation  card
               gram written by                    indications  to
               N.  L.  Schryer                    avoid confusing
               to convert dou-                    them with  con-
               ble   precision                    stants:
               Fortran to sin-                 ^"     "[^ 0]   ECHO;

PS1:16-26                      Lex - A Lexical Analyzer Generator

               In the  regular                    respelled    to
               expression, the                    remove    their
               quotes surround                    initial  d.  By
               the  blanks. It                    using the array
               is  interpreted                    yytext the same
               as  ``beginning                    action suffices
               of  line,  then                    for   all   the
               five    blanks,                    names  (only  a
               then   anything                    sample   of   a
               but   blank  or                    rather     long
               zero.''    Note                    list  is  given
               the   two  dif-                    here).
               ferent meanings      {d}{s}{i}{n}         |
               of   ^.   There      {d}{c}{o}{s}         |
               follow     some      {d}{s}{q}{r}{t}      |
               rules to change      {d}{a}{t}{a}{n}      |
               double   preci-      ...
               sion  constants      {d}{f}{l}{o}{a}{t}   printf("%s",yytext+1);
               to     ordinary                    Another list of
               floating   con-                    names must have
               stants.                            initial       d
  [0-9]+{W}{d}{W}[+-]?{W}[0-9]+     |             changed to ini-
  [0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+     |       tial a:
  "."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+{d}{l}{o}{g}     |
       /* convert constants */       {d}{l}{o}{g}10   |
       for(p=yytext; *p != 0; p++)   {d}{m}{i}{n}1    |
            {                        {d}{m}{a}{x}1    {
            if (*p == 'd' || *p == 'D')               yytext[0] =+ 'a' - 'd';
                 *p=+ 'e'- 'd';                       ECHO;
            ECHO;                                     }
            }                                     And one routine
               After       the                    must  have ini-
               floating  point                    tial d  changed
               constant     is                    to initial r:
               recognized,  it     {d}1{m}{a}{c}{h}   {yytext[0] =+ 'r'  - 'd';
               is  scanned  by
               the for loop to
               find the letter                    To  avoid  such
               d   or  D.  The                    names  as dsinx
               program    than                    being  detected
               adds   'e'-'d',                    as instances of
               which  converts                    dsin,      some
               it  to the next                    final     rules
               letter  of  the                    pick up  longer
               alphabet.   The                    words  as iden-
               modified   con-                    tifiers     and
               stant,      now                    copy  some sur-
               single-                            viving  charac-
               precision,   is                    ters:
               written     out             [A-Za-z][A-Za-z0-9]*   |
               again.    There             [0-9]+                 |
               follow a series             \n                     |
               of  names which             .                      ECHO;
               must         be                    Note that  this

Lex - A Lexical Analyzer Generator                      PS1:16-27

               program  is not                    facility  simi-
               complete;    it                    lar to that for
               does  not  deal                    adjacent  right
               with the  spac-                    context, but it
               ing problems in                    is  unlikely to
               Fortran or with                    be  as  useful,
               the use of key-                    since often the
               words as  iden-                    relevant   left
               tifiers.                           context
               10.  Left  Con-                    appeared   some
               text     Sensi-                    time   earlier,
               tivity.                            such as at  the
                                                  beginning  of a
                    Sometimes                     line.
               it is desirable
               to have several                         This  sec-
               sets of lexical                    tion  describes
               rules   to   be                    three means  of
               applied at dif-                    dealing    with
               ferent times in                    different
               the  input. For                    environments: a
               example, a com-                    simple  use  of
               piler   prepro-                    flags,     when
               cessor    might                    only   a    few
               distinguish                        rules    change
               preprocessor                       from        one
               statements  and                    environment  to
               analyze    them                    another,    the
               differently                        use   of  start
               from   ordinary                    conditions   on
               statements.                        rules,  and the
               This   requires                    possibility  of
               sensitivity  to                    making multiple
               prior  context,                    lexical
               and  there  are                    analyzers   all
               several ways of                    run   together.
               handling   such                    In  each  case,
               problems. The ^                    there are rules
               operator,   for                    which recognize
               example,  is  a                    the   need   to
               prior   context                    change      the
               operator,                          environment  in
               recognizing                        which  the fol-
               immediately                        lowing    input
               preceding  left                    text         is
               context just as                    analyzed,   and
               $    recognizes                    set some param-
               immediately                        eter to reflect
               following right                    the     change.
               context.  Adja-                    This may  be  a
               cent  left con-                    flag explicitly
               text  could  be                    tested  by  the
               extended,    to                    user's   action
               produce       a                    code;  such   a

PS1:16-28                      Lex - A Lexical Analyzer Generator

               flag   is   the                    letter       a,
               simplest way of                    changing  magic
               dealing    with                    to  second   on
               the    problem,                    every      line
               since   Lex  is                    which     began
               not involved at                    with the letter
               all.  It may be                    b, and changing
               more       con-                    magic  to third
               venient,   how-                    on  every  line
               ever,  to  have                    which     began
               Lex    remember                    with the letter
               the  flags   as                    c.   All  other
               initial  condi-                    words  and  all
               tions  on   the                    other lines are
               rules. Any rule                    left unchanged.
               may be  associ-
               ated   with   a                         These
               start    condi-                    rules   are  so
               tion.   It will                    simple that the
               only be  recog-                    easiest  way to
               nized  when Lex                    do this job  is
               is   in    that                    with a flag:
               start    condi-              int flag;
               tion.       The      %%
               current   start      ^a      {flag = 'a'; ECHO;}
               condition   may      ^b      {flag = 'b'; ECHO;}
               be  changed  at      ^c      {flag = 'c'; ECHO;}
               any       time.      \n      {flag =  0 ; ECHO;}
               Finally, if the      magic   {
               sets  of  rules              switch (flag)
               for   the  dif-              {
               ferent environ-              case 'a': printf("first"); break;
               ments  are very              case 'b': printf("second"); break;
               dissimilar,                  case 'c': printf("third"); break;
               clarity  may be              default: ECHO; break;
               best   achieved              }
               by      writing              }
               several    dis-                    should be  ade-
               tinct   lexical                    quate.
               analyzers,  and
               switching  from                         To  handle
               one to  another                    the  same prob-
               as desired.                        lem with  start
                    Consider                      each start con-
               the   following                    dition  must be
               problem:   copy                    introduced   to
               the   input  to                    Lex    in   the
               the     output,                    definitions
               changing    the                    section  with a
               word  magic  to                    line reading
               first  on every               %Start   name1 name2 ...
               line      which                    where the  con-
               began  with the                    ditions  may be

Lex - A Lexical Analyzer Generator                      PS1:16-29

               named  in   any          ^b                {ECHO; BEGIN BB;}
               order. The word          ^c                {ECHO; BEGIN CC;}
               Start  may   be          \n                {ECHO; BEGIN 0;}
               abbreviated  to          <AA>magic         printf("first");
               s  or  S.   The          <BB>magic         printf("second");
               conditions  may          <CC>magic         printf("third");
               be   referenced                    where the logic
               at  the head of                    is  exactly the
               a rule with the                    same as in  the
               <> brackets:                       previous method
              <name1>expression                   of handling the
               is a rule which                    problem,    but
               is  only recog-                    Lex  does   the
               nized when  Lex                    work     rather
               is in the start                    than the user's
               condition                          code.
               name1. To enter
               a start  condi-                    11.   Character
               tion,   execute                    Set.
               the      action
               statement                               The   pro-
                BEGIN name1;                      grams generated
               which   changes                    by  Lex  handle
               the  start con-                    character   I/O
               dition       to                    only    through
               name1.       To                    the    routines
               resume the nor-                    input,  output,
               mal state,                         and unput. Thus
                  BEGIN 0;                        the   character
               resets the ini-                    representation
               tial  condition                    provided     in
               of   the    Lex                    these  routines
               automaton                          is accepted  by
               interpreter.  A                    Lex         and
               rule   may   be                    employed     to
               active       in                    return   values
               several   start                    in yytext.  For
               conditions:                        internal  use a
             <name1,name2,name3>                  character    is
               is a legal pre-                    represented  as
               fix.   Any rule                    a small integer
               not   beginning                    which,  if  the
               with   the   <>                    standard
               prefix operator                    library      is
               is       always                    used,   has   a
               active.                            value  equal to
                                                  the     integer
                    The   same                    value   of  the
               example      as                    bit     pattern
               before  can  be                    representing
               written:                           the   character
     %START AA BB CC                              on   the   host
     %%                                           computer.  Nor-
     ^a                {ECHO; BEGIN AA;}          mally,      the

PS1:16-30                      Lex - A Lexical Analyzer Generator

               letter   a   is                    through     26,
               represented  as                    newline    into
               the  same  form                    27,   +  and  -
               as  the charac-                    into 28 and 29,
               ter    constant                    and  the digits
               'a'.   If  this                    into 30 through
               interpretation                     39.   Note  the
               is  changed, by                    escape for new-
               providing   I/O                    line.    If   a
               routines  which                    table  is  sup-
               translate   the                    plied,    every
               characters, Lex                    character  that
               must  be   told                    is   to  appear
               about   it,  by                    either  in  the
               giving a trans-                    rules or in any
               lation   table.                    valid     input
               This table must                    must         be
               be    in    the                    included in the
               definitions                        table. No char-
               section,    and                    acter  may   be
               must be  brack-                    assigned    the
               eted  by  lines                    number  0,  and
               containing                         no    character
               only    ``%T''.                    may be assigned
               The table  con-                    a bigger number
               tains  lines of                    than  the  size
               the form                           of the hardware
        {integer} {character string}              character set.
               which  indicate
               the value asso-                    12. Summary  of
               ciated     with                    Source Format.
               each character.
               Thus  the  next                         The   gen-
               example                            eral  form of a
                  %T                              Lex source file
                   1    Aa                        is:
                   2    Bb                      {definitions}
                  ...                           %%
                  26    Zz                      {rules}
                  27    \n                      %%
                  28    +                       {user subroutines}
                  29    -                         The definitions
                  30    0                         section    con-
                  31    1                         tains a  combi-
                  ...                             nation of
                  39    9
                  %T                              1)   Defini-
                                                       tions,  in
               Sample character table.                 the   form
               maps the  lower                         ``name
               and  upper case                         space
               letters                                 transla-
               together   into                         tion''.
               the integers  1

Lex - A Lexical Analyzer Generator                      PS1:16-31

               2)   Included                      ``expression
                    code,   in                    action''  where
                    the   form                    the  action may
                    ``space                       be continued on
                    code''.                       succeeding
                                                  lines by  using
               3)   Included                      braces  to del-
                    code,   in                    imit it.
                    the form
                       %{                              Regular
                       code                       expressions  in
                       %}                         Lex   use   the
               4)   Start con-                    following
                    ditions,                      operators:
                    given   in     x        the character "x"
                    the form       "x"      an "x", even if x is an operator.
                %S name1 name2 ... \x       an "x", even if x is an operator.
               5)   Character      [xy]     the character x or y.
                    set            [x-z]    the characters x, y or z.
                    tables, in     [^x]     any character but x.
                    the form       .        any character but newline.
          %T                       ^x       an x at the beginning of a line.
          number space character-st<y>x     an x when Lex is in start condition y.
          ...                      x$       an x at the end of a line.
          %T                       x?       an optional x.
               6)   Changes to     x*       0,1,2, ... instances of x.
                    internal       x+       1,2,3, ... instances of x.
                    array          x|y      an x or a y.
                    sizes,  in     (x)      an x.
                    the form       x/y      an x but only if followed by y.
                     %x  nnn       {xx}     the translation of xx from the
                    where  nnn              definitions section.
                    is       a     x{m,n}   m through n occurrences of x
                    integer                       13. Caveats and
                    represent-                    Bugs.
                    ing     an
                    array size                         There  are
                    and      x                    pathological
                    selects                       expressions
                    the param-                    which   produce
                    eter    as                    exponential
                    follows:                      growth  of  the
        Letter          Parameter                 tables     when
          p      positions                        converted    to
          n      states                           deterministic
          e      tree nodes                       machines;  for-
          a      transitions                      tunately,  they
          k      packed character classes         are rare.
          o      output array size
               Lines  in   the                    does not rescan
               rules   section                    the      input;
               have  the  form                    instead      it

PS1:16-32                      Lex - A Lexical Analyzer Generator

               remembers   the                    Eric Schmidt.
               results  of the
               previous  scan.                    15. References.
               This means that
               if a rule  with                    1.   B. W. Ker-
               trailing   con-                         nighan and
               text is  found,                         D.      M.
               and REJECT exe-                         Ritchie,
               cuted, the user                         The C Pro-
               must  not  have                         gramming
               used  unput  to                         Language,
               change      the                         Prentice-
               characters                              Hall,   N.
               forthcoming                             J. (1978).
               from the  input
               stream. This is                    2.   B. W. Ker-
               the  only  res-                         nighan,
               triction on the                         Ratfor:  A
               user's  ability                         Preproces-
               to   manipulate                         sor for  a
               the    not-yet-                         Rational
               processed                               Fortran,
               input.                                  Software -
               14. Acknowledg-                         and
               ments.                                  Experi-
                                                       ence,   5,
                    As  should                         pp.   395-
               be obvious from                         496
               the above,  the                         (1975).
               outside  of Lex
               is patterned on                    3.   S.      C.
               Yacc   and  the                         Johnson,
               inside on Aho's                         Yacc:  Yet
               string matching                         Another
               routines.                               Compiler
               Therefore, both                         Compiler,
               S.  C.  Johnson                         Computing
               and  A.  V. Aho                         Science
               are really ori-                         Technical
               ginators     of                         Report No.
               much of Lex, as                         32,  1975,
               well         as                         Bell
               debuggers    of                         Labora-
               it. Many thanks                         tories,
               are   due    to                         Murray
               both.                                   Hill,   NJ
                    The   code
               of  the current                    4.   A. V.  Aho
               version of  Lex                         and  M. J.
               was   designed,                         Corasick,
               written,    and                         Efficient
               debugged     by                         String

Lex - A Lexical Analyzer Generator                      PS1:16-33

                    An Aid  to
                    Comm.  ACM
                    18,   333-

               5.   B. W. Ker-
                    nighan, D.
                    M. Ritchie
                    and  K. L.
                    QED   Text
                    Report No.
                    5,   1972,
                    Hill,   NJ

               6.   D.      M.
                    tion.  See
                    also M. E.
                    Lesk,  The
                    Portable C
                    Report No.
                    31,   Bell
                    Hill,   NJ

Generated on 2014-07-04 21:17:45 by $MirOS: src/scripts/roff2htm,v 1.79 2014/02/10 00:36:11 tg Exp $

These manual pages and other documentation are copyrighted by their respective writers; their source is available at our CVSweb, AnonCVS, and other mirrors. The rest is Copyright © 2002‒2014 The MirOS Project, Germany.
This product includes material provided by Thorsten Glaser.

This manual page’s HTML representation is supposed to be valid XHTML/1.1; if not, please send a bug report – diffs preferred.