MirOS Manual: 29.diction(USD)


Writing Tools - the STYLE and DICTION Programs           USD:32-1

         Writing Tools - The STYLE and DICTION Programs

                          L. L. Cherry

                     AT&T Bell Laboratories
                  Murray Hill, New Jersey 07974

                          W. Vesterman

                       Livingston College
                       Rutgers University

                            ABSTRACT

          Text processing systems are now in  heavy  use  in
     many companies to format documents. With many documents
     stored on line, it has become possible to use computers
     to  study writing style itself and to help writers pro-
     duce better written and more readable prose. The system
     of  programs  described  here is an initial step toward
     such  help.  It  includes  programs  and  a  data  base
     designed  to  produce a stylistic profile of writing at
     the word and sentence level. The system measures  read-
     ability,  sentence and word length, sentence type, word
     usage, and sentence openers.  It  also  locates  common
     examples  of wordy phrasing and bad diction. The system
     is useful for evaluating a document's  style,  locating
     sentences  that may be difficult to read or excessively
     wordy, and determining a particular writer's style over
     several documents.

1. Introduction

     Computers have become important in the document  preparation
process, with programs to check for spelling errors and to format
documents. As the amount of text stored  on  line  increases,  it
becomes  feasible  and  attractive  to study writing style and to
attempt to help the writer in producing readable  documents.  The
system  of  writing  tools  described here is a first step toward
such help. The system  includes  programs  and  a  data  base  to
analyze  writing style at the word and sentence level. We use the
term ``style'' in  this  paper  to  describe  the  results  of  a
writer's  particular  choices among individual words and sentence
forms.  Although  many  judgements  of  style   are   subjective,

                          July 4, 2014

USD:32-2           Writing Tools - the STYLE and DICTION Programs

particularly those of word choice, there are some objective meas-
ures that experts agree lead to good style. Three  programs  have
been written to measure some of the objectively definable charac-
teristics of writing style and to identify some commonly  misused
or  unnecessary phrases. Although a document that conforms to the
stylistic rules is not guaranteed to be  coherent  and  readable,
one  that  violates all of the rules is likely to be difficult or
tedious to read. The program STYLE calculates  readability,  sen-
tence  length variability, sentence type, word usage and sentence
openers at a rate of about 400 words per  second  on  a  PDP11/70
running the UNIX- Operating System. It assumes that the sentences
are well-formed, i. e. that each sentence has a verb and that the
subject and verb agree in number. DICTION identifies phrases that
are  either  bad  usage or unnecessarily wordy. EXPLAIN acts as a
thesaurus for the phrases found by DICTION. Sections 2, 3, and  4
describe  the  programs;  Section 5 gives the results on a cross-
section of technical documents; Section 6 discusses accuracy  and
problems; Section 7 gives implementation details.

2. STYLE

     The program STYLE reads a document and prints a  summary  of
readability  indices,  sentence  length and type, word usage, and
sentence openers. It may also be used to locate all sentences  in
a  document  longer  than  a  given  length, of readability index
higher than a given number, those containing a passive  verb,  or
those  beginning  with an expletive. STYLE is based on the system
for finding English word classes or parts of speech,  PARTS  [1].
PARTS  is  a  set of programs that uses a small dictionary (about
350 words) and suffix rules to partially assign word  classes  to
English  text.  It then uses experimentally derived rules of word
order to assign word classes to all words in  the  text  with  an
accuracy of about 95%. Because PARTS uses only a small dictionary
and general rules, it works on text about any subject, from  phy-
sics  to psychology. Style measures have been built into the out-
put phase of the programs that make up PARTS. Some of  the  meas-
ures are simple counters of the word classes found by PARTS; many
are more complicated. For example, the verb count  is  the  total
number of verb phrases. This includes phrases like:

        has been going
        was only going
        to go

each of which each counts as one verb. Figure 1 shows the  output
of  STYLE  run  on a paper by Kernighan and Mashey about the UNIX
programming environment [2]. As the example shows,  STYLE  output
is  in five parts. After a brief discussion of sentences, we will
describe the parts in order.

_________________________
- UNIX is a registered trademark of AT&T  Bell  Labora-
tories in the USA and other countries.

                          July 4, 2014

Writing Tools - the STYLE and DICTION Programs           USD:32-3

________________________________________________________________________________________________
|programming environment                                                                       |
|readability grades:                                                                           |
|                        (Kincaid) 12.3  (auto) 12.8  (Coleman-Liau) 11.8  (Flesch) 13.5 (46.3)|
|sentence info:                                                                                |
|                        no. sent 335 no. wds 7419                                             |
|                        av sent leng 22.1 av word leng 4.91                                   |
|                        no. questions 0 no. imperatives 0                                     |
|                        no. nonfunc wds 4362  58.8%   av leng 6.38                            |
|                        short sent (<17) 35% (118) long sent (>32)  16% (55)                  |
|                        longest sent 82 wds at sent 174; shortest sent 1 wds at sent 117      |
|sentence types:                                                                               |
|                        simple  34% (114) complex  32% (108)                                  |
|                        compound  12% (41) compound-complex  21% (72)                         |
|word usage:                                                                                   |
|                        verb types as % of total verbs                                        |
|                        tobe  45% (373) aux  16% (133) inf  14% (114)                         |
|                        passives as % of non-inf verbs  20% (144)                             |
|                        types as % of total                                                   |
|                        prep 10.8% (804) conj 3.5% (262) adv 4.8% (354)                       |
|                        noun 26.7% (1983) adj 18.7% (1388) pron 5.3% (393)                    |
|                        nominalizations   2 % (155)                                           |
|sentence beginnings:                                                                          |
|                        subject opener: noun (63) pron (43) pos (0) adj (58) art (62) tot  67%|
|                        prep  12% (39) adv   9% (31)                                          |
|                        verb   0% (1)  sub_conj   6% (20) conj   1% (5)                       |
|                        expletives   4% (13)                                                  |
|______________________________________________________________________________________________|

                            Figure 1

2.1. What is a sentence?

     Readers of documents have little trouble deciding where  the
sentences  end.  People  don't  even have to stop and think about
uses of the character ``.'' in constructions  like  1.25,  A.  J.
Jones,  Ph.D., i. e., or etc. . When a computer reads a document,
finding the end of sentences is not as easy. First we must  throw
away  the printer's marks and formatting commands that litter the
text in computer form. Then STYLE defines a sentence as a  string
of words ending in one of:

         . ! ? /.

The end marker ``/.'' may be used to indicate an imperative  sen-
tence.  Imperative sentences that are not so marked are not iden-
tified as imperative. STYLE properly handles numbers with  embed-
ded  decimal  points  and  commas, strings of letters and numbers
with embedded decimal points used for naming computer file names,
and  the  common abbreviations listed in Appendix 1. Numbers that

                          July 4, 2014

USD:32-4           Writing Tools - the STYLE and DICTION Programs

end sentences, like the  preceding  sentence,  cause  a  sentence
break  if  the  next  word begins with a capital letter. Initials
only cause a sentence break if the next word begins with a  capi-
tal  and  is  found  in  the dictionary of function words used by
PARTS. So the string

        J. D. JONES

does not cause a break, but the string

         ... system H.  The ...

does. With these rules most sentences are broken  at  the  proper
place,  although occasionally either two sentences are called one
or a fragment is called a sentence. More on this later.

2.2. Readability Grades

     The first section of STYLE output consists of four readabil-
ity  indices.  As Klare points out in [3] readability indices may
be used to estimate the reading skills needed by  the  reader  to
understand  a document. The readability indices reported by STYLE
are based on measures of sentence and word lengths. Although  the
indices may not measure whether the document is coherent and well
organized, experience has shown that  high  indices  seem  to  be
indicators  of  stylistic  difficulty.  Documents with short sen-
tences and short words have low scores; those with long sentences
and  many  polysyllabic  words  have  high scores. The 4 formulae
reported are Kincaid Formula  [4],  Automated  Readability  Index
[5],  Coleman-Liau Formula [6] and a normalized version of Flesch
Reading Ease Score [7]. The formulae differ  because  they   were
experimentally  derived using different texts and subject groups.
We will discuss each of the formulae briefly; for a more detailed
discussion the reader should see [3].

     The Kincaid Formula, given by:

was based on Navy training manuals that ranged in difficulty from
5.5  to  16.3  in reading grade level. The score reported by this
formula tends to be in the mid-range of the 4 scores. Because  it
is  based on adult training manuals rather than school book text,
this formula is probably the best one to apply to technical docu-
ments.

     The Automated Readability Index (ARI), based  on  text  from
grades  0  to  7, was derived to be easy to automate. The formula
is:

ARI tends to produce scores that  are  higher  than  Kincaid  and
Coleman-Liau but are usually slightly lower than Flesch.

     The  Coleman-Liau  Formula,  based  on   text   ranging   in

                          July 4, 2014

Writing Tools - the STYLE and DICTION Programs           USD:32-5

difficulty from .4 to 16.3, is:

Of the four formulae this one usually gives the lowest grade when
applied to technical documents.

     The last formula, the Flesch Reading Ease Score, is based on
grade school text covering grades 3 to 12. The formula, given by:

is usually reported in the range 0 (very difficult) to 100  (very
easy).  The score reported by STYLE is scaled to be comparable to
the other formulas, except that the maximum grade level  reported
is  set  to  17. The Flesch score is usually the highest of the 4
scores on technical documents.

     Coke [8] found that the Kincaid Formula is probably the best
predictor  for  technical  documents; both ARI and Flesch tend to
overestimate the difficulty; Coleman-Liau tend to  underestimate.
On  text  in the range of grades 7 to 9 the four formulas tend to
be about the same. On easy text the Coleman-Liau formula is prob-
ably  preferred  since  it  is  reasonably  accurate at the lower
grades and it is safer to present text that is a little too  easy
than a little too hard.

     If a document has particularly difficult technical  content,
especially  if  it  includes a lot of mathematics, it is probably
best to make the text very easy to read, i.e. a lower readability
index  by shortening the sentences and words. This will allow the
reader to concentrate on the technical content and not  the  long
sentences.  The user should remember that these indices are esti-
mators; they should not  be  taken  as  absolute  numbers.  STYLE
called  with  ``-r  number''  will  print  all  sentences with an
Automated Readability Index equal to or greater than ``number''.

2.3. Sentence length and structure

     The next two sections of STYLE  output  deal  with  sentence
length and structure. Almost all books on writing style or effec-
tive writing emphasize the  importance  of  variety  in  sentence
length and structure for good writing. Ewing's first rule in dis-
cussing style in the book Writing for Results [9] is:

        ``Vary the sentence structure and length of your sentences.''

Leggett, Mead and Charvat break this rule into 3 in Prentice-Hall
Handbook for Writers [10] as follows:

        ``34a. Avoid the overuse of short simple sentences.''
        ``34b. Avoid the overuse of long compound sentences.''
        ``34c. Use various sentence structures to avoid monotony and increase effectiveness.''

Although experts agree that these rules are  important,  not  all
writers  follow  them. Sample technical documents have been found

                          July 4, 2014

USD:32-6           Writing Tools - the STYLE and DICTION Programs

with almost no sentence length or type variability. One  document
had  90%  of  its sentences about the same length as the average;
another was made up almost entirely of simple sentences (80%).

     The output sections labeled ``sentence info'' and ``sentence
types'' give both length and structure measures. STYLE reports on
the number and average length of both sentences  and  words,  and
number  of  questions  and  imperative sentences (those ending in
``/.''). The measures of non-function words  are  an  attempt  to
look  at  the  content  words  in  the  document. In English non-
function words are nouns, adjectives, adverbs, and  non-auxiliary
verbs;  function  words are prepositions, conjunctions, articles,
and auxiliary verbs. Since most function words  are  short,  they
tend  to  lower  the  average  word length. The average length of
non-function words may be a more  useful  measure  for  comparing
word  choice  of  different  writers  than the total average word
length. The percentages of short and long sentences measure  sen-
tence  length  variability.  Short sentences are those at least 5
words less than the average; long sentences are those at least 10
words  longer  than the average. Last in the sentence information
section is the length and location of the  longest  and  shortest
sentences.  If  the  flag ``-l number'' is used, STYLE will print
all sentences longer than ``number''.

     Because of the difficulties in dealing with the many uses of
commas  and  conjunctions  in  English, sentence type definitions
vary slightly from those of standard textbooks, but still measure
the same constructional activity.

1.   A simple sentence has one verb and no dependent clause.

2.   A complex sentence has one independent clause and one depen-
     dent clause, each with one verb. Complex sentences are found
     by identifying sentences that contain either  a  subordinate
     conjunction  or  a clause beginning with words like ``that''
     or ``who''. The preceding sentence has such a clause.

3.   A compound sentence has more than one verb and no  dependent
     clause.  Sentences  joined by ``;'' are also counted as com-
     pound.

4.   A compound-complex sentence  has  either  several  dependent
     clauses  or  one  dependent  clause  and  a compound verb in
     either the dependent or independent clause.

     Even using these broader definitions, simple sentences  dom-
inate  many of the technical documents that have been tested, but
the example in Figure 1 shows variety in both sentence  structure
and sentence length.

2.4. Word Usage

     The word usage measures are  an  attempt  to  identify  some
other  constructional  features  of writing style. There are many

                          July 4, 2014

Writing Tools - the STYLE and DICTION Programs           USD:32-7

different ways in English to say the same  thing.  The  construc-
tions  differ from one another in the form of the words used. The
following sentences all convey approximately the same meaning but
differ in word usage:

        The cxio program is used to perform all communication between the systems.
        The cxio program performs all communications between the systems.
        The cxio program is used to communicate between the systems.
        The cxio program communicates between the systems.
        All communication between the systems is performed by the cxio program.

The  distribution of the parts of speech and  verb  constructions
helps  identify overuse of particular constructions. Although the
measures used by STYLE are  crude,  they  do  point  out  problem
areas.  For  each  category, STYLE reports a percentage and a raw
count. In addition to looking at the  percentage,  the  user  may
find  it  useful to compare the raw count with the number of sen-
tences. If, for example, the  number  of  infinitives  is  almost
equal  to  the number of sentences, then many of the sentences in
the document are constructed like the  first  and  third  in  the
preceding  example.  The user may want to transform some of these
sentences into another form. Some of the implications of the word
usage measures are discussed below.

Verbs are measured in several different ways to try to  determine
     what  types  of  verb constructions are most frequent in the
     document. Technical writing tends to  contain  many  passive
     verb  constructions  and  other usage of the verb ``to be''.
     The category of verbs labeled ``tobe''  measures  both  pas-
     sives and sentences of the form:

             subject tobe predicate

     In counting verbs, whole verb phrases  are  counted  as  one
     verb. Verb phrases containing auxiliary verbs are counted in
     the category ``aux''. The  verb  phrases  counted  here  are
     those  whose  tense is not simple present or simple past. It
     might eventually be useful to do more detailed  measures  of
     verb  tense  or mood. Infinitives are listed as ``inf''. The
     percentages reported for these three categories are based on
     the total number of verb phrases found. These categories are
     not mutually exclusive; they cannot  be  added,  since,  for
     example,  ``to  be  going''  counts  as  both  ``tobe''  and
     ``inf''. Use of these  three  types  of  verb  constructions
     varies significantly among authors.

     STYLE reports passive verbs as a percentage  of  the  finite
     verbs  in  the  document.  Most style books warn against the
     overuse of passive verbs. Coleman [11] has shown  that  sen-
     tences with active verbs are easier to learn than those with
     passive verbs. Although the inverted object-subject order of
     the  passive  voice seems to emphasize the object, Coleman's
     experiments  showed  that  there  is  little  difference  in

                          July 4, 2014

USD:32-8           Writing Tools - the STYLE and DICTION Programs

     retention  by  word position. He also showed that the direct
     object of an active verb is retained better than the subject
     of  a  passive verb. These experiments support the advice of
     the style books suggesting that writers should  try  to  use
     active verbs wherever possible. The flag ``-p'' causes STYLE
     to print all sentences containing passive verbs.

Pronouns
     add cohesiveness and connectivity to a document by providing
     back-reference.  They  are  often  a short-hand notation for
     something previously mentioned, and  therefore  connect  the
     sentence  containing  the pronoun with the word to which the
     pronoun refers. Although there are other mechanisms for such
     connections, documents with no pronouns tend to be wordy and
     to have little connectivity.

Adverbs
     can provide transition between sentences and order  in  time
     and space. In performing these functions, adverbs, like pro-
     nouns, provide connectivity and cohesiveness.

Conjunctions
     provide parallelism in a document by connecting two or  more
     equal  units.  These  units  may  be  whole  sentences, verb
     phrases, nouns, adjectives, or  prepositional  phrases.  The
     compound  and compound-complex sentences reported under sen-
     tence type are parallel structures. Other uses  of  parallel
     structures  are  indicated  by the degree that the number of
     conjunctions reported under word usage exceeds the  compound
     sentence measures.

Nouns and Adjectives.
     A ratio of nouns to adjectives near unity may  indicate  the
     over-use  of modifiers. Some technical writers qualify every
     noun with one or more adjectives. Qualifiers in phrases like
     ``simple  linear single-link network model'' often lend more
     obscurity than precision to a text.

Nominalizations
     are verbs that are changed to nouns by  adding  one  of  the
     suffixes  ``ment'', ``ance'', ``ence'', or ``ion''. Examples
     are accomplishment, admittance, adherence, and abbreviation.
     When  a  writer  transforms a nominalized sentence to a non-
     nominalized sentence, she/he increases the effectiveness  of
     the  sentence  in  several  ways. The noun becomes an active
     verb and  frequently  one  complicated  clause  becomes  two
     shorter clauses. For example,

             Their inclusion of this provision is admission of the importance of the system.
             When they included this provision, they admitted the importance of the system.

     Coleman found that the transformed sentences were easier  to
     learn,  even when the transformation produced sentences that

                          July 4, 2014

Writing Tools - the STYLE and DICTION Programs           USD:32-9

     were slightly longer, provided the transformation broke  one
     clause  into  two.  Writers who find their document contains
     many nominalizations may want to transform some of the  sen-
     tences to use active verbs.

2.5. Sentence openers

     Another agreed upon principle of style is  variety  in  sen-
tence  openers.  Because  STYLE  determines  the type of sentence
opener by looking at the part of speech of the first word in  the
sentence,  the  sentences  counted  under  the  heading ``subject
opener'' may not all really begin with the  subject.  However,  a
large  percentage  of  sentences in this category still indicates
lack of variety in sentence openers. Other sentence opener  meas-
ures  help  the  user  determine if there are transitions between
sentences and where the subordination occurs.  Adverbs  and  con-
junctions  at the beginning of sentences are mechanisms for tran-
sition between sentences. A pronoun at the beginning shows a link
to something previously mentioned and indicates connectivity.

     The location of subordination can be determined by comparing
the  number  of sentences that begin with a subordinator with the
number of sentences with complex clauses. If few sentences  start
with  subordinate conjunctions then the subordination is embedded
or at the end of the complex sentences. For  variety  the  writer
may  want  to transform some sentences to have leading subordina-
tion.

     The  last  category  of  openers,  expletives,  is  commonly
overworked  in technical writing. Expletives are the words ``it''
and ``there'', usually with the verb ``to be'', in  constructions
where the subject follows the verb. For example,

        There are three streets used by the traffic.
        There are too many users on this system.

This construction tends to emphasize the object rather  than  the
subject  of  the  sentence.  The  flag ``-e'' will cause STYLE to
print all sentences that begin with an expletive.

3. DICTION

     The program DICTION prints all sentences in a document  con-
taining  phrases  that  are either frequently misused or indicate
wordiness. The program, an extension of Aho's FGREP  [12]  string
matching program, takes as input a file of phrases or patterns to
be matched and a file of text to be  searched.  A  data  base  of
about 450 phrases has been compiled as a default pattern file for
DICTION. Before attempting to locate phrases,  the  program  maps
upper case letters to lower case and substitutes blanks for punc-
tuation. Sentence boundaries were deemed less critical in DICTION
than  in  STYLE, so abbreviations and other uses of the character
``.'' are not treated specially.  DICTION  brackets  all  pattern
matches  in a sentence with the characters ``['' ``]'' . Although

                          July 4, 2014

USD:32-10          Writing Tools - the STYLE and DICTION Programs

many of the phrases in the default data base are correct in  some
contexts, in others they indicate wordiness. Some examples of the
phrases and suggested alternatives are:

               Phrase           Alternative
        a large number of       many
        arrive at a decision    decide
        collect together        collect
        for this reason         so
        pertaining to           about
        through the use of      by or with
        utilize                 use
        with the exception of   except

Appendix 2 contains a complete list of the default file. Some  of
the  entries are short forms of problem phrases. For example, the
phrase ``the fact'' is found in all of the following and is  suf-
ficient to point out the wordiness to the user:

                      Phrase                  Alternative
        accounted for by the fact that        caused by
        an example of this is the fact that   thus
        based on the fact that                because
        despite the fact that                 although
        due to the fact that                  because
        in light of the fact that             because
        in view of the fact that              since
        notwithstanding the fact that         although

Entries in Appendix 2 preceded by ``~'' are not matched. See Sec-
tion 7 for details on the use of ``~''.

     The user may supply her/his own pattern file with  the  flag
``-f  patfile''.  In  this  case  the default file will be loaded
first, followed by the user file. This mechanism allows users  to
suppress  patterns  contained  in  the default file or to include
their own pet peeves that are not in the default file.  The  flag
``-n''  will exclude the default file altogether. In constructing
a pattern file, blanks should  be  used  before  and  after  each
phrase  to  avoid  matching  substrings in words. For example, to
find all occurrences of the word ``the'', the pattern ``  the  ''
should  be  used.  The  blanks  cause only the word ``the'' to be
matched and not the string ``the'' in words  like  there,  other,
and  therefore.  One  side  effect  of surrounding the words with
blanks is that when two phrases occur without intervening  words,
only the first will be matched.

4. EXPLAIN

     The last program, EXPLAIN, is an interactive  thesaurus  for

                          July 4, 2014

Writing Tools - the STYLE and DICTION Programs          USD:32-11

phrases  found  by  DICTION.  The  user  types one of the phrases
bracketed by DICTION and EXPLAIN responds with suggested  substi-
tutions for the phrase that will improve the diction of the docu-
ment.

                             Table 1
            Text Statistics on 20 Technical Documents

                         variable          minimum   maximum   mean    standard deviation
_________________________________________________________________________________________
Readability        Kincaid                   9.5      16.9     13.3           2.2
                   automated                 9.0      17.4     13.3           2.5
                   Cole-Liau                10.0      16.0     12.7           1.8
                   Flesch                    8.9      17.0     14.4           2.2
_________________________________________________________________________________________
sentence info.     av sent length           15.5      30.3     21.6           4.0
                   av word length            4.61      5.63     5.08           .29
                   av nonfunction length     5.72      7.30     6.52           .45
                   short sent               23%       46%      33%            5.9
                   long sent                 7%       20%      14%            2.9
_________________________________________________________________________________________
sentence types     simple                   31%       71%      49%           11.4
                   complex                  19%       50%      33%            8.3
                   compound                  2%       14%       7%            3.3
                   compound-complex          2%       19%      10%            4.8
_________________________________________________________________________________________
verb types         tobe                     26%       64%      44.7%         10.3
                   auxiliary                10%       40%      21%            8.7
                   infinitives               8%       24%      15.1%          4.8
                   passives                 12%       50%      29%            9.3
_________________________________________________________________________________________
word usage         prepositions             10.1%     15.0%    12.3%          1.6
                   conjunction               1.8%      4.8%     3.4%           .9
                   adverbs                   1.2%      5.0%     3.4%          1.0
                   nouns                    23.6%     31.6%    27.8%          1.7
                   adjectives               15.4%     27.1%    21.1%          3.4
                   pronouns                  1.2%      8.4%     2.5%          1.1
                   nominalizations           2%        5%       3.3%           .8
_________________________________________________________________________________________
sentence openers   prepositions              6%       19%      12%            3.4
                   adverbs                   0%       20%       9%            4.6
                   subject                  56%       85%      70%            8.0
                   verbs                     0%        4%       1%            1.0
                   subordinating conj        1%       12%       5%            2.7
                   conjunctions              0%        4%       0%            1.5
                   expletives                0%        6%       2%            1.7

5. Results

5.1. STYLE

     To get baseline statistics and check the program's accuracy,

                          July 4, 2014

USD:32-12          Writing Tools - the STYLE and DICTION Programs

we  ran  STYLE  on  20 technical documents. There were a total of
3287 sentences in the sample. The shortest document was  67  sen-
tences  long;  the longest 339 sentences. The documents covered a
wide range of subject matter,  including  theoretical  computing,
physics, psychology, engineering, and affirmative action. Table 1
gives the range, median, and standard deviation  of  the  various
style  measures. As you will note most of the measurements have a
fairly wide range of values across the sample documents.

     As a comparison, Table 2 gives the median  results  for  two
different  technical authors, a sample of instructional material,
and a sample of the Federalist Papers. The two authors show simi-
lar styles, although author 2 uses somewhat shorter sentences and
longer words than author 1. Author 1 uses all types of sentences,
while  author  2  prefers simple and complex sentences, using few
compound or compound-complex sentences. The other  major  differ-
ence in the styles of these authors is the location of subordina-
tion. Author 1 seems to prefer embedded  or  trailing  subordina-
tion,  while  author 2 begins many sentences with the subordinate
clause. The documents tested for both authors 1 and 2 were techn-
ical  documents,  written  for a technical audience. The instruc-
tional  documents,  which  are  written  for  craftspeople,  vary
surprisingly little from the two technical samples. The sentences
and words are a little longer, and they contain many passive  and
auxiliary  verbs,  few  adverbs,  and  almost  no  pronouns.  The
instructional documents contain  many  imperative  sentences,  so
there are many sentence with verb openers. The sample of Federal-
ist Papers contrasts with the other samples in almost every way.

5.2. DICTION

     In the few weeks that DICTION has been  available  to  users
about  35,000  sentences  have  been  run with about 5,000 string
matches. The authors using the program seem to make the suggested
changes  about 50-75% of the time. To date, almost 200 of the 450
strings in the default file have been matched. Although  most  of
these  phrases are valid and correct in some contexts, the 50-75%
change rate seems to show that the phrases  are  used  much  more
often than concise diction warrants.

6. Accuracy

6.1. Sentence Identification

     The correctness of the STYLE output on the 20 document  sam-
ple was checked in detail. STYLE misidentified 129 sentence frag-
ments as sentences and incorrectly joined two or  more  sentences
75  times  in the 3287 sentence sample. The problems were usually
because of nonstandard  formatting  commands,  unknown  abbrevia-
tions,  or  lists  of  non-sentences. An impossibly long sentence
found as the longest sentence in  the  document  usually  is  the
result of a long list of non-sentences.

                          July 4, 2014

Writing Tools - the STYLE and DICTION Programs          USD:32-13

                             Table 2
                Text Statistics on Single Authors

                         variable          author 1   author 2   inst.    FED
______________________________________________________________________________
readability        Kincaid                  11.0       10.3      10.8    16.3
                   automated                11.0       10.3      11.9    17.8
                   Coleman-Liau              9.3       10.1      10.2    12.3
                   Flesch                   10.3       10.7      10.1    15.0
______________________________________________________________________________
sentence info      av sent length           22.64      19.61     22.78   31.85
                   av word length            4.47       4.66      4.65    4.95
                   av nonfunction length     5.64       5.92      6.04    6.87
                   short sent               35%        43%       35%     40%
                   long sent                18%        15%       16%     21%
______________________________________________________________________________
sentence types     simple                   36%        43%       40%     31%
                   complex                  34%        41%       37%     34%
                   compound                 13%         7%        4%     10%
                   compound-complex         16%         8%       14%     25%
______________________________________________________________________________
verb type          tobe                     42%        43%       45%     37%
                   auxiliary                17%        19%       32%     32%
                   infinitives              17%        15%       12%     21%
                   passives                 20%        19%       36%     20%
______________________________________________________________________________
word usage         prepositions             10.0%      10.8%     12.3%   15.9%
                   conjunctions              3.2%       2.4%      3.9%    3.4%
                   adverbs                   5.05%      4.6%      3.5%    3.7%
                   nouns                    27.7%      26.5%     29.1%   24.9%
                   adjectives               17.0%      19.0%     15.4%   12.4%
                   pronouns                  5.3%       4.3%      2.1%    6.5%
                   nominalizations           1%         2%        2%      3%
______________________________________________________________________________
sentence openers   prepositions             11%        14%        6%      5%
                   adverbs                   9%         9%        6%      4%
                   subject                  65%        59%       54%     66%
                   verb                      3%         2%       14%      2%
                   subordinating conj        8%        14%       11%      3%
                   conjunction               1%         0%        0%      3%
                   expletives                3%         3%        0%      3%

6.2. Sentence Types

     Style correctly identified sentence type  on  86.5%  of  the
sentences  in  the sample. The type distribution of the sentences
was 52.5% simple, 29.9% complex, 8.5% compound and  9%  compound-
complex.  The  program  reported  49.5% simple, 31.9% complex, 8%
compound and 10.4% compound-complex. Looking at the errors on the
individual  documents,  the number of simple sentences was under-
reported by about 4% and the complex  and  compound-complex  were
over-reported  by  3%  and 2%, respectively. The following matrix

                          July 4, 2014

USD:32-14          Writing Tools - the STYLE and DICTION Programs

shows the programs output vs. the actual sentence type.

                          Program Results
                          simple   complex   compound   comp-complex
  Actual    simple          1566      132        49           17
 Sentence   complex           47      892         6           65
   Type     compound          40        6       207           23
            comp-complex       0       52         5          249

     The system's inability to find imperative sentences seems to
have  little  effect  on most of the style statistics. A document
with half of its sentences imperative was run, with  and  without
the  imperative end marker. The results were identical except for
the expected errors of not finding verbs as sentence openers, not
counting  the  imperative sentences, and a slight difference (1%)
in the number of nouns and adjectives reported.

6.3. Word Usage

     The accuracy of identifying  word  types  reflects  that  of
PARTS,  which  is about 95% correct. The largest source of confu-
sion is between  nouns  and  adjectives.  The  verb  counts  were
checked  on about 20 sentences from each document and found to be
about 98% correct.

7. Technical Details

7.1. Finding Sentences

     The formatting commands embedded in the  text  increase  the
difficulty of finding sentences. Not all text in a document is in
sentence form; there are headings, tables, equations  and  lists,
for  example. Headings like ``Finding Sentences'' above should be
discarded, not attached to the next sentence. However, since many
of  the  documents  are formatted to be phototypeset, and contain
font changes, which usually operate on the most  important  words
in  the  document,  discarding  all  formatting  commands  is not
correct. To improve the programs' ability to find sentence  boun-
daries,  the  deformatting  program,  DEROFF [13], has been given
some knowledge of  the  formatting  packages  used  on  the  UNIX
operating system. DEROFF will now do the following:

1.   Suppress all formatting macros that  are  used  for  titles,
     headings, author's name, etc.

2.   Suppress the arguments to the macros for  titles,  headings,
     author's name, etc.

3.   Suppress displays, tables, footnotes and text that  is  cen-
     tered or in no-fill mode.

                          July 4, 2014

Writing Tools - the STYLE and DICTION Programs          USD:32-15

4.   Substitute a place holder for equations and check for hidden
     end markers. The place holder is necessary because many typ-
     ists and authors use the equation setter to change fonts  on
     important  words.  For  this reason, header files containing
     the definition of the EQN delimiters must also  be  included
     as  input  to  STYLE.  End  markers are often hidden when an
     equation ends a sentence and the period is typed inside  the
     EQN delimiters.

5.   Add a "." after lists. If the flag -ml  is  also  used,  all
     lists are suppressed. This is a separate flag because of the
     variety of ways the list macros are used. Often,  lists  are
     sentences  that should be included in the analysis. The user
     must determine how lists are used  in  the  document  to  be
     analyzed.

     Both STYLE and DICTION call DEROFF before they look  at  the
text.  The  user  should supply the -ml flag if the document con-
tains many lists of non-sentences that should be skipped.

7.2. Details of DICTION

     The program DICTION is based on the string matching  program
FGREP.  FGREP takes as input a file of patterns to be matched and
a file to be searched and outputs each line that contains any  of
the patterns with no indication of which pattern was matched. The
following changes have been added to FGREP:

1.   The basic unit that DICTION operates on is a sentence rather
     than a line. Each sentence that contains one of the patterns
     is output.

2.   Upper case letters are mapped to lower case.

3.   Punctuation is replaced by blanks.

4    All pattern matches in the sentence are found and surrounded
     with ``['' ``]'' .

5.   A method for suppressing a string match has been added.  Any
     pattern  that begins with ``~'' will not be matched. Because
     the matching algorithm  finds  the  longest  substring,  the
     suppression of a match allows words in some correct contexts
     not to be matched while allowing the word in another context
     to  be  found.  For  example,  the  word  ``which'' is often
     incorrectly used instead of ``that'' in restrictive clauses.
     However,  ``which''  is  usually  correct when preceded by a
     preposition or ``,''. The default  pattern  file  suppresses
     the  match of the common prepositions or a double blank fol-
     lowed by ``which'' and therefore matches  only  the  suspect
     uses. The double blank accounts for the replaced comma.

                          July 4, 2014

USD:32-16          Writing Tools - the STYLE and DICTION Programs

8. Conclusions

     A system of writing tools that measure some of the objective
characteristics  of  writing  style has been developed. The tools
are sufficiently general that they may be applied to documents on
any  subject  with  equal accuracy. Although the measurements are
only of the surface structure of the  text,  they  do  point  out
problem  areas.  In  addition  to  helping writers produce better
documents, these programs may be useful for studying the  writing
process and finding other formulae for measuring readability.

                          July 4, 2014

Writing Tools - the STYLE and DICTION Programs          USD:32-17

References

1.   L. L. Cherry, ``PARTS - A System for Assigning Word  Classes
     to English Text,'' submitted Communications of the ACM.

2.   B. W. Kernighan and J. R.  Mashey,  ``The  UNIX  Programming
     Environment,''  Software  -  Practice & Experience , 9, 1-15
     (1979).

3.   G. R. Klare,  ``Assessing  Readability,''  Reading  Research
     Quarterly, 1974-1975, 10 , 62-102.

4.   E. A. Smith and P. Kincaid, ``Derivation and  validation  of
     the  automated  readability  index  for  use  with technical
     materials,'' Human Factors, 1970, 12, 457-464.

5.   J. P. Kincaid, R. P. Fishburne, R.  L.  Rogers,  and  B.  S.
     Chissom, ``Derivation of new readability formulas (Automated
     Readability Index, Fog count, and Flesch Reading  Ease  For-
     mula)  for  Navy enlisted personnel,'' Navy Training Command
     Research Branch Report 8-75, Feb., 1975.

6.   M. Coleman and T. L. Liau, ``A Computer Readability  Formula
     Designed  for Machine Scoring,'' Journal of Applied Psychol-
     ogy, 1975, 60, 283-284.

7.   R. Flesch,  ``A  New  Readability  Yardstick,''  Journal  of
     Applied Psychology, 1948, 32, 221-233.

8.   E. U. Coke, private communication.

9.   D. W. Ewing, Writing for Results, John Wiley &  Sons,  Inc.,
     New York, N. Y. (1974).

10.  G. Leggett, C. D. Mead and W. Charvat,  Prentice-Hall  Hand-
     book  for  Writers,  Seventh  Edition,  Prentice-Hall  Inc.,
     Englewood Cliffs, N. J. (1978).

11.  E. B. Coleman, ``Learning of Prose Written in Four Grammati-
     cal  Transformations,'' Journal of Applied Psychology, 1965,
     vol. 49, no. 5, pp. 332-341.

12   A. V. Aho and M. J. Corasick, ``Efficient  String  Matching:
     an aid to Bibliographic Search,'' Communications of the ACM,
     18, (6), 333-340, June 1975.

13.  Bell  Laboratories,   ``UNIX   TIME-SHARING   SYSTEM:   UNIX
     PROGRAMMER'S  MANUAL,''  Seventh  Edition,  Vol.  1 (January
     1979).

                          July 4, 2014

USD:32-18          Writing Tools - the STYLE and DICTION Programs

                              Appendix 1

                         STYLE Abbreviations

     a. d.
     A. M.
     a. m.
     b. c.
     Ch.
     ch.
     ckts.
     dB.
     Dept.
     dept.
     Depts.
     depts.
     Dr.
     Drs.
     e. g.
     Eq.
     eq.
     et al.
     etc.
     Fig.
     fig.
     Figs.
     figs.
     ft.
     i. e.
     in.
     Inc.
     Jr.
     jr.
     mi.
     Mr.
     Mrs.
     Ms.
     No.
     no.
     Nos.
     nos.
     P. M.
     p. m.
     Ph. D.
     Ph. d.
     Ref.
     ref.
     Refs.
     refs.
     St.
     vs.
     yr.

                          July 4, 2014

Writing Tools - the STYLE and DICTION Programs          USD:32-19

                              Appendix 2

                       Default DICTION Patterns

      a great deal of
      a large number of
      a lot of
      a majority of
      a need for
      a number of
      a particular preference for
      a preference for
      a small number of
      a tendency to
      abovementioned
      absolutely complete
      absolutely essential
      accomplished
      accordingly
      activate
      actual
      added increments
      adequate enough
      advent
      afford an opportunity
      aggregate
      all of
      all throughout
      along the line
      an indication of
      analyzation
      and etc
      and or
      another additional
      any and all
      arrive at a
      as a matter of fact
      as a method of
      as good or better than
      as of now
      as per
      as regards
      as related to
      as to
      assistance
      assistance to
      assistance to
      assuming that
      at a later date
      at about
      at above
      at all times
      at an early date
      at below

                          July 4, 2014

USD:32-20          Writing Tools - the STYLE and DICTION Programs

      at the present center portion
      at the time whecheck into
      at this point icheckeon
      at this time   check up on
      at which time  circle around
      at your earliesclosevproximity
      authorization  collaborate together
      awful          collect together
      basic fundamentcombine together
      basically      come to an end
      be cognizant ofcommence
      being as       common accord
      being that     compensation
      brief in duraticompletely eliminated
      bring to a conccomprise
      but that       concerning
      but what       conduct an investigation of
      by means of    conjecture
      by the use of  connect up
      carry out experconsensus of opinion
      center about   consequent result
      center around  consolidate together
                     construct
                     contemplate
                     continue on
                     continue to remain
                     could of
                     count up
                     couple together
                     debate about
                     decide on
                     deleterious effect
                     demean
                     demonstrate
                     depreciate in value
                     deserving of
                     desirable benefits
                     desirous of
                     different than
                     discontinue
                     disutility
                     divide up
                     doubt but
                     due to
                     duly noted
                     during the time that
                     each and every
                     early beginnings
                     effectuate
                     emotional feelings
                     empty out
                     enclosed herein
                     enclosed herewith
                     end result

                          July 4, 2014

Writing Tools - the STYLE and DICTION Programs          USD:32-21

                     end up         fearful that
                     endeavor       few in number
                     enter in       file away
                     enter into     final completion
                     enthused       final ending
                     entirely complefinal outcome
                     equally good asfinal result
                     essentially    finalize
                     eventuate      find it interesting to know
                     every now and tfirst and foremost
                     exactly identicfirst beginnings
                     experiencing difirstlinitiated
                     fabricate      firstly
                     face up to     follow after
                     facilitate     following after
                     facts and figurfor the purpose of
                     fast in action for the reason that
                     fearful of     for the simple reason that
                                    for this reason
                                    for your information
                                    from the point of view of
                                    full and complete
                                    generally agreed
                                    good and
                                    got to
                                    gratuitous
                                    greatly minimize
                                    head up
                                    help but
                                    helps in the production of
                                    hopeful
                                    if and when
                                    if at all possible
                                    impact
                                    implement
                                    important essentials
                                    importantly
                                    in a large measure
                                    in a position to
                                    in accordance
                                    in advance of
                                    in agreement with
                                    in all cases
                                    in back of
                                    in behalf of
                                    in behind
                                    in between
                                    in case
                                    in close proximity
                                    in conflict with
                                    in conjunction with
                                    in connection with
                                    in fact
                                    in large measure

                          July 4, 2014

USD:32-22          Writing Tools - the STYLE and DICTION Programs

                                    in many cases  in the form of
                                    in most cases  in the instance of
                                    in my opinion Iinhthe interim
                                    in order to    in the last analysis
                                    in rare cases  in the matter of
                                    in reference toin the near future
                                    in regard to   in the neighborhood of
                                    in regards to  in the not too distant future
                                    in relation witin the proximity of
                                    in short supplyin the range of
                                    in size        in the same way as described
                                    in terms of    in the shape of
                                    in the amount oin the vicinity of
                                    in the case of in this case
                                    in the course oin view of the
                                    in the event   in violation of
                                    in the field ofinasmuch as
                                                   indicate
                                                   indicative of
                                                   initialize
                                                   initiate
                                                   injurious to
                                                   inquire
                                                   inside of
                                                   institute a
                                                   intents and purposes
                                                   intermingle
                                                   irregardless
                                                   is defined as
                                                   is used to control
                                                   is when
                                                   is where
                                                   it is incumbent
                                                   it stands to reason
                                                   it was noted that if
                                                   joint cooperation
                                                   joint partnership
                                                   just exactly
                                                   kind of
                                                   know about
                                                   last but not least
                                                   later on
                                                   leaving out of consideration
                                                   liable
                                                   link up
                                                   literally
                                                   little doubt that
                                                   lose out on
                                                   lots of
                                                   main essentials
                                                   make a
                                                   make adjustments to
                                                   make an
                                                   make application to

                          July 4, 2014

Writing Tools - the STYLE and DICTION Programs          USD:32-23

                                                   make contact with
                                                   make mention of
                                                   make out a list of
                                                   make the acquaintance of
                                                   make the adjustment
                                                   manner
                                                   maximum possible
                                                   meaningful
                                                   meet up with
                                                   melt down
                                                   melt up
                                                   methodology
                                                   might of
                                                   minimize as far as possible
                                                   minor importance
                                                   miss out on
                                                   modification

                          July 4, 2014

USD:32-24          Writing Tools - the STYLE and DICTION Programs

      more preferable
      most unique
      must of
      mutual cooperation
      necessary requisite
      necessitate
      need for
      nice
      not be un
      not in a position to
      not of a high order of accuracy
      not un
      notwithstanding
      of considerable magnitude
      of that
      of the opinion that
      off of
      on a few occasions
      on account of
      on behalf of
      on the grounds that
      on the occasion
      on the part of
      one of the
      open up
      operates to correct
      outside of
      over with
      overall
      past history
      perceptive of
      perform a measurement
      perform the measurement
      permits the reduction of
      personalize
      pertaining to
      physical size
      plan ahead
      plan for the future
      plan in advance
      plan on
      present a conclusion
      present a report
      presently
      prior to
      prioritize
      proceed to
      procure
      productive of
      prolong the duration
      protrude out from
      provided that
      pursuant to
      put to use in

                          July 4, 2014

Writing Tools - the STYLE and DICTION Programs          USD:32-25

      range all the wseemsoapparent
      reason is becausend a communication
      reason why     short space of time
      recur again    should of
      reduce down    single unit
      refer back     situation
      reference to thso as to
      reflective of  sort of
      regarding      spell out
      regretful      still continue
      reinitiate     still remain
      relative to    subsequent
      repeat again   substantially in agreement
      representative succeed in
      resultant effecsuggestive of
      resume again   superior than
      retreat back   surrounding circumstances
      return again   take appropriate
      return back    take cognizance of
      revert back    take into consideration
      seal off       termed as
                     terminate
                     termination
                     the author
                     the authors
                     the case that
                     the fact
                     the foregoing
                     the foreseeable future
                     the fullest possible extent
                     the majority of
                     the nature
                     the necessity of
                     the only difference being that
                     the order of
                     the point that
                     the truth is
                     there are not many
                     through the medium of
                     through the use of
                     throughout the entire
                     time interval
                     to summarize the above
                     total effect of all this
                     totality
                     transpire
                     true facts
                     try and
                     ultimate end
                     under a separate cover
                     under date of
                     under separate cover
                     under the necessity to
                     underlying purpose

                          July 4, 2014

USD:32-26          Writing Tools - the STYLE and DICTION Programs

                     undertake a stuworth while
                     uniformly consiwould of
                     unique        ing behavior
                     until such timwise
                     up to this time̅  which
                     upshot        ~ about which
                     utilize       ~ after which
                     very          ~ at which
                     very complete ~ between which
                     very unique   ~ by which
                     vital         ~ for which
                     which         ~ from which
                     with a view to~ in which
                     with reference~tinto which
                     with regard to~ of which
                     with the except̅ion which
                     with the object̅ on which
                     with the result̅ over which
                     with this in mi̅nthroughswhichr that
                     within the real̅mtofwhichibility
                     without further̅ under which
                                   ~ upon which
                                   ~ with which
                                   ~ without which
                                   ~clockwise
                                   ~likewise
                                   ~otherwise

                          July 4, 2014

Generated on 2014-07-04 21:17:45 by $MirOS: src/scripts/roff2htm,v 1.79 2014/02/10 00:36:11 tg Exp $

These manual pages and other documentation are copyrighted by their respective writers; their source is available at our CVSweb, AnonCVS, and other mirrors. The rest is Copyright © 2002‒2014 The MirOS Project, Germany.
This product includes material provided by Thorsten Glaser.

This manual page’s HTML representation is supposed to be valid XHTML/1.1; if not, please send a bug report – diffs preferred.