MirOS Manual: Locale::Maketext::TPJ13(3p)


Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

NAME

     Locale::Maketext::TPJ13 -- article about software localiza-
     tion

SYNOPSIS

       # This an article, not a module.

DESCRIPTION

     The following article by Sean M. Burke and Jordan Lachler
     first appeared in The Perl Journal #13 and is copyright 1999
     The Perl Journal. It appears courtesy of Jon Orwant and The
     Perl Journal.  This document may be distributed under the
     same terms as Perl itself.

Localization and Perl: gettext breaks, Maketext fixes
     by Sean M. Burke and Jordan Lachler

     This article points out cases where gettext (a common system
     for localizing software interfaces -- i.e., making them work
     in the user's language of choice) fails because of basic
     differences between human languages.  This article then
     describes Maketext, a new system capable of correctly treat-
     ing these differences.

     A Localization Horror Story: It Could Happen To You

         "There are a number of languages spoken by human beings
         in this world."

         -- Harald Tveit Alvestrand, in RFC 1766, "Tags for the
         Identification of Languages"

     Imagine that your task for the day is to localize a piece of
     software -- and luckily for you, the only output the program
     emits is two messages, like this:

       I scanned 12 directories.

       Your query matched 10 files in 4 directories.

     So how hard could that be?  You look at the code that pro-
     duces the first item, and it reads:

       printf("I scanned %g directories.",
              $directory_count);

     You think about that, and realize that it doesn't even work
     right for English, as it can produce this output:

       I scanned 1 directories.

perl v5.8.8                2005-02-05                           1

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     So you rewrite it to read:

       printf("I scanned %g %s.",
              $directory_count,
              $directory_count == 1 ?
                "directory" : "directories",
       );

     ...which does the Right Thing.  (In case you don't recall,
     "%g" is for locale-specific number interpolation, and "%s"
     is for string interpolation.)

     But you still have to localize it for all the languages
     you're producing this software for, so you pull
     Locale::gettext off of CPAN so you can access the "gettext"
     C functions you've heard are standard for localization
     tasks.

     And you write:

       printf(gettext("I scanned %g %s."),
              $dir_scan_count,
              $dir_scan_count == 1 ?
                gettext("directory") : gettext("directories"),
       );

     But you then read in the gettext manual (Drepper, Miller,
     and Pinard 1995) that this is not a good idea, since how a
     single word like "directory" or "directories" is translated
     may depend on context -- and this is true, since in a case
     language like German or Russian, you'd may need these words
     with a different case ending in the first instance (where
     the word is the object of a verb) than in the second
     instance, which you haven't even gotten to yet (where the
     word is the object of a preposition, "in %g directories") --
     assuming these keep the same syntax when translated into
     those languages.

     So, on the advice of the gettext manual, you rewrite:

       printf( $dir_scan_count == 1 ?
                gettext("I scanned %g directory.") :
                gettext("I scanned %g directories."),
              $dir_scan_count );

     So, you email your various translators (the boss decides
     that the languages du jour are Chinese, Arabic, Russian, and
     Italian, so you have one translator for each), asking for
     translations for "I scanned %g directory." and "I scanned %g
     directories.".  When they reply, you'll put that in the lex-
     icons for gettext to use when it localizes your software, so
     that when the user is running under the "zh" (Chinese)

perl v5.8.8                2005-02-05                           2

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     locale, gettext("I scanned %g directory.") will return the
     appropriate Chinese text, with a "%g" in there where printf
     can then interpolate $dir_scan.

     Your Chinese translator emails right back -- he says both of
     these phrases translate to the same thing in Chinese,
     because, in linguistic jargon, Chinese "doesn't have number
     as a grammatical category" -- whereas English does.  That
     is, English has grammatical rules that refer to "number",
     i.e., whether something is grammatically singular or plural;
     and one of these rules is the one that forces nouns to take
     a plural suffix (generally "s") when in a plural context, as
     they are when they follow a number other than "one" (includ-
     ing, oddly enough, "zero"). Chinese has no such rules, and
     so has just the one phrase where English has two.  But, no
     problem, you can have this one Chinese phrase appear as the
     translation for the two English phrases in the "zh" gettext
     lexicon for your program.

     Emboldened by this, you dive into the second phrase that
     your software needs to output: "Your query matched 10 files
     in 4 directories.".  You notice that if you want to treat
     phrases as indivisible, as the gettext manual wisely
     advises, you need four cases now, instead of two, to cover
     the permutations of singular and plural on the two items,
     $dir_count and $file_count.  So you try this:

       printf( $file_count == 1 ?
         ( $directory_count == 1 ?
          gettext("Your query matched %g file in %g directory.") :
          gettext("Your query matched %g file in %g directories.") ) :
         ( $directory_count == 1 ?
          gettext("Your query matched %g files in %g directory.") :
          gettext("Your query matched %g files in %g directories.") ),
        $file_count, $directory_count,
       );

     (The case of "1 file in 2 [or more] directories" could, I
     suppose, occur in the case of symlinking or something of the
     sort.)

     It occurs to you that this is not the prettiest code you've
     ever written, but this seems the way to go.  You mail off to
     the translators asking for translations for these four
     cases.  The Chinese guy replies with the one phrase that
     these all translate to in Chinese, and that phrase has two
     "%g"s in it, as it should -- but there's a problem.  He
     translates it word-for-word back: "In %g directories con-
     tains %g files match your query."  The %g slots are in an
     order reverse to what they are in English.  You wonder how
     you'll get gettext to handle that.

perl v5.8.8                2005-02-05                           3

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     But you put it aside for the moment, and optimistically hope
     that the other translators won't have this problem, and that
     their languages will be better behaved -- i.e., that they
     will be just like English.

     But the Arabic translator is the next to write back.  First
     off, your code for "I scanned %g directory." or "I scanned
     %g directories." assumes there's only singular or plural.
     But, to use linguistic jargon again, Arabic has grammatical
     number, like English (but unlike Chinese), but it's a
     three-term category: singular, dual, and plural. In other
     words, the way you say "directory" depends on whether
     there's one directory, or two of them, or more than two of
     them.  Your test of "($directory == 1)" no longer does the
     job.  And it means that where English's grammatical category
     of number necessitates only the two permutations of the
     first sentence based on "directory [singular]" and "direc-
     tories [plural]", Arabic has three -- and, worse, in the
     second sentence ("Your query matched %g file in %g direc-
     tory."), where English has four, Arabic has nine.  You sense
     an unwelcome, exponential trend taking shape.

     Your Italian translator emails you back and says that "I
     searched 0 directories" (a possible English output of your
     program) is stilted, and if you think that's fine English,
     that's your problem, but that just will not do in the
     language of Dante.  He insists that where $directory_count
     is 0, your program should produce the Italian text for "I
     didn't scan any directories.".  And ditto for "I didn't
     match any files in any directories", although he says the
     last part about "in any directories" should probably just be
     left off.

     You wonder how you'll get gettext to handle this; to accomo-
     date the ways Arabic, Chinese, and Italian deal with numbers
     in just these few very simple phrases, you need to write
     code that will ask gettext for different queries depending
     on whether the numerical values in question are 1, 2, more
     than 2, or in some cases 0, and you still haven't figured
     out the problem with the different word order in Chinese.

     Then your Russian translator calls on the phone, to person-
     ally tell you the bad news about how really unpleasant your
     life is about to become:

     Russian, like German or Latin, is an inflectional language;
     that is, nouns and adjectives have to take endings that
     depend on their case (i.e., nominative, accusative, geni-
     tive, etc...) -- which is roughly a matter of what role they
     have in syntax of the sentence -- as well as on the grammat-
     ical gender (i.e., masculine, feminine, neuter) and number
     (i.e., singular or plural) of the noun, as well as on the

perl v5.8.8                2005-02-05                           4

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     declension class of the noun.  But unlike with most other
     inflected languages, putting a number-phrase (like "ten" or
     "forty-three", or their Arabic numeral equivalents) in front
     of noun in Russian can change the case and number that noun
     is, and therefore the endings you have to put on it.

     He elaborates:  In "I scanned %g directories", you'd expect
     "directories" to be in the accusative case (since it is the
     direct object in the sentnce) and the plural number, except
     where $directory_count is 1, then you'd expect the singular,
     of course.  Just like Latin or German.  But!  Where
     $directory_count % 10 is 1 ("%" for modulo, remember),
     assuming $directory count is an integer, and except where
     $directory_count % 100 is 11, "directories" is forced to
     become grammatically singular, which means it gets the end-
     ing for the accusative singular...  You begin to visualize
     the code it'd take to test for the problem so far, and still
     work for Chinese and Arabic and Italian, and how many get-
     text items that'd take, but he keeps going...  But where
     $directory_count % 10 is 2, 3, or 4 (except where
     $directory_count % 100 is 12, 13, or 14), the word for
     "directories" is forced to be genitive singular -- which
     means another ending... The room begins to spin around you,
     slowly at first...  But with all other integer values, since
     "directory" is an inanimate noun, when preceded by a number
     and in the nominative or accusative cases (as it is here,
     just your luck!), it does stay plural, but it is forced into
     the genitive case -- yet another ending...  And you never
     hear him get to the part about how you're going to run into
     similar (but maybe subtly different) problems with other
     Slavic languages like Polish, because the floor comes up to
     meet you, and you fade into unconsciousness.

     The above cautionary tale relates how an attempt at locali-
     zation can lead from programmer consternation, to program
     obfuscation, to a need for sedation.  But careful evaluation
     shows that your choice of tools merely needed further con-
     sideration.

     The Linguistic View

         "It is more complicated than you think."

         -- The Eighth Networking Truth, from RFC 1925

     The field of Linguistics has expended a great deal of effort
     over the past century trying to find grammatical patterns
     which hold across languages; it's been a constant process of
     people making generalizations that should apply to all
     languages, only to find out that, all too often, these gen-
     eralizations fail -- sometimes failing for just a few
     languages, sometimes whole classes of languages, and

perl v5.8.8                2005-02-05                           5

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     sometimes nearly every language in the world except English.
     Broad statistical trends are evident in what the "average
     language" is like as far as what its rules can look like,
     must look like, and cannot look like.  But the "average
     language" is just as unreal a concept as the "average per-
     son" -- it runs up against the fact no language (or person)
     is, in fact, average.  The wisdom of past experience leads
     us to believe that any given language can do whatever it
     wants, in any order, with appeal to any kind of grammatical
     categories wants -- case, number, tense, real or metaphoric
     characteristics of the things that words refer to, arbitrary
     or predictable classifications of words based on what end-
     ings or prefixes they can take, degree or means of certainty
     about the truth of statements expressed, and so on, ad
     infinitum.

     Mercifully, most localization tasks are a matter of finding
     ways to translate whole phrases, generally sentences, where
     the context is relatively set, and where the only variation
     in content is usually in a number being expressed -- as in
     the example sentences above. Translating specific, fully-
     formed sentences is, in practice, fairly foolproof -- which
     is good, because that's what's in the phrasebooks that so
     many tourists rely on.  Now, a given phrase (whether in a
     phrasebook or in a gettext lexicon) in one language might
     have a greater or lesser applicability than that phrase's
     translation into another language -- for example, strictly
     speaking, in Arabic, the "your" in "Your query matched..."
     would take a different form depending on whether the user is
     male or female; so the Arabic translation "your[feminine]
     query" is applicable in fewer cases than the corresponding
     English phrase, which doesn't distinguish the user's gender.
     (In practice, it's not feasable to have a program know the
     user's gender, so the masculine "you" in Arabic is usually
     used, by default.)

     But in general, such surprises are rare when entire sen-
     tences are being translated, especially when the functional
     context is restricted to that of a computer interacting with
     a user either to convey a fact or to prompt for a piece of
     information.  So, for purposes of localization, translation
     by phrase (generally by sentence) is both the simplest and
     the least problematic.

     Breaking gettext

         "It Has To Work."

         -- First Networking Truth, RFC 1925

     Consider that sentences in a tourist phrasebook are of two
     types: ones like "How do I get to the marketplace?" that

perl v5.8.8                2005-02-05                           6

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     don't have any blanks to fill in, and ones like "How much do
     these ___ cost?", where there's one or more blanks to fill
     in (and these are usually linked to a list of words that you
     can put in that blank: "fish", "potatoes", "tomatoes", etc.)
     The ones with no blanks are no problem, but the fill-in-
     the-blank ones may not be really straightforward. If it's a
     Swahili phrasebook, for example, the authors probably didn't
     bother to tell you the complicated ways that the verb "cost"
     changes its inflectional prefix depending on the noun you're
     putting in the blank. The trader in the marketplace will
     still understand what you're saying if you say "how much do
     these potatoes cost?" with the wrong inflectional prefix on
     "cost".  After all, you can't speak proper Swahili, you're
     just a tourist.  But while tourists can be stupid, computers
     are supposed to be smart; the computer should be able to
     fill in the blank, and still have the results be grammati-
     cal.

     In other words, a phrasebook entry takes some values as
     parameters (the things that you fill in the blank or
     blanks), and provides a value based on these parameters,
     where the way you get that final value from the given values
     can, properly speaking, involve an arbitrarily complex
     series of operations.  (In the case of Chinese, it'd be not
     at all complex, at least in cases like the examples at the
     beginning of this article; whereas in the case of Russian
     it'd be a rather complex series of operations.  And in some
     languages, the complexity could be spread around dif-
     ferently: while the act of putting a number-expression in
     front of a noun phrase might not be complex by itself, it
     may change how you have to, for example, inflect a verb
     elsewhere in the sentence.  This is what in syntax is called
     "long-distance dependencies".)

     This talk of parameters and arbitrary complexity is just
     another way to say that an entry in a phrasebook is what in
     a programming language would be called a "function".  Just
     so you don't miss it, this is the crux of this article: A
     phrase is a function; a phrasebook is a bunch of functions.

     The reason that using gettext runs into walls (as in the
     above second-person horror story) is that you're trying to
     use a string (or worse, a choice among a bunch of strings)
     to do what you really need a function for -- which is
     futile.  Preforming (s)printf interpolation on the strings
     which you get back from gettext does allow you to do some
     common things passably well... sometimes... sort of; but, to
     paraphrase what some people say about "csh" script program-
     ming, "it fools you into thinking you can use it for real
     things, but you can't, and you don't discover this until
     you've already spent too much time trying, and by then it's
     too late."

perl v5.8.8                2005-02-05                           7

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     Replacing gettext

     So, what needs to replace gettext is a system that supports
     lexicons of functions instead of lexicons of strings.  An
     entry in a lexicon from such a system should not look like
     this:

       "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires"

     [\xE9 is e-acute in Latin-1.  Some pod renderers would
     scream if I used the actual character here. -- SB]

     but instead like this, bearing in mind that this is just a
     first stab:

       sub I_found_X1_files_in_X2_directories {
         my( $files, $dirs ) = @_[0,1];
         $files = sprintf("%g %s", $files,
           $files == 1 ? 'fichier' : 'fichiers');
         $dirs = sprintf("%g %s", $dirs,
           $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
         return "J'ai trouv\xE9 $files dans $dirs.";
       }

     Now, there's no particularly obvious way to store anything
     but strings in a gettext lexicon; so it looks like we just
     have to start over and make something better, from scratch.
     I call my shot at a gettext-replacement system "Maketext",
     or, in CPAN terms, Locale::Maketext.

     When designing Maketext, I chose to plan its main features
     in terms of "buzzword compliance".  And here are the buzz-
     words:

     Buzzwords: Abstraction and Encapsulation

     The complexity of the language you're trying to output a
     phrase in is entirely abstracted inside (and encapsulated
     within) the Maketext module for that interface.  When you
     call:

       print $lang->maketext("You have [quant,_1,piece] of new mail.",
                            scalar(@messages));

     you don't know (and in fact can't easily find out) whether
     this will involve lots of figuring, as in Russian (if $lang
     is a handle to the Russian module), or relatively little, as
     in Chinese.  That kind of abstraction and encapsulation may
     encourage other pleasant buzzwords like modularization and
     stratification, depending on what design decisions you make.

perl v5.8.8                2005-02-05                           8

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     Buzzword: Isomorphism

     "Isomorphism" means "having the same structure or form"; in
     discussions of program design, the word takes on the spe-
     cial, specific meaning that your implementation of a solu-
     tion to a problem has the same structure as, say, an infor-
     mal verbal description of the solution, or maybe of the
     problem itself.  Isomorphism is, all things considered, a
     good thing -- it's what problem-solving (and
     solution-implementing) should look like.

     What's wrong the with gettext-using code like this...

       printf( $file_count == 1 ?
         ( $directory_count == 1 ?
          "Your query matched %g file in %g directory." :
          "Your query matched %g file in %g directories." ) :
         ( $directory_count == 1 ?
          "Your query matched %g files in %g directory." :
          "Your query matched %g files in %g directories." ),
        $file_count, $directory_count,
       );

     is first off that it's not well abstracted -- these ways of
     testing for grammatical number (as in the expressions like
     "foo == 1 ? singular_form : plural_form") should be
     abstracted to each language module, since how you get gram-
     matical number is language-specific.

     But second off, it's not isomorphic -- the "solution" (i.e.,
     the phrasebook entries) for Chinese maps from these four
     English phrases to the one Chinese phrase that fits for all
     of them.  In other words, the informal solution would be
     "The way to say what you want in Chinese is with the one
     phrase 'For your question, in Y directories you would find X
     files'" -- and so the implemented solution should be, iso-
     morphically, just a straightforward way to spit out that one
     phrase, with numerals properly interpolated.  It shouldn't
     have to map from the complexity of other languages to the
     simplicity of this one.

     Buzzword: Inheritance

     There's a great deal of reuse possible for sharing of
     phrases between modules for related dialects, or for sharing
     of auxiliary functions between related languages.  (By "aux-
     iliary functions", I mean functions that don't produce
     phrase-text, but which, say, return an answer to "does this
     number require a plural noun after it?".  Such auxiliary
     functions would be used in the internal logic of functions
     that actually do produce phrase-text.)

perl v5.8.8                2005-02-05                           9

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     In the case of sharing phrases, consider that you have an
     interface already localized for American English (probably
     by having been written with that as the native locale, but
     that's incidental). Localizing it for UK English should, in
     practical terms, be just a matter of running it past a Brit-
     ish person with the instructions to indicate what few
     phrases would benefit from a change in spelling or possibly
     minor rewording.  In that case, you should be able to put in
     the UK English localization module only those phrases that
     are UK-specific, and for all the rest, inherit from the
     American English module.  (And I expect this same situation
     would apply with Brazilian and Continental Portugese, poss-
     bily with some very closely related languages like Czech and
     Slovak, and possibly with the slightly different "versions"
     of written Mandarin Chinese, as I hear exist in Taiwan and
     mainland China.)

     As to sharing of auxiliary functions, consider the problem
     of Russian numbers from the beginning of this article; obvi-
     ously, you'd want to write only once the hairy code that,
     given a numeric value, would return some specification of
     which case and number a given quanitified noun should use.
     But suppose that you discover, while localizing an interface
     for, say, Ukranian (a Slavic language related to Russian,
     spoken by several million people, many of whom would be
     relieved to find that your Web site's or software's inter-
     face is available in their language), that the rules in
     Ukranian are the same as in Russian for quantification, and
     probably for many other grammatical functions. While there
     may well be no phrases in common between Russian and
     Ukranian, you could still choose to have the Ukranian module
     inherit from the Russian module, just for the sake of inher-
     iting all the various grammatical methods.  Or, probably
     better organizationally, you could move those functions to a
     module called "_E_Slavic" or something, which Russian and
     Ukranian could inherit useful functions from, but which
     would (presumably) provide no lexicon.

     Buzzword: Concision

     Okay, concision isn't a buzzword.  But it should be, so I
     decree that as a new buzzword, "concision" means that simple
     common things should be expressible in very few lines (or
     maybe even just a few characters) of code -- call it a spe-
     cial case of "making simple things easy and hard things pos-
     sible", and see also the role it played in the MIDI::Simple
     language, discussed elsewhere in this issue [TPJ#13].

     Consider our first stab at an entry in our "phrasebook of
     functions":

perl v5.8.8                2005-02-05                          10

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

       sub I_found_X1_files_in_X2_directories {
         my( $files, $dirs ) = @_[0,1];
         $files = sprintf("%g %s", $files,
           $files == 1 ? 'fichier' : 'fichiers');
         $dirs = sprintf("%g %s", $dirs,
           $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
         return "J'ai trouv\xE9 $files dans $dirs.";
       }

     You may sense that a lexicon (to use a non-committal catch-
     all term for a collection of things you know how to say,
     regardless of whether they're phrases or words) consisting
     of functions expressed as above would make for rather long-
     winded and repetitive code -- even if you wisely rewrote
     this to have quantification (as we call adding a number
     expression to a noun phrase) be a function called like:

       sub I_found_X1_files_in_X2_directories {
         my( $files, $dirs ) = @_[0,1];
         $files = quant($files, "fichier");
         $dirs =  quant($dirs,  "r\xE9pertoire");
         return "J'ai trouv\xE9 $files dans $dirs.";
       }

     And you may also sense that you do not want to bother your
     translators with having to write Perl code -- you'd much
     rather that they spend their very costly time on just trans-
     lation.  And this is to say nothing of the near impossibil-
     ity of finding a commercial translator who would know even
     simple Perl.

     In a first-hack implementation of Maketext, each
     language-module's lexicon looked like this:

      %Lexicon = (
        "I found %g files in %g directories"
        => sub {
           my( $files, $dirs ) = @_[0,1];
           $files = quant($files, "fichier");
           $dirs =  quant($dirs,  "r\xE9pertoire");
           return "J'ai trouv\xE9 $files dans $dirs.";
         },
       ... and so on with other phrase => sub mappings ...
      );

     but I immediately went looking for some more concise way to
     basically denote the same phrase-function -- a way that
     would also serve to concisely denote most phrase-functions
     in the lexicon for most languages.  After much time and even
     some actual thought, I decided on this system:

perl v5.8.8                2005-02-05                          11

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     * Where a value in a %Lexicon hash is a contentful string
     instead of an anonymous sub (or, conceivably, a coderef), it
     would be interpreted as a sort of shorthand expression of
     what the sub does.  When accessed for the first time in a
     session, it is parsed, turned into Perl code, and then
     eval'd into an anonymous sub; then that sub replaces the
     original string in that lexicon.  (That way, the work of
     parsing and evaling the shorthand form for a given phrase is
     done no more than once per session.)

     * Calls to "maketext" (as Maketext's main function is
     called) happen thru a "language session handle", notionally
     very much like an IO handle, in that you open one at the
     start of the session, and use it for "sending signals" to an
     object in order to have it return the text you want.

     So, this:

       $lang->maketext("You have [quant,_1,piece] of new mail.",
                      scalar(@messages));

     basically means this: look in the lexicon for $lang (which
     may inherit from any number of other lexicons), and find the
     function that we happen to associate with the string "You
     have [quant,_1,piece] of new mail" (which is, and should be,
     a functioning "shorthand" for this function in the native
     locale -- English in this case).  If you find such a func-
     tion, call it with $lang as its first parameter (as if it
     were a method), and then a copy of scalar(@messages) as its
     second, and then return that value.  If that function was
     found, but was in string shorthand instead of being a fully
     specified function, parse it and make it into a function
     before calling it the first time.

     * The shorthand uses code in brackets to indicate method
     calls that should be performed.  A full explanation is not
     in order here, but a few examples will suffice:

       "You have [quant,_1,piece] of new mail."

     The above code is shorthand for, and will be interpreted as,
     this:

       sub {
         my $handle = $_[0];
         my(@params) = @_;
         return join '',
           "You have ",
           $handle->quant($params[1], 'piece'),
           "of new mail.";
       }

perl v5.8.8                2005-02-05                          12

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     where "quant" is the name of a method you're using to quan-
     tify the noun "piece" with the number $params[0].

     A string with no brackety calls, like this:

       "Your search expression was malformed."

     is somewhat of a degerate case, and just gets turned into:

       sub { return "Your search expression was malformed." }

     However, not everything you can write in Perl code can be
     written in the above shorthand system -- not by a long shot.
     For example, consider the Italian translator from the begin-
     ning of this article, who wanted the Italian for "I didn't
     find any files" as a special case, instead of "I found 0
     files".  That couldn't be specified (at least not easily or
     simply) in our shorthand system, and it would have to be
     written out in full, like this:

       sub {  # pretend the English strings are in Italian
         my($handle, $files, $dirs) = @_[0,1,2];
         return "I didn't find any files" unless $files;
         return join '',
           "I found ",
           $handle->quant($files, 'file'),
           " in ",
           $handle->quant($dirs,  'directory'),
           ".";
       }

     Next to a lexicon full of shorthand code, that sort of
     sticks out like a sore thumb -- but this is a special case,
     after all; and at least it's possible, if not as concise as
     usual.

     As to how you'd implement the Russian example from the
     beginning of the article, well, There's More Than One Way To
     Do It, but it could be something like this (using English
     words for Russian, just so you know what's going on):

       "I [quant,_1,directory,accusative] scanned."

     This shifts the burden of complexity off to the quant
     method.  That method's parameters are: the numeric value
     it's going to use to quantify something; the Russian word
     it's going to quantify; and the parameter "accusative",
     which you're using to mean that this sentence's syntax wants
     a noun in the accusative case there, although that quantifi-
     cation method may have to overrule, for grammatical reasons
     you may recall from the beginning of this article.

perl v5.8.8                2005-02-05                          13

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     Now, the Russian quant method here is responsible not only
     for implementing the strange logic necessary for figuring
     out how Russian number-phrases impose case and number on
     their noun-phrases, but also for inflecting the Russian word
     for "directory".  How that inflection is to be carried out
     is no small issue, and among the solutions I've seen, some
     (like variations on a simple lookup in a hash where all pos-
     sible forms are provided for all necessary words) are
     straightforward but can become cumbersome when you need to
     inflect more than a few dozen words; and other solutions
     (like using algorithms to model the inflections, storing
     only root forms and irregularities) can involve more over-
     head than is justifiable for all but the largest lexicons.

     Mercifully, this design decision becomes crucial only in the
     hairiest of inflected languages, of which Russian is by no
     means the worst case scenario, but is worse than most.  Most
     languages have simpler inflection systems; for example, in
     English or Swahili, there are generally no more than two
     possible inflected forms for a given noun ("error/errors";
     "kosa/makosa"), and the rules for producing these forms are
     fairly simple -- or at least, simple rules can be formulated
     that work for most words, and you can then treat the excep-
     tions as just "irregular", at least relative to your ad hoc
     rules.  A simpler inflection system (simpler rules, fewer
     forms) means that design decisions are less crucial to main-
     taining sanity, whereas the same decisions could incur
     overhead-versus-scalability problems in languages like Rus-
     sian.  It may also be likely that code (possibly in Perl, as
     with Lingua::EN::Inflect, for English nouns) has already
     been written for the language in question, whether simple or
     complex.

     Moreover, a third possibility may even be simpler than any-
     thing discussed above: "Just require that all possible (or
     at least applicable) forms be provided in the call to the
     given language's quant method, as in:"

       "I found [quant,_1,file,files]."

     That way, quant just has to chose which form it needs,
     without having to look up or generate anything.  While pos-
     sibly not optimal for Russian, this should work well for
     most other languages, where quantification is not as compli-
     cated an operation.

     The Devil in the Details

     There's plenty more to Maketext than described above -- for
     example, there's the details of how language tags ("en-US",
     "i-pwn", "fi", etc.) or locale IDs ("en_US") interact with
     actual module naming ("BogoQuery/Locale/en_us.pm"), and what

perl v5.8.8                2005-02-05                          14

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     magic can ensue; there's the details of how to record (and
     possibly negotiate) what character encoding Maketext will
     return text in (UTF8? Latin-1? KOI8?).  There's the
     interesting fact that Maketext is for localization, but
     nowhere actually has a ""use locale;"" anywhere in it.  For
     the curious, there's the somewhat frightening details of how
     I actually implement something like data inheritance so that
     searches across modules' %Lexicon hashes can parallel how
     Perl implements method inheritance.

     And, most importantly, there's all the practical details of
     how to actually go about deriving from Maketext so you can
     use it for your interfaces, and the various tools and con-
     ventions for starting out and maintaining individual
     language modules.

     That is all covered in the documentation for
     Locale::Maketext and the modules that come with it, avail-
     able in CPAN.  After having read this article, which covers
     the why's of Maketext, the documentation, which covers the
     how's of it, should be quite straightfoward.

     The Proof in the Pudding: Localizing Web Sites

     Maketext and gettext have a notable difference: gettext is
     in C, accessible thru C library calls, whereas Maketext is
     in Perl, and really can't work without a Perl interpreter
     (although I suppose something like it could be written for
     C).  Accidents of history (and not necessarily lucky ones)
     have made C++ the most common language for the implementa-
     tion of applications like word processors, Web browsers, and
     even many in-house applications like custom query systems.
     Current conditions make it somewhat unlikely that the next
     one of any of these kinds of applications will be written in
     Perl, albeit clearly more for reasons of custom and inertia
     than out of consideration of what is the right tool for the
     job.

     However, other accidents of history have made Perl a well-
     accepted language for design of server-side programs (gen-
     erally in CGI form) for Web site interfaces.  Localization
     of static pages in Web sites is trivial, feasable either
     with simple language-negotiation features in servers like
     Apache, or with some kind of server-side inclusions of
     language-appropriate text into layout templates.  However, I
     think that the localization of Perl-based search systems (or
     other kinds of dynamic content) in Web sites, be they public
     or access-restricted, is where Maketext will see the
     greatest use.

     I presume that it would be only the exceptional Web site
     that gets localized for English and Chinese and Italian and

perl v5.8.8                2005-02-05                          15

Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)

     Arabic and Russian, to recall the languages from the begin-
     ning of this article -- to say nothing of German, Spanish,
     French, Japanese, Finnish, and Hindi, to name a few
     languages that benefit from large numbers of programmers or
     Web viewers or both.

     However, the ever-increasing internationalization of the Web
     (whether measured in terms of amount of content, of numbers
     of content writers or programmers, or of size of content
     audiences) makes it increasingly likely that the interface
     to the average Web-based dynamic content service will be
     localized for two or maybe three languages.  It is my hope
     that Maketext will make that task as simple as possible, and
     will remove previous barriers to localization for languages
     dissimilar to English.

      __END__

     Sean M. Burke (sburke@cpan.org) has a Master's in linguis-
     tics from Northwestern University; he specializes in
     language technology. Jordan Lachler (lachler@unm.edu) is a
     PhD student in the Department of Linguistics at the Univer-
     sity of New Mexico; he specializes in morphology and
     pedagogy of North American native languages.

     References

     Alvestrand, Harald Tveit.  1995.  RFC 1766: Tags for the
     Identification of Languages.
     "ftp://ftp.isi.edu/in-notes/rfc1766.txt" [Now see RFC 3066.]

     Callon, Ross, editor.  1996.  RFC 1925: The Twelve Network-
     ing Truths. "ftp://ftp.isi.edu/in-notes/rfc1925.txt"

     Drepper, Ulrich, Peter Miller, and Francois Pinard.
     1995-2001.  GNU "gettext".  Available in
     "ftp://prep.ai.mit.edu/pub/gnu/", with extensive docs in the
     distribution tarball.  [Since I wrote this article in 1998,
     I now see that the gettext docs are now trying more to come
     to terms with plurality.  Whether useful conclusions have
     come from it is another question altogether. -- SMB, May
     2001]

     Forbes, Nevill.  1964.  Russian Grammar.  Third Edition,
     revised by J. C. Dumbreck.  Oxford University Press.

perl v5.8.8                2005-02-05                          16

Generated on 2014-07-04 21:17:45 by $MirOS: src/scripts/roff2htm,v 1.79 2014/02/10 00:36:11 tg Exp $

These manual pages and other documentation are copyrighted by their respective writers; their source is available at our CVSweb, AnonCVS, and other mirrors. The rest is Copyright © 2002‒2014 The MirOS Project, Germany.
This product includes material provided by Thorsten Glaser.

This manual page’s HTML representation is supposed to be valid XHTML/1.1; if not, please send a bug report – diffs preferred.