Wednesday, June 10, 2009

Internationalization of GAE applications

Costs of badly planned internationalization

In my experience, internationalizing an application is very expensive if it is not planned upfront.

The first source of costs is due to developers who are used to lazily hard-coding labels. Extracting them at the end of the development process is always error prone because developers may have done assumptions on exact labels. Good regression test suites can help detecting such situations but they cannot avoid the cost of the required fix and its corresponding test runs. With late label extraction, developers without enough context tend to add extra dictionary entries. With the label extraction near the release milestone, developers in a rush and without the initial context tend also to produce non documented dictionaries—that is limit their re-usability and increase the difficulty of possible defect fixings.

Non localized Java code
protected String formulatePageIndex(int index, int total) {
    String out = "Page " + index;
    if(0 < total) {
        out += " of " + total;
    }
    return out;
}

In the example above, I have seen the extraction of the labels Page and of instead of Page %0 and Page %0 of %1. The quick extraction leads to the impossibility to invert the two arguments! Think about the people naming conventions: in many countries, the last name is displayed before the first name, in others the first name precedes the last name (%last %first compared to %first %last), and in Japan both names are printed without separator in between (%last%first).

Localized Java code
protected String formulatePageIndex(int index, int total) {
    // Get the resource bundle already initialized for the correct locale
    ResourceBundle rb = getCurrentResourceBundle();
    // Prepare values
    Object[] values = Object[] {Integer.valueOf(index), Integer.valueOf(total)};
    // Get the right localized label
    String label = rb.getString("PageIndexLabel_Page");
    if(0 < total) {
        label = rb.getString("PageIndexLabel_PageOf");
    }
    // Return the localized label with injected values
    return Message.format(label, values);
}

The second source of costs is due to the missed opportunities of deploying localized builds early in the development process. In Agile environments, we expect to get runnable builds on regular basis (at the end of each sprints, each 4 to 6 weeks, for example). And these fool-proof builds can be demoed to customers to get early feedback. If the development organization can work with translators iteratively, there are a lot of chance to detect localization defects while their fixing cost is not too high. In the past, I have seen product developments hugely hit when bi-directional languages (like Arabic and Hebrew) had been introduced...

Different aspects of the internationalization

Internationalization (i18n) [1] has two aspects:
  • The translation of the labels;
  • The localization (l10n) of these labels.
The localization takes into account the language and the country, sometimes with variants in a country. For example, the Spanish language spoken 19 identified countries. In Mexico (ES_MX), the language is slightly different from the one generally spoken in Spain (ES or ES_ES). In Spain, there are many regional languages like the Catalan (CA_ES). The different locales are normalized by the Unicode consortium (ISO-639 and ISO-3166). Codes are composed of a sequence of two letters for the language plus two letters for the country plus two letters for the region. If letters are missing after the language, most of programming languages fallback on common defaults.

In order to ease application localizations, Unicode references a Common Locale Data Repository (CLDR) [2]. This repository is used and updated by many companies like IBM, Sun, Microsoft, Oracle, etc. The repository describes rules on how to:
  • Localize currencies;
  • Localize metrics (distance, speed, temperature, etc.);
  • Localize dates and calendars.
As of today, I think only timezone definitions are still not centrally managed... This is especially bad because conversions between Universal Time (UTC) dates and local dates are operating system dependent (Sun Solaris have small differences with Microsoft Windows, for example). Many tools use the Unicode CDLR information. For example, each release of the Dojo toolkit use its information to provide the Calendar widget for 27 locales [3].

Internationalization with different programming languages

Almost all programming languages have ways to facilitate application globalization. Java provide resource bundles (*.properties file), Microsoft .Net has resource files (.rc files), Python has dictionaries, etc. If JavaScript lacks of native support for globalization, some libraries offer various support. To my knowledge, Dojo toolkit is the first providing a full support.
If developing an application on Google App Engine infrastructure can be done with only one programming language, Python or Java as this time of writing, it is highly possible that developers will use some JavaScript libraries to speed up their development. This is without counting the delivery of a similar program front-end as a native application (made with Adobe AIR, Microsoft .Net, C/C++, Groovy, etc.).

In different situations, I have seen developers moving manually label definitions from one environment to another one. Sometimes, definitions were left over, cluttering the system. In Agile environments, developers should focus on the requirements for the current sprint, leaving some tuning for later sprints. For example, at one point during the development, some labels defined in a JavaScript bundle might be moved to a Java bundle because the localization will be done server-side into a JSP file.

My solution is to put all labels in one localized central repository. The dispatch among the different programming languages is done at build time. When I looked for this repository format, my solution was selected against the following criteria:
  • Easily editable;
  • Has a standard format;
  • Usable by static validation processes;
  • Has excellent re-usability factors;
  • Easily extensible to new programming languages.
I chose the TMX format (TMX for Translation Memory eXchange [4]). This is an XML based format (good for edition, extensibility, and use by static validation tools) which has been defined to allow translation memory export/import between different translation tools like DejaVu. The XlDiff format would have been another good candidate.

The following table illustrates the flow of interactions between the different actors in a development team. This sequence diagram shows that, once the developers have delivered a first TMX file, testers and translators can work independently to push tested and localized builds to the customer. As explained later, if developers tune the TMX entries without updating the labels themselves, translators and testers (at least from the l10n point-of-view) can stay out of the loop—only steps [1, 7, 8] are replayed.

Simplified view of the overall interaction flow
Developers Testers Translators Build process End-users
1. Write labels in one language into the TMX file. These labels are extracted from design documents.
2. Generate the application with for one locale.
3. Produce a generic bundle to identify non extracted labels.
4. Generate the application with for two locales.
5. Use the application in one locale (switching to the test language is hidden).
6. Use the initial TMX to produce n localized TMXs.
7. Generate the application with for 2 + n locales.
8. Can use the application in 1 + n locales.


The following code snippet shows how an entry into the base TMX file is defined.

Snippet of a translation unit definition for a TMX formatted file
<tu tuid="entry identifier" datatype="Text">
 <tuv xml:lang="locale identifier">
  <seg>localized content</seg>
 </tuv>
 <note>contextual information on the entry and relations with other entries</note>
 <prop type="x-tier">dojotk</prop>
 <prop type="x-tier">javarb</prop>
</tu>

The key features of the TMX format are:
  • The format can be validated with an external XSD (XML Schema Description);
  • One entry (tu: translation unit) can contain many localized contents (tuv: translation unit value);
  • Developers have a normalized placeholder (<note/>) to register contextual information;
  • Extensions are used by the build process to target the type of resource bundle to receive the localized label.
With such an approach, I have seen a drastic reduction of translation mistakes, especially thanks to the <note.> tag. Sometimes, graphical elements contain inter-related labels that cannot be grouped under a generic entity. The following set of elements illustrates the situation. The TMX approach saves translators headaches because they are simply informed about the relation between four entities.


The conversion from the TMX to the various resource bundles is done by an XSL-Transform. With the continuous integration handled by ant, the corresponding task generates the output after having appended the XSLT file coordinates to a copy of the TMX file and after asked for the transformation with the corresponding <xsl/> ant task. Depending on the machine performance, depending on the TMX file size, I found that the process can be time consuming. If this is your case too, I suggest you write your own little Java program to handle it. You can also use mine ;)

Stylesheet transforming label definitions for the Dojo toolkit
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="text" />
 <xsl:template match="/tmx/body">
  {
  <xsl:for-each select="tu">
    <xsl:for-each select="prop">
      <xsl:if test="@type='x-tier' and .='dojotk'">
        "<xsl:value-of select="../@tuid" />":"<xsl:value-of select="../tuv/seg" />",
      </xsl:if>
    </xsl:for-each>
  </xsl:for-each>
  "build", "@rwa.stageId@"}
 </xsl:template>
</xsl:stylesheet>

Use of the stylesheet above to convert Dojo toolkit related definitions from the TMX files by an Ant task [5]
<target name="convert-tmx">
  <style
    basedir="src/resources"
    destdir="src/resources"
    extension=".js"
    includes="*.tmx"
    style="src/resources/tmx2dojotkxsl"
  />
</target>

A+, Dom
--
Sources:
  1. Introduction to internationalization and localization on Wikipedia.
  2. Unicode Common Locale Data Repository (CLDR).
  3. Dojo toolkit API: dojo.cldr, dojo.i18n, private dijit._Calendar, dojox.widget.Calendar, and dojox.widget.DailyCalendar.
  4. Definition of the Translation Memory eXchange (TMX) format.
  5. Reference of the XSLT/tyle task for Ant scripts.

2 comments:

  1. Does anyone have an XSD file for TMX? Lisa only sports a DTD.

    ReplyDelete
  2. To my knowledge, there's no XSD available. Because the DTD hasn't been updated for quite a while, it can be safe to write (translate) the corresponding XSD. It can be made available as an open document (CC BY-NC-SA)...

    Any volunteer?

    I might follow that path in a near future. If I do. I'll post it on domderrien.github.com, with the source available for modification (patch) at github.com/DomDerrien/two-tiers-utils.

    A+, Dom

    ReplyDelete