Internationalization in Content Management vs Internationalization in applications

December 11, 2006

Here is how the following issues play out in different applications vs CMS systems

1) Characterset and Encoding selection at UI, application and data tiers

I mention all of them seperately as the choice traditionally has not been obvious. Lets look at the responsibilities of each of them.

User Interface:

The user should have fonts to render the given characterset, and keyboard setting to type them. (i.e. the browser should support them). The applications targeted for users on Mac OS X, and Windows 2000+ can assume the users to have unicode fonts but the others can not.  The others typically require the user to have installed a font for unicode characterset ( may not be pan-unicode but atleast the given language).

It is likely that the font we want to use doesnot come in Unicode Characterset, which forces either a different font, or a different characterset.

The next big issue is search. You want to make sure that the serach engines can index and find your site, and hence you need to use an encoding which the local search engine can support. For instance, google supports unicode but not all search engines do.

Also if the application delivers email, the adoption of Unicode has been slow, especially Japanese. Some of the traditional email utilities donot handle unicode, though the situation is changing fast with operating systems in last 5 years supporting unicode natively.

Are there other UI’s apart from browser, like smartphone and pocketPC (the mobile devices including smartphones and pocketPC- post 2000 support unicode, but it is still common to find applications used on these devices which do not support unicode.)

For a typical CMS system, this becomes very important. You need to know where you are delivering content to, you need to know if automatic,lossless charaset translation is possible from the format in the repository to the format on the UI and only then you can decide which encoding to be used.

If you want to use Quark or MS Word either for input or output, you need to be able to support or convert the extra characters they use – like “Smart Quotes”. These translation might have some “Loss”

Application Encoding and Characterset:

The applications need to emit formatted strings in the languages they support – e.g. Date in DD-MON-YYYY format. They may also need to do Alphabetic sort. You need to search some text, you may need to count characters and words, you may need to compare. All these require the application to support the given characterset and encoding for the given language. You might have hardcoded strings. You might want to keep your JSP/ASP.net pages in the encoding you want to render to support the same.

Database Encoding and Characterset:

Content encoding in database is perhaps the most debated point. Chosing the correct encoding affects Storage Space, number of characters which could be supported in typical text fields ( not the Long and CLOB fields which require additional programming to be fetched), processing in storing and interpreting data, while doing charset to encoding conversion and the works. Database is typically the most expensive piece in the architecture of a CMS as well.

Typically people advocate choosing the encoding based on the language for storing 80% of the content. In case 80% or more content is english, then using UTF-8 saves space while requiring extra processing for some cases. In case most content uses more characters than US ASCII then UTF-16 might be a better bet, while it still wastes space for the english characters. If its anything in between, then the cost benefit in either case is not clear. Chances are that if you want to support western european languages only, then ISO 8859-1 might turn out to be most optimal.

2) Timezone, Date, Time

Internationalized application typically need to support multiple timezones, either on a single installation, or one timezone per installation. While supporting multiple timezones, the applications need to choose a default timezone for the user and offer them to change it. On typical application, we see users supporting the same while in CMS, we are worried more about the Content Creation side, while on content delivery we really donot care. We would typically offer a date/time based on the percieved location of the site instead of user location. Of course there are exceptions – like calendaring, schedules, car pickup times etc. which are dependant on user or event timezones – and they become important accordingly.

3) Currency

most applications need to handle money and currency specifically. In case of CMS systems, currency is usually a translation issue and not an application issue.

4) Dimensions, speeds and other measurements

 In a typical application it becomes important to handle the dimensions in the system of use ( FPS or Metric system). In CMS systems, only Graphics may be dimension aware, for the rest of the content again the same is addressed as a translation issue.

Currency and Dimensions as translation issue: In a cms system, the numbers inherently for a part of the body of the content which are either hard to identify and seperate, or those which are not intended to be seperated. They typically requrie translation even within the same language. For instance, a story on speed records needs to say 300 MPH in Britain, while it needs to say 500 KMPH in India.   A statement like “An area the size of Luxemburg” for european consumers needs to be translated to something like “an area the size of Lake LLiamna” for Alaskan readers, or “3 times the size of new york city ” for mainland american consumers. Similarly, a currency of 6 Billion Kroners needs to be translated to “half a billion sterlings” or “500 M” depending on the space available.

5) Labels and Navigation:

In a typical application, there is a 1 to 1 mapping for the different languages. Hence the labels, messages and navigation are translated 1 to 1. these are typically handled as “resource bundles” which are text files – with lets say Numnber 149 – meaning “State” in US English “bundle”and 149 meaning  “County” in UK english “bundle” or “Province” in Italian “bundle”.

Some-times its not trivial and requres change in template – for instance the Address format is different in different countries.

The CMS based sites may have some more differences, like having different set of content or only a subset of translated content being available. This drives the need to give an option of adding/removing navigation items.

6) Translation workflows 

Typically applications donot have a requirement for a translation workflow while CMS driven global sites may have the same. Its not a simple workflow. Every item created in the base language needs to tell if it is relevant for global markets , or for which specific other market. These markets need to decide if they indeed want to carry the content, and if so, they need to send that for translation, or decline the content, or accept as is. The translation itself needs to be previewed and edited if required before making it to the site.

 7) SpellChecks:

Spellcheck, though easily available for some languages, poses special challenges for certain other languages ( like Swedish, Arabic etc.) where the words could be aggregated or Vowels may be dropped and the result may or may not be correct depending on the context.  Some CMS systems may offer a short life differenciator by giving some innovations.

8) Phonetic keyboards

While it is a given that the content authors will use the keyboard of the language they are using, the same is not necessarily true for the end users. For instance in India, users like myself can read and write Hindi but cant type it. We just havent used the keyboard ever. So for users like myself, the message boards and comment boxes provide an option of phonetic keyboards  – where I type using an english keyboard and it picks up the correct set of Hindi letters.

Language Aware and Language Tolerant modes for internationalization:

Depending on your requirements, you can get away with “Language Tolerant” applications. The Language Tolerant applications dont care what you are putting in them as long as they are in correct encoding. They just treat everything like an Image, where the person uploading the image needs to worry about the size, pellette, etc. Similarly they cannot offer text sorting, spell check etc in some applications, while they do in some others.

Most of the CMS systems, languages, databases etc. are Aware of a few applications and tolerant of others. It requires care to ensure that the required features are provided, either directly or by integrating with the appropriate word processor or other external systems to substitute the capacity.

On a further note, let me give a small warning on the cost of internationalization. Internationalization has two parts – globalization and localization. Globalization means making the core application tolerant of different languages, currency and dates it needs to support. Typically a ground up application will require about 30% extra effort for the same.

Localization means taking a globalized application and provide a language aware interface for a given language. This could be either trivial or very expensive depending on native support avaialble from the core platforms and the desired features. This could drive you to purchase extra software licenses, spend more money in integration, do more UI changes like local Address formatting, spend a lot more money in local validations – like zip code checkers, phone number format verification etc. This is the part which is difficult to estimate and lacks a benchmark number.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: