find . -name “*” -exec grep “Exception” {}  \; -print | more

Makes sense? If you are a developer and cannot make sense of the above - chances are that you will appreciate a GUI tool to ease your searches in log files. 

Few Collegues ( idea by Regu ) created a tool they named insight last year and have just made it open source. Here is the direct Download Link . It comes with a GUI- so dont worry,  you will not need to touch code or learn cryptic commands.

Developers who worked on it are quite sharp - so I am sure if you ask a question, add a wishlist or report a bug, you will get a response soon.

There was a time around 1999/2000 when first generation portals coming in place were solving the problem of consolidating information from existing web sites.  Hence they were strong in web “clipping” and allowed easy creation of dashboards. The better ones offered single sign on. However, they were up with very stiff competition - which is simple links and NT Domian/ADS authentication.

These capabilities are driven by having a single entry point to you corporate applications. That seemed to be the only driver here.

Then came mash ups, JSR 168 and WSRP.

 JSR 168, like J2EE JARsand WARs offered the Java world to deploy the same portlet anywhere. WSRP on the other hand was more tuned to mash-ups. No matter where you are running the application, as long as it follows wsrp, you could club UI elements of these applications together in a single page.

Now again, these capabilities are fixing the same problem, but are allowing easier creation of mashups and dashboards. Apart from that, now the are also trying to increase the application longivity and relevance. Which is by removing container lock-in and also by a standardized mash up protocol at the front end, keeing intranets talking to applications on heterogenous and non-standard platforms.

So the focus seems to shift from end user view, to developement and deployment view for a single entry point to an enterprise.

 Post that, in the last two years, they scrambled to add a whole host of applications over the infrastructure - collaboration, content management and others.

In the last two years, the real change starts to happen. Enterprises starting looking at SOA. Business agility, M&A and the dynamic business drivers in general along with increasing spend on IT - is forcing the enterprise applications to be more dynamic and agile then ever before.

The need for this agility drives Service Orientation at the back end layer, but leaves the choice of front end open.

Here, the portal vendors realized that they were in a very good shape to cater to “above service bus” needs. They have struts based UI framework - and other ways to create UI, they have navigation builders, authentication frameworks & SSO enablement, they provide frameworks for inter-application communication ( for both data and events) and extremely good support for existing applications as well. So theoritically, like SOA offers agility for business logic and business data, Portal architecture offers the same for presentation. It caters to single application, multiple brands as well as multiple applications with same  look and feel. It allows tying application built completely independently together.

So transactional applications driven by SOA drive portal from the other direction.

Portal vendors have responded to it differently. Websphere offers capabilities at both ends. BEA offers weblogic portal for SOA driven, Aqualogic UI for Intranet driven and a combination of both for needs requiring both.

Personally, with so much happening on the presentation layer - especially RIA and partly disconnected applications - it is hard to imagine the presentation of browser based applications remaining static for a very long time.

With the above capabilities, the drivers which make an organization go to market for portal products have increased to cover

1) Corporate intranets and extranets ( traditional need of a single entry point to corporate applications - going on from just sign-on to deep links to integrated workflows across multiple applications thus contributing to real productivity gains) Dashboards, report presentations and other aspects of BI are also being given increasing importance here.

2) Customer self service ( Some institutions like Banks, telecom etc. have the same user subscribing to multiple services - a single entry window to all of them leave users less confused thus reducing service costs)

3) Standardized web platforms for organization ( typically driven by SOA initiatives or by having an unmanagable set of heterogenous products - like 3 portal and 6 CM products) which also provides a set of “mini” applications re-usable across enterprise.

4) Collaboration (essentially a single workflow involving multiple people and multiple applications, primarily resulting in creation of a document)

Looking at the new set of feature improvement in portals, like increased Mashup capability for non-WSRP applications, embracing outside the firewall tools and collaboration infrastructure, increasin Content Management, Increasing interface with non-browser applications (like MS office), user created forms based applications via BPM, and more - they seem to be poised to address yet another business need - the need for ad-hoc applications.

Thanks to collegue Kishan A, saw this writeup on theserverside  “Spring is the new Java EE“  by Salil Deshpande, ex CEO of The Middleware Company (the company that originally created TheServerSide.com and TheServerSide Java Symposium). It summarizes changes in the Java world in last 3 years - and sums it up in one word - Spring.

I found it very interesting as I had coded my last serious work in Java around roughly the same time 3 years back - and then had jumped to the .NET bandwagon. Of late - I had been struggling to get up to date with Java technology - and was roughly noticing the same things which Salil points out so well.

So In short - the perception is:

1) EJBs are a thing of past. Pojo are back. Spring is the platform now. Everything is based on Spring. <Quote>

Last but not least, next generation application servers from BEA, and maybe IBM, will be built on top of Spring. Am I the only one that finds this mind-blowing?”

</Quote>

2) .NET and Java continue to co-exist - with many Java things being ported to .NET and more slowly, .NET things being adopted in Java.

3) Service Orientated Architecture and Open Source seem to go hand in hand. There are open source ESB like Mule.

4) Dependency injection frameworks and Dependency injection metaframeworks seem to be in thing. I had earlier expressed my frustration with the abundance of Java Frameworks , however Salil seems to say that the Java community has no confusion, spring is the way to go all the way.

5) Its the UI technology which has seen most innovation. I had put my earlier views here however, the alphabet soup is continuing to grow with JavaFX, F3, Flex being open source and what not.

6) RoR is very important, has good press, has some money ( though small ) and the war of metaprogramming is not yet been won - with Groovy, Grails and JRoR on JRuby.

Some of the other things I noted were:

  • The IDEs are actually complete now, not requiring you to go out to command prompt every few seconds. Intellisense does work even for Javascripts. Most new projects immediately release eclipse plug ins
  • Communication between tiers is still not easy. Unlike .NET - you have a lot more work to do. Hopefully, someone will apply the concept of Windows Communication Framework to Java ( or maybe it exists and I havent seen it yet).
  • There seems to be a lack of centre of gravity. Sun is no longer it. It could easily be IBM - but it doesnt seem to be. Oracle seems to be most aggressive - but doesnt have a large fan base. Fragmented Open source community seems to be the biggest driver.
  • There are more standards than ever, but less than enthusiastic compliance - with vendors having proprietory full featured interfaces and part handicapped Standard compliant interfaces.
  • There is hardly any clear and distinct differentiators that Java has now. It is hardly a market leader in any stack - from servers to mobiles. Without significant commercial investments - It could easily be “Legacy” in next 3 years.
  • Supported Linux Servers ( Read Red Hat with Jboss) are more expensive than supported Windows servers (with IIS/.NET framework as an app server). Yes you read it right. This re-inforces my previous point.
  • The learning curve for new developers is higher than ever before. A typical developer has to learn Java, JSP, Javascript, Struts, Spring, AOP, Hibernate, SQL, XML DOM, Quartz, Swing, GWT, JMS, MDB, AXIS and practically a new open source component for every task they take up. All projects have a complex framework, how it works remain a black magic and debugging with all that magic around you is like searching for your car keys under the lamp post (regardless of where it fell).  By the time you learn the framework, and catch up with the lost productivity - it is time for a new project with yet another complex framework. Whether the use of these open source components has actually increased our productivity remains a big question.  Data on Hours per FP atleast is not going down.  While open source projects, frameworks and components let us write lesser code and provide more features than we generally would, its not really helping improve productivity. (Or maybe guys just spend the time on Beer if they are ahead on the FP delivered)

I really dont think that the doomsday will happen . There is just too much investment already done on it for it to fade away . It will re-invent itself - and maybe the popularity of Spring is begining of it.

My Views on RIA

April 6, 2007

Having used asynchronous/part refreshing interfaces and rich client interfaces since 99, I was quite surprised on Ajax and RIA suddenly been seen a new and happening. While thousands of DHTML/javascripts have always been available by individual developers and botiqueues offering RIA. It was a tough task finding what worked and what was flaky and invariably required a lot of coding.

However with RIA becoming popular, we started getting more integrated offerings in terms of toolkits. There were and are multiple approaches:

Pure Browser based:

- Pure Javascript toolkits : Dojo toolkit, Yahoo UI library, Open Rico etc.

- Server side Java toolkits to generate RIA client code: GWT, Echo2 etc. (Possibly JSF can also fall here)

Rich client:

 - Flash based: Either using Flex/Atlas feeding Flash, pure flash widgets, or Flash generated by OpenLaszlo

 - Windows Presentation Framework - the Microsoft’s anwer to Flash - with WPF/E availale it now supports many browsers, not just IE 7.

- Traditional rich client ( Applets, Activex etc. and not so traditional - Java’s F3)

 In the past, finding a good javascript, and Activex used to the only practical options with the rest being too heavy or too flaky. For instance a few years back - It took us over two weeks to optimize Javascript tree menus so that they could render 2000 Nodes.

Today, all of the above are viable. But the question is - which is most pervasive (i.e. will work for most users - based on the current hardware / software they have), which would last some time and which are least likely to break with server side application upgrades.

I find JSF and server side toolkits like GWT slightly difficult to use as there are different versions of JDK and different server side frameworks that we need to work on. However any stand alone - client side item - whether flash/applet or javascript is good enough. I am in no way writing them off. If your need is limited to a single product/platform and you have developers who understand event driven programming - like Java swing - this may be the best option allowing you to debug on proper IDEs - which must be a lot more robust than difficult to test javascripts. I have talked to developers who swear by GWT having taken away their cross browser worries. And I even find a greater willingness amongst programmers to write Java code using GWT than javascript. Now that is a very real advantage.

So among flash/Applet and Javascript - my personal experience forces me away from applets. Its very difficult to get it to make applets light and to get them to work ( atleast amongst the ones I have used - not necessarily coded) and they take just too much time to load - even F3.  Activex used to be slightly more reliable but is increasingly facing trust issues. Flash are robust, fast and persvasive and Javascript based RIA work well - as long as they have gone thru a lot of quality control and testing for cross browser compatibility.

In my personal opinion, I would prefer javascript over flash, even more so with the advent of toolkits. The reason for that are

  • It looks one with the rest of the page.
  • You can select text.
  • You dont have to suffer “click here to activate” forcing you to click twice when all you want is to push a button or expand a node.
  • You dont want browser selection to be chunky etc.
  • You dont need to worry about Javascript vs Flash ( or whatever) interaction or need to worry about how to share data and events between different active windows on page if the entire page is not rich client.

But Flash like client are not a write off but bring a huge value - in fact in content management field, media fields and print fields they can have huge advantages which Javascript cant.

  • you can do multiple file select - no need to upload one file at a time.
  • While editing (lets say you are posting a yellow page entry or classified) - you can take advantage of kerning, hyphenation and justification being able to calculate real column inches
  • You can actually play video and audio
  • Its slightly easier to get them to actually work, after all their environment is more unifirm than the number of browsers. (but wait - before you carried away by the 98% figure that flash claims, Flash 7 , 8 and 9 are all present, flash doesnt come by default - and you need to worry about making your script cross flash version compliant and adoption is driven by biggies like myspace and youtube)

I am faced with a situation where I have to write components which work on a set of existing applications. The existing applications are based on Java 1.4/5/6 on different frameworks. So at this moment, since I dont understand how server side javascript generators can be used, I wont use them. I will go with javascript using Yahoo UI and flash where I must ( generated using OpenLaszlo). However once I understand server side toolkits better, I will give them a real shot.

If you need to start a new web application today - what would you base it on?

Will the client experience be managed by Pure HTML or RIA ( Yahoo UI or GWT or DOJO toolkit or ATF  or  OpenLaszlo or Flash or Swing…)

Will the web tier use MVC (Struts  2.0 or 1.3 Or Spring MVC) Or Tapestry or JSF ( MYFaces , Seam ) ….

What about templating and mashups? Tiles / Velocity / Framemarker / portal servers ….

How to address security and state management for Ajax UI? will you use a BPM engine?

And on the app layer? Pojo, AOP (Spring ..), EJB, ….

Which JDK 1.4/5/6( half the world is on 1.4)?

Java technology has never been as fragmented as it is now. There is no centre of gravity with Sun no longer being a dominant player.

This is making life extremely difficult for enterprises and architects - as you need to bet on a technology while creating enterprise applications which are supposed to last a few years.

We ( MindTree as Software development and consulting organization ) - are increasingly finding customers requesting our help to define a technology stack and if possible a framework. Neel who heads the Framework group at MindTree has an innovative solution in the form of a Meta Framework.

His team has created thin wrappers for various layers of frameworks which make accessing one layer from the other layer uniform ( regardless of whether its  Spring or an EJB behind the scene, and regardless of whether its struts or Spring MVC calling the business logic).

This approach makes it possible to swap in and out different layers without impacting the other layers.  This way he has been able to limit the problem in half (assuming only one layer will go desperately out of fasion at a time). Also the developers dont need to re-learn a lot if a new framework component is introduced.

He also spends time in tracking the list of mature options at each layer and has opinion on which technologies you can bet on . As always best option depends on requirements and what the customer organization has already invested in. Organizations may end up having hybrids - based on demands of individual options.

I faced with the above dilemma for a development I am going to jump into approached him and walked back with a stack of

(YUI And/OR OpenLaszlo ) => (Struts 2 OR Spring MVC) => Spring

though I am still not convinced on why not Seam.

Today, for the third time in 6 years I have put in as architect - I am being pushed to create un-necessary distribution in the architecture to get around CPU licenses. I am still not convinced if I need to give in to it.

The problem is simple - Lets say I need a CMS, a Search engine, an Imaging library and a portal server in the architecture. All these components can happily co-exist on all machines, but the problem is that all these software have a CPU based licensing, and possibly the vendors will force the client to pay for 3 times as much licensing fee as required if we were to take the simple approach.

So whats the alternative ?  Distribute?

If you put a Server for Imaging library, keep CMS, search engines and Portal in their diferent machine, possibly create redundancy by having a spare machine on which all of these are there.

The result is higher network traffic, slower applications and complex deployment operations.

Its high time vendors figure out a way of defining CPU thresholds  on each machine to let clients have lets say 1 CPU license on a 4 CPU machine - or atleast monitor the utilization for license fee enforcement rather than enfocing licenses for the entire deployment architecture.

Do you think its reasonable to ask vendors that - I will buy 2 CPU license but put it on 4 machines because I know I am not going to use more than that?

What do you do in these cases ? Negotiate hard, Pay up, go for a different license model with the vendor, or complicate the architecture?

Regu- whose IBM Yahoo Omnifind search review I posted earlier has a very interesting byte about Conventions over Configuration

He believes that with increasing commoditization of IT - cost, productivity etc are very important - and he goes on to suggest that we could use Conventions instead of Configurations (like Ruby on Rails does) towards this end. Its a great thought I believe.

However I donot believe IT is a commodity as of today - I wont explain that as Sadagopan has done a good job in articulating the same. He in fact actively opposes Nichola’s carr’s view that IT doesn’t matter. I am not sure that there are many analysts who dispute IT spending having their value Especially with IT being 50% of capex these days. Read the story here http://123suds.blogspot.com/2007/01/it-does-matter.html

However I strongly believe, that most things start tailor made and later bifurcate into commodity and designer ware. That is very likely to happen with IT as well. So commoditization is inevitable.

I am also a great believer in less code contributing to maintainability ( Rather than flexibility). So Personally I am not a big fan of excessive configurability ( as invariably it leads to lot more code - till you use rules engine) . Have a look at this interesting post here from Donald Ferguson - ex IBM, new Microsoft employee.
http://www-03.ibm.com/developerworks/blogs/page/donferguson?entry=less_code

Similarly I am not a big fan of code generators - as once you customize the generated code, the code generators cannot help you. But I think the meta programming guys have cracked it. RoR is showing the way for meta programming. Java has a poor cousin with JSR 52 (Standard Tag library) - which is just a start, and that to behind its time. One wants a lot more. Similarly - I have seen in atleast two occasions on large projects in our company requring swing based forms - architects going in for meta-programming of those using XML based language they defined.

Using convention does make sense. Swedish, Arabic, Sanskrit and to some extent most languages allow you to join words and pre-fix/suffix part words to make new words which mean as much as sentenses. If we can learn that, conventions should come natural to us. Its a good thought and Sun, IBM, Oracle or whomsoevers job it is to drive Java these days - Please take notice.

That we need to identify a user over a Session is a given for most web apps. Sessions typically inject state, binding users to web servers. However such binding imposes certain issues like inconvenince in form of requiring to re-login in case of server failure and issues in distributing load and sessions evenly. Though these issues donot sound as important, their visibility imposes a challenge to architects to either convince the CIOs or to overcome the same. Technology and vendors have been obliging but does that help ?

Session failover has been around in J2EE app servers for a while. Session failover has been available on microsoft platform since .NET was born. DB based sessions, and client based sessions ( either hidden variables or hidden frames) have been always present. But are they used?

A quick survey revels that the app server based/iis based  session failover is used by architects to answer the tough questions to CIOs, but ultimately they are not used in production due to performance degradation they cause. Depending on the session size, enabling session clustering can slow down your system from about 30% to upto 500%! So if you can limit the session size to < 10 KB, Or if resources are not an issue, you can venture there.

DB based sessions are quite heavy but slightly less expensive. The problem with these is that clustered DB servers are expensive. So you want to minimize the session size, both to increase performance and to keep the cost low by having a smaller Session DB server.

Similarly client based sessions impose a two way transfer of session data for every request - hence practically limiting it to the same range ( < 10 KB).

So what if you have bigger data?

1) You can consider Cluster aware cache. Some statistics indicate that these are cheaper than combined Software/Hardware licenses for DB.

2) You can consider cluster unaware cache - but you need to make sure that all information required to “re-build” session data is available either in clustered session or with the client.

To take an example, one of the applications I was involved in required shopping for a hotel. In this case, we would keep the criteria which lead to finding the hotel with the client. The hotel results were kept in a cluster-un-aware cache. The load balancers were kept sticky. So on a user request, if data is available in the session, users get the data, else the query is fired to the source systems again.

 Having said that, my quick survey tells me that DB based sessions are most popular for application where you do want to use clustered sessions at production ( and not app server/iis based sessions) even though they require additional code.

It is difficult to come across a CMS implementation where the business owner donot complain about the hardware. It could be either the speed is too slow, or it could be that my CMS systems require too much hardware. Seasonal nature of application usage doesnot make it any easier.

Fortunately, performance tuning is rapidly becoming more of a predictable science from the witchcraft and art it was percieved to be earlier.

Here are the top reasons why a typical CMS Based site could be slow:

1. Complex, dashboard style content delivery pages:

If the application has complex dashboard style pages, showing different content items from different “sections” of the content repository, chances are that it will require too many queries and will be too heavy on the database. Typically these interfaces will not have specifically selected content items, they might have top 3 / latest 3 from different sections.

If you have such a scenario - you  have to look at two things

a) Cache. Consider caching different sections of the page or the page itself.

b) Denormalization: An ideal content repository is Normalized. That means that the same information appears only once.  So if we have tagged content items under specific categories, the tagging information is residing with the content.  This requires that for surfing each tag - we have to go thru the entire content repository ( or an index spanning the entire content repository). Normalization also means that we break the information related to an asset into different relational tables based on the logical entities. For instance if we have an image and an article content type, and each content type has some common attributes like create and publish info, then we will have three “tables”. The BaseContent, the ArticleCotent and the ImageContent. So for getting information to show on the page, i am not just scanning one index, I am scanning multiple tables, comparing different parameters and finally joining them.

This can get inefficient and slow. This is overcome by Denormalizing. And there are two ways here. One way is to duplicate the data. For example, when every new content item gets created or modified, we look at it and put their references against the seperate database tables having those tags. Even here we may choose to have two tables - All content with Tags “Perofmance” and all Published content with tag performance. The moment we do that, the processing need to find the relevant articles reduces.

The second way would be to aggregate all data which is required for article selection and for the summary display, and keep them in the same table- either the existing BaseContent or ArticleContent table OR a third cache table.

Denormalization increases the application complexity as the CRUD operations now need to update at multiple places.

2) Live data from interfaced applications.

Sometimes, in the content delivery applications, we show live data from external sources, with data fetched in real time. Whenever this happens, we create a dependancy on the response time of external applications.

The most common methods used to overcome this is to either cache the runtime data, or do a nightly import of the third party data in a local database.

Apart from that, the live data sometimes require lots of CPU cycles in converting the datastream to usable objects and ultimately the prentable HTMLs.  One should look at opportunities to either reduce the data from the source to what is required, and choosing formats which provide maximum efficiency in conversion.

3) XSLT processing and XML

Since XML became mainstream some six years back, many architects give in to the elegance of it. They pass the data as XML and the presentation layer uses XSLT to convert it to the desired HTML Formats. Unfortunately, XSL transformation is a very expensive process and very difficult to performance tune. XML navigation and deserialization itself is quite inefficient. If you look at it with a microscope, if you are using XPATH,  the XPATH string has to be compiled in runtime to come up with the code for navigating the XML itself.

So dont give in to XML for internal processing till you see real benefits.

4) Retrieving the content itself

Your choice of content repository ( Database, File system or Mix) and the means of accessing the same have a huge implication on the efficiency and speed of rendering the same.  Be careful while chosing non-native connectors to the repository. If you are yourself writing connectors, tune it to eternity. Static publishing in the most efficient format (like copying images to /images folder as files, like putting story based content in the format it will be consumed - i.e. HTML Snippets or as database rows.)

5) Single Repository for content production and delivery

Some installations use the same installation for content production as well as content delivery. Content Delivery part typically will have seasonalities in use. This results in content production getting slowed down when content delivery is near its peak loads. some way of seggregation, or throtteling of peak load helps.

6) Heavy background jobs

Most CMS systems need to do a heavy batch processing - be it creation of thumbnails, indexing of content, processing reminders and alerts, etc.

7) Auto refreshes and alerts.

In some really dynamic CMS or portals, we have some part of the delivery page which needs to be continuously updated. Be it  content like - breaking news. be it alerts like - new item added to your tasklist,  or maybe its some information from third party systems like current stock prices or weather information.

Traditionally applications refresh the entire page for that. We should consider refreshing only part pages using Ajax and Ajax like technologies. 8) Live data set is too large.

Any CMS system will typically have a mix of long shelf life content and short shelf life content. This leads to a tendency to leave too much content in the Live Data Store. Alan keeps on stressing on the importance of data purging, and I couldnt agree more. 

As a thumb rule, a 10 fold increase in data size makes your system twice as slow. But this is only query processing. It makes selecting the useful content hard. It increases the chances of finding out-dated documents and accidently linking to them more probable.

In short - give great stress to data expiry and seggregate expired from live data. It is usually easier to implement if you give an option to import from the archive repository to the live repository.

In a typical implementation, we will default a content expiry date unless and until the user changes the same.

The expiry algorithm also needs to check on content use, expire unused content and give a warning for used content which is supposed to have been expired. This is indeed complex, but worth the money.

Here is how the following issues play out in different applications vs CMS systems

1) Characterset and Encoding selection at UI, application and data tiers

I mention all of them seperately as the choice traditionally has not been obvious. Lets look at the responsibilities of each of them.

User Interface:

The user should have fonts to render the given characterset, and keyboard setting to type them. (i.e. the browser should support them). The applications targeted for users on Mac OS X, and Windows 2000+ can assume the users to have unicode fonts but the others can not.  The others typically require the user to have installed a font for unicode characterset ( may not be pan-unicode but atleast the given language).

It is likely that the font we want to use doesnot come in Unicode Characterset, which forces either a different font, or a different characterset.

The next big issue is search. You want to make sure that the serach engines can index and find your site, and hence you need to use an encoding which the local search engine can support. For instance, google supports unicode but not all search engines do.

Also if the application delivers email, the adoption of Unicode has been slow, especially Japanese. Some of the traditional email utilities donot handle unicode, though the situation is changing fast with operating systems in last 5 years supporting unicode natively.

Are there other UI’s apart from browser, like smartphone and pocketPC (the mobile devices including smartphones and pocketPC- post 2000 support unicode, but it is still common to find applications used on these devices which do not support unicode.)

For a typical CMS system, this becomes very important. You need to know where you are delivering content to, you need to know if automatic,lossless charaset translation is possible from the format in the repository to the format on the UI and only then you can decide which encoding to be used.

If you want to use Quark or MS Word either for input or output, you need to be able to support or convert the extra characters they use - like “Smart Quotes”. These translation might have some “Loss”

Application Encoding and Characterset:

The applications need to emit formatted strings in the languages they support - e.g. Date in DD-MON-YYYY format. They may also need to do Alphabetic sort. You need to search some text, you may need to count characters and words, you may need to compare. All these require the application to support the given characterset and encoding for the given language. You might have hardcoded strings. You might want to keep your JSP/ASP.net pages in the encoding you want to render to support the same.

Database Encoding and Characterset:

Content encoding in database is perhaps the most debated point. Chosing the correct encoding affects Storage Space, number of characters which could be supported in typical text fields ( not the Long and CLOB fields which require additional programming to be fetched), processing in storing and interpreting data, while doing charset to encoding conversion and the works. Database is typically the most expensive piece in the architecture of a CMS as well.

Typically people advocate choosing the encoding based on the language for storing 80% of the content. In case 80% or more content is english, then using UTF-8 saves space while requiring extra processing for some cases. In case most content uses more characters than US ASCII then UTF-16 might be a better bet, while it still wastes space for the english characters. If its anything in between, then the cost benefit in either case is not clear. Chances are that if you want to support western european languages only, then ISO 8859-1 might turn out to be most optimal.

2) Timezone, Date, Time

Internationalized application typically need to support multiple timezones, either on a single installation, or one timezone per installation. While supporting multiple timezones, the applications need to choose a default timezone for the user and offer them to change it. On typical application, we see users supporting the same while in CMS, we are worried more about the Content Creation side, while on content delivery we really donot care. We would typically offer a date/time based on the percieved location of the site instead of user location. Of course there are exceptions - like calendaring, schedules, car pickup times etc. which are dependant on user or event timezones - and they become important accordingly.

3) Currency

most applications need to handle money and currency specifically. In case of CMS systems, currency is usually a translation issue and not an application issue.

4) Dimensions, speeds and other measurements

 In a typical application it becomes important to handle the dimensions in the system of use ( FPS or Metric system). In CMS systems, only Graphics may be dimension aware, for the rest of the content again the same is addressed as a translation issue.

Currency and Dimensions as translation issue: In a cms system, the numbers inherently for a part of the body of the content which are either hard to identify and seperate, or those which are not intended to be seperated. They typically requrie translation even within the same language. For instance, a story on speed records needs to say 300 MPH in Britain, while it needs to say 500 KMPH in India.   A statement like “An area the size of Luxemburg” for european consumers needs to be translated to something like “an area the size of Lake LLiamna” for Alaskan readers, or “3 times the size of new york city ” for mainland american consumers. Similarly, a currency of 6 Billion Kroners needs to be translated to “half a billion sterlings” or “500 M” depending on the space available.

5) Labels and Navigation:

In a typical application, there is a 1 to 1 mapping for the different languages. Hence the labels, messages and navigation are translated 1 to 1. these are typically handled as “resource bundles” which are text files - with lets say Numnber 149 - meaning “State” in US English “bundle”and 149 meaning  ”County” in UK english “bundle” or “Province” in Italian “bundle”.

Some-times its not trivial and requres change in template - for instance the Address format is different in different countries.

The CMS based sites may have some more differences, like having different set of content or only a subset of translated content being available. This drives the need to give an option of adding/removing navigation items.

6) Translation workflows 

Typically applications donot have a requirement for a translation workflow while CMS driven global sites may have the same. Its not a simple workflow. Every item created in the base language needs to tell if it is relevant for global markets , or for which specific other market. These markets need to decide if they indeed want to carry the content, and if so, they need to send that for translation, or decline the content, or accept as is. The translation itself needs to be previewed and edited if required before making it to the site.

 7) SpellChecks:

Spellcheck, though easily available for some languages, poses special challenges for certain other languages ( like Swedish, Arabic etc.) where the words could be aggregated or Vowels may be dropped and the result may or may not be correct depending on the context.  Some CMS systems may offer a short life differenciator by giving some innovations.

8) Phonetic keyboards

While it is a given that the content authors will use the keyboard of the language they are using, the same is not necessarily true for the end users. For instance in India, users like myself can read and write Hindi but cant type it. We just havent used the keyboard ever. So for users like myself, the message boards and comment boxes provide an option of phonetic keyboards  - where I type using an english keyboard and it picks up the correct set of Hindi letters.

Language Aware and Language Tolerant modes for internationalization:

Depending on your requirements, you can get away with “Language Tolerant” applications. The Language Tolerant applications dont care what you are putting in them as long as they are in correct encoding. They just treat everything like an Image, where the person uploading the image needs to worry about the size, pellette, etc. Similarly they cannot offer text sorting, spell check etc in some applications, while they do in some others.

Most of the CMS systems, languages, databases etc. are Aware of a few applications and tolerant of others. It requires care to ensure that the required features are provided, either directly or by integrating with the appropriate word processor or other external systems to substitute the capacity.

On a further note, let me give a small warning on the cost of internationalization. Internationalization has two parts - globalization and localization. Globalization means making the core application tolerant of different languages, currency and dates it needs to support. Typically a ground up application will require about 30% extra effort for the same.

Localization means taking a globalized application and provide a language aware interface for a given language. This could be either trivial or very expensive depending on native support avaialble from the core platforms and the desired features. This could drive you to purchase extra software licenses, spend more money in integration, do more UI changes like local Address formatting, spend a lot more money in local validations - like zip code checkers, phone number format verification etc. This is the part which is difficult to estimate and lacks a benchmark number.