That we need to identify a user over a Session is a given for most web apps. Sessions typically inject state, binding users to web servers. However such binding imposes certain issues like inconvenince in form of requiring to re-login in case of server failure and issues in distributing load and sessions evenly. Though these issues donot sound as important, their visibility imposes a challenge to architects to either convince the CIOs or to overcome the same. Technology and vendors have been obliging but does that help ?

Session failover has been around in J2EE app servers for a while. Session failover has been available on microsoft platform since .NET was born. DB based sessions, and client based sessions ( either hidden variables or hidden frames) have been always present. But are they used?

A quick survey revels that the app server based/iis based  session failover is used by architects to answer the tough questions to CIOs, but ultimately they are not used in production due to performance degradation they cause. Depending on the session size, enabling session clustering can slow down your system from about 30% to upto 500%! So if you can limit the session size to < 10 KB, Or if resources are not an issue, you can venture there.

DB based sessions are quite heavy but slightly less expensive. The problem with these is that clustered DB servers are expensive. So you want to minimize the session size, both to increase performance and to keep the cost low by having a smaller Session DB server.

Similarly client based sessions impose a two way transfer of session data for every request – hence practically limiting it to the same range ( < 10 KB).

So what if you have bigger data?

1) You can consider Cluster aware cache. Some statistics indicate that these are cheaper than combined Software/Hardware licenses for DB.

2) You can consider cluster unaware cache – but you need to make sure that all information required to “re-build” session data is available either in clustered session or with the client.

To take an example, one of the applications I was involved in required shopping for a hotel. In this case, we would keep the criteria which lead to finding the hotel with the client. The hotel results were kept in a cluster-un-aware cache. The load balancers were kept sticky. So on a user request, if data is available in the session, users get the data, else the query is fired to the source systems again.

 Having said that, my quick survey tells me that DB based sessions are most popular for application where you do want to use clustered sessions at production ( and not app server/iis based sessions) even though they require additional code.

It is difficult to come across a CMS implementation where the business owner donot complain about the hardware. It could be either the speed is too slow, or it could be that my CMS systems require too much hardware. Seasonal nature of application usage doesnot make it any easier.

Fortunately, performance tuning is rapidly becoming more of a predictable science from the witchcraft and art it was percieved to be earlier.

Here are the top reasons why a typical CMS Based site could be slow:

1. Complex, dashboard style content delivery pages:

If the application has complex dashboard style pages, showing different content items from different “sections” of the content repository, chances are that it will require too many queries and will be too heavy on the database. Typically these interfaces will not have specifically selected content items, they might have top 3 / latest 3 from different sections.

If you have such a scenario – you  have to look at two things

a) Cache. Consider caching different sections of the page or the page itself.

b) Denormalization: An ideal content repository is Normalized. That means that the same information appears only once.  So if we have tagged content items under specific categories, the tagging information is residing with the content.  This requires that for surfing each tag – we have to go thru the entire content repository ( or an index spanning the entire content repository). Normalization also means that we break the information related to an asset into different relational tables based on the logical entities. For instance if we have an image and an article content type, and each content type has some common attributes like create and publish info, then we will have three “tables”. The BaseContent, the ArticleCotent and the ImageContent. So for getting information to show on the page, i am not just scanning one index, I am scanning multiple tables, comparing different parameters and finally joining them.

This can get inefficient and slow. This is overcome by Denormalizing. And there are two ways here. One way is to duplicate the data. For example, when every new content item gets created or modified, we look at it and put their references against the seperate database tables having those tags. Even here we may choose to have two tables – All content with Tags “Perofmance” and all Published content with tag performance. The moment we do that, the processing need to find the relevant articles reduces.

The second way would be to aggregate all data which is required for article selection and for the summary display, and keep them in the same table- either the existing BaseContent or ArticleContent table OR a third cache table.

Denormalization increases the application complexity as the CRUD operations now need to update at multiple places.

2) Live data from interfaced applications.

Sometimes, in the content delivery applications, we show live data from external sources, with data fetched in real time. Whenever this happens, we create a dependancy on the response time of external applications.

The most common methods used to overcome this is to either cache the runtime data, or do a nightly import of the third party data in a local database.

Apart from that, the live data sometimes require lots of CPU cycles in converting the datastream to usable objects and ultimately the prentable HTMLs.  One should look at opportunities to either reduce the data from the source to what is required, and choosing formats which provide maximum efficiency in conversion.

3) XSLT processing and XML

Since XML became mainstream some six years back, many architects give in to the elegance of it. They pass the data as XML and the presentation layer uses XSLT to convert it to the desired HTML Formats. Unfortunately, XSL transformation is a very expensive process and very difficult to performance tune. XML navigation and deserialization itself is quite inefficient. If you look at it with a microscope, if you are using XPATH,  the XPATH string has to be compiled in runtime to come up with the code for navigating the XML itself.

So dont give in to XML for internal processing till you see real benefits.

4) Retrieving the content itself

Your choice of content repository ( Database, File system or Mix) and the means of accessing the same have a huge implication on the efficiency and speed of rendering the same.  Be careful while chosing non-native connectors to the repository. If you are yourself writing connectors, tune it to eternity. Static publishing in the most efficient format (like copying images to /images folder as files, like putting story based content in the format it will be consumed – i.e. HTML Snippets or as database rows.)

5) Single Repository for content production and delivery

Some installations use the same installation for content production as well as content delivery. Content Delivery part typically will have seasonalities in use. This results in content production getting slowed down when content delivery is near its peak loads. some way of seggregation, or throtteling of peak load helps.

6) Heavy background jobs

Most CMS systems need to do a heavy batch processing – be it creation of thumbnails, indexing of content, processing reminders and alerts, etc.

7) Auto refreshes and alerts.

In some really dynamic CMS or portals, we have some part of the delivery page which needs to be continuously updated. Be it  content like – breaking news. be it alerts like – new item added to your tasklist,  or maybe its some information from third party systems like current stock prices or weather information.

Traditionally applications refresh the entire page for that. We should consider refreshing only part pages using Ajax and Ajax like technologies.

8) Live data set is too large.

Any CMS system will typically have a mix of long shelf life content and short shelf life content. This leads to a tendency to leave too much content in the Live Data Store. Alan keeps on stressing on the importance of data purging, and I couldnt agree more. 

As a thumb rule, a 10 fold increase in data size makes your system twice as slow. But this is only query processing. It makes selecting the useful content hard. It increases the chances of finding out-dated documents and accidently linking to them more probable.

In short – give great stress to data expiry and seggregate expired from live data. It is usually easier to implement if you give an option to import from the archive repository to the live repository.

In a typical implementation, we will default a content expiry date unless and until the user changes the same.

The expiry algorithm also needs to check on content use, expire unused content and give a warning for used content which is supposed to have been expired. This is indeed complex, but worth the money.

Here is how the following issues play out in different applications vs CMS systems

1) Characterset and Encoding selection at UI, application and data tiers

I mention all of them seperately as the choice traditionally has not been obvious. Lets look at the responsibilities of each of them.

User Interface:

The user should have fonts to render the given characterset, and keyboard setting to type them. (i.e. the browser should support them). The applications targeted for users on Mac OS X, and Windows 2000+ can assume the users to have unicode fonts but the others can not.  The others typically require the user to have installed a font for unicode characterset ( may not be pan-unicode but atleast the given language).

It is likely that the font we want to use doesnot come in Unicode Characterset, which forces either a different font, or a different characterset.

The next big issue is search. You want to make sure that the serach engines can index and find your site, and hence you need to use an encoding which the local search engine can support. For instance, google supports unicode but not all search engines do.

Also if the application delivers email, the adoption of Unicode has been slow, especially Japanese. Some of the traditional email utilities donot handle unicode, though the situation is changing fast with operating systems in last 5 years supporting unicode natively.

Are there other UI’s apart from browser, like smartphone and pocketPC (the mobile devices including smartphones and pocketPC- post 2000 support unicode, but it is still common to find applications used on these devices which do not support unicode.)

For a typical CMS system, this becomes very important. You need to know where you are delivering content to, you need to know if automatic,lossless charaset translation is possible from the format in the repository to the format on the UI and only then you can decide which encoding to be used.

If you want to use Quark or MS Word either for input or output, you need to be able to support or convert the extra characters they use – like “Smart Quotes”. These translation might have some “Loss”

Application Encoding and Characterset:

The applications need to emit formatted strings in the languages they support – e.g. Date in DD-MON-YYYY format. They may also need to do Alphabetic sort. You need to search some text, you may need to count characters and words, you may need to compare. All these require the application to support the given characterset and encoding for the given language. You might have hardcoded strings. You might want to keep your JSP/ pages in the encoding you want to render to support the same.

Database Encoding and Characterset:

Content encoding in database is perhaps the most debated point. Chosing the correct encoding affects Storage Space, number of characters which could be supported in typical text fields ( not the Long and CLOB fields which require additional programming to be fetched), processing in storing and interpreting data, while doing charset to encoding conversion and the works. Database is typically the most expensive piece in the architecture of a CMS as well.

Typically people advocate choosing the encoding based on the language for storing 80% of the content. In case 80% or more content is english, then using UTF-8 saves space while requiring extra processing for some cases. In case most content uses more characters than US ASCII then UTF-16 might be a better bet, while it still wastes space for the english characters. If its anything in between, then the cost benefit in either case is not clear. Chances are that if you want to support western european languages only, then ISO 8859-1 might turn out to be most optimal.

2) Timezone, Date, Time

Internationalized application typically need to support multiple timezones, either on a single installation, or one timezone per installation. While supporting multiple timezones, the applications need to choose a default timezone for the user and offer them to change it. On typical application, we see users supporting the same while in CMS, we are worried more about the Content Creation side, while on content delivery we really donot care. We would typically offer a date/time based on the percieved location of the site instead of user location. Of course there are exceptions – like calendaring, schedules, car pickup times etc. which are dependant on user or event timezones – and they become important accordingly.

3) Currency

most applications need to handle money and currency specifically. In case of CMS systems, currency is usually a translation issue and not an application issue.

4) Dimensions, speeds and other measurements

 In a typical application it becomes important to handle the dimensions in the system of use ( FPS or Metric system). In CMS systems, only Graphics may be dimension aware, for the rest of the content again the same is addressed as a translation issue.

Currency and Dimensions as translation issue: In a cms system, the numbers inherently for a part of the body of the content which are either hard to identify and seperate, or those which are not intended to be seperated. They typically requrie translation even within the same language. For instance, a story on speed records needs to say 300 MPH in Britain, while it needs to say 500 KMPH in India.   A statement like “An area the size of Luxemburg” for european consumers needs to be translated to something like “an area the size of Lake LLiamna” for Alaskan readers, or “3 times the size of new york city ” for mainland american consumers. Similarly, a currency of 6 Billion Kroners needs to be translated to “half a billion sterlings” or “500 M” depending on the space available.

5) Labels and Navigation:

In a typical application, there is a 1 to 1 mapping for the different languages. Hence the labels, messages and navigation are translated 1 to 1. these are typically handled as “resource bundles” which are text files – with lets say Numnber 149 – meaning “State” in US English “bundle”and 149 meaning  “County” in UK english “bundle” or “Province” in Italian “bundle”.

Some-times its not trivial and requres change in template – for instance the Address format is different in different countries.

The CMS based sites may have some more differences, like having different set of content or only a subset of translated content being available. This drives the need to give an option of adding/removing navigation items.

6) Translation workflows 

Typically applications donot have a requirement for a translation workflow while CMS driven global sites may have the same. Its not a simple workflow. Every item created in the base language needs to tell if it is relevant for global markets , or for which specific other market. These markets need to decide if they indeed want to carry the content, and if so, they need to send that for translation, or decline the content, or accept as is. The translation itself needs to be previewed and edited if required before making it to the site.

 7) SpellChecks:

Spellcheck, though easily available for some languages, poses special challenges for certain other languages ( like Swedish, Arabic etc.) where the words could be aggregated or Vowels may be dropped and the result may or may not be correct depending on the context.  Some CMS systems may offer a short life differenciator by giving some innovations.

8) Phonetic keyboards

While it is a given that the content authors will use the keyboard of the language they are using, the same is not necessarily true for the end users. For instance in India, users like myself can read and write Hindi but cant type it. We just havent used the keyboard ever. So for users like myself, the message boards and comment boxes provide an option of phonetic keyboards  – where I type using an english keyboard and it picks up the correct set of Hindi letters.

Language Aware and Language Tolerant modes for internationalization:

Depending on your requirements, you can get away with “Language Tolerant” applications. The Language Tolerant applications dont care what you are putting in them as long as they are in correct encoding. They just treat everything like an Image, where the person uploading the image needs to worry about the size, pellette, etc. Similarly they cannot offer text sorting, spell check etc in some applications, while they do in some others.

Most of the CMS systems, languages, databases etc. are Aware of a few applications and tolerant of others. It requires care to ensure that the required features are provided, either directly or by integrating with the appropriate word processor or other external systems to substitute the capacity.

On a further note, let me give a small warning on the cost of internationalization. Internationalization has two parts – globalization and localization. Globalization means making the core application tolerant of different languages, currency and dates it needs to support. Typically a ground up application will require about 30% extra effort for the same.

Localization means taking a globalized application and provide a language aware interface for a given language. This could be either trivial or very expensive depending on native support avaialble from the core platforms and the desired features. This could drive you to purchase extra software licenses, spend more money in integration, do more UI changes like local Address formatting, spend a lot more money in local validations – like zip code checkers, phone number format verification etc. This is the part which is difficult to estimate and lacks a benchmark number.

Wiki and the Enterprise

December 4, 2006

Alan recently wrote why he is sceptic about Wikis in the enterprise .

My view of Wiki has been completely different. Yes – people used to call me geek, but now that I am no longer one, I still find myself using Wikis.

Here are some good example of how enterprises benefit or may benefit from Wikis

– Help: It makes a lot of sense to write help in a wiki. Users – when they are pained by the system – discover how to make it work. I have tried it and it works. The user community just has to be large enough

– Policies and procedures: While policies and procedures are not a good candidates for being edited by users, it is a good idea to have a comments page associated with it, where users could elaborate or give examples. I havent tried this so I dont know if it will work.

– Collaborative content creation: Wikis also provide a good platform for writing these Policies and procedures in the begining.

My own company preaches practitioner driven processes and is experimenting with re-writing process using a Wiki, and I am keenly waiting to see if it gets adopted.  However, there is a catch. Its not that good if we need to have diagrams and graphics.

Handbooks, group bookmarks and other updatable team info: I am in software industry, and we have something called as “developers handbook”. This is like a quick reference telling us about where things are and the steps for getting things done.  This needs quick updates ( For example, the URL of the latest release keeps on changing). Wikis form a good place to host them.

Wikis do have a use ( probably not as much as we geeks think and the vendors think) and is a powerful answer to many questions yet to be asked by the next generation of organizations having distributed headoffices and distributed teams.