New Sharepoint Templates are launched - but none for ECM
March 20, 2007
Last week, Microsoft launched the much awaited sharepoint templates. These are kind of solution accelerators and have been available with earlier versions as well. However the release was a bit disappointing, coming only for Sharepoint Services - covering workgroup scenario with dashboard of events, projects and collection of documents. Most of these templates are an upgrade of those available earlier, with nothing being available for ECM and WCM scenarios.
Like MS CMS, there are some things which are difficult to do in MOSS - like making a story appear in two places on a website, or making a display template look different from content entry template. Providing application templates , which also cover such scenario are definitely needed as they would cut down implementation time by 3-4 weeks.
The default templates and Role Based Templates also leave much to be desired with simple WCM scenario available in default templates.
While Microsoft’s Marketing machine is positioning up MOSS from department to enterprise levels, these templates are just not helping confirm the same.
Lets hope that better application templates are on the anvil and would be out soon.
Notes on Archival
March 11, 2007
Had the pleasure of talking to Dr. Ram (Ramachandran Narayanaswamy) on a flight back from europe this week. Dr. Ram heads the storage vertical at MindTree and we had very passionate opinions about content archival - and I expressed the same opinions which I had at this post on Apoorv’s blog Challenges of ECM 1.0 still not solved.
Dr. Ram needed some 90 year old records from the local council - and he was able to get that, a sheet of paper, hand written, still available and still understood. Consider this with records archived about 20 years ago. They will be on an 8 inch floppy drive. As he puts it - there are multiple dimensions of complexity here
1) You need to find the physical disk (Lets assume you do - after all you can still find a 90 year old paper)
2) Media should be in good quality (Lets assume it is)
3) You need to have a drive to load it ( Lets assume you archived a drive along with the disk, every year)
4) You need to be able to physically connect it (do you need a yesteryear’s PC)
5) You need drivers for it (do you need a yesteryear’s PC/OS etc. to be archived as well?)
6) You need to make sure you are able to read files from the file system of that time and the encoding/characterset of that time
7) You need a copy of wordstar to read the file, something which can run on current PCs/ OSes - or you should have a way to export data to currently readable formats - if you use an old machine / OS. Now will you need to print it and go for OCR ?
All this is too complex and too much work. Something you might be able to do for information which is very valuable. If so - whats the point in archiving the rest ?
Now imagine the data in archive Databases -how much chances are there of the same making sense even if you are able to fulfull all the 7 conditions above ? What about images? Will JPEG be available tomorrow ? We are still able to see 100 year old photographs - but will we be able to see our albums - which are JPEG - just 20 years later ?
So whats the solution ?
Constant upgrade of archived content ? Isnt that too expensive and too much work as well?
His opinion was that any long term storage needs to be readable in Natural Language ( Including database data). Makes sense for a piece of paper - but for digital records ?? Well Dr. Ram said that we still have to find a solution - and that is one thing he will be looking at as a part of his and his teams work on storage.
Well, with the short sigtedness of an engineer and not a researcher - I look at it slightly differently. While there may be no solution today - I believe that if ever there is a solution - it will be for the most popular content format. Thus I will rather place my bets on HTML. Better still if I have a XML version of the data for these HTMLs. Also that the archive storage should be online - like a NAS and not offline. So when you upgrade your NAS - you automatically upgrade your data as well.
Dr. Ram and others working on this field of archival and storage - Please solve this soon and solve for documents first and database later ![]()
Are CPU Licenses the culprit for complicated architecture?
February 16, 2007
Today, for the third time in 6 years I have put in as architect - I am being pushed to create un-necessary distribution in the architecture to get around CPU licenses. I am still not convinced if I need to give in to it.
The problem is simple - Lets say I need a CMS, a Search engine, an Imaging library and a portal server in the architecture. All these components can happily co-exist on all machines, but the problem is that all these software have a CPU based licensing, and possibly the vendors will force the client to pay for 3 times as much licensing fee as required if we were to take the simple approach.
So whats the alternative ? Distribute?
If you put a Server for Imaging library, keep CMS, search engines and Portal in their diferent machine, possibly create redundancy by having a spare machine on which all of these are there.
The result is higher network traffic, slower applications and complex deployment operations.
Its high time vendors figure out a way of defining CPU thresholds on each machine to let clients have lets say 1 CPU license on a 4 CPU machine - or atleast monitor the utilization for license fee enforcement rather than enfocing licenses for the entire deployment architecture.
Do you think its reasonable to ask vendors that - I will buy 2 CPU license but put it on 4 machines because I know I am not going to use more than that?
What do you do in these cases ? Negotiate hard, Pay up, go for a different license model with the vendor, or complicate the architecture?
Yet Another search on Google Coop
February 2, 2007
I figured Google coop could be a good way to search my feeds.
So I imported my OPML file and put the custom search engine at http://www.google.com/coop/cse?cx=007715917224222119143%3Agauyeww9pbk , added a couple of weeks of browser history and told it to search web also.
I am not highly impressed with the results yet, but I believe it has potential - esp as long as I am willing to remove sites from the list and not just add to them.
Interested people are most welcome to try. I have also given an option for everyone to edit - if someone else wants to collaborate to get a better list out.
Its mostly CMS and Portal related sites, a few vendor sites ( - both commercial and open source - Unfortunately whole of Microsoft and IBM as I couldnt figure out how to do specific products only)
The danger is that I will have an even lesser chance of discovering things I dont know about :-( Thats not too exciting.
Convention, configuration and metaprogramming
January 24, 2007
Regu- whose IBM Yahoo Omnifind search review I posted earlier has a very interesting byte about Conventions over Configuration
He believes that with increasing commoditization of IT - cost, productivity etc are very important - and he goes on to suggest that we could use Conventions instead of Configurations (like Ruby on Rails does) towards this end. Its a great thought I believe.
However I donot believe IT is a commodity as of today - I wont explain that as Sadagopan has done a good job in articulating the same. He in fact actively opposes Nichola’s carr’s view that IT doesn’t matter. I am not sure that there are many analysts who dispute IT spending having their value Especially with IT being 50% of capex these days. Read the story here http://123suds.blogspot.com/2007/01/it-does-matter.html
However I strongly believe, that most things start tailor made and later bifurcate into commodity and designer ware. That is very likely to happen with IT as well. So commoditization is inevitable.
I am also a great believer in less code contributing to maintainability ( Rather than flexibility). So Personally I am not a big fan of excessive configurability ( as invariably it leads to lot more code - till you use rules engine) . Have a look at this interesting post here from Donald Ferguson - ex IBM, new Microsoft employee.
http://www-03.ibm.com/developerworks/blogs/page/donferguson?entry=less_code
Similarly I am not a big fan of code generators - as once you customize the generated code, the code generators cannot help you. But I think the meta programming guys have cracked it. RoR is showing the way for meta programming. Java has a poor cousin with JSR 52 (Standard Tag library) - which is just a start, and that to behind its time. One wants a lot more. Similarly - I have seen in atleast two occasions on large projects in our company requring swing based forms - architects going in for meta-programming of those using XML based language they defined.
Using convention does make sense. Swedish, Arabic, Sanskrit and to some extent most languages allow you to join words and pre-fix/suffix part words to make new words which mean as much as sentenses. If we can learn that, conventions should come natural to us. Its a good thought and Sun, IBM, Oracle or whomsoevers job it is to drive Java these days - Please take notice.
Jakob Nielsens’ 10 Best Intranets finds no standard CMS
January 18, 2007
Jakob Nielsens published the study on 10 Best Intranets and exposes some interesting facts - which we all knew were existing, but never thought that best intranets were made on them.
One of the points he makes - which I am going to talk about today
” This year, all the winning intranets were template-driven and relied on a content management system (CMS). Strikingly, most intranets used their own homemade CMS. Thus, even though there are standards within each intranet, there’s no standard across intranets, even in the choice of CMS. “
Almost everyone I talk to about intranets - can cite more than one examples when prototypes and demos became live intranets and flourished from there. Not surprisingly, many of the working prototypes tend to be based on open source products. However, this survey doesnt mention any of the open source porducts and that I do find surprising. Possibly because the choice of products were so varied among these 10 companies that there was no product used ad multiple places.
However more and more companies are looking at standardizing their intranet platforms. Amongst the projects that me and my company have been involved in, the platforms were implemented using varied technology sets - like Documentum and ATG Portal, Interwoven Teamsite+Mediabin and BEA portal, Vignette Suite, Sharepoint + Lamp based open source, complete custom development, etc.
The point here is that the list of features required for an intranet are so varied, that no matter what product set you go with, you will need to plug in applications, third party components and do a lot of custom building. Usually the product choices are driven by the requirements of first set of sites to be created - and that is not a bad strategy as it helps time to market for the first set, thus keeping the interest of the organization high.
So which product you go with doesnot matter (you have to significantly customize all of them)- what matters is that the product should be extensible, provide you option to plug in or override functionalilties like Authentication, search etc, should make it easy to integrate with custom developed applications and that it should have a friendly licensing option for your needs. Most critically, it should expose a lot of interfaces to allow third party applications to connect to it and - in certain cases - drive it.
I think the only vendor which would come close to providing a ready to use Intranet would be SAP - some day soon.
Yahoo/IBM Omnifind where?
January 1, 2007
A co-architect at Mindtree - Regunath Balasubramanian had a good look at IBM/Yahoo Omnifind. Here is what he had to say about it.
———
Omnifind offers three types of relevance – level from entry link, how recent the document is and page rank – in terms of links pointing to the page. These give a good default ranking when you are indexing a site by crawling.
Omnifind could crawl sites via authenticated proxy, but couldnot crawl secure sites – atleast not by default. It could crawl file system. It was also not apparent how authenticated sites will work as well.
There is a limit of 400,000 docs beyond which you need to upgrade to commercial version.
It provides interfaces – very comprehensive to search, and slightly cumbersome for insert and delete from index.
There was a limit to max results – 1024 – which should not be a big problem in most cases.
The UI is nice and easy to use – and you can get started easily.
It supports the popular file formats out of the box.
Also it was not apparent if you could control the section of the page to be indexed – for instance – can you tell it to not index the keywords in navigation.
Now the question is where you will use this engine?
On looking under the hood – it uses Lucene – which is a popular Java search engine. That is the choice of many for Java developers so unlikely to be question.
It also has an Apache Derby – which provides a file based database. Now derby is not known to work in a clustered environment. Can you cluster it? Atleast for failover?
What exactly is it using Derby for is not clear. It can either use it to store a cache of the web pages, thus reducing load on Lucene, or possibly its used to store only the configurations.
It seems to work well for crawling and searching. Its too heavy for desktop search, so that doesn’t seem to be what it aims.
In case you want to use it for searching database content, Any value over vanilla Lucene is not apparent.
=================
So what market is it actually targeting? If its searching only non-authenticated HTTP sites? is it targeting searches on sites publishing documentation or something avaialble to all? That’s a very limited market to enter into! Almost every intranet site and most internet sites will also have dynamic sections where ACL based search will be desired, forcing the architects to go for two solutions instead of one ( if considering omnifind).
It will definitely raise the bar for commercial search engines in this market and put onus on them to prove why they should be worth the extra money – or bring in low cost entry level licensing - the same thing that happened to the database market where Oracle and IBM offer single user databases upto 2 GB for free.
Stateless clustering at Web Tier
December 26, 2006
That we need to identify a user over a Session is a given for most web apps. Sessions typically inject state, binding users to web servers. However such binding imposes certain issues like inconvenince in form of requiring to re-login in case of server failure and issues in distributing load and sessions evenly. Though these issues donot sound as important, their visibility imposes a challenge to architects to either convince the CIOs or to overcome the same. Technology and vendors have been obliging but does that help ?
Session failover has been around in J2EE app servers for a while. Session failover has been available on microsoft platform since .NET was born. DB based sessions, and client based sessions ( either hidden variables or hidden frames) have been always present. But are they used?
A quick survey revels that the app server based/iis based session failover is used by architects to answer the tough questions to CIOs, but ultimately they are not used in production due to performance degradation they cause. Depending on the session size, enabling session clustering can slow down your system from about 30% to upto 500%! So if you can limit the session size to < 10 KB, Or if resources are not an issue, you can venture there.
DB based sessions are quite heavy but slightly less expensive. The problem with these is that clustered DB servers are expensive. So you want to minimize the session size, both to increase performance and to keep the cost low by having a smaller Session DB server.
Similarly client based sessions impose a two way transfer of session data for every request - hence practically limiting it to the same range ( < 10 KB).
So what if you have bigger data?
1) You can consider Cluster aware cache. Some statistics indicate that these are cheaper than combined Software/Hardware licenses for DB.
2) You can consider cluster unaware cache - but you need to make sure that all information required to “re-build” session data is available either in clustered session or with the client.
To take an example, one of the applications I was involved in required shopping for a hotel. In this case, we would keep the criteria which lead to finding the hotel with the client. The hotel results were kept in a cluster-un-aware cache. The load balancers were kept sticky. So on a user request, if data is available in the session, users get the data, else the query is fired to the source systems again.
Having said that, my quick survey tells me that DB based sessions are most popular for application where you do want to use clustered sessions at production ( and not app server/iis based sessions) even though they require additional code.
CMS performance - where did the Iron go?
December 12, 2006
It is difficult to come across a CMS implementation where the business owner donot complain about the hardware. It could be either the speed is too slow, or it could be that my CMS systems require too much hardware. Seasonal nature of application usage doesnot make it any easier.
Fortunately, performance tuning is rapidly becoming more of a predictable science from the witchcraft and art it was percieved to be earlier.
Here are the top reasons why a typical CMS Based site could be slow:
1. Complex, dashboard style content delivery pages:
If the application has complex dashboard style pages, showing different content items from different “sections” of the content repository, chances are that it will require too many queries and will be too heavy on the database. Typically these interfaces will not have specifically selected content items, they might have top 3 / latest 3 from different sections.
If you have such a scenario - you have to look at two things
a) Cache. Consider caching different sections of the page or the page itself.
b) Denormalization: An ideal content repository is Normalized. That means that the same information appears only once. So if we have tagged content items under specific categories, the tagging information is residing with the content. This requires that for surfing each tag - we have to go thru the entire content repository ( or an index spanning the entire content repository). Normalization also means that we break the information related to an asset into different relational tables based on the logical entities. For instance if we have an image and an article content type, and each content type has some common attributes like create and publish info, then we will have three “tables”. The BaseContent, the ArticleCotent and the ImageContent. So for getting information to show on the page, i am not just scanning one index, I am scanning multiple tables, comparing different parameters and finally joining them.
This can get inefficient and slow. This is overcome by Denormalizing. And there are two ways here. One way is to duplicate the data. For example, when every new content item gets created or modified, we look at it and put their references against the seperate database tables having those tags. Even here we may choose to have two tables - All content with Tags “Perofmance” and all Published content with tag performance. The moment we do that, the processing need to find the relevant articles reduces.
The second way would be to aggregate all data which is required for article selection and for the summary display, and keep them in the same table- either the existing BaseContent or ArticleContent table OR a third cache table.
Denormalization increases the application complexity as the CRUD operations now need to update at multiple places.
2) Live data from interfaced applications.
Sometimes, in the content delivery applications, we show live data from external sources, with data fetched in real time. Whenever this happens, we create a dependancy on the response time of external applications.
The most common methods used to overcome this is to either cache the runtime data, or do a nightly import of the third party data in a local database.
Apart from that, the live data sometimes require lots of CPU cycles in converting the datastream to usable objects and ultimately the prentable HTMLs. One should look at opportunities to either reduce the data from the source to what is required, and choosing formats which provide maximum efficiency in conversion.
3) XSLT processing and XML
Since XML became mainstream some six years back, many architects give in to the elegance of it. They pass the data as XML and the presentation layer uses XSLT to convert it to the desired HTML Formats. Unfortunately, XSL transformation is a very expensive process and very difficult to performance tune. XML navigation and deserialization itself is quite inefficient. If you look at it with a microscope, if you are using XPATH, the XPATH string has to be compiled in runtime to come up with the code for navigating the XML itself.
So dont give in to XML for internal processing till you see real benefits.
4) Retrieving the content itself
Your choice of content repository ( Database, File system or Mix) and the means of accessing the same have a huge implication on the efficiency and speed of rendering the same. Be careful while chosing non-native connectors to the repository. If you are yourself writing connectors, tune it to eternity. Static publishing in the most efficient format (like copying images to /images folder as files, like putting story based content in the format it will be consumed - i.e. HTML Snippets or as database rows.)
5) Single Repository for content production and delivery
Some installations use the same installation for content production as well as content delivery. Content Delivery part typically will have seasonalities in use. This results in content production getting slowed down when content delivery is near its peak loads. some way of seggregation, or throtteling of peak load helps.
6) Heavy background jobs
Most CMS systems need to do a heavy batch processing - be it creation of thumbnails, indexing of content, processing reminders and alerts, etc.
7) Auto refreshes and alerts.
In some really dynamic CMS or portals, we have some part of the delivery page which needs to be continuously updated. Be it content like - breaking news. be it alerts like - new item added to your tasklist, or maybe its some information from third party systems like current stock prices or weather information.
Traditionally applications refresh the entire page for that. We should consider refreshing only part pages using Ajax and Ajax like technologies.
Live data set is too large.
Any CMS system will typically have a mix of long shelf life content and short shelf life content. This leads to a tendency to leave too much content in the Live Data Store. Alan keeps on stressing on the importance of data purging, and I couldnt agree more.
As a thumb rule, a 10 fold increase in data size makes your system twice as slow. But this is only query processing. It makes selecting the useful content hard. It increases the chances of finding out-dated documents and accidently linking to them more probable.
In short - give great stress to data expiry and seggregate expired from live data. It is usually easier to implement if you give an option to import from the archive repository to the live repository.
In a typical implementation, we will default a content expiry date unless and until the user changes the same.
The expiry algorithm also needs to check on content use, expire unused content and give a warning for used content which is supposed to have been expired. This is indeed complex, but worth the money.
Internationalization in Content Management vs Internationalization in applications
December 11, 2006
Here is how the following issues play out in different applications vs CMS systems
1) Characterset and Encoding selection at UI, application and data tiers
I mention all of them seperately as the choice traditionally has not been obvious. Lets look at the responsibilities of each of them.
User Interface:
The user should have fonts to render the given characterset, and keyboard setting to type them. (i.e. the browser should support them). The applications targeted for users on Mac OS X, and Windows 2000+ can assume the users to have unicode fonts but the others can not. The others typically require the user to have installed a font for unicode characterset ( may not be pan-unicode but atleast the given language).
It is likely that the font we want to use doesnot come in Unicode Characterset, which forces either a different font, or a different characterset.
The next big issue is search. You want to make sure that the serach engines can index and find your site, and hence you need to use an encoding which the local search engine can support. For instance, google supports unicode but not all search engines do.
Also if the application delivers email, the adoption of Unicode has been slow, especially Japanese. Some of the traditional email utilities donot handle unicode, though the situation is changing fast with operating systems in last 5 years supporting unicode natively.
Are there other UI’s apart from browser, like smartphone and pocketPC (the mobile devices including smartphones and pocketPC- post 2000 support unicode, but it is still common to find applications used on these devices which do not support unicode.)
For a typical CMS system, this becomes very important. You need to know where you are delivering content to, you need to know if automatic,lossless charaset translation is possible from the format in the repository to the format on the UI and only then you can decide which encoding to be used.
If you want to use Quark or MS Word either for input or output, you need to be able to support or convert the extra characters they use - like “Smart Quotes”. These translation might have some “Loss”
Application Encoding and Characterset:
The applications need to emit formatted strings in the languages they support - e.g. Date in DD-MON-YYYY format. They may also need to do Alphabetic sort. You need to search some text, you may need to count characters and words, you may need to compare. All these require the application to support the given characterset and encoding for the given language. You might have hardcoded strings. You might want to keep your JSP/ASP.net pages in the encoding you want to render to support the same.
Database Encoding and Characterset:
Content encoding in database is perhaps the most debated point. Chosing the correct encoding affects Storage Space, number of characters which could be supported in typical text fields ( not the Long and CLOB fields which require additional programming to be fetched), processing in storing and interpreting data, while doing charset to encoding conversion and the works. Database is typically the most expensive piece in the architecture of a CMS as well.
Typically people advocate choosing the encoding based on the language for storing 80% of the content. In case 80% or more content is english, then using UTF-8 saves space while requiring extra processing for some cases. In case most content uses more characters than US ASCII then UTF-16 might be a better bet, while it still wastes space for the english characters. If its anything in between, then the cost benefit in either case is not clear. Chances are that if you want to support western european languages only, then ISO 8859-1 might turn out to be most optimal.
2) Timezone, Date, Time
Internationalized application typically need to support multiple timezones, either on a single installation, or one timezone per installation. While supporting multiple timezones, the applications need to choose a default timezone for the user and offer them to change it. On typical application, we see users supporting the same while in CMS, we are worried more about the Content Creation side, while on content delivery we really donot care. We would typically offer a date/time based on the percieved location of the site instead of user location. Of course there are exceptions - like calendaring, schedules, car pickup times etc. which are dependant on user or event timezones - and they become important accordingly.
3) Currency
most applications need to handle money and currency specifically. In case of CMS systems, currency is usually a translation issue and not an application issue.
4) Dimensions, speeds and other measurements
In a typical application it becomes important to handle the dimensions in the system of use ( FPS or Metric system). In CMS systems, only Graphics may be dimension aware, for the rest of the content again the same is addressed as a translation issue.
Currency and Dimensions as translation issue: In a cms system, the numbers inherently for a part of the body of the content which are either hard to identify and seperate, or those which are not intended to be seperated. They typically requrie translation even within the same language. For instance, a story on speed records needs to say 300 MPH in Britain, while it needs to say 500 KMPH in India. A statement like “An area the size of Luxemburg” for european consumers needs to be translated to something like “an area the size of Lake LLiamna” for Alaskan readers, or “3 times the size of new york city ” for mainland american consumers. Similarly, a currency of 6 Billion Kroners needs to be translated to “half a billion sterlings” or “500 M” depending on the space available.
5) Labels and Navigation:
In a typical application, there is a 1 to 1 mapping for the different languages. Hence the labels, messages and navigation are translated 1 to 1. these are typically handled as “resource bundles” which are text files - with lets say Numnber 149 - meaning “State” in US English “bundle”and 149 meaning ”County” in UK english “bundle” or “Province” in Italian “bundle”.
Some-times its not trivial and requres change in template - for instance the Address format is different in different countries.
The CMS based sites may have some more differences, like having different set of content or only a subset of translated content being available. This drives the need to give an option of adding/removing navigation items.
6) Translation workflows
Typically applications donot have a requirement for a translation workflow while CMS driven global sites may have the same. Its not a simple workflow. Every item created in the base language needs to tell if it is relevant for global markets , or for which specific other market. These markets need to decide if they indeed want to carry the content, and if so, they need to send that for translation, or decline the content, or accept as is. The translation itself needs to be previewed and edited if required before making it to the site.
7) SpellChecks:
Spellcheck, though easily available for some languages, poses special challenges for certain other languages ( like Swedish, Arabic etc.) where the words could be aggregated or Vowels may be dropped and the result may or may not be correct depending on the context. Some CMS systems may offer a short life differenciator by giving some innovations.
Phonetic keyboards
While it is a given that the content authors will use the keyboard of the language they are using, the same is not necessarily true for the end users. For instance in India, users like myself can read and write Hindi but cant type it. We just havent used the keyboard ever. So for users like myself, the message boards and comment boxes provide an option of phonetic keyboards - where I type using an english keyboard and it picks up the correct set of Hindi letters.
Language Aware and Language Tolerant modes for internationalization:
Depending on your requirements, you can get away with “Language Tolerant” applications. The Language Tolerant applications dont care what you are putting in them as long as they are in correct encoding. They just treat everything like an Image, where the person uploading the image needs to worry about the size, pellette, etc. Similarly they cannot offer text sorting, spell check etc in some applications, while they do in some others.
Most of the CMS systems, languages, databases etc. are Aware of a few applications and tolerant of others. It requires care to ensure that the required features are provided, either directly or by integrating with the appropriate word processor or other external systems to substitute the capacity.
On a further note, let me give a small warning on the cost of internationalization. Internationalization has two parts - globalization and localization. Globalization means making the core application tolerant of different languages, currency and dates it needs to support. Typically a ground up application will require about 30% extra effort for the same.
Localization means taking a globalized application and provide a language aware interface for a given language. This could be either trivial or very expensive depending on native support avaialble from the core platforms and the desired features. This could drive you to purchase extra software licenses, spend more money in integration, do more UI changes like local Address formatting, spend a lot more money in local validations - like zip code checkers, phone number format verification etc. This is the part which is difficult to estimate and lacks a benchmark number.

