Print to Web

September 24, 2006

Print content resides as PDF, Quark or Microsoft Word files. Frequently, the content needs to be taken onto the web.

The best place to tap into this is the print CMS system where the content has some meta-data. Where you have the header and the body of the story together and you can associate them.

At this point in time, there are very few issues that you have to look at.

1) Charactersets: You need to convert a few special characters like the Start Quote and End Quotes – into the proper single/double quote.

2) Special characters in Fonts used by print – like for bullets

3) Hyphenation – if manually introduced

4) Unwanted line breaks – if manually introduced

The second best place is the original content submitted by authors. The advantage it gives is that you have full content, not a version edited to fit the space available. The big disadvantage is that changes could have been lost and it will require re-editing.

Once the print content leaves the print content management system, or if the original contribution email/word document whatever is lost, or if the Final print version is required, the magnitude of the task increases manifold.

The story can now be split into multiple frames. Different frames for headline, date, body etc. Correlating that can be a challenge.

The story can be split in multiple pages ( like – Continued on page 6 ).

Even within the same frame, identifying metadata like authors name, etc. might not be so easy as multiple conventoins are followed.

This means that such a system and process needs to be developed over time, to cater for different types of layouts commonly used in the organization And by standardising on some kind of templates ( farame sets) within the publishing house making auto-recognition easy.

For existing content, the conversion tool needs to be visual – quark or adobe plugin, which automatically finds a story but gives options to the user to fix the mistakes it makes.

If such a tool can be made to auto-learn, its even better.

The task is slightly easier for media where content is top-down and not side by side. E.g in case of typical magazines.

Even the digitization companies use a mix of automated and manual process, and it seems that for the content to be re-purposed at article level ( not at the entire Edition level), a complete automation is still not practical – Even though these tools allow conversion to HTML and XML.

Please do drop me a note if you have come across a better way of doing the same.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: