Almost everyone wants guided navigation ( or multi-faceted navigation) now. Commercial entities like Endeca rule the space, but tend to get really expensive really fast.  So I looked at Apache Solr ( based on Lucene) for such need for a customer.

Solr is pretty impressive. You can do multi-faceted search and navigation based on pre-defined tags. The navigation could be based on string matches, one of multiple-value matches, date range match etc. – and you could optionally do keyword search as well.

This fits the bill where you have structured metadata ( like a product catalogue / product reviews like CNET review etc).

So what about “guided navigation” for content which has not been as effectively meta-tagged.

Now this is where it becomes challenging in open source stream. There are a few projects like Classifier4j – which uses byesian filter which can be trained to auto-classify content. There are projects like carrot2 which do search result clustering. Carrot2 is pretty effective in choosing the phrases to cluster against. About 80% of  categories it determines are very meaningful. It appeared a bit slow in the tests I ran. I am not very sure of its performance for large resultsets – or what % of meaning categories we miss out on.

The auto-classifiers need a lot more work. They are not simple plug and play – I dont know an effective open source alternative for this yet. So I am spending some time – looking at it from ground up using  a set of existing libraries which can provide base for text classification. I will update if I make a headway – Or will appreciate inputs from someone who has got it working.

Search result prioritization can be done on defined metadata easily – but I have not tried “learning” software here. Similarly  am yet to try “suggestions” and spell checks.

In short – Using lucene/solr for multi-facteted search is a very viable alternative to complex database queries and expensive commercial engines to implement the same.  But you are not getting a 1:1 equivalent of Endeca or Autonomy.