Stripping HTML out of Zend Search Lucene indexes
I'm using Zend Search Lucene (versions 0.92-beta and 1,04) to index the content of a web site. Basically it is all working but the embedded HTML markup is causing a problem as a search query for, say, the word 'family' returns almost every page in the site as the HTML tag <font-family> has been used
Is there a built-in Zend_Search_Lucene function that will strip out all of the HTML markup from the page before it is indexed? Or should I filter it through the PHP strip_tags() function before indexing it?
Andy
|