Welcome, Guest. Register Now!
   
Mark Forums Read Mark Forums Read Mark Forums Read


Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 03-12-2008, 04:13 PM
Junior Member
 
Join Date: Mar 2008
Posts: 1
Default Stripping HTML out of Zend Search Lucene indexes

I'm using Zend Search Lucene (versions 0.92-beta and 1,04) to index the content of a web site. Basically it is all working but the embedded HTML markup is causing a problem as a search query for, say, the word 'family' returns almost every page in the site as the HTML tag <font-family> has been used

Is there a built-in Zend_Search_Lucene function that will strip out all of the HTML markup from the page before it is indexed? Or should I filter it through the PHP strip_tags() function before indexing it?

Andy
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 03-18-2008, 04:33 PM
Member
 
Join Date: Aug 2007
Location: Sweden
Posts: 52
Send a message via MSN to Leif.Högberg
Default

I would do the later.. strip html before you index. You should only index keywords/keydata.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 03-18-2008, 04:42 PM
xentek's Avatar
Senior Member
 
Join Date: Feb 2008
Posts: 112
Default

Quote:
Originally Posted by andyt View Post
Is there a built-in Zend_Search_Lucene function that will strip out all of the HTML markup from the page before it is indexed?
Andy,

There is a in-built filter for ZF that you can use.

Zend Framework: Documentation
__________________
- xentek
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT. The time now is 07:47 AM.