Welcome, Guest. Register Now!
   
Mark Forums Read Mark Forums Read Mark Forums Read


Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 04-03-2008, 03:50 PM
Junior Member
 
Join Date: Apr 2008
Posts: 1
Default Zend_Search_Lucene and umlauts

Hi!
I have a strange problem:

I am trying to index a website that is in UTF-8 format.
Like this:

PHP Code:
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
               new 
Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8()
        );
        
$searchindex Zend_Search_Lucene::create($dir); // $dir is set correctly
        
foreach($urls as $url// $urls holds a list of URLs
        
{
            
$doc Zend_Search_Lucene_Document_Html::loadHTMLFile($url['url']);
            
$doc->addField(Zend_Search_Lucene_Field::Text('url'$url['url']));
            
$searchindex->addDocument($doc);
        } 
No error, but trying to search for a word with an umlaut always failed.
I have checked the index using this function:
PHP Code:
header('Content-Type: text/html; charset=UTF-8');
/* ... */
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
               new 
Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8()
        );
        
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');

        
$searchindex Zend_Search_Lucene::open($dir);

        foreach (
$searchindex->terms() as $term) {
            echo (
$term->key())."<br />";
        } 
The result:
...
body�Accounts
body�Achtung
body�AdministrationsoberflĂ
body�Administratoren
body�Adressbestand
body�AdressbestĂ
body�Adressdaten
...

Well...
The word "AdministrationsoberflĂ" actually is "Administrationsoberfläche", and "AdressbestĂ" actually is "Adressbestände". In every word with umlauts, the word is cut after the fist umlaut which, additionally, is unreadable.
What am I doing wrong?
Any ideas?
Thank you very much.
Tobias
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT. The time now is 11:08 AM.