Welcome, Guest. Register Now!
   
Mark Forums Read Mark Forums Read Mark Forums Read


Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 04-21-2008, 04:06 AM
Junior Member
 
Join Date: Apr 2008
Posts: 6
Default Zend_search_lucene don't support UTF-8 ?

I tried Zend_search_lucene with some UTF-8 data, and i can't search it
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-21-2008, 03:10 PM
Junior Member
 
Join Date: Apr 2008
Posts: 2
Default Zend_search_lucene don't support UTF-8

you can view this:
Zend Framework: Documentation

sorry for my english
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-21-2008, 04:10 PM
Junior Member
 
Join Date: Apr 2008
Posts: 6
Default

i tried it, but nothing's happend.

you can test with this some words:
á ạ ầ ą

ZF can't index exactly the utf8 character , and when i search it , i can't read these character
(VIEW / CHARACTER ENCODING / UTF-8 )

Last edited by rassen : 04-23-2008 at 11:45 AM.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-22-2008, 09:51 PM
Junior Member
 
Join Date: Aug 2007
Posts: 2
Default

You could write text analyzer to replace non-standard characters to their equivalents. For instance you can replace
'ą' to 'a' or more complex 'ą' to 'xxxaxxx' and vice-versa during search.
Tomorrow I will send you sample code. It works perfectly.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-23-2008, 11:23 AM
Junior Member
 
Join Date: Aug 2007
Posts: 2
Default

Quote:
Originally Posted by lucassus View Post
You could write text analyzer to replace non-standard characters to their equivalents. For instance you can replace
'ą' to 'a' or more complex 'ą' to 'xxxaxxx' and vice-versa during search.
Code:
class Lucene_Helper {

protected $_find = array('ą','ż','ś','ź','ę','ć','ń','ó','ł','Ą','Ż','Ś','Ź','Ę','Ć','Ń','Ó','Ł');
	protected $_replace = array('a','z','s','x','e','c','n','o','l','A','Z','S','X','E','C','N','O','L');

	/**
	 *
	 * @param 	string 	$string
	 * @return 	string
	 */
	public function simplify($string) {
		foreach ($this->_find as $key => $value) {
			$string = str_replace($value, 'xxx' . $this->_replace[$key] . 'xxx', $string);
		}

		$string = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
		return $string;
	}

	/**
	 *
	 * @param 	string 	$string
	 * @return 	string
	 */
	function unsimplify($string) {
		$string = iconv('ASCII//TRANSLIT', 'UTF-8', $string);
		foreach ($this->_replace as $key => $value) {
			$string = str_replace('xxx' . $value . 'xxx', $this->_find[$key], $string);
		}

		return $string;
	}

}
During indexing:
Code:
$luceneHelper = new Lucene_Helper();
$doc->addField(Zend_Search_Lucene_Field::UnStored('subject', $luceneHelper->simplify($this->subject)));
		$doc->addField(Zend_Search_Lucene_Field::UnStored('body', $luceneHelper->simplify($this->body)));
During search:
Code:
$queryStr = $luceneHelper->simplify('out query with zażółć gęsią jaźń ;)');
$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
In view you could highlight matches:
Code:
$luceneHelper = new BluePaprica_Helper_Lucene();
$post = $postDAO->find($post_id)->current();
$highlightedBody = $this->query->highlightMatches($luceneHelper->simplify($post->body));
$highlightedSubject = $this->query->highlightMatches($luceneHelper->simplify($post->subject));
I hope it would help.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-23-2008, 11:33 AM
Junior Member
 
Join Date: Apr 2008
Posts: 6
Default

just thank u. i'll check you're example code.
so , i have one more question.

With UTF-8 data, can i find out it with normal keywords ?
example:

Here's my data string: "abc ćńó xyz"

and when i search with query: "cno"

the result with return my above record. Can it be?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-23-2008, 11:44 AM
Junior Member
 
Join Date: Apr 2008
Posts: 6
Default

Quote:
Originally Posted by lucassus View Post
You could write text analyzer to replace non-standard characters to their equivalents. For instance you can replace
'ą' to 'a' or more complex 'ą' to 'xxxaxxx' and vice-versa during search.
Tomorrow I will send you sample code. It works perfectly.
i tested it, with your solution, my data will be not perfectly.
the search query can be proccess, however data return is not UTF-8 data.
I thinked about this solution, perhaps for guarantee my data, i need store one field with 2 version
- Pure non-utf8 data
- and utf8 data

one for search, and one for display in search result
too much cost.

(sorry for my english skill)
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT. The time now is 11:08 AM.