Hi,
I'm in the process of trying to put together a search engine using lucene search for a moderately busy forum. First, I wrote a small program to index the posts that are stored in a mysql database. Currently, there are some 310,000 posts to index. Currently, I'm using the following code :
PHP Code:
if (isset($_REQUEST['start']))
{
$startpid = $_REQUEST['start'];
$newindex = false;
}
else
{
$startpid = 0;
$newindex = true;
}
if (isset($_REQUEST['offset']))
{
$offset = $_REQUEST['offset'];
$newindex = false;
}
else
{
$offset = 50;
$newindex = true;
}
$PostQuery = $db->query("SELECT p.pid, p.post, p.post_date, p.topic_id, p.queued, p.author_id, t.forum_id, t.approved
FROM ibf_posts p LEFT JOIN ibf_topics t ON ( p.topic_id=t.tid )
WHERE p.pid > {$startpid}
LIMIT {$offset}");
//$TopicQuery = $db->query("SELECT tid, title, forum_id, start_date, last_post, approved, starter_id FROM ibf_topics");
//create the index
$index = new Zend_Search_Lucene($indexdir, $newindex);
/*
value stored? indexed? tokenized? binary?
Keyword yes yes no no
UnIndexed yes no no no
Binary yes no no yes
Text yes yes yes no
UnStored no yes yes no
*/
$count = 0;
while ($post = $PostQuery->fetch())
{
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Keyword('pid', sanitize($post['pid']) ));
$doc->addField(Zend_Search_Lucene_Field::Keyword('forum_id', sanitize($post['forum_id']) ));
$doc->addField(Zend_Search_Lucene_Field::Keyword('topic_id', sanitize($post['topic_id']) ));
$doc->addField(Zend_Search_Lucene_Field::Keyword('queued', sanitize($post['queued']) ));
$doc->addField(Zend_Search_Lucene_Field::Keyword('approved', sanitize($post['approved']) ));
$doc->addField(Zend_Search_Lucene_Field::Keyword('post_date', sanitize($post['post_date']) ));
$doc->addField(Zend_Search_Lucene_Field::Keyword('author_id', sanitize($post['author_id']) ));
$doc->addField(Zend_Search_Lucene_Field::Text('post', sanitize($post['post']) ));
$lastpid = $post['pid'];
$index->addDocument($doc);
$count++;
}
if ($count == $offset)
{
//$url = 'http://'.$_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF']."?start={$count}&offset={$offset}";
$url = $_SERVER['PHP_SELF']."?start={$lastpid}&offset={$offset}";
//header("Location: ".$url);
}
else
{
$valid = false;
}
echo "Processed {$count} posts <br /> Last post ID was {$lastpid} <br />";
$index->commit();
?>
<html>
<head>
<title>Indexing Forum</title>
<meta http-equiv="refresh" content="1; URL= <?php echo $url ?>">
<meta name="keywords" content="automatic redirection">
</head>
<body>
Processed <?php echo $lastpid ?> posts
</body>
</html>
The trouble is, even though each post is very small, an attempt is made to optimize the index automatically after each document has been added. By the time I reach about 30,000 posts, the optimization step takes more than the 30 seconds allowed to run a php process.
Is this normal for lucene search to take so long ? Indexing those 30,000 posts takes around 10 minutes, which seems extraordinarily slow (in comparison, sphinxsearch can index all 310,000 posts in just over 2 minutes).
Has anyone else had this problem and how did you get around it ?
Thanks,
Martin