Welcome, Guest. Register Now!
   
Mark Forums Read Mark Forums Read Mark Forums Read


Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 01-20-2008, 02:53 PM
Junior Member
 
Join Date: Jan 2008
Posts: 1
Default Lucene search scalability

Hi,
I'm in the process of trying to put together a search engine using lucene search for a moderately busy forum. First, I wrote a small program to index the posts that are stored in a mysql database. Currently, there are some 310,000 posts to index. Currently, I'm using the following code :

PHP Code:

if (isset($_REQUEST['start']))
{
    $startpid = $_REQUEST['start'];
    $newindex = false;

}
else
{
    $startpid = 0;
    $newindex = true;
}
if (isset($_REQUEST['offset']))
{
    $offset = $_REQUEST['offset'];
    $newindex = false;
}
else
{
    $offset = 50;
    $newindex = true;
}


    $PostQuery = $db->query("SELECT p.pid, p.post, p.post_date, p.topic_id, p.queued, p.author_id, t.forum_id, t.approved
                            FROM ibf_posts p LEFT JOIN ibf_topics t ON ( p.topic_id=t.tid )
                            WHERE p.pid > {$startpid}
                            LIMIT {$offset}");

    //$TopicQuery = $db->query("SELECT tid, title, forum_id, start_date, last_post, approved, starter_id FROM ibf_topics");

    //create the index
    $index = new Zend_Search_Lucene($indexdir, $newindex);

    /*
                        value stored?      indexed?      tokenized?       binary?
        Keyword             yes             yes             no              no
        UnIndexed           yes             no              no              no
        Binary              yes             no              no              yes
        Text                yes             yes             yes             no
        UnStored            no              yes             yes             no
    */
    $count = 0;
    while ($post = $PostQuery->fetch())
    {
        $doc = new Zend_Search_Lucene_Document();

        $doc->addField(Zend_Search_Lucene_Field::Keyword('pid', sanitize($post['pid']) ));
        $doc->addField(Zend_Search_Lucene_Field::Keyword('forum_id', sanitize($post['forum_id']) ));
        $doc->addField(Zend_Search_Lucene_Field::Keyword('topic_id', sanitize($post['topic_id']) ));
        $doc->addField(Zend_Search_Lucene_Field::Keyword('queued', sanitize($post['queued']) ));
        $doc->addField(Zend_Search_Lucene_Field::Keyword('approved', sanitize($post['approved']) ));
        $doc->addField(Zend_Search_Lucene_Field::Keyword('post_date', sanitize($post['post_date']) ));
        $doc->addField(Zend_Search_Lucene_Field::Keyword('author_id', sanitize($post['author_id']) ));
        $doc->addField(Zend_Search_Lucene_Field::Text('post', sanitize($post['post']) ));

        $lastpid = $post['pid'];

        $index->addDocument($doc);
        $count++;
    }
    if ($count == $offset)
    {
        //$url = 'http://'.$_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF']."?start={$count}&offset={$offset}";
        $url = $_SERVER['PHP_SELF']."?start={$lastpid}&offset={$offset}";
        //header("Location: ".$url);
    }
    else
    {
        $valid = false;
    }

    echo "Processed {$count} posts <br /> Last post ID was {$lastpid} <br />";
    $index->commit();

?>
<html>
<head>
<title>Indexing Forum</title>
<meta http-equiv="refresh" content="1; URL= <?php echo $url ?>">
<meta name="keywords" content="automatic redirection">
</head>
<body>
Processed <?php echo $lastpid ?> posts
</body>
</html>
The trouble is, even though each post is very small, an attempt is made to optimize the index automatically after each document has been added. By the time I reach about 30,000 posts, the optimization step takes more than the 30 seconds allowed to run a php process.

Is this normal for lucene search to take so long ? Indexing those 30,000 posts takes around 10 minutes, which seems extraordinarily slow (in comparison, sphinxsearch can index all 310,000 posts in just over 2 minutes).

Has anyone else had this problem and how did you get around it ?

Thanks,
Martin

Last edited by coop : 01-20-2008 at 02:56 PM.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT. The time now is 12:11 AM.