Zend Framework Forum

Go Back   Zend Framework Forum > Zend Framework Components > Mail, Formats & Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 01-12-2010, 03:52 PM
Junior Member
 
Join Date: Sep 2009
Posts: 7
Default Zend_Search_Lucene and memory issues

Hi,

I have an index with 3 100 000 documents that is expected to grow to 12 000 000 documents in the short term and eventually to 20 000 000 and more in the mid-term. I chose Lucene because it was built-in the Zend Framework and all my infrastructure is built upon Zend Framework.

Building the index takes a long time but this is not a problem. My problem comes when I am trying to search anything from that index. The apache process of the search eats up to 1Gb of RAM (the server is a dedicated, Core2duo w/ 4Gb of RAM running CentOS 5.4) and then dies. Sometimes the search returns the results, sometimes it won't (process killed, blank page).

I've set PHP's memory limit to 1Gb, yet, this is not enough to optimize my index. My CLI index optimization script runs out of memory after 6 minutes of optimizing.

I knew I had to scale, but I was wondering if someone else ran into similar issues or had a similar data set with no such issue. Should I upgrade the server or should I try to find out why Zend_Search_Lucene eats up so much memory?

Thank you very much, any input is greatly appreciated.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 01-13-2010, 07:29 AM
Member
 
Join Date: Mar 2009
Posts: 67
Default

I think you can explore why it eats up so much memory before upgrading the server. How big is each document? And have you used Unstored fields?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 01-13-2010, 05:07 PM
Junior Member
 
Join Date: Sep 2009
Posts: 7
Default

Thanks for your reply tjorriemorrie. Each document is about 1 to 2 kb. My documents are products, they are indexed this way:

unique id - keyword
title - unstored
description - unstored
tags - unstored
keywords - unstored
price - unindexed

I search for keywords in the title, description, keywords and tags fields, then I recover the prices to apply basic sort (by price asc or desc), but, my investigation shows that the scripts dies before even being able to sort the results...

Optimizing the index with optimize() helped but was not enough.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 01-14-2010, 06:53 AM
Member
 
Join Date: Mar 2009
Posts: 67
Default

Well you can post the code of the search method - maybe we can see something.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 01-23-2010, 07:27 PM
Junior Member
 
Join Date: Sep 2009
Posts: 7
Default

Sorry for the time it took me to answer. I re-wrote many parts of my search method to try finding where the issue comes from.

Since my last post, I tried removing searchable fields from my index, putting the index into a memory disk and halving my dataset. The bottleneck appears to be the find() method. With a 300mo index, my search method eats up 1gb of RAM and 99% of my CPU for 30 seconds.

Here is my code:
Code:
// Check that the consumer we are talking to is a valid consumer.
        $this->_checkConsumer();


        // Preparing the caching engine.
        $frontendOptions = array(
            'caching'                 => true,
            'cache_id_prefix'         => 'search_',
            'lifetime'                => 3*(24*60*60),  // 3 days
            'automatic_serialization' => true,
            'ignore_user_abort'       => true,
        );

        $backendOptions = array(
            'cache_dir'              => '/tmp',
            'hashed_directory_level' => 1,
        );

        $cache = Zend_Cache::factory('Core',
                                     'File',
                                     $frontendOptions,
                                     $backendOptions);

        
        // Check RPP count and displayed page.
        $pagination = $this->_getPagination(100, 1);
        
        // Check pagination limits
        if ($pagination['rpp'] > 100)
            $pagination['rpp'] = 100;


        $query = '';
        $terms = explode(' ', urldecode($keywords));
        for ($i=0; $i<count($terms); $i++)
        {
            $query .= "\"{$terms[$i]}\"";
            if ($i != count($terms) - 1)
                $query .= " AND ";
        }


        $front  = Zend_Controller_Front::getInstance();
	$params = $front->getRequest()->getParams();

        // Filter by country
        if (!empty($params['filter_by_country']))
        {
            $countries = explode(',', $params['filter_by_country']);

            $subquery = " AND (";
            $subquery .= "countries:{$countries[0]}";
            for ($i=1; $i<count($countries); $i++)
            {
                $subquery .= " OR countries:{$countries[$i]}";
            }
            $subquery .= ")";
            $query .= $subquery;
        }

        // Filter by category
        if (!empty($params['filter_by_category']))
        {
            $categories = explode(',', $params['filter_by_category']);

            $subquery = " AND (";
            $subquery .= "categories:{$categories[0]}";
            for ($i=1; $i<count($categories); $i++)
            {
                $subquery .= " AND categories:{$categories[$i]}";
            }
            $subquery .= ")";
            $query .= $subquery;
        }

        // Filter by subcategory
        if (!empty($params['filter_by_subcategory']))
        {
            $subcategories = explode(',', $params['filter_by_subcategory']);

            $subquery = " AND (";
            $subquery .= "subcategories:{$subcategories[0]}";
            for ($i=1; $i<count($subcategories); $i++)
            {
                $subquery .= " AND subcategories:{$subcategories[$i]}";
            }
            $subquery .= ")";
            $query .= $subquery;
        }


        // Hits are cached by query.
        $cachedItemID = md5($query);
        if(!$hits = $cache->load($cachedItemID))
        {
            // Cache miss, search and cache

            // Open the index
            $index = Zend_Search_Lucene::open('../data/products_index');

            // Limit the returned result set to prevent request timeouts.
            Zend_Search_Lucene::setResultSetLimit(500);

            $hits = $index->find($query);

            $newHits = array();
            foreach ($hits as $hitID => $hit)
            {
                $newHits[$hitID]['productID'] = $hit->productID;
                $newHits[$hitID]['price']     = $hit->price;
            }

            unset($hits);
            $hits = $newHits;

            // Cache hits returned
            $cache->save($hits, $cachedItemID);
        }


        // Determine categories/subcategories and their respective count
        $fcat = array();
        $fsub = array();
        foreach ($hits as $hit)
        {
            $product = new MyProject_Model_Product();
            $product->findByID($hit['productID']);
            if ($product->exists())
            {
                $vendors = $product->getVendors();
                foreach ($vendors as $vendor)
                {
                    $categories = $vendor->getCategories();
                    foreach ($categories as $category)
                    {
                        $fcat[$category[0]]['label_english'] = $category[1];
                        $fcat[$category[0]]['label_french'] = $category[2];
                        if (isset($fcat[$category[0]]['count']))
                            $fcat[$category[0]]['count']++;
                        else
                            $fcat[$category[0]]['count'] = 1;

                        $fsub[$category[3]]['label_english'] = $category[4];
                        $fsub[$category[3]]['label_french'] = $category[5];
                        if (isset($fsub[$category[3]]['count']))
                            $fsub[$category[3]]['count']++;
                        else
                            $fsub[$category[3]]['count'] = 1;
                        $fsub[$category[3]]['parent'] = $category[0];
                    }
                }
            }
        }


        // Append search results to answer.
        $answer = new MyProject_Rest_Answer_ProductsSearchResult(
            count($hits),
            $pagination['rpp'],
            $pagination['page']
        );
        $beginning = ($pagination['rpp'] * $pagination['page']) - $pagination['rpp'];
        $ending    = $pagination['rpp'] * $pagination['page'];

        $answer->appendCategories($fcat, $fsub);

        if (!empty($params['sort']))
        {
            $sorted = array();
            if ($params['sort'] == 'price-asc')
            {
                // Sort by prices
                foreach ($hits as $hitID =>$hit)
                {
                    $sorted[$hitID] = $hit['price'];
                }
                asort($sorted, SORT_NUMERIC);
            }
            
            if ($params['sort'] == 'price-desc')
            {
                // Sort by prices
                foreach ($hits as $hitID =>$hit)
                {
                    $sorted[$hitID] = $hit['price'];
                }
                arsort($sorted, SORT_NUMERIC);
            }

            // Set the array pointer to the beginning of the array
            reset($sorted);
            // Skip the unneed rows
            for ($i=0; $i<$beginning; $i++)
            {
                next($sorted);
            }
            // Fetch the needed rows
            for ($i=$beginning; $i<$ending; $i++)
            {
                if (empty($hits[key($sorted)]))
                    break;

                $product = new MyProject_Model_Product();
                $product->findByID($hits[key($sorted)]['productID']);

                $answer->append($product);

                next($sorted);
            }
        }
        else
        {
            for ($i=$beginning; $i<$ending; $i++)
            {
                if (empty($hits[$i]))
                    break;

                $product = new MyProject_Model_Product();
                $product->findByID($hits[$i]['productID']);

                $answer->append($product);
            }
        }
        

        return $answer->toXML();
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 01-28-2010, 10:15 AM
SirAdrian's Avatar
Member
 
Join Date: Apr 2008
Posts: 83
Default

Hey,

I went down this road just over a year ago, and came up with a shocking answerr: Zend Lucene is very slow. If you like Lucene, you can try and get the Java version working - which is much better.

We ended up changing to Sphinx and it handled our millions of documents in under a hundredth of a second with no memory problems. It's amazingly fast and has been very reliable so far. MySQL.com even uses it, as well as craigslist.

Despite being much more complicated to set up and install, actually getting up and running was even faster with Sphinx over Lucene. It's just more intuitive.

Hope this helps!
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 01-28-2010, 11:41 AM
Cristian's Avatar
Administrator
 
Join Date: Feb 2007
Location: Sibiu, Romania
Posts: 116
Default

Seems other frameworks already have more Sphinx support.

By example on Code Igniter there are bundled classes to work with Sphinx.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 01-28-2010, 11:57 AM
SirAdrian's Avatar
Member
 
Join Date: Apr 2008
Posts: 83
Default

If you're able to do it through SphinxSE, it takes almost no application code. Took us very little time to get it up and running. SphinxQL looks interesting, too. Both of these would not require much application code.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 02-01-2010, 03:26 AM
Junior Member
 
Join Date: Sep 2009
Posts: 7
Default

Thanks for the replies!

I am currently trying Solr has it will allow me to scale the search performance as my infrastructure will grow (with things like index replication that Sphinx does not support yet).

I'll post back performance info here in a day or two.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 02-05-2010, 05:25 PM
Junior Member
 
Join Date: Sep 2009
Posts: 7
Default

Solr definitely rocks. With 2 000 000 documents I search in ~240ms (average).

I haven't tried Sphinx, because Solr already supports replication and clustering, but in fact, Zend_Search_Lucene is too slow in some cases.

Thanks for your help.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT. The time now is 08:37 AM.


Designed by: Miner Skinz Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.1.0