+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 12

Thread: Zend_Search_Lucene and memory issues

  1. #1
    stukov is offline Junior Member
    Join Date
    Sep 2009
    Posts
    7

    Default Zend_Search_Lucene and memory issues

    Hi,

    I have an index with 3 100 000 documents that is expected to grow to 12 000 000 documents in the short term and eventually to 20 000 000 and more in the mid-term. I chose Lucene because it was built-in the Zend Framework and all my infrastructure is built upon Zend Framework.

    Building the index takes a long time but this is not a problem. My problem comes when I am trying to search anything from that index. The apache process of the search eats up to 1Gb of RAM (the server is a dedicated, Core2duo w/ 4Gb of RAM running CentOS 5.4) and then dies. Sometimes the search returns the results, sometimes it won't (process killed, blank page).

    I've set PHP's memory limit to 1Gb, yet, this is not enough to optimize my index. My CLI index optimization script runs out of memory after 6 minutes of optimizing.

    I knew I had to scale, but I was wondering if someone else ran into similar issues or had a similar data set with no such issue. Should I upgrade the server or should I try to find out why Zend_Search_Lucene eats up so much memory?

    Thank you very much, any input is greatly appreciated.

  2. #2
    tjorriemorrie is offline Member
    Join Date
    Mar 2009
    Posts
    68

    Default

    I think you can explore why it eats up so much memory before upgrading the server. How big is each document? And have you used Unstored fields?

  3. #3
    stukov is offline Junior Member
    Join Date
    Sep 2009
    Posts
    7

    Default

    Thanks for your reply tjorriemorrie. Each document is about 1 to 2 kb. My documents are products, they are indexed this way:

    unique id - keyword
    title - unstored
    description - unstored
    tags - unstored
    keywords - unstored
    price - unindexed

    I search for keywords in the title, description, keywords and tags fields, then I recover the prices to apply basic sort (by price asc or desc), but, my investigation shows that the scripts dies before even being able to sort the results...

    Optimizing the index with optimize() helped but was not enough.

  4. #4
    tjorriemorrie is offline Member
    Join Date
    Mar 2009
    Posts
    68

    Default

    Well you can post the code of the search method - maybe we can see something.

  5. #5
    stukov is offline Junior Member
    Join Date
    Sep 2009
    Posts
    7

    Default

    Sorry for the time it took me to answer. I re-wrote many parts of my search method to try finding where the issue comes from.

    Since my last post, I tried removing searchable fields from my index, putting the index into a memory disk and halving my dataset. The bottleneck appears to be the find() method. With a 300mo index, my search method eats up 1gb of RAM and 99% of my CPU for 30 seconds.

    Here is my code:
    Code:
    // Check that the consumer we are talking to is a valid consumer.
            $this->_checkConsumer();
    
    
            // Preparing the caching engine.
            $frontendOptions = array(
                'caching'                 => true,
                'cache_id_prefix'         => 'search_',
                'lifetime'                => 3*(24*60*60),  // 3 days
                'automatic_serialization' => true,
                'ignore_user_abort'       => true,
            );
    
            $backendOptions = array(
                'cache_dir'              => '/tmp',
                'hashed_directory_level' => 1,
            );
    
            $cache = Zend_Cache::factory('Core',
                                         'File',
                                         $frontendOptions,
                                         $backendOptions);
    
            
            // Check RPP count and displayed page.
            $pagination = $this->_getPagination(100, 1);
            
            // Check pagination limits
            if ($pagination['rpp'] > 100)
                $pagination['rpp'] = 100;
    
    
            $query = '';
            $terms = explode(' ', urldecode($keywords));
            for ($i=0; $i<count($terms); $i++)
            {
                $query .= "\"{$terms[$i]}\"";
                if ($i != count($terms) - 1)
                    $query .= " AND ";
            }
    
    
            $front  = Zend_Controller_Front::getInstance();
    	$params = $front->getRequest()->getParams();
    
            // Filter by country
            if (!empty($params['filter_by_country']))
            {
                $countries = explode(',', $params['filter_by_country']);
    
                $subquery = " AND (";
                $subquery .= "countries:{$countries[0]}";
                for ($i=1; $i<count($countries); $i++)
                {
                    $subquery .= " OR countries:{$countries[$i]}";
                }
                $subquery .= ")";
                $query .= $subquery;
            }
    
            // Filter by category
            if (!empty($params['filter_by_category']))
            {
                $categories = explode(',', $params['filter_by_category']);
    
                $subquery = " AND (";
                $subquery .= "categories:{$categories[0]}";
                for ($i=1; $i<count($categories); $i++)
                {
                    $subquery .= " AND categories:{$categories[$i]}";
                }
                $subquery .= ")";
                $query .= $subquery;
            }
    
            // Filter by subcategory
            if (!empty($params['filter_by_subcategory']))
            {
                $subcategories = explode(',', $params['filter_by_subcategory']);
    
                $subquery = " AND (";
                $subquery .= "subcategories:{$subcategories[0]}";
                for ($i=1; $i<count($subcategories); $i++)
                {
                    $subquery .= " AND subcategories:{$subcategories[$i]}";
                }
                $subquery .= ")";
                $query .= $subquery;
            }
    
    
            // Hits are cached by query.
            $cachedItemID = md5($query);
            if(!$hits = $cache->load($cachedItemID))
            {
                // Cache miss, search and cache
    
                // Open the index
                $index = Zend_Search_Lucene::open('../data/products_index');
    
                // Limit the returned result set to prevent request timeouts.
                Zend_Search_Lucene::setResultSetLimit(500);
    
                $hits = $index->find($query);
    
                $newHits = array();
                foreach ($hits as $hitID => $hit)
                {
                    $newHits[$hitID]['productID'] = $hit->productID;
                    $newHits[$hitID]['price']     = $hit->price;
                }
    
                unset($hits);
                $hits = $newHits;
    
                // Cache hits returned
                $cache->save($hits, $cachedItemID);
            }
    
    
            // Determine categories/subcategories and their respective count
            $fcat = array();
            $fsub = array();
            foreach ($hits as $hit)
            {
                $product = new MyProject_Model_Product();
                $product->findByID($hit['productID']);
                if ($product->exists())
                {
                    $vendors = $product->getVendors();
                    foreach ($vendors as $vendor)
                    {
                        $categories = $vendor->getCategories();
                        foreach ($categories as $category)
                        {
                            $fcat[$category[0]]['label_english'] = $category[1];
                            $fcat[$category[0]]['label_french'] = $category[2];
                            if (isset($fcat[$category[0]]['count']))
                                $fcat[$category[0]]['count']++;
                            else
                                $fcat[$category[0]]['count'] = 1;
    
                            $fsub[$category[3]]['label_english'] = $category[4];
                            $fsub[$category[3]]['label_french'] = $category[5];
                            if (isset($fsub[$category[3]]['count']))
                                $fsub[$category[3]]['count']++;
                            else
                                $fsub[$category[3]]['count'] = 1;
                            $fsub[$category[3]]['parent'] = $category[0];
                        }
                    }
                }
            }
    
    
            // Append search results to answer.
            $answer = new MyProject_Rest_Answer_ProductsSearchResult(
                count($hits),
                $pagination['rpp'],
                $pagination['page']
            );
            $beginning = ($pagination['rpp'] * $pagination['page']) - $pagination['rpp'];
            $ending    = $pagination['rpp'] * $pagination['page'];
    
            $answer->appendCategories($fcat, $fsub);
    
            if (!empty($params['sort']))
            {
                $sorted = array();
                if ($params['sort'] == 'price-asc')
                {
                    // Sort by prices
                    foreach ($hits as $hitID =>$hit)
                    {
                        $sorted[$hitID] = $hit['price'];
                    }
                    asort($sorted, SORT_NUMERIC);
                }
                
                if ($params['sort'] == 'price-desc')
                {
                    // Sort by prices
                    foreach ($hits as $hitID =>$hit)
                    {
                        $sorted[$hitID] = $hit['price'];
                    }
                    arsort($sorted, SORT_NUMERIC);
                }
    
                // Set the array pointer to the beginning of the array
                reset($sorted);
                // Skip the unneed rows
                for ($i=0; $i<$beginning; $i++)
                {
                    next($sorted);
                }
                // Fetch the needed rows
                for ($i=$beginning; $i<$ending; $i++)
                {
                    if (empty($hits[key($sorted)]))
                        break;
    
                    $product = new MyProject_Model_Product();
                    $product->findByID($hits[key($sorted)]['productID']);
    
                    $answer->append($product);
    
                    next($sorted);
                }
            }
            else
            {
                for ($i=$beginning; $i<$ending; $i++)
                {
                    if (empty($hits[$i]))
                        break;
    
                    $product = new MyProject_Model_Product();
                    $product->findByID($hits[$i]['productID']);
    
                    $answer->append($product);
                }
            }
            
    
            return $answer->toXML();

  6. #6
    SirAdrian's Avatar
    SirAdrian is offline Member
    Join Date
    Apr 2008
    Posts
    87

    Default

    Hey,

    I went down this road just over a year ago, and came up with a shocking answerr: Zend Lucene is very slow. If you like Lucene, you can try and get the Java version working - which is much better.

    We ended up changing to Sphinx and it handled our millions of documents in under a hundredth of a second with no memory problems. It's amazingly fast and has been very reliable so far. MySQL.com even uses it, as well as craigslist.

    Despite being much more complicated to set up and install, actually getting up and running was even faster with Sphinx over Lucene. It's just more intuitive.

    Hope this helps!

  7. #7
    Cristian's Avatar
    Cristian is offline Administrator
    Join Date
    Feb 2007
    Location
    Sibiu, Romania
    Posts
    124

    Default

    Seems other frameworks already have more Sphinx support.

    By example on Code Igniter there are bundled classes to work with Sphinx.

  8. #8
    SirAdrian's Avatar
    SirAdrian is offline Member
    Join Date
    Apr 2008
    Posts
    87

    Default

    If you're able to do it through SphinxSE, it takes almost no application code. Took us very little time to get it up and running. SphinxQL looks interesting, too. Both of these would not require much application code.

  9. #9
    stukov is offline Junior Member
    Join Date
    Sep 2009
    Posts
    7

    Default

    Thanks for the replies!

    I am currently trying Solr has it will allow me to scale the search performance as my infrastructure will grow (with things like index replication that Sphinx does not support yet).

    I'll post back performance info here in a day or two.

  10. #10
    stukov is offline Junior Member
    Join Date
    Sep 2009
    Posts
    7

    Default

    Solr definitely rocks. With 2 000 000 documents I search in ~240ms (average).

    I haven't tried Sphinx, because Solr already supports replication and clustering, but in fact, Zend_Search_Lucene is too slow in some cases.

    Thanks for your help.

+ Reply to Thread
Page 1 of 2 1 2 LastLast

Similar Threads

  1. Memory issues (maybe) with Zend_Session
    By topcatxx in forum General Q&A on Zend Framework
    Replies: 0
    Last Post: 05-07-2010, 12:32 PM
  2. static memory in PHP - possible?
    By giovanni in forum General Q&A on Zend Framework
    Replies: 1
    Last Post: 06-03-2009, 01:54 AM
  3. Form Translate: memory waste
    By MozartFazito in forum Internationalization (i18n) & Localization (l10n)
    Replies: 3
    Last Post: 01-20-2009, 11:46 AM
  4. out of memory?
    By tixrus in forum General Q&A on Zend Framework
    Replies: 0
    Last Post: 01-16-2009, 06:23 PM
  5. forward action = memory exhausted
    By arodrpin in forum Model-View-Controller (MVC)
    Replies: 0
    Last Post: 05-22-2007, 08:15 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts