Page MenuHomePhabricator

CirrusSearch renders every page in the search results probably just to tell the user how many bytes are in it
Closed, ResolvedPublic

Description

+++ This bug was initially created as a clone of Bug #55590 +++

Bug 55590 was "discovered" by CirrusSearch's overzealous rendering so I'm cloning it to make this one. That bug caused crashing which is bad the crashing happens without CirrusSearch. Its just that CirrusSearch casts a wide net (due to this bug) and 55590 throws a bomb in the net so the search results page blows up.

The part of the backtrace that matters here:
#17 /usr/local/apache/common-local/php-1.22wmf21/includes/search/SearchEngine.php(868): CirrusSearch->getTextFromContent(Object(Title), Object(WikitextContent))
#18 /usr/local/apache/common-local/php-1.22wmf21/includes/search/SearchEngine.php(954): SearchResult->initText()
#19 /usr/local/apache/common-local/php-1.22wmf21/includes/specials/SpecialSearch.php(651): SearchResult->getByteSize()
#20 /usr/local/apache/common-local/php-1.22wmf21/includes/specials/SpecialSearch.php(543): SpecialSearch->showHit(Object(CirrusSearchResult), Array)


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=55750

Details

Reference
bz55592

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:26 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz55592.

This makes showing results really really really slow.

Got started on this but I have to stop for the night. We already have the number of bytes in the article in elasticsearch (called textLen) but it isn't stored (so it has to be retrieved from the source, slowing down queries). I'd like to store both the number of bytes and the number of words directly in Elasticsearch. I think it is worth overriding these methods to stop the rendering and return textLen for both, deprecate textLen, and replace it with text_bytes and text_words. The next step would be to reindex. Then stop using textLen and stop writing it. On the next reindex it won't be recreated.

Ultimately I'd like to let Elasticsearch figure out the word length on its own but I'm not sure how to do that at this point. str_word_count will have to do for now.

Change 89832 had a related patch set uploaded by Manybubbles:
Include wordCount and byteSize in result

https://gerrit.wikimedia.org/r/89832

Change 89832 merged by jenkins-bot:
Include wordCount and byteSize in result

https://gerrit.wikimedia.org/r/89832