Page MenuHomePhabricator

User interface elements should not show up in CirrusSearch search result excerpts
Closed, ResolvedPublic

Description

Author: sumanah

Description:

  1. Search test2wiki for environment : https://test2.wikipedia.org/w/index.php?search=environment&title=Special%3ASearch
  1. Notice that the first result has, as the text snippet:

Geography
making maps Countries of the world Natural environment[edit | edit source] Climate Soil Rivers Rocks

  1. Click through to https://test2.wikipedia.org/wiki/Geography and notice that "[edit | edit source]" is not in the real text of the article.

I think CirrusSearch should not be displaying "[edit | edit source]" in the text snippets in the search results.


Version: unspecified
Severity: normal

Details

Reference
bz52906

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:49 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz52906.

sumanah wrote:

Another repro case, slightly different:

Search for "Valiant" https://test2.wikipedia.org/w/index.php?search=valiant&title=Special%3ASearch and you'll see the result "Blooper", with the text excerpt being:

"A Bug's Life, Toy Story 2, Monsters, Inc., and Valiant. Contents 1 The "blooper" in pop culture 1"

The words after "Valiant" are part of the table of contents of the page.

sumanah wrote:

"[edit | edit source]" in the search results snippet

Attached:

edit-source-in-results.png (252×452 px, 22 KB)

Try this one: https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=edit+source&fulltext=Search

The action item here: remove the edit links and any other automatically added text from the page before dropping it into the search backend. Also, remove the able of contents if possible. I'm pretty sure the edit links and their ilk are super high priority but I'm not sure of the priority on the table of contents.

sumanah wrote:

https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=video+sorry&fulltext=Search gets me a link to https://test2.wikipedia.org/wiki/Birch_beer that includes the excerpt:

a heart as big as a whale. Also: enjoy this video! Sorry, your browser either has JavaScript disabled

So that's one more automatically added bit of text to remove from the search corpus.

I've pushed a fix to gerrit: https://gerrit.wikimedia.org/r/#/c/80018/

I'll set this bug to PATCH_TO_REVIEW once I push some regression tests to review as well.

Tests: https://gerrit.wikimedia.org/r/#/c/80021/

I forgot to include the bug number in the commit messages but these links should help.

Tweaked the summary a bit. Older summary included: "[edit | edit source]", ToC text, & "JS disabled" warning. Genericized this to user interface elements and clarified that this is a CirrusSearch-specific issue.

Change 80018 had a related patch set uploaded by TTO:
Remove parts of rendered page from search.

https://gerrit.wikimedia.org/r/80018

Change 80018 merged by jenkins-bot:
Remove parts of rendered page from search.

https://gerrit.wikimedia.org/r/80018