Page MenuHomePhabricator

CirrusSearch: Remove as much non-sentence stuff as possible from article text
Closed, ResolvedPublic

Description

The snippets we generate actually contain stuff from within tables, image captions, and headings. These don't look great. If we could smash those into another field then the snippets would be nicer. We could also use the sentence fragmenter in the experimental highlighter.

Note: we already do this for the headings. We should do it for tables and infoboxes and stuff. Maybe we should do it for a css class as well.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=61669

Details

Reference
bz63729

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:10 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz63729.
bzimport added a subscriber: Unknown Object (MLST).

Possibly a duplicate of bug 61669 ?

(In reply to Quiddity from comment #1)

Possibly a duplicate of bug 61669 ?

Not a duplicate, but certainly related.

Rather then removing the text, we've moved it into another field:
https://gerrit.wikimedia.org/r/#/c/127140/

We'll still search them, but they'll be worth less and less likely to be highlighted.