CirrusSearch: Remove as much non-sentence stuff as possible from article text
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	• Manybubbles
	Apr 9 2014, 3:07 PM

Description

The snippets we generate actually contain stuff from within tables, image captions, and headings. These don't look great. If we could smash those into another field then the snippets would be nicer. We could also use the sentence fragmenter in the experimental highlighter.

Note: we already do this for the headings. We should do it for tables and infoboxes and stuff. Maybe we should do it for a css class as well.

Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=61669

Details

Reference: bz63729

Event Timeline

• bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:10 AM

• bzimport added a project: CirrusSearch.

• bzimport set Reference to bz63729.

• bzimport added a subscriber: Unknown Object (MLST).

• Manybubbles created this task.Apr 9 2014, 3:07 PM

Possibly a duplicate of bug 61669 ?

(In reply to Quiddity from comment #1)

Possibly a duplicate of bug 61669 ?

Not a duplicate, but certainly related.

Rather then removing the text, we've moved it into another field:
https://gerrit.wikimedia.org/r/#/c/127140/

We'll still search them, but they'll be worth less and less likely to be highlighted.

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Apr 20 2015, 4:07 AM

CirrusSearch: Remove as much non-sentence stuff as possible from article textClosed, ResolvedPublicActions

Description

Details

Event Timeline

CirrusSearch: Remove as much non-sentence stuff as possible from article text
Closed, ResolvedPublic
Actions