Page MenuHomePhabricator

CirusSearch: Accelerated regex searches that stop early do not signal that
Open, LowestPublic

Description

In order to keep load down on the search cluster accelerated regex searches are only allowed to recheck a limited number of documents (10,000 right now). Right now when that limit is reached all subsequent documents are considered not to match and Cirrus doesn't signal the user at all that this happened. This means that results are less reliable. OTOH this should only happen if your regex can't be accelerated down to a small subset of the wiki which _should_ be reasonably rare. It'd happen if the regex actually does match more then the recheck limit or if it is specific but the trigram that we're able to extract from it still matches too many documents.

Example:
insource:/ {{/ will match a ton of pages and under report the number
insource:/ {{..ca/ will match fewer pages but the only trigram that can be extracted from (" {{") is still on too many pages

The plan is to allow the recheck code to signal back to cirrus that it gave up so it can let the user know that the results may not be consistent and it can tell them how to fix their regex. Unfortunately that first level of signalling requires Elasticsearch 1.4 which isn't quite released yet.


Version: unspecified
Severity: normal
Whiteboard: Elasticsearch_1.4

Details

Reference
bz72128

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:58 AM
bzimport set Reference to bz72128.
bzimport added a subscriber: Unknown Object (MLST).
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Deskana moved this task from Needs triage to Search on the Discovery-ARCHIVED board.