Page MenuHomePhabricator

Search index text is empty if page contains unmatched "<"
Closed, ResolvedPublic

Description

Author: nephele

Description:
Fix broken regexp in SearchUpdate.php (patch to r49794)

If an article contains a "<" symbol and there is no subsequent ">" symbol anywhere in the article, the si_text field for that article in the searchindex table ends up completely empty -- even the text in the article before the "<" symbol is wiped out. It is therefore impossible to search on any of the article's contents.

For example, http://www.uesp.net/wiki/UESPWiki:Mirror_Plan is currently triggering this bug; si_text is being set to ''. Although UESP is currently running MW1.10, the same bug occurs if the article is added to a test wiki running r49794.

The basic problem is an incorrect pair of parentheses in a preg_replace expression in SearchUpdate.php::doUpdate(). The attached patch file removes those parentheses; I also did some secondary cleanup of the expression by deleting some redundant chunks ("[A-Za-z0-9]*\\s*" is all covered equally well by "[^>]*?", and the simpler expression doesn't mislead editors). The revised regexp successfully processes UESPWiki:Mirror_Plan, and also successfully processes some test pages containing html tags.


Version: 1.16.x
Severity: normal

Attached:

Details

Reference
bz18609

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:33 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz18609.

Thanks, patch applied in r58548, along with some tests.