Page MenuHomePhabricator

MySQL 4 MATCHes "Grouped Phrases" as substrings, not word boundaries.
Closed, DeclinedPublic

Description

Author: morbus

Description:
Create a page with the phrase "Folktown Records" (plural). Then, do a
search on "Folktown Record" (singular, WITH QUOTES). The quotes
group the phrases so that you won't match "Folktown hasn't a single
record". The problem, however, is that the MySQL MATCH seems to
respect word boundaries on single words, but nto on grouped words.
"Folktown Record" (with quotes) will MATCH "Folktown Records"
(plural) in an entry. "Grouped Phrase" will match "Ungrouped Phrase"
and so forth.

Without changing any code, you can confirm this behavior by looking for
" Folktown Record " (with the quotes AND leading and trailing spaces). Since
the space is now being treated literally, "Records" doesn't match. I think that
result is what most people are expecting when they type a grouped phrase,
but I sincerely doubt they'll make the cognitive leap to add leading and
trailing spaces to get the proper result.

To fix this in MW, we can take every [quote] and turn them into
[space][quote][space]. In SearchEngine.php:parseQuery4, look for:

$searchon = wfStrencode( $searchon );
$this->mTitlecond = " MATCH(si_title) AGAINST('$searchon' IN BOOLEAN MODE)";
$this->mTextcond = " (MATCH(si_text) AGAINST('$searchon' IN BOOLEAN MODE) AND cur_is_redirect=0)";

and add a new line before it:

$searchon = str_replace( '"', ' " ', $searchon);
$searchon = wfStrencode( $searchon );
$this->mTitlecond = " MATCH(si_title) AGAINST('$searchon' IN BOOLEAN MODE)";
$this->mTextcond = " (MATCH(si_text) AGAINST('$searchon' IN BOOLEAN MODE) AND cur_is_redirect=0)";


Version: 1.3.x
Severity: normal

Details

Reference
bz375

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 6:55 PM
bzimport set Reference to bz375.
bzimport added a subscriber: Unknown Object (MLST).

Probably should do this a few lines up where the query is being normalized; there's already different handling of quoted
phrases and non-quoted words for building text extract match regexps.

morbus wrote:

Moving it inside could be:

$searchon .= $terms[1] . $wgLang->stripForSearch( $terms[2] );
if( $terms[3] ) {

$regexp = preg_quote( $terms[3] );

to:

$searchon .= $terms[1] . $wgLang->stripForSearch( $terms[2] );
$searchon = str_replace( '"', ' " ', $searchon);
if( $terms[3] ) {

$regexp = preg_quote( $terms[3] );

bugzilla_wikipedia_org.to.jamesd wrote:

This breaks the search for pi desired in this bug:
http://bugzilla.wikimedia.org/show_bug.cgi?id=42 . At present you can use "3.14"
and expect it to match "3.14159265".

Rather than adding undesired spaces, add the words within the quotes outside the
quotes. That will raise the results for articles containing "Folktown Record"
above those containing "Folktown Records" but won't break the searches for "3.14".

MySQL 4 isn't supported anymore