Page MenuHomePhabricator

Search results highlight partial word matches
Closed, ResolvedPublic

Description

Author: morbus

Description:
I'm having "issues" with searching that I'm not exactly sure how to solve, and all of these are evident at the specified URL. In
essence, if I search for the word "four", I get absolutely no results. The SQL in question is roughly: SELECT * from searchindex
where MATCH (si_text) AGAINST ('+four' IN BOOLEAN MODE); (this is for MySQL 4, naturally). But, if I turn around and do a
decidedly MySQL 3.x query: SELECT * from searchindex where si_text LIKE '%four%'; I get back the two entries I expect. This
seems to tell me that the searchindex table is "Ok". To doublecheck, I dumped the table, deleted it, recreated it, and reimported
the data (thus recreated the indexes). Same result.

The real goal here is to show all matches for the word "EC" - I don't want "suspect" to be matched, but I want "-20 EC." and
similar entries (EC is a date measurement). To let MySQL search for these smaller words, I've already modified the my.cnf and set
it at 2 characters, and then rebuilt my index (REPAIR searchindex QUICK). But, somewhere in the wiki code (at the very least in
the display settings), searches are being done as strings, and not word boundaries. Is there anyway to force a word boundary? To
make matters worse, searching for "ec" at http://gamegrene.com/wiki/ "works" (because of my edit to my.cnf) but matches on
"suspect". However, searching for "ur", which should match on "procedure", doesn't return any results (but "procedure" does, as
opposed to "four").

ARggGh!


Version: 1.3.x
Severity: minor
URL: http://gamegrene.com/wiki/Special:Search?search=ec&fulltext=Search

Details

Reference
bz278

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 6:48 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz278.
bzimport added a subscriber: Unknown Object (MLST).

These are limitations of MySQL's full text search engine. You need to adjust MySQL's stopword list (which ignores "four") and
minimum word length (which ignores "EC"). Please see: http:
//dev.mysql.com/doc/mysql/en/Fulltext_Fine-tuning.html

morbus wrote:

As mentioned in the initial report, I already have revised MySQL's fulltext index: "To let MySQL search for these smaller words, I've already modified
the my.cnf and set it at 2 characters, and then rebuilt my index (REPAIR searchindex QUICK)." - otherwise, I wouldn't get any results at all for EC,
which I am (as per the original report). As for "four", that I didn't know, and I'll correct that shortly.

morbus wrote:

Just to reiterate clearer:

  • I've increased the full text search to 2 letters.
  • I've rebuilt the table indexes with no success.
  • I've deleted, recreated, and reimported the searchindex table.
  • I want to search on word boundaries such that "EC" does not match "suspect".
  • When searching for "EC" at Gamegrene, we get five pages that I know match.
  • However, I don't know what exactly is matched. If MySQL MATCH() does word

boundaries,

then the MW display does string searching (as it always shows "suspect").
  • "ur" as in "procedure" shows no matches; "procedure" does.

http://gamegrene.com/wiki/Special:Search?search=ec&fulltext=Search only returns pages which contain "EC" by itself.

Can you clarify what exactly your problem is?

morbus wrote:

From three of my machines (different IPs, logged in or not), and another
person's machine entirely, we're NOT seeing "EC" by itself (word boundary).
We're seeing EC as a string. For instance, one of the returned results shows the
below, which is matching on "effect", "secret", and "ineffective".

Avazian Box (2331 bytes)

1: ...d quickly. This advancement came with the side effect of immense
greed. Many highly advanced magnetic ...
3: ...g new magnetic propulsion technologies, formed a secret team
intending to thwart the ongoing conflict.
5: ...which rendered all weapons of Avazian origin ineffective, and the
absorption of the magnetic field wou...

bugzilla_wikipedia_org.to.jamesd wrote:

"ec" is matched in the middle of a word. Other two character sequences are
typically not matched in the middle of a word. The desired behavior is to match
ec only when it is a whole word, not in the middle of words.

Can you explain what you mean by "match"? As far as I can tell, the search is *ONLY* returning pages in which "EC"
appears as a distinct word when asked to search for "EC". Nothing else. No other pages are returned.

So, is this about the *searching*?

Or, is it about the *highlighting* of text extracts in the search results display?

Can you please clarify?

morbus wrote:

Brion - exactly, that's what I don't know (from a previous entry): "When
searching for "EC" at Gamegrene, we get five pages that I know match. If MySQL
MATCH() does word boundaries, then the MW display does string searching (as it
always shows "suspect")."

If MySQL MATCH() does do word boundaries, then yeah, I guess I'm reporting a bug
in the display code (specifically, showHit() in SearchEngine.php).

Thanks for the patience.

Morbus, for general information on the fulltext search engine see http://dev.mysql.com/doc/mysql/en/
Fulltext_Boolean.html

Matches are on full words unless you use the * operator (eg, search for "apple*" finds "applet" and "applesauce" but search
for "apple" does not).

Changed summary and sample URL to reflect the problem.

morbus wrote:

This not been heavily tested yet, but the following revision
in SearchEngine.php:showHit() seems to do what I want:

$pat1 = "/(.*)(\b" . implode( "|", $this->mSearchterms ) . "\b)(.*)/i";

The generated pattern then becomes /(.*)(\bEC\b)(.*)/i or, in the case of
multiple searches /(.*)(\b20|EC\b)(.*)/i. This code is currently live
at the provided URL, so you can test as needed.

morbus wrote:

Sorry - the correct revision is:

$pat1 = "/(.*)(\b" . implode( "\b|\b", $this->mSearchterms ) . "\b)(.*)/i";

which creates a pattern like /(.*)(\b20\b|\bEC\b)(.*)/i.

Fixed in r26269 for mainline, r26271 for lucenesearch extension.