Page MenuHomePhabricator

Multiple search terms are not enforced properly for Chinese
Closed, ResolvedPublic

Description

Here the search string I give is "逢甲", so why is it like I merely
typed "甲'?

$ w3m -dump "http://taizhongbus.jidanni.org/index.php?search=逢甲&fulltext=搜索"
Problem 1: raw $1:
有關搜索中公的更多詳情,參見$1。

  1. 大甲-龜殼村-海墘 (344字節)

Problem: 2 it also matches on only one character of my two character query:

  1. 大甲-海尾子 (685字節)
  2. 大甲-外埔-土城 (421字節)
  1. 大甲-龜殼村-海墘 (344字節)
  2. 大甲-豐原 (884字節)

The website is online, for you to test.


Version: 1.16.x
Severity: normal

Details

Reference
bz8445

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:30 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz8445.

Ok, it looks like the splitting of characters (done to compensate for the lack of word spacing in Chinese text) is happening after the boolean search query is constructed, leading to failure:

The input:
'逢甲'

is translated to a boolean query for a single required word:
'+逢甲"

which then gets split up by character, then encoded to compensate for encoding bugs:
'+ U8e980a2 U8e794b2'

The '+' gets detached from the characters, so has no affect, and the search backend will returns results that contain either character instead of requiring both.

As a workaround, you can quote the multi-character string, which ends up encoding correctly for a phrase search:
'+" U8e980a2 U8e794b2"'

OK, comparing

http://radioscanningtw.jidanni.org/index.php?search=學甲&ns0=1&title=特殊:搜尋&fulltext=Search
http://radioscanningtw.jidanni.org/index.php?search='學甲'&ns0=1&title=特殊:搜尋&fulltext=Search
http://radioscanningtw.jidanni.org/index.php?search="學甲"&ns0=1&title=特殊:搜尋&fulltext=Search

it is clear only the final form gives correct results.

Could you fellows glue the + that has fallen off, back on, there
behind the scenes?

Wouldn't that be better than Asian sites' users thinking Search is broken, or MediaWiki
needing to add instructions telling Asian users to double "quote" "every" Asian "string"
they want to search.

Alas, I see WMF doesn't use SpecialSearch.php anymore, but
these extensions instead,

$ w3m -dump http://zh.wikipedia.org/wiki/Special:Version | grep Search
MWSearch MWSearch plugin Brion Vibber and
OpenSearchXml OpenSearch XML interface Brion Vibber

So the best I can do for now is put a message in
MediaWiki:Searchresulttext: "If searching Chinese, try your search
again with quote marks, 逢甲 -> "逢甲" . Sorry".

SpecialSearch.php provides the front-end UI, and is indeed used on Wikimedia sites.

MWSearch provides an alternate back-end. PostgreSQL users also have a different search back-end. Unsurprisingly, different back-ends have different properties and do not all share the same bugs.

Created attachment 6211
CJK quoter

How about this patch. Seems to work and maybe not break anything else.
All I'm trying to do is type those quote marks that Brion mentioned
for the user behind the scenes, instead of asking them up front to
type them in, via some embarrassing message. Otherwise what is the
logic of distributing a broken search without the least warning to the
user?

But as Wikipedia uses a better search, repairing this worse search
will be a uphill battle, as without being forced to eat your own
medicine, you won't have any impetus to improve it.

So mediawiki should distribute the good stuff it uses itself instead.

Anyway, note that I only patched zh-hans. This will not help the other
CJK languages that already have their own
languages/classes/Language*.php. Fortunately zh-tw doesn't, so it will
get this fix.

As far as patch quality, well, as it seems nobody cares much about
this old search function, just chuck it in, better than nothing.

All I know is it works for me here on MySQL Linux etc.

Attached:

Patch as written can result in double-quoting, causing searches to fail if quotes were used in the original search term. With no quotes in input it seems ok... should be possible to tweak it to not add double-quotes.

OK, tomorrow I will make the patch first scan to see if the user has put
any double quote marks in their input, and not tamper with their input if so.

Glad to know this is the right place to fix this bug, so I needn't look deeper
under the hood.

Other CJK languages are welcome to make similar fixes, I'll just
concentrate on Zh here.

Implementation committed in r52338:

Big fixup for Chinese word breaks and variant conversions in the MySQL search backend...

  • removed redunant variant terms for Chinese, which forces all search indexing to canonical zh-hans
  • added parens to properly group variants for languages such as Serbian which do need them at search time
  • added quotes to properly group multi-word terms coming out of stripForSearch, as for Chinese where we segment up the characters. This is based on Language::hasWordBreaks() check.
  • also cleaned up LanguageZh_hans::stripForSearch() to just do segmentation and pass on the Unicode stripping to the base Language implementation, avoiding scary code duplication. Segmentation was already pulled up to LanguageZh, but was being run again at the second level. :P
  • made a fix to Chinese word segmentation to handle the case where a Han character is followed by a Latin char or numeral; a space is now added after as well. Spaces are then normalized for prettiness.

"Other CJK languages are welcome to make similar fixes, I'll just
concentrate on Zh here."

Not all CJK languages omit interword spaces and not all languages which omit interword spaces are CJK:

  • Korean does use spaces between words. Quite possibly a full-width space character rather than ASCII 0x20.
  • Thai and Khmer (Cambodian) do not use spaces between words.
  • Note that both Unicode and HTML include means of indicating invisible word breaks for such languages. Then again a quick Google seems to indicate that the HTML "WBR" tag is neither official nor interpreted to have the same semantics by everybody.

Another approach would be to harvest Han compounds from souces such as EDICT, CEDICT, and the various Wiktionaries. Google does morphological analysis to determine which strings of Han characters are compounds that should be treated as words.

Andrew Dunbar (hippietrail)

Glad Chinese is finally fixed. No need for anymore "try Google instead"
in MediaWiki:Searchresulttext!

Another approach would be to harvest Han compounds from souces such as EDICT,

Well my wikis' compounds are all police department and bus stop names:
http://jidanni.org/comp/wiki/article-category.html .