Page MenuHomePhabricator

CJKFilter wrongly tokenize a CJK and non-CJK mixed string.
Closed, DeclinedPublic

Description

Author: mizuno.jun

Description:
a patch for CJKFilter.java and its test.

With language=ja setting,
CJKFilter wrongly tokenize CJK string
if this string starts with non-CJK characters.

Example:
A string "abC1C2C3", where C1 C2 C3 mean a CJK characters, is tokenized into
a token stream (abC1, C1C2, C2C3).
This should be (ab, C1C2, C2C3, C3C4).

This behavior causes an odd snippet in search result.
A token stream (abC1, C1C2, C2C3) is combined into a word "abC1C1C2C3".


Version: unspecified
Severity: normal

attachment cjkfilter.patch ignored as obsolete

Details

Reference
bz26997

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:15 PM
bzimport set Reference to bz26997.
bzimport added a subscriber: Unknown Object (MLST).

mizuno.jun wrote:

a patch for CJKFilter.java and its test.

The previous patch has the wrong code.
Tokens without a CJK character will be filtered wrong.
I replace the patch.

Attached:

sumanah wrote:

Jun, I'm sorry that it is taking so long for a developer to review your patch! I have added the "need-review" keyword to indicate that a your patch awaits review. Thank you for the patch.

sumanah wrote:

Jun, I'm asking Oren Bochman to take a look at your patch. You might also be interested in working with him more generally to improve our Lucene search extension.

If you really want to work on this I think you can try to incorporate some existing project into the extension: http://stackoverflow.com/questions/5834371/is-there-any-good-open-source-or-freely-available-chinese-segmentation-algorithm

mizuno.jun wrote:

Hi,
It is a welcome news that my patch might be reviewed.
I have used a patched lucene-search for almost one year at my site, however,
I am not sure my patch is valid.

Lucene-Search extension is a fundamental tool at my site.
I don't know there is anything I can do, though,
I will learn the implementation of CJK support more closely.

sumanah wrote:

Jun: Thanks again for the patch. Are you interested in using developer access to directly suggest any future MediaWiki and MediaWiki extension improvements into our Git source control system?

https://www.mediawiki.org/wiki/Developer_access

https://www.mediawiki.org/wiki/Git/Workflow#How_to_submit_a_patch

[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]

In the meantime, lucene-search in Wikimedia has reached its end of life and will not be improved further.
Jun Mizuno: It would be awesome if you could check if the problem still exists in the CirrusSearch extension that is being working on (it is also a Lucene-based search for MediaWiki, backed by Elasticsearch instead of Wikimedia's home-grown lsearchd).

I don't see this issue in Cirrus/Elastic. Marking WONTFIX since lsearchd is end of life but adding cirrus-fixed tag.