Page MenuHomePhabricator

Word boundary parameter \b not working with Unicode devanagari words
Open, MediumPublic

Description

At Marathi language wikipedia usually I am useing contains_any(added_lines,"तू") parameter to filter a given word,since say i want to stop use of word "तू" .

To avoid false positives due to prefixes and suffixes to the word; we want to use parameter \b as word boundry on either side of the word or both side of the word as per reuirement.

We wish we should be able to use

*contains_any(added_lines,"तू\b"),should work, so that we do not get a false positive on word "तूप" and many similler words.

*contains_any(added_lines,"\bतू"),should work, so that we do not get a false positive on word "धातू" and many similler words.

*contains_any(added_lines,"\bतू\b"),should work,so that we do not get a false positive on word "दुकानातून" and many similler words.

The related edit(abuse) filter on Marathi language wikipedia is http://mr.wikipedia.org/wiki/विशेष:संपादन_गाळणी/10

For words where prefixes and suffixes are less we are using ! parameter but this parameter is not sufficient in words where too many suffixes or prefixes are possible.

If parameter \b can work or any other good option for word boundry it will be usefull to many devanagari script using wikis like Hindi and many other.

See Also:

Details

Reference
bz46773

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:34 AM
bzimport set Reference to bz46773.
bzimport added a subscriber: Unknown Object (MLST).

*One suggession was given to use 'added_lines rlike "\bतू\b" ' but this also did not work.

*It seems Some youropean languages had problems related to \b parameter and those are resolved so request to developers to support devanagari script using languages in this respect.

quentinv57 wrote:

I just tried to fix this, using the following expression :

added_lines irlike "\bतू\b"

(this is the way it is done on the french Wikipedia from what I've seen)

The expression was not matched as I expected. Apparently it comes from the fact that "\b" does not support UTF-8 characters.

Regards,

Quentinv57

Hi everyone,

Using rlike is indeed the way to go as contains_any works on plain strings, not regular expressions (in this context, \b is nothing more than an invalid escaped character).

We've indeed already had the same problem on several European languages such as French and Portuguese (see bug 22761), but it has been fixed by updating PHP to a newer version which provides UTF-8-aware special characters.

Now it would be interesting to test PCRE alone, to see if it can handle this well.

Best regards

*In most filters we are shifting from 'contains_any(added_lines,"") to added_lines irlike" "

*Ofcourse we still need solution to \b word boundry issue and is very important to several of our filters and indic language wikis.

*( BTW -bit of subject diversion- added_lines irlike" " Seems to have some unstated limit to the number of words/strings it can handle in a single filter ? or this behaviour only with Devnagari script)

(In reply to comment #3)

We've indeed already had the same problem on several European languages such
as
French and Portuguese (see bug 22761), but it has been fixed by updating PHP
to
a newer version which provides UTF-8-aware special characters.

So it should be reported upstream to PHP?

-upstream keyword: "Bugs marked this way *should* include a link to the upstream bug report in the "See Also" field!" (https://bugzilla.wikimedia.org/describekeywords.cgi)

(In reply to comment #6)

-upstream keyword: "Bugs marked this way *should* include a link to the
upstream bug report in the "See Also" field!"
(https://bugzilla.wikimedia.org/describekeywords.cgi)

Sure. That's why I added it.

But there's no PHP bug URL in the See Also field...

Is there something like a minimal test script to trigger this? Also wondering about our "PHP version" and "Package affected".
See https://bugs.php.net/report.php

Change 71718 had a related patch set uploaded by Hashar:
test word boundaries in devanagari words

https://gerrit.wikimedia.org/r/71718

Created attachment 12734
PCRE unit tests without and with unicode mode

The root cause is that PCRE does not look up unicode characters properties by default and would not recognize word boundaries in various scripts.

To make PCRE matches the word boundaries, we need to have PCRE act in unicode mode using the 'u' regex modifiers. That will make PCRE to lookup the character properties in a huge table which might be a bit slow.

So that is definitely doable, but we have to look at the performance impact.

The change https://gerrit.wikimedia.org/r/71718 adds a lame test in MediaWiki core which shows the problem.

$ php phpunit.php --testdox includes/bug46773Test.php
PHPUnit 3.7.21 by Sebastian Bergmann.

Configuration read from /Users/amusso/projects/mediawiki/core/tests/phpunit/suite.xml

bug46773

  • Regex boundaries devanagari
  • Regex boundaries devanagari in unicode mode
  • Media wiki test case parent setup called

$

(a 'x' denote test is passing).

Attached is the --tap output of the test.

Attached:

Change 71718 abandoned by Hashar:
test word boundaries in devanagari words

Reason:
That was an example for bug 46773

https://gerrit.wikimedia.org/r/71718

Running a basic preg_match 1M times with and without the modifier, with the u it averaged 15% longer.

Doing regexes isn't the only thing AbuseFilter does, so I think we would be safe enabling it with a flag, and then we can watch the performance of it to make sure we don't see anything too crazy.

(In reply to comment #14)

safe enabling it with a flag<<

Hi,

Are we expected to do any Edit filter testing at our local wiki?

Thanks and regards

Any good news for us on this bug, Please.

The AbuseFilter currently runs all regex in unicode mode. That apparently solves problems for some languages but not all, including the devanagari examples offered here.

With respect to regex, the underlying issue is almost certainly related to how PCRE handles /b with multi-byte unicode. I'm not sure at the moment if this might be fixable with a version or configuration change or an upstream patch, or whether it is basically unfixable.

That said, in doing some testing, I found a workaround that might help some of the time. \b is essentially equivalent to testing whether the associated character is either not a word character or is the start or end of the line. Replacing \b with the equivalent explicit posix forms, such as:

\bतू\b  ->  (?:\P{Xwd}|\A)तू(?:\P{Xwd}|\Z)

Seems to work more often than \b does.

Here "\P{Xwd}" denotes "not a letter / number" under unicode, and \A is the start of a line and \Z is the end of a line.

For example:

"तूप" rlike "\bतू\b"

is unexpectedly true, but

"तूप" rlike "(?:\P{Xwd}|\A)तू(?:\P{Xwd}|\Z)"

is correctly false.

The trick explained above also works for: "दुकानातून".

Likewise:

"तू" rlike "(?:\P{Xwd}|\A)तू(?:\P{Xwd}|\Z)"

is true as expected.

However, this is only a partial solution as the other example, "धातू", does not work correctly. (The modifier on the first character apparently still reads as not a letter.)

This trick is at best a band-aid to help filter writers in some situations, but maybe it will be of some use till someone comes up with a more complete solution.

Thanks, interesting. That's indeed what PCRE suggests to do in unicode: http://bugs.exim.org/show_bug.cgi?id=865#c1 The problem with ligatures/diacritics is worth reporting to them, if you have an example, because I don't see the topic discussed in their issue tracker.

Perhaps we should look for a regex engine which respects http://unicode.org/reports/tr18/ (is basic unicode support enough?) and offer it in AbuseFilter, at least as an option. There are some comparisons but I've not looked for an "official" one.

I don't know how the T49512: Switch AbuseFilter to using Lua would affect this task, but I though I should mention it here.