Abuse filter regex \b considers unicode characters as word boundaries
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• bzimport
	Mar 8 2010, 3:28 AM

Description

Author: delbu9c1

Description:
In analyzing a false positive, I've been trying to track down the reason my regex debugger says a regex doesn't match yet it does match on the abuse filter. Eventually I found what appears to be a good lead on the issue.

Details of the incorrect match are here: http://test.wikipedia.org/w/index.php?title=Special:AbuseLog&details=1784

It appears what's going on is the é (which appears to be encoded in UTF-8) is mishandled when testing against the regex. It appears to the regex engine as a word boundary, so the match succeeds (specifically, "\brence\b" matches "conférence").

Hopefully there's a way to correct this and it's not a problem in the heart of PHP instead.

Please let me know if you need any additional information.

Shirik @ enwiki

Version: unspecified
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=46773

Details

Reference: bz22761

Related Objects

Mentioned In: T48773: Word boundary parameter \b not working with Unicode devanagari words
Mentioned Here: T29987: AbuseFilter: Function ccnorm shouldn't convert "I" and "L" to "1", "O" to "0" and "S" to "5"

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:01 PM

• bzimport added a project: AbuseFilter.

• bzimport set Reference to bz22761.

• bzimport added a subscriber: Unknown Object (MLST).

• bzimport created this task.Mar 8 2010, 3:28 AM

Hello,

We have the same problem at frwiki: accented letters cause many false positives because they are recognized as word boundaries.

Best regards,

Arkanosis @ frwiki

Same problem on ptwikibooks, where a '\bpica\b' matched 'típica'.

(In reply to comment #2)

Same problem on ptwikibooks, where a '\bpica\b' matched 'típica'.

and also 'hípica' when this edit was made last week:
http://pt.wikibooks.org/?diff=prev&oldid=218088
This forced us have to use something like [^A-ZÁÀÂÃÇÉÊẼÍÓÒÔÕQ̃ÚŰÜŨ015] in a regex, so that it doesn't match accented letters

(In reply to comment #3)

[^A-ZÁÀÂÃÇÉÊẼÍÓÒÔÕQ̃ÚŰÜŨ015]

Where "015" is there as an workaround to T29987: AbuseFilter: Function ccnorm shouldn't convert "I" and "L" to "1", "O" to "0" and "S" to "5".

The /u flag is definitely there, as far as I can see.

public static function keywordRegex( $str, $regex, $pos, $insensitive = false ) {

		$str = $str->toString();
		$pattern = $regex->toString();

		$pattern = preg_replace( '!(\\\\\\\\)*(\\\\)?/!', '$1\/', $pattern );
		$pattern = "/$pattern/u";

		if( $insensitive ) {
			$pattern .= 'i';
		}

		$handler = new AFPRegexErrorHandler( $pattern, $pos );
		try {
			$handler->install();
			$result = preg_match( $pattern, $str );
			$handler->restore();
		} catch ( Exception $e ) {
			$handler->restore();
			throw $e;
		}
		return new AFPData( self::DBool, (bool)$result );

}

delirium wrote:

Is this still happening? My understanding is that it was a PHP behavior (possibly a "bug" depending on your interpretation) that was changed in more recent versions.

My understanding of the problem:

PHP passes its regexes to PCRE, and the traditional PCRE behavior is that the "traditional" character class \w is defined as just [A-Za-z0-9_], even when matching a UTF8 string. The word boundary class \b is defined based on \w, so behaves similarly. If you wanted the Unicode notion of "letter character", there are instead Unicode character classes, such as \pL. The /u switch in PHP just tells PCRE it's matching a Unicode string, but didn't change the definition of the legacy character classes.

But, PCRE added a switch 'PCRE_UCP' in mid-2010, which if set makes the traditional character classes into aliases for morally equivalent Unicode character classes. That should produce something closer to the expected behavior, at least in our case. From what I can find, this PCRE switch is enabled when /u is specified in newer versions of PHP, starting with 5.3.4, which came out in late 2010. I assume Wikimedia must be using a sufficiently new version by now?

delirium wrote:

*** Bug 36129 has been marked as a duplicate of this bug. ***

(In reply to comment #6)

Is this still happening? My understanding is that it was a PHP behavior
(possibly a "bug" depending on your interpretation) that was changed in more
recent versions.

<snip>

PCRE added a switch 'PCRE_UCP' in mid-2010, which if set makes the
traditional character classes into aliases for morally equivalent Unicode
character classes. That should produce something closer to the expected
behavior, at least in our case. From what I can find, this PCRE switch is
enabled when /u is specified in newer versions of PHP, starting with 5.3.4,
which came out in late 2010. I assume Wikimedia must be using a sufficiently
new version by now?

Indeed, I haven't managed to reproduce the bug this morning.
I'm marking the bug as RESOLVED FIXED but don't hesitate to reopen if it's not OK for you.

Thanks Mark for the detailed explanation!

Surely it's not FIXED if the behaviour changed in PHP -> WORKSFORME.
WMF wikis are currently using PHP 5.3.10-1ubuntu3.4+wmf1, by the way.

It's fixed somehow by updating PHP (it didn't workforme back in 2010, now it does). But I guess this is of little importance.

Thanks for the info on the PHP version :)

He7d3r mentioned this in T48773: Word boundary parameter \b not working with Unicode devanagari words.Oct 11 2017, 12:57 PM

Abuse filter regex \b considers unicode characters as word boundariesClosed, DeclinedPublicActions

Description

Details

Related Objects

Event Timeline

Abuse filter regex \b considers unicode characters as word boundaries
Closed, DeclinedPublic
Actions