Page MenuHomePhabricator

Spam-blacklist does not support unicode characters in regex, needed to filter internationalized domain names
Closed, ResolvedPublic

Description

Author: alex.lazovsky

Description:
Example: http://ru.wikipedia.org/w/index.php?diff=30178032 http://пример.испытание was blacklisted, but I can add this url http://ru.wikipedia.org/w/index.php?diff=30178038


Version: unspecified
Severity: major

Details

Reference
bz26332

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 11:22 PM
bzimport added a project: SpamBlacklist.
bzimport set Reference to bz26332.

Presumably the SpamBlacklist extension needs to be modified to use the u flag for the regexes it makes so it interprets them as UTF-8.

As a temporary work around, you can escape unicode characters using \xHH (replace HH with hex codes). For example:

\bмакросъемка\.рф becomes \b\xD0\xBC\xD0\xB0\xD0\xBA\xD1\x80\xD0\xBE\xD1\x81\xD1\x8A\xD0\xB5\xD0\xBC\xD0\xBA\xD0\xB0\.\xD1\x80\xD1\x84

\bпример\.испытание becomes \b\xD0\xBF\xD1\x80\xD0\xB8\xD0\xBC\xD0\xB5\xD1\x80\.\xD0\xB8\xD1\x81\xD0\xBF\xD1\x8B\xD1\x82\xD0\xB0\xD0\xBD\xD0\xB8\xD0\xB5

alex.lazovsky wrote:

at first look this work around does not work,
http://ru.wikipedia.org/w/index.php?diff=30229518
http://ru.wikipedia.org/w/index.php?diff=30229527

Now I use AbuseFilter http://ru.wikipedia.org/wiki/Special:AbuseFilter/117 to block such links, but this approach has some drawbacks.

Sorry, the work around should not have the \b in it (presumably because things like \xD0 aren't word characters in non-utf8).

\bмакросъемка\.рф becomes
\xD0\xBC\xD0\xB0\xD0\xBA\xD1\x80\xD0\xBE\xD1\x81\xD1\x8A\xD0\xB5\xD0\xBC\xD0\xBA\xD0\xB0\.\xD1\x80\xD1\x84

\bпример\.испытание becomes
\xD0\xBF\xD1\x80\xD0\xB8\xD0\xBC\xD0\xB5\xD1\x80\.\xD0\xB8\xD1\x81\xD0\xBF\xD1\x8B\xD1\x82\xD0\xB0\xD0\xBD\xD0\xB8\xD0\xB5


Would someone who knows about such things be able to comment if adding the /u flag to the generated regexes would have any adverse performance affects?

alex.lazovsky wrote:

This work around works fine, thanks!

Alex

I haven't tried profiling, but tossing a /u on in SpamRegexBatch::buildRegexes() doesn't seem to break at least. It should however be double-checked with the full-size blacklists.

However -- this isn't necessarily sufficient for handling IDN domain spam, as it won't match the punycode form of the name if it's linked that way. May require some normalization to really do this right.

Created attachment 8465
Suggested patch

Could you verify that the attached patch is where you think the /u should go to fix this?

Attached:

Any update? The workaround and/or the /u should be documented too. People at meta don't even know that there's a problem with unicode characters (https://meta.wikimedia.org/w/index.php?title=Talk:Spam_blacklist#non-ascii_are_not_blocked.3F).

Also, the workaround doesn't work in Thai Wikipedia... We wish to block \bสุนัขไทยหลังอาน\.com, so we put \xe0\xb8\xaa\xe0\xb8\xb8\xe0\xb8\x99\xe0\xb8\xb1\xe0\xb8\x82\xe0\xb9\x84\xe0\xb8\x97\xe0\xb8\xa2\xe0\xb8\xab\xe0\xb8\xa5\xe0\xb8\xb1\xe0\xb8\x87\xe0\xb8\xad\xe0\xb8\xb2\xe0\xb8\x99\.com, but it's still not blocked...

Try:

\b%E0%B8%AA%E0%B8%B8%E0%B8%99%E0%B8%B1%E0%B8%82%E0%B9%84%E0%B8%97%E0%B8%A2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%87%E0%B8%AD%E0%B8%B2%E0%B8%99.com

Yes, it works! Thx :) Again, this should be documented...

The use of alternate characters is ever increasing. I am wondering whether there is the capacity to get this addressed? Otherwise, can we have a lab tool that can generate the alternate code to help out those who cannot easily fathom the produced alternates

@kaldari is this something that can have a poke within your project of fixes?

It looks like /u was added to the regexes back in 2011. Is this still broken?

It looks like /u was added to the regexes back in 2011. Is this still broken?

Issue is probably not the regex themselves but how we canonicalize non ascii url characters. The regex should maybe be post processed to do the same conversion (but maybe not for the domain part of idn's)

Can I flag this one again to the community. We are back facing issues at meta for spamblacklist. The above workarounds are not working, and I believe that I have followed the stated conventions. Thanks.

Billinghurst added a subscriber: Aklapper.

@Aklapper Are we able to reopen this or do I need to do a new bug report

the component 㥗釛 as part of a url 㥗釛.domain.com are not being blocked

I will rephrase that. It works when done as the converted hex text, though not as unicode text. It would be most useful to not have to do a conversion.