Author: alex.lazovsky
Description:
Example: http://ru.wikipedia.org/w/index.php?diff=30178032 http://пример.испытание was blacklisted, but I can add this url http://ru.wikipedia.org/w/index.php?diff=30178038
Version: unspecified
Severity: major
• bzimport | |
Dec 14 2010, 12:51 PM |
F7367: tmp.diff | |
Nov 21 2014, 11:22 PM |
Author: alex.lazovsky
Description:
Example: http://ru.wikipedia.org/w/index.php?diff=30178032 http://пример.испытание was blacklisted, but I can add this url http://ru.wikipedia.org/w/index.php?diff=30178038
Version: unspecified
Severity: major
Presumably the SpamBlacklist extension needs to be modified to use the u flag for the regexes it makes so it interprets them as UTF-8.
As a temporary work around, you can escape unicode characters using \xHH (replace HH with hex codes). For example:
\bмакросъемка\.рф becomes \b\xD0\xBC\xD0\xB0\xD0\xBA\xD1\x80\xD0\xBE\xD1\x81\xD1\x8A\xD0\xB5\xD0\xBC\xD0\xBA\xD0\xB0\.\xD1\x80\xD1\x84
\bпример\.испытание becomes \b\xD0\xBF\xD1\x80\xD0\xB8\xD0\xBC\xD0\xB5\xD1\x80\.\xD0\xB8\xD1\x81\xD0\xBF\xD1\x8B\xD1\x82\xD0\xB0\xD0\xBD\xD0\xB8\xD0\xB5
alex.lazovsky wrote:
at first look this work around does not work,
http://ru.wikipedia.org/w/index.php?diff=30229518
http://ru.wikipedia.org/w/index.php?diff=30229527
Now I use AbuseFilter http://ru.wikipedia.org/wiki/Special:AbuseFilter/117 to block such links, but this approach has some drawbacks.
Sorry, the work around should not have the \b in it (presumably because things like \xD0 aren't word characters in non-utf8).
\bмакросъемка\.рф becomes
\xD0\xBC\xD0\xB0\xD0\xBA\xD1\x80\xD0\xBE\xD1\x81\xD1\x8A\xD0\xB5\xD0\xBC\xD0\xBA\xD0\xB0\.\xD1\x80\xD1\x84
\bпример\.испытание becomes
\xD0\xBF\xD1\x80\xD0\xB8\xD0\xBC\xD0\xB5\xD1\x80\.\xD0\xB8\xD1\x81\xD0\xBF\xD1\x8B\xD1\x82\xD0\xB0\xD0\xBD\xD0\xB8\xD0\xB5
Would someone who knows about such things be able to comment if adding the /u flag to the generated regexes would have any adverse performance affects?
I haven't tried profiling, but tossing a /u on in SpamRegexBatch::buildRegexes() doesn't seem to break at least. It should however be double-checked with the full-size blacklists.
However -- this isn't necessarily sufficient for handling IDN domain spam, as it won't match the punycode form of the name if it's linked that way. May require some normalization to really do this right.
Created attachment 8465
Suggested patch
Could you verify that the attached patch is where you think the /u should go to fix this?
Attached:
Any update? The workaround and/or the /u should be documented too. People at meta don't even know that there's a problem with unicode characters (https://meta.wikimedia.org/w/index.php?title=Talk:Spam_blacklist#non-ascii_are_not_blocked.3F).
Also, the workaround doesn't work in Thai Wikipedia... We wish to block \bสุนัขไทยหลังอาน\.com, so we put \xe0\xb8\xaa\xe0\xb8\xb8\xe0\xb8\x99\xe0\xb8\xb1\xe0\xb8\x82\xe0\xb9\x84\xe0\xb8\x97\xe0\xb8\xa2\xe0\xb8\xab\xe0\xb8\xa5\xe0\xb8\xb1\xe0\xb8\x87\xe0\xb8\xad\xe0\xb8\xb2\xe0\xb8\x99\.com, but it's still not blocked...
Try:
\b%E0%B8%AA%E0%B8%B8%E0%B8%99%E0%B8%B1%E0%B8%82%E0%B9%84%E0%B8%97%E0%B8%A2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%87%E0%B8%AD%E0%B8%B2%E0%B8%99.com
The use of alternate characters is ever increasing. I am wondering whether there is the capacity to get this addressed? Otherwise, can we have a lab tool that can generate the alternate code to help out those who cannot easily fathom the produced alternates
@kaldari is this something that can have a poke within your project of fixes?
Issue is probably not the regex themselves but how we canonicalize non ascii url characters. The regex should maybe be post processed to do the same conversion (but maybe not for the domain part of idn's)
Can I flag this one again to the community. We are back facing issues at meta for spamblacklist. The above workarounds are not working, and I believe that I have followed the stated conventions. Thanks.
@Aklapper Are we able to reopen this or do I need to do a new bug report
the component 㥗釛 as part of a url 㥗釛.domain.com are not being blocked
I will rephrase that. It works when done as the converted hex text, though not as unicode text. It would be most useful to not have to do a conversion.