Author: kichik
Description:
The SpamBlacklist extension uses preg_match to search for bad URLs. When used
along with a big blacklist like chongqed's[1], it fails with the following
warning message:
Warning: preg_match(): Compilation failed: regular expression too large at offset 0
According to the PCRE source code used in PHP[2], that pattern limit is 64kb
(2^16). It's pretty big, but chongqed's list is over 100kb.
It's possible to use eregi instead, but it's very slow. On SourceForge servers,
it even dies because of the 10 seconds execution time limit on medium pages.
Attached is a patch that uses the regular expression only once, after it finds
one of the bad URLs without using a regular expression. The regular expression
is used just to make sure a URL was really found and not just the domain name.
Thinking about this again, I realized I'm forfeiting the advantages of a regular
expression to catch bad URLs. However, as regular expressions usage is very low,
I believe this a very reasonable trade off. Especially if you consider that the
other possibility is not having a blacklist at all.
It might be possible to split the regular expression to several regular
expressions, smaller than 64kb, and try them one by one. But I haven't tried
going down that road.
[1] http://blacklist.chongqed.org/
[2]
http://chora.php.net/co.php/php-src/ext/pcre/pcrelib/pcre_internal.h?php=ea47bb18dd995a11cb26cbb56196f3a4&r=1.1#198
Version: unspecified
Severity: major
URL: http://meta.wikimedia.org/wiki/SpamBlacklist_extension