Page MenuHomePhabricator

Large blacklists fail due to PCRE regex size limits
Closed, ResolvedPublic

Description

Author: kichik

Description:
The SpamBlacklist extension uses preg_match to search for bad URLs. When used
along with a big blacklist like chongqed's[1], it fails with the following
warning message:

Warning: preg_match(): Compilation failed: regular expression too large at offset 0

According to the PCRE source code used in PHP[2], that pattern limit is 64kb
(2^16). It's pretty big, but chongqed's list is over 100kb.

It's possible to use eregi instead, but it's very slow. On SourceForge servers,
it even dies because of the 10 seconds execution time limit on medium pages.

Attached is a patch that uses the regular expression only once, after it finds
one of the bad URLs without using a regular expression. The regular expression
is used just to make sure a URL was really found and not just the domain name.

Thinking about this again, I realized I'm forfeiting the advantages of a regular
expression to catch bad URLs. However, as regular expressions usage is very low,
I believe this a very reasonable trade off. Especially if you consider that the
other possibility is not having a blacklist at all.

It might be possible to split the regular expression to several regular
expressions, smaller than 64kb, and try them one by one. But I haven't tried
going down that road.

[1] http://blacklist.chongqed.org/
[2]
http://chora.php.net/co.php/php-src/ext/pcre/pcrelib/pcre_internal.h?php=ea47bb18dd995a11cb26cbb56196f3a4&r=1.1#198


Version: unspecified
Severity: major
URL: http://meta.wikimedia.org/wiki/SpamBlacklist_extension

Details

Reference
bz3632

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 8:50 PM
bzimport added a project: SpamBlacklist.
bzimport set Reference to bz3632.
bzimport added a subscriber: Unknown Object (MLST).

kichik wrote:

the proposed patch

attachment SpamBlacklist_body.php.patch ignored as obsolete

gregory.szorc wrote:

Modification to avoid size limit

Modification of original extension so that the list obtained is sliced into
segments small enough to fit into preg_match without error.

attachment SpamBlacklist_body.php ignored as obsolete

kichik wrote:

Comment on attachment 943
the proposed patch

Index: SpamBlacklist/SpamBlacklist_body.php

RCS file: /cvsroot/wikipedia/extensions/SpamBlacklist/SpamBlacklist_body.php,v
retrieving revision 1.11
diff -u -r1.11 SpamBlacklist_body.php

  • SpamBlacklist/SpamBlacklist_body.php 13 Jul 2005 22:41:14 -0000 1.11

+++ SpamBlacklist/SpamBlacklist_body.php 6 Oct 2005 22:02:30 -0000
@@ -3,7 +3,7 @@
if ( defined( 'MEDIAWIKI' ) ) {

class SpamBlacklist {

  • var $regex = false;

+ var $blacklist = false;

	var $previousFilter = false;
	var $files = array();
	var $warningTime = 600;

@@ -55,11 +55,11 @@

		}
  • if ( $this->regex === false || $recache ) {

+ if ( $this->blacklist === false || $recache ) {

			if ( !$recache ) {
  • $this->regex = $wgMemc->get( "spam_blacklist_regex" );

+ $this->blacklist = $wgMemc->get( "spam_blacklist_blacklist" );

			}
  • if ( !$this->regex ) {

+ if ( !$this->blacklist ) {

  1. Load lists $lines = array(); wfDebug( "Constructing spam blacklist\n" );

@@ -100,31 +100,35 @@

  1. Strip comments and whitespace, then remove blanks $lines = array_filter( array_map( 'trim', preg_replace( '/#.*$/', '', $lines ) ) );
  • # No lines, don't make a regex which will match everything

+ # No lines, don't make a blacklist which will match everything

				if ( count( $lines ) == 0 ) {
					wfDebug( "No lines\n" );
  • $this->regex = true;

+ $this->blacklist = false;

				} else {
  • # Make regex
  • # It's faster using the S modifier even though it will usually only be run once
  • $this->regex = 'http://[a-z0-9\-.]*(' . implode( '|', $lines ) . ')';
  • $this->regex = '/' . str_replace( '/', '\/', preg_replace('|\\\*/|', '/', $this->regex) ) . '/Si';

+ $this->blacklist = $lines;

				}
  • $wgMemc->set( "spam_blacklist_regex", $this->regex, 3600 );

+ $wgMemc->set( "spam_blacklist_blacklist", $this->blacklist, 3600 );

			}
		}
  • if ( $this->regex{0} == '/' ) {

+
+ $retVal = false;
+
+ if ( $this->blacklist ) {

    1. Do the match
  • wfDebug( "Checking text against regex: {$this->regex}\n" );
  • if ( preg_match( $this->regex, $text, $matches ) ) {
  • wfDebug( "Match!\n" );
  • EditPage::spamPage( $matches[0] );
  • $retVal = true;
  • } else {
  • $retVal = false;

+ wfDebug( "Checking text against blacklist\n" );
+
+ foreach ( $this->blacklist as $badurl ) {
+ if ( stristr( $text, $badurl ) ) {
+ $pattern = '/' . preg_quote('http', '/') . 's?' . preg_quote('://', '/');
+ $pattern .= '[a-z0-9\-\.]*' . preg_quote($badurl, '/') . '/i';
+ if ( preg_match( $pattern, $text ) ) {
+ wfDebug( "Match!\n" );
+ EditPage::spamPage( $badurl );
+ $retVal = true;
+ break;
+ }
+ }

			}
  • } else {
  • $retVal = false; }

    wfProfileOut( $fname );

vaclav wrote:

This patch wasn't implemented into the main tree and the patch doesn't work with
a current SpamBlacklist extension version. And the problem with large limited
regex persists. Could be this improvement implemented into the main tree or
could someone update this patch, please?

clodoaldo.pinto wrote:

Patch to avoid "regular expression too large" error

This patch still uses preg_match(). To avoid the "regular expression too large"
it breaks the regex using array_chunk().

Attached:

robchur wrote:

Fixed in r16537, with further improvements in r19197.