Page MenuHomePhabricator

$wgSpamRegex should be seperated into summary- and page text-regex
Closed, ResolvedPublic

Description

In 1.14.0 RELEASE-NOTES we see

  • $wgSpamRegex now matches the edit summary and page move descriptions in addition to body text.

I'm sorry, but that's absolutely crazy, reckless, irresponsible. I'm
commenting it out in EditPage.php:

Check for spam

$match = false; #JIDANNI turning OFF!!: $match = self::matchSpamRegex( $this->summary );

Please consider e.g.,:

$wgSpamRegex=array('/^\B$/',

This regular expression is what our wiki uses to prevent vicious page
blanking. (By the way, if one triggers it, oddly the function that
usually shows the user what the problem was doesn't say anything.)

Anyway, a blanked page is bad, but a blank comment is fine!

Now let's look at another regexp we use on our sites:

'/^[^{][[:ascii:]]*$/');

This regular expression means the user's edit must have at least one
Chinese character in it, because our wikis are all zh-tw language
wikis, and a pure ASCII post is surely spam.

However, a quick English, or NULL _summary_ is very common and
accepted on our wikis.

Anyway, the rash decision to glue 'edit summary', 'page move descriptions'
'body text' together will have users banging down my door saying why
are their postings getting rejected now! *Please let the
administrator glue them together if he wishes!:
($wgSpamRegex['edit summary']= $wgSpamRegex['page move descriptions']=
$wgSpamRegex['body text'];) Don't arbitrarily glue them all together
for us!
*

Please instead run each one as a separate test.
You (MediaWiki team) can have an array of arrays, and just do something like the PHP
version of foreach('edit summary', 'page move description', 'body
text' as $bla){ run the matcher of $wgSpamRegex[$bla] on $get->$bla}
or however you write it in PHP, which I am poor at.
And of course you need three different MediaWiki:Spamprotectiontext
now too. And please allow us to set them in LocalSettings.php:
$wgSpamProtectionText['body text']= and the other two too. Setting
them in MediaWiki:Spamprotectiontext is a big pain when you are making
a Wiki Family.

By the way, we also have a rule
/{{[Cc]\|\d\d\d\.\d{0,3}}}/
that I mention in Spamprotectiontext:
Radio frequencies must have at least four digits after the decimal place.

What would be neat is if each regexp could have its own optional text
that gets printed out.

Ah, you might say I should stop complaining and use this mentioned in DefaultSettings.php:

  • For a complete example, have a look at the SpamBlacklist extension. */ $wgFilterCallback = false;

Well I'll have you know that I did look at it, and it is all 100 times
overkill and un-understandable gobbledygook, so sorry. It didn't help
me one bit.

Anyway, I was doing fine until you glued all the tests together.
Next time I'll test while your release candidate is fresh. Sorry I
only discovered this (glue mess) now.

By the way, I also use /<[Aa]/, which stops attempted spam links. This
regexp I wish to use in all three places: summary, body text, etc.

I.e., I cannot live for long with no summary filtering (caused by my
above commenting out), as I know it is only a matter of time before they
attack, therefore I hope you will separate the three tests (and not
just toss in some var $ignoreEditSummary), by version 1.14.1. Thank
you.


Version: unspecified
Severity: normal

Details

Reference
bz17677

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:32 PM
bzimport set Reference to bz17677.

P.S. the above example should be '/^[[:ascii:]]*$/');

(No need to show the "[^{]", which is our local (
http://taizhongbus.jidanni.org/index.php?title=Template:B
http://radioscanningtw.jidanni.org/index.php?title=Template:C )
jazz, meaning it is OK to not even have one Chinese character, if one is
entering a bus stop or police frequency via these templates.)

Or maybe a even fancier array is needed
[/REGEXP1/,0,1,0,"No xyz allowed"]
[/REGEXP2/,1,1,1,null]
...
the 0,1,0 stuff are the three tests, followed by an optional message, which if null, just prints the /REGEXP/ that triggered.

I.e., instead of three arrays, which will probably have a lot of duplication, use one array... OK, anything is OK, except the current gluing with no way to unglue short of hacking the source.

Hmmm, [/REGEXP2/,1,1,1,null] doesn't look too expandable for the future with more tests added. Sorry.
Maybe [/REGEXP2/,[1,1,1],null] would be better, so if a fourth test was added, older LocalSettings would still work.
(By older, I mean older than 1.14.2, but younger than 1.14.0 :-) OK, bye.)

ayg wrote:

You make some fairly good points. This change ignores some fairly reasonable use-cases for the spam regex.

(In reply to comment #4)

You make some fairly good points. This change ignores some fairly reasonable
use-cases for the spam regex.

Thanks. By the way, the patterns I mentioned, and no more, have kept us 100% spam free for years!

Also consider that some regex are not applicable to the summary, and thus is a wasted regex check.
IMHO a different regex for summary is the way to go.
$wgSummarySpamRegex = $wgSpamRegex; is easy enough for people which like using the same.