Page MenuHomePhabricator

Improve spam filtering for Mailman mailing lists
Closed, DuplicatePublic

Description

The volume of spam getting through to the mailing list moderators on multiple Mailman-based mailing lists is increasing very significantly. It is common to see the same email address spam multiple lists one after the other. I moderate half a dozen lists, and in total am seeing about 150-200 spam emails being sent to the moderation queue on a daily basis, even after auto-discarding emails from the hundreds of addresses already on "auto-discard".

The volume started increasing around August and is steadily rising.


Version: wmf-deployment
Severity: normal

Details

Reference
bz56525

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:33 AM
bzimport set Reference to bz56525.
bzimport added a subscriber: Unknown Object (MLST).

Does that mean that Mailman's /privacy/spam section is not sufficient?
Wondering if http://jamesh.id.au/articles/mailman-spamassassin/ would be overkill.

I'm not sure what is expected by this report. :-/

Thehelpfulonewiki wrote:

We do have some sort of configuration of SpamAssassin on sodium, but I'm not sure how up to date the details at https://wikitech.wikimedia.org/wiki/Mailing_lists#SpamAssassin are. Currently I believe SA adds X-Spam-Score headers to the emails, which allow you to configure spam filters through the list admin interface (at https://lists.wikimedia.org/mailman/admin/list-name/privacy/spam).

Risker do you think we should add some of this spam filtering as default across all mailing lists? Please could you check some of the messages that you go through this week to see if they have an X-Spam-Score headers (you can check this when you are going through the moderation queue by clicking on the link to read full details about each email), and if they do note a rough idea of the score that they're given?

we do run spamassassin and it scores the mails for you already. You can see the score in the headers as X-Spam-Score. what is missing is activating the filtering on it via the mailman ui, which is a "per list" thing a list admin can do. basically you put a regex in to filter by spam score. RobH just recently did that for the ops list f.e.

the main issue is finding the right score threshold to filter. this might vary per list. also, you can choose between just "hold" or "discard" as an action. when experimenting with the spam filter i suggest you select just "hold" first, which will prevent the messages from being delivered but you can still check them as a list admin/mod to make sure they are not false positives

Thehelpfulonewiki wrote:

I imagine most lists have mail from non-members to be set to hold Daniel - so the "discard" option would probably be preferred, but it's getting a level that's high enough to minimise the false positives. From the lists that I admin, this could be around 3+ - but I'd like to see what other scores other admins are getting.

try: Privacy options... -> Spam filters -> Spam Filter Regexp -> put in the value "x-spam-status: yes" -> Select your "action"

this would be default settings. if that doesn't prove effective there should be other ways to make more specific regexes, see;

https://lists.wikimedia.org/mailman/admin/<LISTNAME>/?VARHELP=privacy/spam/header_filter_rules

Thehelpfulonewiki wrote:

So far I've been using X-Spam-Score: \d{1,2}\.\d \(\+{3,}\) as the Regexp - does yours affect all messages that have any X-Spam-Status?

Thehelpfulone, ya, that example using X-Spam-Score is is what i used before and i meant with "other ways" basically.The way i described above is another option that RobH activated today, to try out the defaults and see how good it works as opposed to specific values we'd have to pick.

(In reply to comment #6)

try: Privacy options... -> Spam filters -> Spam Filter Regexp -> put in the
value "x-spam-status: yes" -> Select your "action"

this would be default settings. if that doesn't prove effective there should
be
other ways to make more specific regexes, see;

https://lists.wikimedia.org/mailman/admin/<LISTNAME>/?VARHELP=privacy/spam/
header_filter_rules

Thank you, Daniel, this might be the ticket. If I put in "x-spam-status: yes" does it send *all* messages with a spam value to wherever I send it? (Likely it would be "discard".)

Just to give you a notion of the extent of the spam, there are 69 spam messages to functionaries-en-L in less than 24 hours; 26 to arbcom-en-appeals; 22 to arbcom-L. I've got a couple more on my list, but they're not being spammed nearly as badly, probably because they've had a very low number of people who've either been subscribers or have had posts accepted over the last 3-5 years.


Example from one email that was definitely spam:

X-Spam-Score: 3.1 (+++)
X-Spam-Report: Spam detection software, running on the system "mchenry.wikimedia.org", has
identified this incoming email as possible spam. If you have any
questions, see the administrator of that system for details.
Content analysis details: (3.1 points, 4.0 required)

I note that other emails sent to the same mailing list replace "mchenry.wikimedia.org" with "sodium.wikimedia.org" - not sure if that is relevant.

On scanning several emails to functionaries-en-L, most of them are well above the 4.0 spam score.

I'm concerned about tweaking the "admin_immed_notify" settings so that list admins and moderators get fewer emails (right now we get one for every email sent to moderation, every one auto-rejected, and everyone auto-discarded). It seems the alternatives are lots of emails, which wouldn't change even if we up the spam filters (they'd be auto-rejected) or no emails, which makes it less likely that we can recover legitimate emails that sometimes get missed in the haystack of spam.

I frequently see the same email address spamming a lot of the lists.wikimedia.org mailing lists in a serial fashion (seeing the same email come up on 3-5 of the lists that I moderate, one after the other). Is that something that can be filtered before the spam goes all the way through the system?

(In reply to comment #10)

I frequently see the same email address spamming a lot of the
lists.wikimedia.org mailing lists in a serial fashion (seeing the same email
come up on 3-5 of the lists that I moderate, one after the other). Is that
something that can be filtered before the spam goes all the way through the
system?

I did a little bit of searching and found https://bugzilla.mozilla.org/show_bug.cgi?id=681460 which is duped to a WONTFIXED bug.

If the X-Spam-Score filter method works, maybe we just need sane defaults?

I looked into activating it for the Wikimediaannounce-l list a while ago, but unfortunately there are a lot of false positives, see the example below where SpamAssassin thought that I am trying to commit money fraud ;)

(Note that Mailman includes this report in the mail header, i.e. any list subscriber can look through past messages and help find such false positives.)


From: Tilman Bayer <tbayer@wikimedia.org>
Date: Thu, 14 Nov 2013 22:49:56 -0800
Message-ID: <CAPDdKA5q+Nr2J6XmEvXUR6Fg=HKZv=NKg_QpLG_FMru-k-CYFg@mail.gmail.com>
Subject: Wikimedia Foundation Report, October 2013
To: wikimediaannounce-l@lists.wikimedia.org
Cc: Staff All <wmfall@lists.wikimedia.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 10.1 (++++++++++)
X-Spam-Report: Spam detection software, running on the system "sodium.wikimedia.org", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
the administrator of that system for details.

Content preview: Hi all, please find below the WMF report for October 2013,

in plain text. As always, the editable and formatted version has been published
on Meta: https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Report,_October_2013
[...]

Content analysis details: (10.1 points, 4.0 required)

pts rule name              description

-0.0 SPF_PASS SPF: sender matches SPF record

2.5 US_DOLLARS_3           BODY: Mentions millions of $ ($NN,NNN,NNN.NN)
0.0 WEIRD_PORT             URI: Uses non-standard port number for HTTP
0.0 LOTS_OF_MONEY          Huge... sums of money
0.0 T_DKIM_INVALID         DKIM-Signature header exists but is not valid
3.6 MONEY_FRAUD_3          Lots of money and several fraud phrases
4.0 ADVANCE_FEE_2_NEW_MONEY Advance Fee fraud and lots of money

(In reply to comment #12)

I looked into activating it for the Wikimediaannounce-l list a while ago, but
unfortunately there are a lot of false positives, see the example below where
SpamAssassin thought that I am trying to commit money fraud ;)

LOL, I assume we can't customise SpamAssassin rules per-list? I hope humans don't subconsciously discard those phrases as spam-sounding too, though. :p

I’m managing WMFR’s mailing lists since some years and we have a quite low level of spam in moderation. I don’t know a lot Exim+Mailman (we use Postfix+Sympa) so I may miss some things, but I wonder about three config stategies:

  • throttling: I didn’t see a lot of such config parameters, apart "smtp_accept_max = 4000" and around this parameter; perhaps a finer config here, possibly per host, could limit the spam volume.
  • DNSBL: this is found very effective on WMFR’s mailing lists (with Zen-Spamhaus) and I guess it is much more cheap than SpamAssassin; I know there are some arguments about such methods, and if you don’t want to enable it to reject connections, perhaps there exists some program which could take this into account for the computation of the spam score.
  • blacklists/whitelists: WMFR’s lists are generally whitelisted for members of the list, WMFR’s members, and members of an additionnal global whitelist, and in moderation for others. We have a low level of spam, so a global blacklist is not needed, but perhaps it would be worth using it for WMF mailing lists. I am thinking about some webpage (or in Mailman interface) where all list admins could add "addresses of known spammers", and this could be used either in SpamAssassin either directly in Exim. This would save time for all list admins. Or a reverse scenario could be to implement a global whitelist, manual or semi-automatic with members of any mailing list, and hence the moderation queue would have less false negatives/non-spam messages awaiting moderation.

(In reply to comment #12)

I looked into activating it for the Wikimediaannounce-l list a while ago, but
unfortunately there are a lot of false positives,

LOL, I assume we can't customise SpamAssassin rules per-list?

You _can_ set the spam scores (and actions) per list and the above is exactly the reason why i don't think setting a global default would be a good idea (and there is no global config)., so yes, do it per list.

JohnLewis claimed this task.
JohnLewis subscribed.

Ambitiously setting to invalid as this is a local list issue. Documentation on how to implement this has been discussed above and is available at https://wikitech.wikimedia.org/wiki/Lists.wikimedia.org#Fighting_spam_in_mailman.

Is there a separate report for improving the defaults in Wikimedia mailing lists?

@Nemo_bis no as there are no default or global concept for this until mailman version 3 unfortunately. Also as stated above, from a technical standpoint all tools and software are provided including code that makes this painless and easy for list admins to use.

Another challenge it would be great to find a solution for is spammy subscription requests. Those don't seem to go through SpamAssassin.

I think spam subscriptions will mainly come from the web interface. Moreover, I don't think SpamAssessing could do much for discerning valid and invalid emails that produce a subscription.

If we are concerned about fraudulent email subscriptions, we should start by placing a captcha in the web subscription UI.

Related: T112912

Ambitiously setting to invalid as this is a local list issue. Documentation on how to implement this has been discussed above and is available at https://wikitech.wikimedia.org/wiki/Lists.wikimedia.org#Fighting_spam_in_mailman.

The closure felt too hasty but I was willing to give it a try. Two years later it's clear we aren't able to chase the list administrators for each of hundreds of lists, and it turns out that actually there is some global setting in mailman 2 which can help. (JohnLewis was blocked in the meanwhile and cannot clarify, but I don't know whether he had considered the configuration variables in mm_cfg.py.)

Instead of reopening, I'm merging this report to the more specific one about how to fully use the spamassassin scores we already have, on the lists which don't already do so.