Page MenuHomePhabricator

Mail notifications from fab.wmflabs.org delivered only days later (or not at all?)
Closed, ResolvedPublic

Description

Copying from http://fab.wmflabs.org/T348 ; confirmed by other users

Notification mail from http://fab.wmflabs.org/ has lag (and I think I also miss notifications, so they get lost). Two examples:

My last bugmail notification from http://fab.wmflabs.org/T335 I received was from Fri, 16 May 2014 20:24:51 +0000:

"aklapper added a subscriber: aklapper."

so I am explicitly subscribed to it. But I only realized that there are new comments on T335 via the feed icon on http://fab.wmflabs.org itself. I should have received notifications. My spam folder is empty and searching for "335" does not show any newer messages. I've checked both in my desktop email client and via the browser on mail.google.com.

Furthermore, for my visibility change in http://fab.wmflabs.org/T12 I received a notification message on Thu, 22 May 2014 23:31:27 UTC when I slept already for a few hours:

Delivered-To: aklapper@wikimedia.org
Received: by 10.140.47.144 with SMTP id m16csp235382qga; Thu, 22 May 2014
16:31:29 -0700 (PDT)
X-Received: by 10.224.134.194 with SMTP id k2mr1301090qat.5.1400801489448;
Thu, 22 May 2014 16:31:29 -0700 (PDT)
Return-Path: <root@wikimedia.org>
Received: from mchenry.wikimedia.org (mchenry.wikimedia.org.
[208.80.152.186]) by mx.google.com with ESMTP id
f1si1648140qai.52.2014.05.22.16.31.29 for <aklapper@wikimedia.org>; Thu, 22
May 2014 16:31:29 -0700 (PDT)
Received-SPF: pass (google.com: domain of root@wikimedia.org designates
208.80.152.186 as permitted sender)
Authentication-Results: mx.google.com; spf=pass (google.com: domain of
root@wikimedia.org designates 208.80.152.186 as permitted sender)
smtp.mail=root@wikimedia.org
Received: from [10.68.16.110] (port=33970 helo=fab2.eqiad.wmflabs) by
mchenry.wikimedia.org with esmtp (Exim 4.69) (envelope-from
<root@wikimedia.org>) id 1WncSN-00018T-UL for aklapper@wikimedia.org; Thu,
22 May 2014 23:31:29 +0000
Received: from jenkins-bot by fab2.eqiad.wmflabs with local (Exim 4.76)
(envelope-from <root@wikimedia.org>) id 1WncSN-0000Qp-N6 for
aklapper@wikimedia.org; Thu, 22 May 2014 23:31:27 +0000
Date: Thu, 22 May 2014 23:31:27 +0000

However I made this change on Tue, May 20, 16:55 UTC according to the date stamp in http://fab.wmflabs.org/T12 .
So there seems to be a delivery delay of more than two days.

My attention workflow is mostly email based so this is really really distracting.

<Coren> andre: If I had to venture a guess, that's a problem with the mail setup on the phabricator instances and not related to mchenry -- but it might be a problem in the *default* mail handling setup on instances.
My exim4 skillz are limited - I'm more of a postfix dude - but I could take a look at it if you open a bz for me (sorry, the labs workflow is still on the old tools) :-)
andre: I should say the first thing to look at would be the mail logs on the instance the email originates from; in particular, you want to look for emails being deffered.


Version: unspecified
Severity: major

Details

Reference
bz65861

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:12 AM
bzimport set Reference to bz65861.
bzimport added a subscriber: Unknown Object (MLST).

Another example from a few minutes ago:
http://fab.wmflabs.org/T148#26 says "Via Web · Fri, May 23, 6:25 AM"; email received more than five days later: Wed, 28 May 2014 17:31:39 UTC.

One useful bit of data that would be important to have: what instance is the email originally generated on? (I.e., is it the public facing one, a backend, etc)

Coren: Not sure if I understand the question correctly :-/

Received: from jenkins-bot by fab2.eqiad.wmflabs with local (Exim 4.76)

Or does this refer to how Phabricator internally handles mail?

Ideally, I need to be able to follow the mail flow from where it first enters the system - unless there is a relay /within/ the phabricator project, then fab2 is probably it. I'll start my investigation there.

The issue, for that instance at least, is that the MTA is running out of space for spooling mail (forcing them to be deferred) because /var is full.

In fab2's case, there are two proximate causes:

(a) While /var/log is properly on a separate volume, there exist two other log directories on the volume that take up several hundreds of megs of space (/var/log2 and /var/log3). At first glance, neither of those have been in actual use since May 13.

(b) while a separate volume has been set aside for the mysql database, phabricator itself seems to also use /var/tmp for large, persistent storage. Pointing it at (a subdirectory of) /mnt * would offer more resillience.

  • I note that the extra disk space allocated to the volume has been mounted at /mnt which is deprecated practice (in fact, it was never *correct* to do so but that is indeed how Labs used to do it while in Tempa). There is no necessity to change that now, but if the instance is rebuilt in the future it would be best if the "correct" mount point of /srv is used instead.

the instance is a completely manual setup and as has been pointed out before it really needs to be setup for real (via puppet, incl. log rotate etc., no manual mounts, enough space etc) (before it's actually being used for tickets)

At any rate, a simple stopgap would be to delete or move aside /var/log2 and /var/log3. This will free ~700M of space which will unclog email - at least as long as phabricator does not consume all of it).

afaik Chad has set it up but i doubt he will care about keeping those logs, so the suggestion sounds good

I've made tarballs of both directories to /mnt/log[23].tgz, freeing >600M of space.

Thanks so much, Coren (and Daniel)!

There seems to be a problem again for me.
I have received no Phabricator notifications since Thursday (though I have actively commented). http://fab.wmflabs.org/T294#38 was the last notification I received.
Could somebody take a look? :-/

/var was full again, because /var/tmp/phd/log had a 1.4G logfile(!)

I've truncated it to the last 50K lines, but this is trying to bail the sea with a spoon; at the very least that logfile needs to be configured to go into /var/log (which has more space) until the box is properly puppetized.

Also, given the speed of growth of logs, the proper setup will need to have aggressive log rotation.

Sigh. Thanks! Confirming I received mail now.

at the very least /var/tmp/phd/log needs to be configured to go into /var/log
(which has more space) until the box is properly puppetized.

Alright.

(In reply to Marc A. Pelletier from comment #13)

at the very least that logfile needs to be configured to go
into /var/log (which has more space) until the box is properly puppetized.

Is that being tracked anywhere?

Problem happens again, as expected. I've asked on Cloud-Services.

CC'ing Chase because this sounds relevant for the final setup:

(In reply to Marc A. Pelletier from comment #13)

/var was full again, because /var/tmp/phd/log had a 1.4G logfile(!)

I've truncated it to the last 50K lines, but this is trying to bail the sea
with a spoon; at the very least that logfile needs to be configured to go
into /var/log (which has more space) until the box is properly puppetized.

Also, given the speed of growth of logs, the proper setup will need to have
aggressive log rotation.

Same logfile. I've nuked it.