Page MenuHomePhabricator

no log in deployment-bastion:/data/project/logs from "503 server unavailable" on beta labs
Closed, ResolvedPublic

Description

I'm getting 503 errors from beta labs when I visit http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Topic:S2kt0b7x2jrv7gw0&action=compare-postsummary-revisions&topicsummary_newRevision=s3ems8b3lyhnoafm

I can't find these errors in deployment-bastion:/data/project/logs where I would expect to find something. I notice some weirdness:

  • No fatal.log since August 19 (maybe it's all in HHVM logs?)
  • No hhvm log
  • No apache log (but maybe this isn't relevant with hhvm?)
  • The error message and its HTTP header gives no indication what server handled it, so you have to visit all the front-end servers.

I think I found the problem in deployment-mediawiki02:/var/log/hhvm/error.log . Maybe this bug more succinctly is "Server hhvm logs are not brought to deployment-bastion". FWIW there is a fluorine:/a/mw-log/hhvm.log for production logs.

Here's the complete error headers and text of error message. To be clear, I'm not reporting this particular 503 error, I'm reporting problems *logging* 503 errors.

HTTP/1.1 503 Service Unavailable^M
Server: Varnish^M
Content-Type: text/html; charset=utf-8^M
X-Varnish: 281336368, 1116870927^M
Via: 1.1 varnish, 1.1 varnish^M
Content-Length: 2717^M
Accept-Ranges: bytes^M
Date: Mon, 20 Oct 2014 19:59:36 GMT^M
Age: 0^M
Connection: keep-alive^M
X-Cache: deployment-cache-text02 miss (0), deployment-cache-text02 frontend miss (0)^M
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate^M
X-Analytics: php=zend^M
Set-Cookie: GeoIP=::::v4; Path=/; Domain=.wmflabs.org

Wikimedia Foundation
Error

Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes.
If you report this error to the Wikimedia System Administrators, please include the details below.
Request: GET http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Topic:S2kt0b7x2jrv7gw0&action=compare-postsummary-revisions&topicsummary_newRevision=s3ems8b3lyhnoafm, from 127.0.0.1 via deployment-cache-text02 deployment-cache-text02 ([127.0.0.1]:3128), Varnish XID 281335977
Forwarded for: 198.73.209.3, 127.0.0.1
Error: 503, Service Unavailable at Mon, 20 Oct 2014 19:57:01 GMT


Version: unspecified
Severity: normal

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:51 AM
bzimport set Reference to bz72275.

Well operations/mediawiki-config has:

$ mwscript eval.php --wiki=enwiki
> print $wmfUdp2logDest;
10.68.16.58:8420
> 

which is deployment-bastion.

udp2log / python demux script crash from time to time so we have to restart it:

deployment-bastion:~$ sudo /etc/init.d/udp2log-mw restart

  • Restarting UDP log receiver (udp2log) udp2log-mw [ OK ] $

So that should fix the log writing for now.

We also have an instance udplog.eqiad.wmflabs which receives some other stream of udp2log events, maybe from hhvm instance? That is rather inconsistent, maybe we should have everything to point to udplog instance and remove the service from deployment-bastion. CCing Ori who has set up the new instance.

(In reply to Antoine "hashar" Musso (WMF) from comment #1)

udp2log / python demux script crash from time to time so we have to restart
it:
...
So that should fix the log writing for now.

No, there's nothing. There are a few archived hhvm-log-<DATE>.gz files in
deployment-bastion:/data/project/logs/archive , last one from August 28th.

We also have an instance udplog.eqiad.wmflabs which receives some other
stream of udp2log events, maybe from hhvm instance? That is rather
inconsistent

That seems to have the same content in /data/project/logs and ../archive as deployment-bastion

Also, ErikB comments the HHVM fatals don't show up on logstash-beta.wmflabs.org, I don't know if that's a separate bug.

This means we don't notice problems on the beta cluster and they're harder to debug, so I'm bumping the priority.

I haven't took time to properly investigate the issue. But recently we had a new udp2log-mw instance on udplog.eqiad.wmflabs in addition to the one that always existed on deployment-bastion.eqiad.wmflabs.

So we have two daemons writing to the same file, and two logrotate runs on different instances that moves the files and reload the daemon so it open new files. That is definitely a race condition which eventually cause the logs to be written in the wrong place and/or kill the daemon.

What would need to be done is figure out what udplog.eqiad.wmflabs is for, Ori set it up (is on CC of this bug) so there must be a purpose. From there, the easiest is probably to get rid of the udplog-mw instance on deployment-bastion and point everything at the new udplog.eqiad.wmflabs. That would need a few changes in operations/puppet.git and operations/mediawiki-config.git.

(In reply to spage from comment #2)

We also have an instance udplog.eqiad.wmflabs

There's also a deployment-fluoride machine. Both udplog and deployment-bastion have the same /etc/udp2log/mw trying to send '^(fatal|exception) to it, but first fatals have to be in the logs.

That seems to have the same content in /data/project/logs and ../archive as
deployment-bastion

Duh, they all mount the same deployment-project NFS share :-)

Fatals come from the apache or hhvm process logs. It looks like we haven't recorded any new apache2.log data in the beta cluster since 2014-07-23 and the last hhvm-fatal.log is dated 2014-08-29. I would expect these log streams to come in via rsyslog forwarding from each host as configured by modules/mediawiki/files/rsyslog.conf in operations/puppet.

Our loss of logging for the hhv-fatal.log file quite nicely lines up with the first commit of that bit of refactored puppet configuration on Aug 28, 2014. The last log message we have is timestamped "Aug 28 19:47:43" and the change was merged on 08-28 13:47 (https://gerrit.wikimedia.org/r/#/c/154710/).

Debugging some...

On udplog.eqiad.wmflabs sudo tcpdump udp port 8420 does not show any log packets being received from any host.

On deployment-mediawiki01 sudo tcpdump udp port 8420 and host 10.68.16.58 shows lots of traffic headed to deployment-bastion. This is traffic from the udp2log client built into MediaWiki. It's making it to the right place and being logged to files and forwarded to logstash.

Again on deployment-mediawiki01 sudo tcpdump udp port 8420 and host 10.68.17.218 shows no traffic headed to udplog.eqiad.wmflabs.

The /etc/rsyslog.d/40-mediawiki.conf file on deployment-mediawiki01 matches https://github.com/wikimedia/operations-puppet/blob/565672c41b016b465afc871080dbf440bced9451/modules/mediawiki/files/rsyslog.conf. This config appears to be active as there is data in the local /var/log/apache2.log file.

So for some reason currently unknown to me, the rsyslog server on deployment-mediawiki01 is up and running and processing local files but not forwarding messages via udp to udplog.eqiad.wmflabs.

Change 176191 had a related patch set uploaded (by BryanDavis):
Use hiera to configure udp2log endpoint for ::mediawiki

https://gerrit.wikimedia.org/r/176191

Patch-For-Review

The combination of https://gerrit.wikimedia.org/r/176191 and https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=prev&oldid=136031 seems to have restored collective hhvm error reporting to /data/project/logs/hhvm.log on deployment-bastion.

If we can get the puppet patch merged then we can also delete the udplog instance in the deployment-prep labs project.

I'm going to work on some followup puppet changes to get the hhvm log stream into logstash that we can use in production as well.

Change 176191 merged by Giuseppe Lavagetto:
Use hiera to configure udp2log endpoint for ::mediawiki

https://gerrit.wikimedia.org/r/176191

bd808 removed a project: Patch-For-Review.
bd808 set Security to None.

Apache and hhvm logs are flowing to /data/project/logs/ again and also being piped into logstash.

Joe mentioned this in Unknown Object (Diffusion Commit).Dec 10 2014, 8:37 PM