Page MenuHomePhabricator

Intermittent "502 Bad Gateway" errors on Wikimedia wikis via HTTPS
Closed, DeclinedPublic

Description

Intermittently, I get 502 Bad Gateway errors using HTTPS on en.wikipedia.org while logged in. The footer reads "nginx/1.1.1.9" or similar. https://en.wikipedia.org/wiki/List_of_Anything_Muppets is a sample URL. A browser window refresh solves the issue, but we should investigate and address what's causing these intermittent errors.


Version: wmf-deployment
Severity: normal

Details

Reference
bz50891

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:54 AM
bzimport set Reference to bz50891.
bzimport added a subscriber: Unknown Object (MLST).

I just got this again at this URL: https://en.wikipedia.org/w/index.php?title=Special:LinkSearch&limit=250&offset=0&target=http%3A%2F%2Ftoolserver.org%2F~mzmcbride%2Fcgi-bin%2Fwatcher.


<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.1.19</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

<!-- a padding to disable MSIE and Chrome friendly error page -->

I'll upload a screenshot momentarily.

Created attachment 12910
Screenshot of 502 Bad Gateway error on https://en.wikipedia.org

Attached:

Screen_Shot_2013-07-21_at_2.01.56_PM.png (934×1 px, 109 KB)

Just got one of these on https://meta.wikimedia.org/wiki/User:MF-Warburg/abuse. Reloading fixed the issue.

The source was basically the same as in comment 1, but no <!-- html comments -->.

It looks like traffic through the ssl cluster has doubled in the past month. eqiad hardware is being overloaded. We're adding some more nodes to the cluster.

Two new ssl servers were just pooled in eqiad. We'll need to do this in esams eventually as well, but they have newer/better hardware. Please let me know if you're still having this issue.

I had to depool them due to issues with ipv6. I'll update the ticket when they are repooled.

They are repooled now and everything should be working. I'll close this as fixed. Please re-open if it's not.

yeah, had a 501 on meta itself (the whole page) twice this evening too. Looking at it through firebug I get multiple 502's from upload as I wander the sites (the best one to test I've found is the meta front page because it has a ton of images but I've seen 3-4 on random enWiki pages too and usually at least 1 if it has any images at all).

MZMcBride: Has this happened recently?

(In reply to comment #13)

MZMcBride: Has this happened recently?

I can confirm that I received a "502 Bad Gateway" (nginx/1.1.19) today on enwiki when following a perfectly fine link. The second time I followed the link, it took me where it was supposed to.

Just for the record, the link was https://en.wikipedia.org/w/index.php?title=User_talk%3ARYasmeen_%28WMF%29&diff=586114761&oldid=586070808

This hasn't happened recently for me. I wonder if this bug report should re-focus on better logging/monitoring of 502s.

Ori or Nemo: do you know if we graph this data (users hitting nginx gateway timeout errors --> 502s) anywhere or if it would be possible to do so?

(In reply to comment #16)

Ori or Nemo: do you know if we graph this data (users hitting nginx gateway
timeout errors --> 502s) anywhere or if it would be possible to do so?

Presumably they appear in https://gdash.wikimedia.org/dashboards/reqerror/ mixed with all the 5xx (I'm not able to assess how complete/precise this report is)?
If this intermittent problem has the same cause that such problems have had lately, i.e. network links at capacity, it might be more fruitful to try and set up a monitoring tool for the network like https://monitor.archive.org/weathermap/weathermap.html

(In reply to MZMcBride from comment #15)

This hasn't happened recently for me.

If that's still the case I propose RESOLVED WORKSFORME.

I wonder if this bug report should
re-focus on better logging/monitoring of 502s.

We have only 5xx monitoring - if you have specific recommendations, could you put them into a separate enhancement requests?

(In reply to Andre Klapper from comment #18)

(In reply to MZMcBride from comment #15)

This hasn't happened recently for me.

If that's still the case I propose RESOLVED WORKSFORME.

Yeah, this report is not particularly actionable by now. Icinga reports were also added and are regularly acted upon, for instance:

23.48 <+icinga-wm_> PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0]
00.03 <+icinga-wm_> RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0]