Page MenuHomePhabricator

git.wikimedia.org (gitblit) goes down when getting overloaded by Googlebot
Closed, DeclinedPublic

Description

From the mailing list

from: planetenxin <planetenxin@web.de> via gmail.com
reply-to: Wikimedia developers <wikitech-l@lists.wikimedia.org>
to: wikitech-l <wikitech-l@lists.wikimedia.org>
date: Sun, Jul 21, 2013 at 10:15 PM
subject: [Wikitech-l] Git Proxy Error

Hi folks,
I'm constantly getting the following error on https://git.wikimedia.org/:
Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /.
Reason: Error reading from remote server
/Alexander


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=51656
https://bugzilla.wikimedia.org/show_bug.cgi?id=49371

Details

Reference
bz51769

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:55 AM
bzimport added a project: Gerrit.
bzimport set Reference to bz51769.
bzimport added a subscriber: Unknown Object (MLST).

This is probably a duplicate of bug 51656, but I'll just add that bug as a see also to this bug.

Confirming the current issue:

$ curl https://git.wikimedia.org/
<!DOCTYPE HTML PUBLIC "-IETFDTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/">GET&nbsp;/</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
</body></html>

git.wikimedia.org being down is highly disruptive to support work in MediaWiki-General.

Restarted the service. Two things:

A) I'm doing some logging this time so I can figure out why it's crashing
B) I'm going to finish puppetizing and packaging this starting tomorrow so it'll take care of this better

Hopefully we can find an easy fix for this, but (B) is necessary anyway.

It's happening again. :-(

$ time curl https://git.wikimedia.org
<!DOCTYPE HTML PUBLIC "-IETFDTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/">GET&nbsp;/</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
</body></html>

real 12m1.110s
user 0m0.015s
sys 0m0.021s

For the record, it doesn't seem to be completely down, just really slow to respond. When I tried again, I got:

$ time curl https://git.wikimedia.org
<!DOCTYPE html PUBLIC "-W3CDTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" ng-app="">

<!-- Head -->
<head>

		<meta name="viewport" content="width=device-width, initial-scale=1.0">
		<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
   		<title>Wikimedia</title>
		<link rel="icon" href="gitblt-favicon.png" type="image/png"/>
		
		<link rel="stylesheet" href="bootstrap/css/bootstrap.css"/>
		<link rel="stylesheet" href="bootstrap/css/iconic.css"/>
		<link rel="stylesheet" type="text/css" href="gitblit.css"/>

<link rel="stylesheet" type="text/css" href="bootstrap/css/bootstrap-responsive.css" />
<style type="text/css">
.navbar-inner {
background-color: #FBFBFB;
border-bottom: 1px solid #39688E !important;
}
.navbar ul li:focus, .navbar .active {
border-bottom: 4px solid #9C000F;
}
.navbar ul.nav li a {
color: #002060;
}
.navbar ul.nav .active a {
color: #002060;
}
.navbar ul.nav li a:hover {
color: #002060 !important;
}
</style>
<script type="text/javascript" src="https://www.google.com/jsapi"></script>
<script type="text/javascript" ><!--/*--><![CDATA[/*><!--*/
google.load("visualization", "1", {packages:["corechart"]});
google.setOnLoadCallback(drawChart);
function drawChart() {
var data_ffa5c540 = new google.visualization.DataTable();
data_ffa5c540.addColumn('string', 'repository');
data_ffa5c540.addColumn('number', 'commits');
data_ffa5c540.addRows(85);
data_ffa5c540.setValue(0, 0, 'mediawiki/core');
data_ffa5c540.setValue(0, 1, 2.0);
data_ffa5c540.setValue(1, 0, 'mediawiki/extensions');
data_ffa5c540.setValue(1, 1, 2.0);
data_ffa5c540.setValue(2, 0, 'mediawiki/extensions/TranslationNotifications');
data_ffa5c540.setValue(2, 1, 2.0);
data_ffa5c540.setValue(3, 0, 'mediawiki/extensions/PollNY');
data_ffa5c540.setValue(3, 1, 1.0);
data_ffa5c540.setValue(4, 0, 'mediawiki/extensions/Nuke');
data_ffa5c540.setValue(4, 1, 1.0);
data_ffa5c540.setValue(5, 0, 'mediawiki/extensions/LastModified');
data_ffa5c540.setValue(5, 1, 1.0);
data_ffa5c540.setValue(6, 0, 'mediawiki/extensions/MoodBar');
data_ffa5c540.setValue(6, 1, 1.0);
data_ffa5c540.setValue(7, 0, 'mediawiki/extensions/Validator');
data_ffa5c540.setValue(7, 1, 1.0);
data_ffa5c540.setValue(8, 0, 'mediawiki/extensions/Phalanx');
data_ffa5c540.setValue(8, 1, 1.0);
data_ffa5c540.setValue(9, 0, 'mediawiki/extensions/MassEditRegex');
data_ffa5c540.setValue(9, 1, 1.0);
var chart_ffa5c540 = new google.visualization.PieChart(document.getElementById('chartRepositories'));
chart_ffa5c540.draw(data_ffa5c540, { title: 'active repositories', colors:['#3810a6','#7110a6','#1068a6','#a61029','#1056a6','#63a610','#a64210','#109ca6','#7610a6','#a61083',], legend: { position:'none' } });

var data_39b7080b = new google.visualization.DataTable();
data_39b7080b.addColumn('string', 'author');
data_39b7080b.addColumn('number', 'commits');
data_39b7080b.addRows(17);
data_39b7080b.setValue(0, 0, 'jeroendedauw');
data_39b7080b.setValue(0, 1, 28.0);
data_39b7080b.setValue(1, 0, 'Ori Livneh');
data_39b7080b.setValue(1, 1, 12.0);
data_39b7080b.setValue(2, 0, 'Kunal Mehta');
data_39b7080b.setValue(2, 1, 6.0);
data_39b7080b.setValue(3, 0, 'Yuki Shira');
data_39b7080b.setValue(3, 1, 4.0);
data_39b7080b.setValue(4, 0, 'legoktm');
data_39b7080b.setValue(4, 1, 3.0);
data_39b7080b.setValue(5, 0, 'Hoo man');
data_39b7080b.setValue(5, 1, 2.0);
data_39b7080b.setValue(6, 0, 'Santhosh Thottingal');
data_39b7080b.setValue(6, 1, 2.0);
data_39b7080b.setValue(7, 0, 'Jeroen De Dauw');
data_39b7080b.setValue(7, 1, 2.0);
data_39b7080b.setValue(8, 0, 'Kip');
data_39b7080b.setValue(8, 1, 2.0);
data_39b7080b.setValue(9, 0, 'Marius Hoch');
data_39b7080b.setValue(9, 1, 2.0);
var chart_39b7080b = new google.visualization.PieChart(document.getElementById('chartAuthors'));
chart_39b7080b.draw(data_39b7080b, { title: 'active authors', colors:['#a6106c','#71a610','#a1a610','#a61040','#a68010','#a68010','#a67410','#1022a6','#104ca6','#1080a6',], legend: { position:'none' } });

}

/*-->]]>*/</script>

<script type="text/javascript" src="resources/com.gitblit.wicket.ng.NgController/angular.js"></script>
<script type="text/javascript" ><!--/*--><![CDATA[/*><!--*/
<!-- AngularJS projectsCtrl data controller -->
function projectsCtrl($scope) {
$scope.projectsList = [{"p":"mediawiki","n":"mediawiki","t":"18 mins ago","d":"2013-07-22","i":"","c":626},{"p":"labs","n":"labs","t":"5 hours ago","d":"2013-07-21","i":"","c":9},{"p":"operations","n":"operations","t":"8 hours ago","d":"2013-07-21","i":"","c":72},{"p":"main","n":"Main Repositories","t":"15 hours ago","d":"2013-07-21","i":"main group of repositories","c":19},{"p":"pywikibot","n":"pywikibot","t":"21 hours ago","d":"2013-07-21","i":"","c":4},{"p":"analytics","n":"analytics","t":"2 days ago","d":"2013-07-20","i":"","c":39},{"p":"wikimedia","n":"wikimedia","t":"3 days ago","d":"2013-07-19","i":"","c":24},{"p":"integration","n":"integration","t":"3 days ago","d":"2013-07-19","i":"","c":11},{"p":"qa","n":"qa","t":"3 days ago","d":"2013-07-19","i":"","c":1},{"p":"apps","n":"apps","t":"4 days ago","d":"2013-07-18","i":"","c":4},{"p":"wiktionary","n":"wiktionary","t":"4 weeks ago","d":"2013-06-21","i":"","c":1},{"p":"test","n":"test","t":"8 weeks ago","d":"2013-05-25","i":"","c":2},{"p":"officeit","n":"officeit","t":"4 months ago","d":"2013-04-08","i":"","c":1},{"p":"VisualEditor","n":"VisualEditor","t":"6 months ago","d":"2013-01-29","i":"","c":1},{"p":"glam","n":"glam","t":"10 months ago","d":"2012-10-11","i":"","c":1}];
}

/*-->]]>*/</script>

/*-->]]>*/</script>

</head>

<body>

		<!-- page content -->
		[a metric fuckton of HTML]

		<!-- Override Bootstrap's responsive menu background highlighting -->
		<style>
		@media (max-width: 979px) {
			.nav-collapse .nav > li > a:hover, .nav-collapse .dropdown-menu a:hover {
				background-color: #002060;
			}
			
			.navbar div > ul .dropdown-menu li a {
				color: #ccc;
			}
		}
		</style>
		
		<!-- Include scripts at end for faster page loading -->
		<script type="text/javascript" src="bootstrap/js/jquery.js"></script>
		<script type="text/javascript" src="bootstrap/js/bootstrap.js"></script>

</body>
</html>
real 10m1.110s
user 0m0.018s
sys 0m0.028s

Restarted the service again. Nothing was suspicious in the logs, but CPU was pegged at 100% :\

This seems to be an issue again.

14:42 apergos: (btw docs would be nice, is that really the right way to kick it?)
14:42 apergos: shot and restarted gitblit: on antinomy, cd /var/lib/gitblit, java -jar gitblit.jar & (see bug 51769)

Can we help Ariel out? :)

(In reply to comment #8)

14:42 apergos: (btw docs would be nice, is that really the right way to kick
it?)
14:42 apergos: shot and restarted gitblit: on antinomy, cd /var/lib/gitblit,
java -jar gitblit.jar & (see bug 51769)

Can we help Ariel out? :)

Peachey88 helpfully created https://wikitech.wikimedia.org/wiki/Git.wikimedia.org and https://wikitech.wikimedia.org/wiki/Gitblit. Thank you, Peachey88!

git.wikimedia.org continues to be broken. This is very frustrating.

We cant take stack trace right now since gitblit is running with Java 6 and we are missing the jstack utility. That is bug 51859.

(In reply to comment #10)

git.wikimedia.org continues to be broken. This is very frustrating.

You're telling me :(

(In reply to comment #5)

For the record, it doesn't seem to be completely down, just really slow to
respond.

To me it just serves "Internal error" messages. When CPU usage goes down a bit, you may happen to get some content, maybe curl is just being very patient and retrying a lot till that happens.

I've got some new caching stuff that landed in master just for us :) Looking at turning that on today as well.

So, it's a little sluggish still but it's at least staying up. Did some more puppetizing and blocked some misbehaving spiders that were causing it to fall over all the time.

Still want to up the heap and see if we can get some better performance out of it.

Closing this as resolved/fixed for now.

It's again inaccessible (my browser doesn't receive any reply) with antimony at 100 % load.

(In reply to comment #17)

It's again inaccessible (my browser doesn't receive any reply) with antimony
at 100 % load.

Confirmed that git.wikimedia.org is inaccessible currently; no idea about antimony's load.

Re-opening this bug as it's no longer resolved. Thank you, Nemo, for commenting here.


$ time curl https://git.wikimedia.org/
<!DOCTYPE HTML PUBLIC "-IETFDTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/">GET&nbsp;/</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
</body></html>

real 12m43.815s
user 0m0.010s

sys 0m0.024s

(In reply to comment #20)

Down again?

Indeed. :-(


$ time curl https://git.wikimedia.org/
<!DOCTYPE HTML PUBLIC "-IETFDTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/">GET&nbsp;/</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
</body></html>

real 12m2.101s
user 0m0.010s

sys 0m0.028s

So bots aren't going to be allowed to index gitblit anymore since they can't seem to behave.

(In reply to comment #22)

So bots aren't going to be allowed to index gitblit anymore since they can't
seem to behave.

Do the misbehaving bots respect robots.txt?

(In reply to comment #23)

(In reply to comment #22)

So bots aren't going to be allowed to index gitblit anymore since they can't
seem to behave.

Do the misbehaving bots respect robots.txt?

Nope.

(In reply to comment #24)

Do the misbehaving bots respect robots.txt?

Nope.

Just to be clear, I hope that you block whatever wished and needed except Google (so that we have at least one search engine).

Google is the one misbehaving.

I'm talking to the folks at google about it, will update as soon as I have more info.

Turns out that we were serving an empty robots.txt even though there is content. Logs show lots of:
[Wed Aug 07 21:44:15 2013] [error] [client XXX.XXX.XXX.XXX] proxy: Error reading from remote server returned by /robots.txt

Chad says this is likely due to some fancy proxy rewriting sending everything to gitblit. In the meantime the google folks have cut back both parallel open connections and req/sec until we have this sorted and can give them the heads up.

merged this:

https://gerrit.wikimedia.org/r/#/c/77909/1/templates/apache/sites/git.wikimedia.org.erb

and:

08:42 mutante: on overloaded antimony: stopped gitblit, ran puppet, blocked Googlebot, restarted apache and gitblit
08:22 mutante: git.wikimedia.org back
08:21 mutante: attempted to restart gitblit on antimony

01:44 < mutante> apergos: oh.. Googlebot-Image/1.0 != Googlebot ?
01:44 < apergos> no
01:44 < apergos> googlebot is the one that rate limited
01:44 < apergos> that should be enough, it's the /zip/ paths primarily that were the problem
01:44 < mutante> Googlebot-Image is the one still being very active
01:44 < mutante> ok
01:44 < apergos> it's minor compared to the rest
01:44 < mutante> alright
01:44 < apergos> I did a count from the logs yeasterday
01:45 < mutante> now YandexBot kicking in :)
01:45 < apergos> please note it on the bz report so we don't forget
01:46 < mutante> java CPU usage isn't > 500% anymore though :)

(Resetting severity & priority as long as things are working again.)

chad fixed up service of robots.txt and we've unblocked the google bot. Waiting for verification that google has picked up the file and we'll see how performance looks after that.

For the records: https://gerrit.wikimedia.org/r/#/c/78243/
(By the way: Ariel & Daniel, thanks for keeping this ticket updated.)

As reported by OsamaK last night in #wikimedia-tech and again today by rupert THURNER on wikitech-l (and I personally confirmed last evening), git.wikimedia.org is down yet again.


$ time curl https://git.wikimedia.org/
<!DOCTYPE HTML PUBLIC "-IETFDTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/">GET&nbsp;/</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
</body></html>

real 12m1.522s
user 0m0.010s

sys 0m0.026s

(In reply to comment #34)

created a bug at gitblit:
https://code.google.com/p/gitblit/issues/detail?id=294.

No need for that, I've already been in contact with upstream. It's mainly being overloaded from Googlebot.

Back up again at the moment...

down again at the moment... i think gitblit might behave nicer in case bots overloading it.

before restarting the java process, could you trace it with:

  • jps -l to find out the process id
  • strace to see if it excessivly calls into the operating system
  • jstack
  • kill -QUIT <p> to print the stacktrace
  • jmap -heap <p> to find memory usage
  • jmap -histo:live <p> | head to find excessivly used classes
  • if you have ui, you might try jconsole or http://visualvm.java.net as well

you might want to set the java -XX:+HeapDumpOnOutOfMemoryError parameter when starting, so it writes a heap dump early enough. it can be analyzed with jhat.

Stupid question: is crawl-delay in robots.txt worth trying?

faidon lambiotis restarted it a couple of hours ago withoutout tracing it seems? and now its dead again.

https://gerrit.wikimedia.org/r/#/c/78919/

As soon as the bot picks up th new copy of the robots.txt file that should take care of it for now.

max semenik reported that gitblit should mark critical links with rel="nofollow"

Just a heads-up: Upstream in https://code.google.com/p/gitblit/issues/detail?id=274 asks "for discussion with Wikimedia folks on how to improve caching since the current strategy is not ideal".

Maybe http://git.zx2c4.com/cgit/about/ might be an alternative to gitblit as its main design criterion is speed and caching.

hashar claimed this task.
hashar subscribed.

The gitblit software powering git.wikimedia.org is in the process of being replaced by Phabricator Diffusion. See:

T111465: [keyresult] Deprecate gitblit in favor of Diffusion
Project: Gitblit-Deprecate