Page MenuHomePhabricator

Bot being served Error: 503, Service Unavailable (default error page) for API action=edit requests on large pages (caused by parsing the page three times on save)
Closed, ResolvedPublic

Description

90% of the time, my bots are hitting a technical issue when trying to edit. Here's one of thousands of instances it got by doing 3 hours of bot work.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Wikimedia Error</title>
<meta name="author" content="Mark Ryan"/>
<meta name="copyright" content="(c) 2005-2007 Mark Ryan and others. Text licensed under the GNU Free Documentation License. http://www.gnu.org/licenses/fdl.txt"/>

<style type="text/css"><!--
body {
background-color: #dbe5df;
font: 100% "Gill Sans MT", "Gill Sans", "Trebuchet MS", Helvetica, sans-serif;
margin: 0px;
}
.TechnicalStuff {
font-size: 0.8em;
font-style: italic;
text-align: center;
padding-bottom: 0.8em;
}
.Lines {
width: 100%;
height: 1px;
overflow: hidden;
font-size: 0.5px;
}
.ContentArea {
background-color: white;
padding: 0.8em 10% 0 10%;
font-size: 1.0em;
}
a:hover {
color: red;
}
h1, h2 {
margin: 0px;
font-size: 1.0em;
}
h2 {
background: #9fbfd8;
font-size: 1.2em;
font-weight: bold;
text-align: center;
}
h1 {
background: #dbe5df;
font: bold 1.5em "Gill Sans MT", "Gill Sans", Helvetica, Humanist, sans-serif;
text-transform: uppercase;
text-align: center;
width: 100%;
padding-top:0.8em;
}
-->
</style>
<script>/*<![CDATA[*/
function lines(s) {

		var c = s.split(' ');
		for (var i = 0; i < c.length; i++) {
			document.write('<div class="Lines" style="background-color:#' + c[i] + ';"></div>');
		}

}
//]]></script>
</head>

<body link="#24442E" text="#000000" vlink="#24442E" alink="#FF0000">
<h1>Wikimedia Foundation</h1>
<script>lines('ccd4cf bdc3bf adb1af 9ea09f dbe5df');</script>

<h2>Error</h2>

<script>lines('8f8f8f acacac c6c6c6 dbdbdb eaeaea f4f4f4');</script>

<!-- BEGIN CONTENT AREA -->
<div class="ContentArea">

<div id="en" lang="en">
<p>Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please <a href="//w/api.php" onclick="window.location.reload(false); return false">try again</a> in a few minutes.</p>
<p>The Wikimedia Foundation is a non-profit organisation which hosts some of the most popular sites on the Internet, including Wikipedia. It has a constant need to purchase new hardware. If you would like to help, please <a href="https://donate.wikimedia.org/?utm_medium=varnisherr&utm_source=20120
809SQ01&utm_campaign=errorpage">donate</a>.</p>
<hr noshade="noshade" size="1px" width="80%" />
<div class="TechnicalStuff">
If you report this error to the Wikimedia System Administrators, please include the details below.<br/>
</div>
<div class="TechnicalStuff">
<bdo dir="ltr">
Request: POST http://en.wikipedia.org/w/api.php, from 130.203.176.150 via cp1055 frontend ([208.80.154.225]:80), Varnish XID 2270366995<br/>Forwarded for: 130.203.176.150<br/>Error: 503, Service Unavailable at Wed, 13 Nov 2013 15:30:37 GMT
</bdo>
</div>
</div>

</div>

<div>
<p>Your cache administrator is <a href="https://bugzilla.wikimedia.org/show_bug.cgi?id=18903">nobody</a>.</p>
</div>
<script>lines('9ea09f adb1af bdc3bf ccd4cf');</script>

</body>
</html>

This is taken straight from the bot that received it.


Version: unspecified
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=59788

Details

Reference
bz57026

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:37 AM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz57026.
bzimport added a subscriber: Unknown Object (MLST).

What page are you trying to edit? What data are you posting?

(In reply to comment #1)

What page are you trying to edit? What data are you posting?

Trying to edit User:Cyberbot I/Current AfD's

Trying to add: An updated version of it's contents.

(In reply to comment #2)

Trying to edit [[User:Cyberbot I/Current AfD's]]

Let me just note that this is a pretty large page (120K, takes 11s to render), which naturally doesn't mean that it should fail to save, and especially return the default error page for API requests.

Checking the API logs, I see several identical edit requests that all took 40-50 seconds to process from start to finish. And looking at the page history, I see your bot has runs where it makes an edit every minute with the only difference being the "This table was last updated on" timestamp.

So it seems that what's probably going on here is that your API request is taking too long to process so the proxy times out and gives you an error page. And then your bot just keeps hammering away trying to make the same edit. A better idea here would be to catch the 503, wait a minute or two, and then see if the edit eventually saved.

(In reply to comment #5)

Checking the API logs, I see several identical edit requests that all took
40-50 seconds to process from start to finish. And looking at the page
history,
I see your bot has runs where it makes an edit every minute with the only
difference being the "This table was last updated on" timestamp.

So it seems that what's probably going on here is that your API request is
taking too long to process so the proxy times out and gives you an error
page.
And then your bot just keeps hammering away trying to make the same edit. A
better idea here would be to catch the 503, wait a minute or two, and then
see
if the edit eventually saved.

Even better was to not have it return 503 to begin with. If the framework can't find a servedby parameter, or a requestid, it considers it a failed attempt. I've also had another user come to IRC with the exact same problems, and this never was a problem before so clearly, what needs to be fixed is not on my end.

(In reply to comment #5)

Checking the API logs, I see several identical edit requests that all took
40-50 seconds to process from start to finish. And looking at the page
history,
I see your bot has runs where it makes an edit every minute with the only
difference being the "This table was last updated on" timestamp.

So it seems that what's probably going on here is that your API request is
taking too long to process so the proxy times out and gives you an error
page.
And then your bot just keeps hammering away trying to make the same edit. A
better idea here would be to catch the 503, wait a minute or two, and then
see
if the edit eventually saved.

And to add on, if the edit is going through, then a human can see that, a bot clearly can't when being thrown an error message that can't be read by a computer program, since it's not an XML, PHP, or JSON formatted output. Also, why is it taking 50 seconds to process?

(In reply to comment #6)

If the framework
can't find a servedby parameter, or a requestid, it considers it a failed
attempt.

Then fix your framework.

(In reply to comment #7)

And to add on, if the edit is going through, then a human can see that, a bot
clearly can't when being thrown an error message that can't be read by a
computer program, since it's not an XML, PHP, or JSON formatted output.

The bot should be able to detect the fact that the HTTP status code is 503. Then, just like a human might, the bot can wait a minute and then check if the edit showed up in the page history after all.

Also, why is it taking 50 seconds to process?

I'll try to look into that.

[Setting priority again to "normal". Please keep it there.]

Some data:

  • Parsing the new page text to extract external links for SpamBlacklistHooks::filterAPIEditBeforeSave takes 11 seconds.
  • Then that parse info gets thrown away by a call to Article::clear(), as part of the making sure there aren't edit conflicts.
  • TemplateDataHooks::onPageContentSave needs a parse, too, so there goes another 11 seconds. And for some reason it passes null for the $serialization_format in the call to WikiPage::prepareContentForEdit, instead of the format actually being used for the parse.
  • Then saving the edit into the database needs the parse info too. And since the saved parse info from TemplateDataHooks::onPageContentSave has the wrong serialization format, it can't be reused so there goes another 11 seconds.

And that seems to about cover it, my testing saving your page was taking about 33 seconds this morning. If things are overloaded, each of those 11 seconds could well be more.

Change 95481 had a related patch set uploaded by Aaron Schulz:
Avoid extra parsing on edit

https://gerrit.wikimedia.org/r/95481

Change 95481 merged by jenkins-bot:
Avoid extra parsing on edit

https://gerrit.wikimedia.org/r/95481

Change 95519 had a related patch set uploaded by Aaron Schulz:
Avoid extra parsing in prepareContentForEdit()

https://gerrit.wikimedia.org/r/95519

Change 95519 merged by jenkins-bot:
Avoid extra parsing in prepareContentForEdit()

https://gerrit.wikimedia.org/r/95519

We are still having this problem on labs.

POST: http://en.wikipedia.org/w/api.php
API Error...

Code: error503
Text: HTTP Error 503
The webserver's service is currently unavailable, however, the edit appears to have gone through.

This is still getting returned by my framework (Peachy), after the suggested update I made to it. Has this patch been reviewed and submitted yet?

Patches have been merged, as you can see above, and should go out with 1.23wmf5. See https://www.mediawiki.org/wiki/MediaWiki_1.23/Roadmap for the schedule.

Change 97634 had a related patch set uploaded by Ori.livneh:
Avoid extra parsing in prepareContentForEdit()

https://gerrit.wikimedia.org/r/97634

Change 97634 merged by jenkins-bot:
Avoid extra parsing in prepareContentForEdit()

https://gerrit.wikimedia.org/r/97634

As I understand, this has been backported and deployed by Ori and Faidon yesterday.

Still getting errors. Will be happy to post what my framework is reporting.

I think there's still double rendering going on. On commons, a null edit takes just slightly more than double the time a page preview takes. On my local copy, the operations take roughly the same time (As they should).

I still see TemplateDataHooks::onPageContentSave mentioned in dbperformance.log

(In reply to comment #22)

I still see TemplateDataHooks::onPageContentSave mentioned in
dbperformance.log

I think that's expected. As I read it, the fix above is meant to work by having the hook do the parsing, and then have core skip doing the parsing.

Testing Locally with TemplateData, this seems to work, with parsing only happening in TemplateData extension on page save.

<aside>It'd be nice if there was some way to do ?forceprofile=true/?forcetrace=true on page save. Maybe $wgDebugRedirects could somehow be triggered by a url parameter, and whatever magic makes forcetrace work in places where forceprofile doesn't be made to happen </aside>

Change 101224 had a related patch set uploaded by Anomie:
Fix WikiPage::prepareContentForEdit's default format handling

https://gerrit.wikimedia.org/r/101224

(In reply to comment #21)

I think there's still double rendering going on. On commons, a null edit
takes
just slightly more than double the time a page preview takes. On my local
copy,
the operations take roughly the same time (As they should).

I redid the same sort of simulation as I did in comment 10 (anyone with access to terbium, feel free to look at /home/anomie/eval-bug57026.txt), and it seems there is no longer double parsing going on before the text is saved to the database on enwiki. I also checked Commons with simulating a null edit on a random page, and again I see no double parsing.

Doing a non-null edit on Commons, I see abuse filter rule 87's use of added_links is triggering reparses (it seems to be broken such that it runs added_links for all actions). While it could probably use at least the same as in Gerrit change 95481 (and probably some checking to see if it's going to be breaking the cache by passing the old wikitext for some vars), I submitted Gerrit change 101224 to be better about handling a null format in core.

Also, there might still be a second parse from WikiPage::doEditUpdates() if the page uses magic words that access information about the current revision or if the page transcludes itself. I don't see any way around that.

And I'm not entirely sure that just timing a null edit isn't going to cause a parser cache miss for the view-after-save, particularly if your preferences don't match the defaults. Do you still see double timing if you do the null edit via the API?

Change 101224 merged by jenkins-bot:
Fix WikiPage::prepareContentForEdit's default format handling

https://gerrit.wikimedia.org/r/101224

(In reply to comment #25)

Also, there might still be a second parse from WikiPage::doEditUpdates() if
the
page uses magic words that access information about the current revision or
if
the page transcludes itself. I don't see any way around that.

And I'm not entirely sure that just timing a null edit isn't going to cause a
parser cache miss for the view-after-save, particularly if your preferences
don't match the defaults. Do you still see double timing if you do the null
edit via the API?

I was looking at firebug, and recording the time it took to retrieve the POST request that results in a redirect. After the redirect, firefox does a GET request which usually parses the page again, which I didn't count.

I was testing https://commons.wikimedia.org/wiki/Commons:Featured_picture_candidates/Log/November_2013 and https://commons.wikimedia.org/wiki/Commons:Featured_picture_candidates/Log/October_2013 which a user at commons had previously identified as being slow.

I don't believe these pages are vary-revision, but hard to be 100% sure of that

I was testing via index.php. When I test via api.php, page save actions take the same time as rendering a page preview, which is interesting. Thus the issue I'm seeing seems to be from the web interface only.

Change 101490 had a related patch set uploaded by Anomie:
Use WikiPage::prepareContentForEdit in SpamBlacklistHooks::filterMergedContent

https://gerrit.wikimedia.org/r/101490

(In reply to comment #27)

I was testing via index.php. When I test via api.php, page save actions take
the same time as rendering a page preview, which is interesting. Thus the
issue
I'm seeing seems to be from the web interface only.

I did some investigating of the UI edit code page, and found the culprit there is SpamBlacklistHooks::filterMergedContent.

Change 101490 merged by jenkins-bot:
Use WikiPage::prepareContentForEdit in SpamBlacklistHooks::filterMergedContent

https://gerrit.wikimedia.org/r/101490

We've cleaned up more instances of double-parsing, including one in the UI code path. Latest changes should go out with 1.23wmf9.

Before reopening this bug, please test that the timeouts are due to page save taking much longer than an equivalent preview (or api action=parse), and that the page being saved does not use revision-specific magic words or parser functions and that the page being saved does not transclude itself.

Wikidata is calling Edit hooks with non-editable pages as context, which breaks Gerrit change 101490. Gerrit change 106126 can be merged to reinstate it once bug 59788 is fixed.

The problem arises because Wikibase generates titles for new pages from a counter stored in the database, which is incremented in the transaction that actually creates the page. So, before the page is saved, the title isn't know. When calling a pre-save edit hook in such a situation, there is no way to provide the actual title of the page that will be created.

Perhaps it would be possible to garb a fresh ID and determine the Title earlier - but that means wasting an ID for every failed attempt to save. It also means refactoring quite a bit of code in Wikibase, on a critical code path.

I think it's really a fundamental question whether it can be assumed that the actual title of a page can be assumed to be known before a page has been created. A related fundamental question is whether the context title of an edit is always the title of the page where the new content will be saved, or if there are situations where this can not sensibly be expected to be the case.

Date/Time: Thu, 09 Jan 2014 12:27:52 +0000
Method: POST
URL: http://cy.wikipedia.org/w/api.php (Parameters masked for security)
Raw Data:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Wikimedia Error</title>
<meta name="author" content="Mark Ryan"/>
<meta name="copyright" content="(c) 2005-2007 Mark Ryan and others. Text licensed under the GNU Free Documentation License. http://www.gnu.org/licenses/fdl.txt"/>

<style type="text/css"><!--
body {
background-color: #dbe5df;
font: 100% "Gill Sans MT", "Gill Sans", "Trebuchet MS", Helvetica, sans-serif;
margin: 0px;
}
.TechnicalStuff {
font-size: 0.8em;
font-style: italic;
text-align: center;
padding-bottom: 0.8em;
}
.Lines {
width: 100%;
height: 1px;
overflow: hidden;
font-size: 0.5px;
}
.ContentArea {
background-color: white;
padding: 0.8em 10% 0 10%;
font-size: 1.0em;
}
a:hover {
color: red;
}
h1, h2 {
margin: 0px;
font-size: 1.0em;
}
h2 {
background: #9fbfd8;
font-size: 1.2em;
font-weight: bold;
text-align: center;
}
h1 {
background: #dbe5df;
font: bold 1.5em "Gill Sans MT", "Gill Sans", Helvetica, Humanist, sans-serif;
text-transform: uppercase;
text-align: center;
width: 100%;
padding-top:0.8em;
}
-->
</style>
<script>/*<![CDATA[*/
function lines(s) {

		var c = s.split(' ');
		for (var i = 0; i < c.length; i++) {
			document.write('<div class="Lines" style="background-color:#' + c[i] + ';"></div>');
		}

}
//]]></script>
</head>

<body link="#24442E" text="#000000" vlink="#24442E" alink="#FF0000">
<h1>Wikimedia Foundation</h1>
<script>lines('ccd4cf bdc3bf adb1af 9ea09f dbe5df');</script>

<h2>Error</h2>

<script>lines('8f8f8f acacac c6c6c6 dbdbdb eaeaea f4f4f4');</script>

<!-- BEGIN CONTENT AREA -->
<div class="ContentArea">

<div id="en" lang="en">
<p>Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please <a href="//w/api.php" onclick="window.location.reload(false); return false">try again</a> in a few minutes.</p>
<hr noshade="noshade" size="1px" width="80%" />
<div class="TechnicalStuff">
If you report this error to the Wikimedia System Administrators, please include the details below.<br/>
</div>
<div class="TechnicalStuff">
<bdo dir="ltr">
Request: POST http://cy.wikipedia.org/w/api.php, from 10.64.0.102 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 3544388602<br/>Forwarded for: 10.4.1.153, 10.64.0.102<br/>Error: 503, Service Unavailable at Thu, 09 Jan 2014 12:28:08 GMT
</bdo>
</div>
</div>

</div>

<script>lines('9ea09f adb1af bdc3bf ccd4cf');</script>

</body>
</html>

UNSERIALIZATION FAILED

(In reply to comment #34)

Date/Time: Thu, 09 Jan 2014 12:27:52 +0000
Method: POST
URL: http://cy.wikipedia.org/w/api.php (Parameters masked for security)

That makes it difficult to debug this. But I don't see any long times in the API log around that time (the worst time so far today for cywiki was only 14186ms, far below the point where this bug shows up). I also don't see any errors or exceptions in exception.log or fatal.log around this time. Which means it's probably unrelated to this specific bug (errors caused by long parse times).

Change 107266 had a related patch set uploaded by MarkAHershberger:
Avoid extra parsing in prepareContentForEdit()

https://gerrit.wikimedia.org/r/107266

Change 107266 merged by jenkins-bot:
Avoid extra parsing in prepareContentForEdit()

https://gerrit.wikimedia.org/r/107266

One of the patches was backported to core stable (causing bug 60054), does this mean that there was some urgency for core too and this is not a Wikimedia-only bug? Please categorise it as appropriate.
What else needs to be done here? Cyberpower678, are you still seeing errors?

Main patches seem to be against core, so moving accordingly.

All six patches mentioned in this bug report have been merged.

What is left here?

(In reply to Andre Klapper from comment #40)

All six patches mentioned in this bug report have been merged.

What is left here?

One of the patches was reverted due to issues in Wikidata, see bug 59788.

Nemo_bis assigned this task to Anomie.
Nemo_bis set Security to None.