Page MenuHomePhabricator

beta labs mysteriously goes read-only overnight
Closed, ResolvedPublic

Description

I've been seeing this in the overnight runs of the browser tests in recent times. The build for VisualEditor will fail with a modal dialog that says "Error loading data from server: readonly. The wiki is currently in read-only mode. Would you like to retry?"

Here is an example from the overnight run Sunday 18 May: https://wmf.ci.cloudbees.com/job/VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome/512/testReport/(root)/VisualEditor/Edit_with_strings__outline_example_____Editing_with_%C3%84%C3%8B%C3%8F%C3%96%C3%9C___Editing_with_%C3%84%C3%8B%C3%8F%C3%96%C3%9C___/

I can't think of any reason why beta labs would be in read-only mode late on a Sunday (PDT).

I suspect this may also be the cause of the occasional failure in other builds with less information, for example "too many connection resets (due to Net::ReadTimeout - Net::ReadTimeout)" that we see in the MobileFrontend builds: too many connection resets (due to Net::ReadTimeout - Net::ReadTimeout) https://wmf.ci.cloudbees.com/job/MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox/571/testReport/(root)/Check%20UI%20components/Check_existence_of_important_UI_components_on_other_pages_/


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=68349

Details

Reference
bz65486

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:23 AM
bzimport set Reference to bz65486.
bzimport added a subscriber: Unknown Object (MLST).

If I recall correctly, this is something that can happen when things go sideways with the database. Not sure if that's what's going on here, but may be worth looking into.

On one SauceLab failure, it was POSTing to "http://en.wikipedia.beta.wmflabs.org/wiki/User:Selenium_user/firefox?vehidebetadialog=true&veaction=edit"

The message:

The wiki is currently in read-only mode. Would you like to retry?

Which comes from ApiBase::dieReadOnly(). That method seems to only be called when wfReadOnly() is true which is some legacy code that would let us create a file on the cluster that would disable edits entirely.

There is like 0% change it is being triggered that way unless something mess with $wgReadOnly. So most probably the i18n message is being reused by another path of code.

FWIW VisualEditor doesn't know about <readonlytext> – it's just passing on what it gets from the API.

Right, and we know about the unexpected readonly status because it seems only VisualEditor displays that error in a javascript confirm modal dialog. It might manifest in other ways that we would not see if not for the modal dialog that stops the test.

Created attachment 15558
Screenshot

I have reproduced this issue today on Betalabs, attaching the screenshot

Attached:

Screen_Shot_2014-06-03_at_11.37.15_AM.png (509×1 px, 227 KB)

Antoine, would these messages be relevant? They do not seem to happen at any particular interval but they might be correlated to the time at which Rummana saw the problem.

@deployment-bastion:/data/project/logs$ tail -f dberror.log
Tue Jun 3 17:17:09 UTC 2014 deployment-apache01 testwiki Error connecting to 10.68.17.94: :real_connect(): (42000/1049): Unknown database 'testwikidatawiki'
Tue Jun 3 17:17:09 UTC 2014 deployment-apache01 testwiki Connection error: No working slave server: Unknown error (10.68.17.94)
Tue Jun 3 17:17:09 UTC 2014 deployment-apache01 testwiki Error connecting to 10.68.17.94: :real_connect(): (42000/1049): Unknown database 'testwikidatawiki'
Tue Jun 3 17:17:09 UTC 2014 deployment-apache01 testwiki Connection error: No working slave server: Unknown error (10.68.17.94)
Tue Jun 3 17:17:09 UTC 2014 deployment-apache01 testwiki Error connecting to 10.68.17.94: :real_connect(): (42000/1049): Unknown database 'testwikidatawiki'
Tue Jun 3 17:17:09 UTC 2014 deployment-apache01 testwiki Connection error: No working slave server: Unknown error (10.68.17.94)
Tue Jun 3 17:50:48 UTC 2014 deployment-apache01 testwiki Error connecting to 10.68.17.94: :real_connect(): (42000/1049): Unknown database 'testwikidatawiki'
Tue Jun 3 17:50:48 UTC 2014 deployment-apache01 testwiki Connection error: No working slave server: Unknown error (10.68.17.94)
Tue Jun 3 19:20:48 UTC 2014 deployment-apache01 testwiki Error connecting to 10.68.17.94: :real_connect(): (42000/1049): Unknown database 'testwikidatawiki'
Tue Jun 3 19:20:48 UTC 2014 deployment-apache01 testwiki Connection error: No working slave server: Unknown error (10.68.17.94)

Adding Sean Pringle. This seems to be getting worse. I'd like to either update the db less often or else make it less disruptive.