Page MenuHomePhabricator

Intermittent "cannot contact the database server" on https://en.wikipedia.org/
Closed, ResolvedPublic

Description

I've intermittently been getting "(Cannot contact the database server: Unknown error (10.0.6.42))" errors on https://en.wikipedia.org. This most recent time happened when trying to preview an edit. I don't edit much, but I've gotten the error a few times over the past week. It'd be nice if someone could check the frequency of such errors and examine what the underlying issue is.


Version: unspecified
Severity: normal

Details

Reference
bz31530
TitleReferenceAuthorSource BranchDest Branch
Add iceberg version of referrer_daily table.repos/data-engineering/airflow-dags!378xcollazoT335305-migrate-referrer-daily-to-icebergmain
Publish separate images for different evaluatorsrepos/abstract-wiki/wikifunctions/function-evaluator!7kindrobotmoar-imagesmain
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 21 2014, 11:58 PM
bzimport set Reference to bz31530.
bzimport added a subscriber: Unknown Object (MLST).

Not that useful for ordinary people. But if there's a RT ticket saying something like "db32 randomly drops connections". Should this bug be closed?

(In reply to comment #2)

Not that useful for ordinary people. But if there's a RT ticket saying
something like "db32 randomly drops connections". Should this bug be closed?

Unless that RT ticket contains top-secret information, Bugzilla should always take precedence. Ops needs to get better about using RT only when absolutely necessary.

No. Definitely don't close bugs if an RT is created. We are looking for better ways to update both ways. I'd prefer we have a public way of tracking info.

Just got "(Can't contact the database server: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) (localhost))" on https://en.wikipedia.org.

(In reply to comment #5)

Just got "(Can't contact the database server: Can't connect to local MySQL
server through socket '/var/run/mysqld/mysqld.sock' (2) (localhost))" on
https://en.wikipedia.org.

What was the URL? Was the error message inside a MediaWiki skin, or was it just a blank page with an error message? If the navigation elements were there, did they look normal, or was the site name incorrect?

titoxd.wikimedia wrote:

(In reply to comment #7)

(In reply to comment #5)

Just got "(Can't contact the database server: Can't connect to local MySQL
server through socket '/var/run/mysqld/mysqld.sock' (2) (localhost))" on
https://en.wikipedia.org.

What was the URL? Was the error message inside a MediaWiki skin, or was it just
a blank page with an error message? If the navigation elements were there, did
they look normal, or was the site name incorrect?

I ran into the same error myself yesterday, but on http, not https. I found it when clicking on an internal link to http://en.wikipedia.org/wiki/2011_Pacific_hurricane_season. No MediaWiki skin was visible, just a white page with the localhost error message and a search bar. Unfortunately, I can't seem to replicate the problem consistently in any way.

When there's a connection error, a log entry is written by LoadBalancer, not Database. If an extension is creating its own Database objects with incorrect configuration, that would explain the lack of connection errors in dberror.log.

(In reply to comment #9)

When there's a connection error, a log entry is written by LoadBalancer, not
Database. If an extension is creating its own Database objects with incorrect
configuration, that would explain the lack of connection errors in dberror.log.

Actually none of that is true. Maybe an extension could make these errors somehow, but I'm not sure how.

The linked diff say "the one I'm getting has a "(Can't contact the database server: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) (localhost))" on it" which would be very wrong. It would be trying to connect to a mysql server running in the apaches!

afeldman wrote:

A change to the job queue system in 1.18 to fix an issue where the job runners
were hammering the enwiki master resulted in a high number of locks triggering
this mysql bug - http://bugs.mysql.com/bug.php?id=49047 (thanks domas!)

r99650 removes the lock issue and since deploying, haven't seen any connection
errors to db32. I am going to build and package mysql 5.1.52@fb in the near
future which includes a fix for mysql 49047, after which we can try reverting
r99650.

Considering the cause and fix, it definitely seems that bugzilla was the correct place for this.