Page MenuHomePhabricator

populateBacklinkNamespace script causing massive slave lag on beta
Closed, ResolvedPublic

Description

See https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/2741/console

Ones completed so far (timestamp is "Elapsed time"):

00:00:17.131 deployment-bastion-eqiad,enwikinews completed with result SUCCESS
00:00:17.131 deployment-bastion-eqiad,enwikiquote completed with result SUCCESS
01:02:40.299 deployment-bastion-eqiad,eswiki completed with result SUCCESS
01:02:40.305 deployment-bastion-eqiad,enwikibooks completed with result SUCCESS
01:02:40.305 deployment-bastion-eqiad,ee_prototypewiki completed with result SUCCESS
01:02:40.305 deployment-bastion-eqiad,testwiki completed with result SUCCESS
01:02:40.306 deployment-bastion-eqiad,eowiki completed with result SUCCESS

In otherwords, just doing eswiki took almost an hour.

We can't have the Beta Cluser throwing database locked errors for the entire day.


Version: unspecified
Severity: critical
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=65486

Details

Reference
bz68349

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:36 AM
bzimport set Reference to bz68349.
bzimport added a subscriber: Unknown Object (MLST).

To clarify is this every time or just a specific update?

If the schema is already up to date, it should finish within seconds (especially if the --quick option is present to skip the 5 second delay)

The previous job was aborted by hashar and prior to that it failed on enwiki. So the past 2 runs failed for reference.

To be explicit: this causes browser tests to fail because the database is in read-only mode (ie: no edits can be made).

(In reply to Bawolff (Brian Wolff) from comment #1)

To clarify is this every time or just a specific update?

If the schema is already up to date, it should finish within seconds
(especially if the --quick option is present to skip the 5 second delay)

It normally completes fine/quickly, eg:
https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/2730/console was 32 seconds.

For reference, according to https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/label=deployment-bastion-eqiad,wikidb=eswiki/lastBuild/console

the step taking a lot of time is:
00:00:17.120 Updating *_from_namespace fields in links tables.

Which is to be expected when you're updating that huge a table. (update from b8c038f6784ef0820)


Also there's a 3 second jump at

00:00:13.669 ...afl_namespace in table abuse_filter_log already modified by patch /mnt/srv/scap-stage-dir/php-master/extensions/AbuseFilter/db_patches/patch-afl-namespace_int.sql.
00:00:16.808 ...user_daily_contribs table already exists.

which is longer then I would expect (but not an issue)

We can't have the Beta Cluser throwing database locked errors for the entire
day.

At first glance, I don't see any reason why this update should lock the database.

(In reply to Greg Grossmeier from comment #3)

To be explicit: this causes browser tests to fail because the database is in
read-only mode (ie: no edits can be made).

Just to write down what was said in irc.

update.php caused massive slave lag (91 minutes currently - http://en.wikipedia.beta.wmflabs.org/w/index.php?maxlag=-1 ), triggering MediaWiki to auto lock the db. The wfWaitForSlaves() function in the update script is ineffective because it will only wait for at most 10 seconds.

Assuming it's running them in order (which seems likely), as of this comment it's taken 106 minutes to do 27k of BL-Wikidata's ~30k pages, so roughly an hour to do 15k pages; pages to come:

ruwiki ?k
metawiki 1k
simplewiki 232k
zhwiki ?k
hewiki 3k
enwikiversity 0k
enwiktionary 2k
commonswiki 28k
ukwiki ?k
en_rtlwiki ?k
sqwiki ?k
fawiki ?k
enwikisource 0k
kowiki ?k
dewiki 2k
jawiki ?k
labswiki ?k
arwiki ?k
cawiki ?k
hiwiki ?k
aawiki ?k
loginwiki 0k
enwiki 26k

> ~300k pages to go (assuming the wikis I couldn't reach due to service timeouts – "?k" – are roughly 1k pages each), which will take a further 21 hours to complete.

At this point I'd suggest that we drop the valueless simplewiki clone (232k pages on a test wiki is insane) and call it the best of a bad job.

So right now the update script does a lot of little queries (Batch sizes of 200, each batch involves one select query for all the page_ids in range, plus update queries for each page id. So in total each batch has 1 select, and ~600 update queries).

Perhaps it would be more efficient to do something like
UPDATE pagelinks, page SET pagelinks.pl_from_namespace = page.page_namespace WHERE pagelinks BETWEEN $blockStart AND $blockEnd;

to get rid of the overhead of so many small queries? I don't really know.

I guess at the very least batch size should be much less. I also do wonder if something is perhaps wrong with deployment-db2. It has no entry in the wmflabs ganglia.

Change 148296 had a related patch set uploaded by Brian Wolff:
Reduce batch size of populateBacklinkNamespace from 200 to 20

https://gerrit.wikimedia.org/r/148296

(In reply to Gerrit Notification Bot from comment #10)

Change 148296 had a related patch set uploaded by Brian Wolff:
Reduce batch size of populateBacklinkNamespace from 200 to 20

https://gerrit.wikimedia.org/r/148296

This is only somewhat related to the bug (As in it would help if it was there from the get-go, but probably not that much unless we restart the update script)


If its really important that beta "work" for people, one possibility as a temporary hack would be to add something like the following to beta's config file (after whenever db-labs.php is loaded):

if ( !$wgCommandLineMode ) {
unset( $wgLBFactoryConf['sectionLoads']['DEFAULT']['deployment-db2'] );
}

This would make the web interface ignore the lagged slave (The update script will still wait in 10 second intervals for it). Things would be editable again, load on the master db for labs would increase by quite a bit (but it's beta, what's the worst case scenario here?) [You should probably run this idea by someone else before actually doing it].

  • Bug 68373 has been marked as a duplicate of this bug. ***

Created attachment 15999
log of update.php for the beta cluster simplewiki

Aaron might be interested.

On beta simple wiki (which has roughly 250k pages), the console run is https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/label=deployment-bastion-eqiad,wikidb=simplewiki/2742/console (attached to bug report)

Attached:

Aaron Schulz might be interested in this bug report.

FWIW, the update.php job finished successfully. lag on deployment-db2 seems to be holding at about 3 hours and 20 minutes for now. Things will probably be back to normal in several hours.

Slave lag is back down to 0. Guess this is fixed.

I want to leave this open until we've figured out if we can prevent this from happening again.

(In reply to Greg Grossmeier from comment #17)

I want to leave this open until we've figured out if we can prevent this
from happening again.

Well update is done. Update only gets run once so it wont happen again on beta wiki unless someone manually runs populateBacklinkNamespace.php --force, or deletes the relavent entry in the updatelog table.

The deeper issue of course is the population script had too big a batch size. If you want to see if that is fixed i guess it might make sense to remove the line from updatelog on beta after merging comment 10 to see if the update still explodes.

I am also wondering how we are going to handle that update in production. Might end up taking a long time as well.

(In reply to Antoine "hashar" Musso from comment #19)

I am also wondering how we are going to handle that update in production.
Might end up taking a long time as well.

Towards the end, beta was updating about 200 rows every 30 seconds. enwiki's page_ids go up to 43371588, which gives ((30/200)*43371588)/(60*60*24) = 75.2

So 75 days to update enwiki (assuming similar performance, which is questionable. enwiki has much more powerful db so can probably do the update faster. OTOH, it should probably have a much smaller batchsize, which could potentially slow down the update. So who knows). Anyways, taking that very rough guess at face value, if the update takes 2.5 months, I don't see any problem. There's no deadline for when the update has to finish by.

(In reply to Bawolff (Brian Wolff) from comment #20)

(In reply to Antoine "hashar" Musso from comment #19)

I am also wondering how we are going to handle that update in production.
Might end up taking a long time as well.

Towards the end, beta was updating about 200 rows every 30 seconds. enwiki's
page_ids go up to 43371588, which gives ((30/200)*43371588)/(60*60*24) = 75.2

So 75 days to update enwiki (assuming similar performance, which is
questionable. enwiki has much more powerful db so can probably do the update
faster. OTOH, it should probably have a much smaller batchsize, which could
potentially slow down the update. So who knows). Anyways, taking that very
rough guess at face value, if the update takes 2.5 months, I don't see any
problem. There's no deadline for when the update has to finish by.

I'm no DBA, but running three UPDATEs for every page row doesn't sound like the brightest idea. I'm pretty sure MariaDB has much nicer performance if you speak to it in SQL like you proposed in comment #9.

(In reply to Antoine "hashar" Musso from comment #19)

I am also wondering how we are going to handle that update in production.
Might end up taking a long time as well.

It already happened in production. Which is the only reason why it was merged to begin with.

Remember folks: If your code goes to production and you want to make a database change, file a Schema Change bug and have our DBA (Sean) take care of it BEFORE you merge. Aaron did that right.

Excellent! So there is nothing to talk about anymore =) Beta is happy, slave lag is back to 0 seconds.

Topic closed.

(In reply to Greg Grossmeier from comment #22)

It already happened in production. Which is the only reason why it was
merged to begin with.

I'm not sure, but in production, this update may still be in progress. The only entry in https://wikitech.wikimedia.org/wiki/Server_Admin_Log I see regarding this is under July 30: "21:04 AaronSchulz: Started populateBacklinkNamespace.php on wikidata and commons".

The schema change done before the change was merged was to add some new columns and set them to a default value of 0. The update referred to in this report ("populateBacklinkNamespace script") would happen afterward, setting the correct values for those columns. That's why a $wgUseLinkNamespaceDBFields setting was added. It is not currently enabled in production.

Change 148296 abandoned by Brian Wolff:
Reduce batch size of populateBacklinkNamespace from 200 to 20

Reason:
gerrit change 151027 addresses the same issue but probably more robustly

https://gerrit.wikimedia.org/r/148296

physik wrote:

Is there an option to skip this update in update.php?
I tried mwscript update.php --quick --nopurge --skip-compat-checks, to run updated that follow the step
"Updating *_from_namespace fields in links tables."
But I did nothing helped.