Page MenuHomePhabricator

[scap] Local sync script on any individual server should be atomic
Open, MediumPublicFeature

Description

Currently when we make site software updates with scap, sync-common-all, etc the web servers are still running while they work.

This has the unfortunate side effect that a portion of web requests will come in to a server whose copy of MediaWiki is only partially updated, which can cause transient but very scary-looking errors. A common type of error is where files in different directories are both changed and have a dependency on each other; especially problematic with skin files since skins may be synced out ahead of time... this can toss up big scary PHP fatal errors or exceptions.

We want the updates to be atomic, so any given request will get _either_ the old deployment version _or_ the new version, but never a mix.

There's two main ways we could implement this:

  1. Shut down Apache before rsync, restart it after.

Simple, but could make updates slower, or leave us with most machines out of service simultaneously for a minute or two.

  1. rsync to a staging directory, then swap the entire thing out for the live one.

(I'm not sure if it's possible to totally atomically swap out two directories in posix semantics.)

or maybe also

  1. rsync to a staging directory, then swap which directory we refer to in the .conf files and do an apachectl graceful restart.

This would avoid holes in response time, but we may have a magical moving directory which could be confusing madness. :)

(Another thing to consider might be keeping the 'live' skin and extension JS/CSS files in a separate subdir, so we can update those en masse first with no code safety issues, then run the code updates -- atomic per server -- guaranteeing we'll have the new css/JS on all new hits.)


Version: unspecified
Severity: enhancement
Whiteboard: deploysprint-13

Details

Reference
bz20085

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:51 PM
bzimport added a project: Deployments.
bzimport set Reference to bz20085.
bzimport added a subscriber: Unknown Object (MLST).

For any of these we would need to have to have strict control of how many machines are being activated at any point in time along with the interval length to the next block of activations in order to minimize brown/black outs

atomic scaps is much much smaller issue than having failed syncs or outdated trees running in cluster for days.

(I'm not sure if it's possible to totally atomically swap out two directories
in posix semantics.)

I don't think you can do so, but you can atomically replace a symlink so that it magically points to a different directory (I am using that approach in a scap script).

or maybe also

  1. rsync to a staging directory, then swap which directory we refer to in the

.conf files and do an apachectl graceful restart.

This looks even better, as far as the don't get full for copying, it is easy to change that configuration entry, and anyone doing a can do a graceful restart (no root-only errors)-

This would avoid holes in response time, but we may have a magical moving
directory which could be confusing madness. :)

The directory could be named by the revision, so it looks logical.

Today's mobile deployment removed SkinMobileBase.php from MobileFrontend, a file that was previously loaded on every request. This provided an illustration of the problem and gives some sense of its magnitude.

MaxSem started scap at 21:16. Between 21:18:39 and 21:26:24 we had a total of 65 fatals caused by SkinMobileBase.php having been deleted prior to the calling code receiving the update:

[25-Jun-2013 23:26:18] Fatal error: require() [<a href='function.require'>function.require</a>]: Failed opening required '/usr/local/apache/common-local/php-1.22wmf7/extensions/MobileFrontend/includes/skins/SkinMobileBase.php' (include_path='/usr/local/apache/common-local/php-1.22wmf7/extensions/TimedMediaHandler/handlers/OggHandler/PEAR/File_Ogg:/usr/local/apache/common-local/php-1.22wmf7:/usr/local/lib/php:/usr/share/php') at /usr/local/apache/common-local/php-1.22wmf7/includes/AutoLoader.php on line 1155

Of mw* hosts in the mediawiki-installation group, 172 had no errors, 30 had one fatal, 16 had 2 fatals, and one host had three fatals.

Using rsync's --delete-after or --delete-delayed option would not make scap atomic, but it could still significantly reduce the rate at which these kinds of errors occur.

(In reply to Brion Vibber from comment #0)

There's two main ways we could implement this:

  1. Shut down Apache before rsync, restart it after.
  1. rsync to a staging directory, then swap the entire thing out for the live

one.

or maybe also

  1. rsync to a staging directory, then swap which directory we refer to in

the .conf files and do an apachectl graceful restart.

The simple fix for this that is currently in use is passing the --delay-update option to rsync (when called from scap). This makes it so the rsync to any given apache copies the files to a tmp dir then switches it over as the last step (basically, brion's suggestion #2).

There's also ideas on cluster wide atomicity, but I'm calling that out of scope for this bug :).

Is the --delay-update good enough for this bug for now?

One thing delay update doesn't fix is that the l10n cache rebuild on each node until after all nodes get the new code. This leads to some errors like were seen yesterday.

This bug not yet being fixed caused bug 63791 today, the fifth or sixth VE breakage from this that I recall. :-( It'd be really great if we could get it fixed some time soon.

Greg: Should this have higher priority and an assignee set?

It requires re-writing the deployment system and the requirement is on the list of known issues for that work.

greg raised the priority of this task from Low to Medium.Jan 8 2015, 6:05 PM
greg moved this task from To Triage to Backlog (Tech) on the Deployments board.

This bug not yet being fixed caused T65791 today, the fifth or sixth VE breakage from this that I recall. :-( It'd be really great if we could get it fixed some time soon.

Actually, that was caused by T47877 (which is about a some requests to newer-version servers to interacting badly with older-version servers whilst in the middle of a scap). This task (T22085) is about a single request causing errors because its own file-system is in the middle of an rsync that replaces files in /srv/mediawiki. Thus causing e.g. a class to be missing, or a variable to be undefined.

For example, deployment of 646b1441d0ef83c caused a brief flood of the following in the logs:

Notice: Undefined variable: wmgVisualEditorConsolidateFeedback in /srv/mediawiki/wmf-config/CommonSettings.php on line 2033

Because for a short time an individual server had the newer version of CommonSettings.php but not yet the updated InitialiseSetttings.php.

Krinkle renamed this task from [scap] [l10n] Atomic updates for sync scripts to [scap] Local sync script on any individual server should be atomic.Dec 10 2015, 10:41 PM
Krinkle set Security to None.
Krinkle removed a subscriber: wikibugs-l-list.

Because for a short time an individual server had the newer version of CommonSettings.php but not yet the updated InitialiseSetttings.php.

I hate that error, but I've never been able to bring myself to complicate the rsync commands used by a full scap to exclude CommonSettings.php and then sync it later. T73212: Make it possible to quickly and programmatically pool and depool application servers/T104352: Make scap able to depool/repool servers via the conftool API will really be the right fix for this class of problem.

Because for a short time an individual server had the newer version of CommonSettings.php but not yet the updated InitialiseSetttings.php.

I hate that error, but I've never been able to bring myself to complicate the rsync commands used by a full scap to exclude CommonSettings.php and then sync it later.

Agreed, and that would actually cause problems since the dependency isn't always in the same direction. It's the responsibility of the deployer to sync files separately.

T73212: Make it possible to quickly and programmatically pool and depool application servers
T104352: Make scap able to depool/repool servers via the conftool API

Reloading HHVM and/or using RepoAuthoritive mode is also an option.

Both of those options are really blocked on sane depooling at the pybal layer. It is actually possible to have scap restart hhvm on the hosts today but we don't advertise that or do it by default because the method it uses to signal pybal to depool doesn't actually work as hoped.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM
Aklapper removed a subscriber: Tfinc.