Maniphest T22085

[scap] Local sync script on any individual server should be atomic
Open, MediumPublicFeature
Actions

Assigned To

None

Authored By

	• brion
	Aug 5 2009, 11:55 PM

Description

Currently when we make site software updates with scap, sync-common-all, etc the web servers are still running while they work.

This has the unfortunate side effect that a portion of web requests will come in to a server whose copy of MediaWiki is only partially updated, which can cause transient but very scary-looking errors. A common type of error is where files in different directories are both changed and have a dependency on each other; especially problematic with skin files since skins may be synced out ahead of time... this can toss up big scary PHP fatal errors or exceptions.

We want the updates to be atomic, so any given request will get _either_ the old deployment version _or_ the new version, but never a mix.

There's two main ways we could implement this:

Shut down Apache before rsync, restart it after.

Simple, but could make updates slower, or leave us with most machines out of service simultaneously for a minute or two.

rsync to a staging directory, then swap the entire thing out for the live one.

(I'm not sure if it's possible to totally atomically swap out two directories in posix semantics.)

or maybe also

rsync to a staging directory, then swap which directory we refer to in the .conf files and do an apachectl graceful restart.

This would avoid holes in response time, but we may have a magical moving directory which could be confusing madness. :)

(Another thing to consider might be keeping the 'live' skin and extension JS/CSS files in a separate subdir, so we can update those en masse first with no code safety issues, then run the code updates -- atomic per server -- guaranteeing we'll have the new css/JS on all new hits.)

Version: unspecified
Severity: enhancement
Whiteboard: deploysprint-13

Details

Reference: bz20085

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T22085 [scap] Local sync script on any individual server should be atomic
Resolved		bd808	T29294 [scap] Rewrite Wikimedia's sync scripts (scap, sync-file, etc.)
Resolved		None	T104352 Make scap able to depool/repool servers via the conftool API
Resolved		Joe	T73212 Make it possible to quickly and programmatically pool and depool application servers
Resolved		None	T115899 Move scap target configuration to etcd
Resolved		Joe	T163565 Install conftool on deployment masters

Event Timeline

• bzimport raised the priority of this task from to Low.Nov 21 2014, 10:51 PM

• bzimport added a project: Deployments.

• bzimport set Reference to bz20085.

• bzimport added a subscriber: Unknown Object (MLST).

• brion created this task.Aug 5 2009, 11:55 PM

For any of these we would need to have to have strict control of how many machines are being activated at any point in time along with the interval length to the next block of activations in order to minimize brown/black outs

atomic scaps is much much smaller issue than having failed syncs or outdated trees running in cluster for days.

(I'm not sure if it's possible to totally atomically swap out two directories
in posix semantics.)

I don't think you can do so, but you can atomically replace a symlink so that it magically points to a different directory (I am using that approach in a scap script).

or maybe also

rsync to a staging directory, then swap which directory we refer to in the

.conf files and do an apachectl graceful restart.

This looks even better, as far as the don't get full for copying, it is easy to change that configuration entry, and anyone doing a can do a graceful restart (no root-only errors)-

This would avoid holes in response time, but we may have a magical moving
directory which could be confusing madness. :)

The directory could be named by the revision, so it looks logical.

Today's mobile deployment removed SkinMobileBase.php from MobileFrontend, a file that was previously loaded on every request. This provided an illustration of the problem and gives some sense of its magnitude.

MaxSem started scap at 21:16. Between 21:18:39 and 21:26:24 we had a total of 65 fatals caused by SkinMobileBase.php having been deleted prior to the calling code receiving the update:

[25-Jun-2013 23:26:18] Fatal error: require() [<a href='function.require'>function.require</a>]: Failed opening required '/usr/local/apache/common-local/php-1.22wmf7/extensions/MobileFrontend/includes/skins/SkinMobileBase.php' (include_path='/usr/local/apache/common-local/php-1.22wmf7/extensions/TimedMediaHandler/handlers/OggHandler/PEAR/File_Ogg:/usr/local/apache/common-local/php-1.22wmf7:/usr/local/lib/php:/usr/share/php') at /usr/local/apache/common-local/php-1.22wmf7/includes/AutoLoader.php on line 1155

Of mw* hosts in the mediawiki-installation group, 172 had no errors, 30 had one fatal, 16 had 2 fatals, and one host had three fatals.

Using rsync's --delete-after or --delete-delayed option would not make scap atomic, but it could still significantly reduce the rate at which these kinds of errors occur.

(In reply to Brion Vibber from comment #0)

There's two main ways we could implement this:

Shut down Apache before rsync, restart it after.

rsync to a staging directory, then swap the entire thing out for the live

one.

or maybe also

rsync to a staging directory, then swap which directory we refer to in

the .conf files and do an apachectl graceful restart.

The simple fix for this that is currently in use is passing the --delay-update option to rsync (when called from scap). This makes it so the rsync to any given apache copies the files to a tmp dir then switches it over as the last step (basically, brion's suggestion #2).

There's also ideas on cluster wide atomicity, but I'm calling that out of scope for this bug :).

Is the --delay-update good enough for this bug for now?

One thing delay update doesn't fix is that the l10n cache rebuild on each node until after all nodes get the new code. This leads to some errors like were seen yesterday.

This bug not yet being fixed caused bug 63791 today, the fifth or sixth VE breakage from this that I recall. :-( It'd be really great if we could get it fixed some time soon.

Greg: Should this have higher priority and an assignee set?

It requires re-writing the deployment system and the requirement is on the list of known issues for that work.

greg raised the priority of this task from Low to Medium.Jan 8 2015, 6:05 PM

greg moved this task from To Triage to Backlog (Tech) on the Deployments board.

In T22085#253069, @Jdforrester-WMF wrote:

This bug not yet being fixed caused T65791 today, the fifth or sixth VE breakage from this that I recall. :-( It'd be really great if we could get it fixed some time soon.

Actually, that was caused by T47877 (which is about a some requests to newer-version servers to interacting badly with older-version servers whilst in the middle of a scap). This task (T22085) is about a single request causing errors because its own file-system is in the middle of an rsync that replaces files in /srv/mediawiki. Thus causing e.g. a class to be missing, or a variable to be undefined.

For example, deployment of 646b1441d0ef83c caused a brief flood of the following in the logs:

Notice: Undefined variable: wmgVisualEditorConsolidateFeedback in /srv/mediawiki/wmf-config/CommonSettings.php on line 2033

Because for a short time an individual server had the newer version of CommonSettings.php but not yet the updated InitialiseSetttings.php.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 10 2015, 10:40 PM

Krinkle renamed this task from [scap] [l10n] Atomic updates for sync scripts to [scap] Local sync script on any individual server should be atomic.Dec 10 2015, 10:41 PM

Krinkle set Security to None.

Krinkle removed a subscriber: • wikibugs-l-list.

In T22085#1871011, @Krinkle wrote:

Because for a short time an individual server had the newer version of CommonSettings.php but not yet the updated InitialiseSetttings.php.

I hate that error, but I've never been able to bring myself to complicate the rsync commands used by a full scap to exclude CommonSettings.php and then sync it later. T73212: Make it possible to quickly and programmatically pool and depool application servers/T104352: Make scap able to depool/repool servers via the conftool API will really be the right fix for this class of problem.

bd808 added subtasks: T73212: Make it possible to quickly and programmatically pool and depool application servers, T104352: Make scap able to depool/repool servers via the conftool API.Dec 10 2015, 11:25 PM

In T22085#1871562, @bd808 wrote:

In T22085#1871011, @Krinkle wrote:

Because for a short time an individual server had the newer version of CommonSettings.php but not yet the updated InitialiseSetttings.php.

I hate that error, but I've never been able to bring myself to complicate the rsync commands used by a full scap to exclude CommonSettings.php and then sync it later.

Agreed, and that would actually cause problems since the dependency isn't always in the same direction. It's the responsibility of the deployer to sync files separately.

T73212: Make it possible to quickly and programmatically pool and depool application servers
T104352: Make scap able to depool/repool servers via the conftool API

Reloading HHVM and/or using RepoAuthoritive mode is also an option.

In T22085#1871600, @Krinkle wrote:

T73212: Make it possible to quickly and programmatically pool and depool application servers
T104352: Make scap able to depool/repool servers via the conftool API

Reloading HHVM and/or using RepoAuthoritive mode is also an option.

Both of those options are really blocked on sane depooling at the pybal layer. It is actually possible to have scap restart hhvm on the hosts today but we don't advertise that or do it by default because the method it uses to signal pybal to depool doesn't actually work as hoped.

Krenair subscribed.Dec 10 2015, 11:46 PM

Jdforrester-WMF awarded a token.Jan 5 2016, 12:21 AM

Joe closed subtask T73212: Make it possible to quickly and programmatically pool and depool application servers as Resolved.Feb 2 2016, 9:51 AM

greg edited projects, added scap2; removed Deployments.Feb 9 2016, 11:34 PM

Meno25 unsubscribed.Feb 19 2016, 5:51 PM

• mmodell edited projects, added Scap; removed scap2.Feb 10 2017, 6:22 PM

Krinkle mentioned this in T157210: Gadget dependencies sometimes don't update.Apr 28 2017, 10:09 PM

• mmodell moved this task from Needs triage to Debt on the Scap board.Feb 1 2018, 12:19 AM

Krinkle mentioned this in T233769: Scap should delete files after other updates.Sep 25 2019, 1:30 AM

Krinkle added a project: Release-Engineering-Team.Sep 25 2019, 1:38 AM

greg edited projects, added Release-Engineering-Team (Deployment services); removed Release-Engineering-Team.Feb 11 2020, 4:37 PM

thcipriani edited projects, added Release-Engineering-Team (thcipriani-workboard-fiddling); removed Release-Engineering-Team (Deployment services).Apr 20 2021, 12:56 AM