Page MenuHomePhabricator

App servers get into bad states when coming back online/are newly provisioned due to puppet/salt craziness
Closed, DeclinedPublic

Details

Reference
bz66050

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:25 AM
bzimport added a project: Deployments.
bzimport set Reference to bz66050.
bzimport added a subscriber: Unknown Object (MLST).

The puppet configuration that attempts to ensure that each apache server has the latest version of the mediawiki code and configuration is in ::mediawiki::sync. Specifically Exec['mw-sync'] and Exec['mw-sync-rebuild-cdbs'] combine to perform the end host scap steps of syncing with the state of the rsync server on tin. The Exec['mw-sync'] definition if marked as refreshonly => true which means it will only be applied if something else explicitly asks for it to run.

The explicit ask comes from ::mediawiki::web where Exec['apache-trigger-mw-sync'] is defined. This exec checks to see if any apache2 processes are running. If none are found, it notifies Exec['mw-sync']. The Service['apache'] define subscribes to Exec['mw-sync'] to start apache after Exec['mw-sync'] has completed.

There is at least one possible race condition in ::mediawiki::sync. Exec['mw-sync'] requires File['/usr/local/bin/sync-common'], but File['/usr/local/bin/sync-common'] is a symlink to /srv/deployment/scap/scap/bin/sync-common and that file is realized by Deployment:Target['scap'] (i.e. Trebuchet). There is no require to ensure that Trebuchet has deployed/updated sync-common before mw-sync invokes it.

It would probably be good to change the Service['apache'] subscribe to Exec['mw-sync-rebuild-cdbs'] so that Apache isn't started until after the l10n cache is present.

I think there is an additional point of weakness here in the design of ::deployment::target. The creation of the salt grain notifies several execs. If the call to create the salt grain succeeds on the salt master but fails to notify the host applying via puppet due to a reporting timeout, these initial execs may never be called. This breaks puppet's notion of idempotent eventual consistency.

Not sure this still apply.

I got another race condition with the salt grain deployment_server that is updated on each puppet run and does notify a full code sync on each puppet run T146914. More or less related to this task.

MoritzMuehlenhoff subscribed.

Salt is being removed.