Page MenuHomePhabricator

[l10n] l10nupdate process should respect the scap lock file
Closed, ResolvedPublic

Description

Full scap log

This scap had a lot of errors (and does not seem to have had one of its desired effects, fixing the messages at https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=true&lang=en&modules=ext.gettingstarted.lightbulb.flyout), but I think "file has vanished" may be one of the root ones.

Partial log (full log attached):


02:53:30 ['sync-common', '--no-update-l10n', 'mw1010.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1064 returned [70]: 02:52:25 Copying to mw1064.eqiad.wmnet from mw1010.eqiad.wmnet
02:52:25 Started rsync common
file has vanished: "/php-1.24wmf21/cache/l10n/.l10n_cache-ak.cdb.FyWmNi" (in common)
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1536) [generator=3.0.9]
02:53:29 Finished rsync common (duration: 01m 04s)
02:53:29 Unhandled error:
Traceback (most recent call last):

File "/srv/deployment/scap/scap/scap/cli.py", line 262, in run                                                                                     
  exit_status = app.main(extra_args)                                                                                                               
File "/srv/deployment/scap/scap/scap/main.py", line 282, in main                                                                                   
  verbose=self.verbose                                                                                                                             
File "/srv/deployment/scap/scap/scap/tasks.py", line 275, in sync_common                                                                           
  subprocess.check_call(rsync)                                                                                                                     
File "/usr/lib/python2.7/subprocess.py", line 511, in check_call                                                                                   
  raise CalledProcessError(retcode, cmd)

CalledProcessError: Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=/.svn/lock', '--exclude=/.git/objects', '--exclude=/.git//objects', '--exclude=**/cache/l10n/*.cdb', '--no-perms', 'mw1010.eqiad.wmnet::common', '/srv/mediawiki']' returned non-zero exit status 24

02:53:29 sync-common failed: <CalledProcessError> Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=/.svn/lock', '--exclude=/.git/objects', '--exclude=/.git//objects', '--exclude=**/cache/l10n/*.cdb', '--no-perms', 'mw1010.eqiad.wmnet::common', '/srv/mediawiki']' returned non-zero exit status 24

This this repeats for various languages.


Version: wmf-deployment
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=70446

Attached:

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:55 AM
bzimport added a project: Deployments.
bzimport set Reference to bz70752.

(In reply to Matthew Flaschen from comment #0)

This scap had a lot of errors (and does not seem to have had one of its
desired effects, fixing the messages at
https://bits.wikimedia.org/en.wikipedia.org/load.
php?debug=true&lang=en&modules=ext.gettingstarted.lightbulb.flyout), but I
think "file has vanished" may be one of the root ones.

I still don't really know what any of the underlying problems were, or what fixed what, but if you follow that link you can see the RL message blob is now correct (confirmed on GUI).

Was this the host where Roan killed the ssh session from tin or another host? In general this looks like rsync experiencing some sort of major on the receiving client side. With the horrible sync times and extreme load that the mw1010 server was under during this scap I'm actually not surprised to see a few errors like this occur.

I'm not sure there is much that could be done to adjust for this in the scap program itself. We could dial back the number of parallel processes we run or try to add some sort of load check in the rsync server selection process. The settings we are running now however have remained the same for many months and we have only seen these really bad sync times and high load on the rsync servers in the last week or two. Many (all?) of the rsync servers are job runner nodes. I'm sort of wondering if something has changed in the behavior or job load that is clashing with scap.

I'd really like to move from rsync to a git fetch or other method of synchronization that uses precomputed diffs and lessens the computational cost on both the client and server.

(In reply to Greg Grossmeier from comment #3)

Looks like an l10nupdate and scap at the same time is too much for tin.

Oh man. They would be fighting over the content of the l10n cdb files and l10nupdate doesn't do the nicest things possible to sync the blobs with the cluster. The l10nupdate process should probably respect the scap lock file. As Sam pointed out last week l10nupdate should be changed to actually use scap/sync-common to update the cluster and use the json representation to transmit deltas to the mw servers rather than direct cdb syncs.

(In reply to Bryan Davis from comment #4)

As Sam pointed out last week l10nupdate should be changed to actually use
scap/sync-common to update the cluster and use the json representation to
transmit deltas to the mw servers rather than direct cdb syncs.

https://bugzilla.wikimedia.org/show_bug.cgi?id=70446

Reedy: Can you take a look at this in relation to the "Are Scap and LocalisationUpdate actually fighting over l10nupdates?" issue?

(In reply to Bryan Davis from comment #2)

Was this the host where Roan killed the ssh session from tin or another
host?

This was actually a separate scap. I did two scaps:

[Thursday, September 11, 2014] [7:24:53 PM] <RoanKattouw> bd808: OK so should I just kill the ssh mw1063 process on tin so the scap will move on with 1 failure?
[Thursday, September 11, 2014] [7:24:54 PM] <bd808> AaronS: So this may be an issue in general? Or we need to not use job runners as rsync slaves anymore?
[Thursday, September 11, 2014] [7:25:01 PM] Nick James_F|Away is now known as James_F.
[Thursday, September 11, 2014] [7:25:01 PM] <bd808> RoanKattouw: Sure
[Thursday, September 11, 2014] [7:25:04 PM] <RoanKattouw> OK
[Thursday, September 11, 2014] [7:25:48 PM] <RoanKattouw> Done

...

[Thursday, September 11, 2014] [7:26:50 PM] <superm401> 23:25:11 Finished sync-apaches (duration: 87m 15s)

...

[Thursday, September 11, 2014] [7:29:49 PM] <logmsgbot> !log mattflaschen Finished scap: Deploy new GettingStarted recommendations A/B test (duration: 99m 34s)

...

[Thursday, September 11, 2014] [11:08:42 PM] <logmsgbot> !log mattflaschen Finished scap: One last CSS fix (wrapping issue for error state) for GettingStarted A/B test (duration: 24m 38s)

This bug in regards to the second one (though the first one had its own issues).

greg removed Reedy as the assignee of this task.Mar 27 2015, 7:24 PM
greg set Security to None.

This may be fixed now that T72443 is resolved. @bd808 can confirm?

This may be fixed now that T72443 is resolved. @bd808 can confirm?

The sync itself respects the lock now, but the part of the process that actually modifies the cdb and json files (https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/scap/files/l10nupdate-1;3eabdda0846611e26acb88accdc939f2e69b1a3f$104) does not. This could still lead to files changing mid-scap.

Change 303923 had a related patch set uploaded (by BryanDavis):
l10nupdate: aquire scap lock before changing files

https://gerrit.wikimedia.org/r/303923

Change 303923 merged by Filippo Giunchedi:
l10nupdate: acquire scap lock before changing files

https://gerrit.wikimedia.org/r/303923

Completed at 2017-01-05 02:17:31+00:00. Copying LC files to /srv/mediawiki-staging
02:18:50 Updated 396 JSON file(s) in /srv/mediawiki-staging/php-1.29.0-wmf.6/cache/l10n
^GSyncing to Apaches at 2017-01-05 02:18:50+00:00
02:18:50 Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 261, in run
    exit_status = app.main(app.extra_arguments)
  File "/usr/lib/python2.7/dist-packages/scap/main.py", line 609, in main
    return super(SyncL10n, self).main(*extra_args)
  File "/usr/lib/python2.7/dist-packages/scap/main.py", line 38, in main
    with utils.lock(self.config['lock_file'], self.arguments.message):
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 419, in lock
    raise LockFailedError(details)
LockFailedError: Failed to acquire lock "/var/lock/scap"; owner is "l10nupdate"; reason is "no excuse given"
02:18:50 sync-l10n failed: <LockFailedError> Failed to acquire lock "/var/lock/scap"; owner is "l10nupdate"; reason is "no excuse given"

The ownership of the lock file is now:

tin:~
bd808$ ls -al /var/lock/scap
-rw-rw-r-- 1 l10nupdate l10nupdate 0 Jan  5 02:17 /var/lock/scap

The lock file should be rm'd after being released by the subshell. This matches the behavior of scap itself.

Change 330636 had a related patch set uploaded (by BryanDavis):
l10nupdate: clean up scap lock file after release

https://gerrit.wikimedia.org/r/330636

Change 330636 merged by Filippo Giunchedi:
l10nupdate: clean up scap lock file after release

https://gerrit.wikimedia.org/r/330636

bd808 moved this task from To Do to Done on the User-bd808 board.