Page MenuHomePhabricator

Unable to mount /public directories on queue nodes
Closed, ResolvedPublic

Description

On tools-login, I can see this file: /public/datasets/public/pdcwiki/20131115/pdcwiki-20131115-pages-articles.xml.bz2

However, the queue nodes cannot see the file. I get "Can't open input file /data/project/checkwiki/dumps/pdcwiki-20131115-pages-articles.xml.bz2: No such file or directory"

If I copy the file to /data/project/checkwiki, the queue can see the file and run normally.

I think the problem happened around 0z. Some programs started at 0:00z and 0:01z just fine. Some programs started at 0:03z and they died. Any programs since also die.

Bryan


Version: unspecified
Severity: normal

Details

Reference
bz57479

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:39 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz57479.

I'm not sure I understand your bug report; the error message you mention does not match the file you expected (/public/datasets/... vs /data/project), it looks like your tool isn't trying to read the file from where you expect it to?

It would be helpful if I didn't give a test run result.

Several queue machines cannot see /public/datasets/public/dumps/*
Note: This doesn't affect all queue machines.

I meant to say "Can't open input file /public/datasets/public/dumps/pdcwiki-20131115-pages-articles.xml.bz2: No such file or directory"

I tested things out on the 24th and still got "Can't open input file /public/datasets/public/gdwiki/20131123/gdwiki-20131123-pages-articles.xml.bz2: No such file or directory."

I just test things out and I still get the same error.

On tools-exec-01, /public is empty and /home looks rather sparse:

scfc@tools-login:~$ ssh tools-exec-01 ls -l /public
total 0
scfc@tools-login:~$ ssh tools-exec-01 ls -l /home
total 20
drwx------ 3 dapete wikidev 4096 Nov 24 17:43 dapete
drwxr-xr-x 2 gmetric gmetric 4096 Feb 27 2013 gmetric
drwx------ 4 marc wikidev 4096 Nov 22 17:09 marc
drwx------ 3 scfc wikidev 4096 Nov 26 01:16 scfc
drwxr-xr-x 3 ubuntu ubuntu 4096 Feb 27 2013 ubuntu
scfc@tools-login:~$

Also, on tools-exec-01 (at least) my home directory is not my "real" one:

scfc@tools-exec-01:~$ ll /home/scfc
total 32
drwx------ 3 scfc wikidev 4096 Nov 26 01:16 ./
drwxr-xr-x 7 root root 4096 Nov 25 22:50 ../
-rw------- 1 scfc wikidev 242 Nov 26 01:16 .bash_history
-rw------- 1 scfc wikidev 220 Nov 25 22:50 .bash_logout
-rw------- 1 scfc wikidev 3387 Nov 25 22:50 .bashrc
drwx------ 2 scfc wikidev 4096 Nov 25 22:50 .cache/
-rw------- 1 root root 43 Nov 26 01:12 .lesshst
-rw------- 1 scfc wikidev 675 Nov 25 22:50 .profile
scfc@tools-exec-01:~$

I don't see any differences in /etc/auto* compared to tools-exec-02, but it looks like the automounts for /home and /public aren't working (/data/project is mounted fine).

Yeah, I just checked and it's definitely broken.

Annoyingly, autofs is really bad at restarting if any mounts are active, so there is little to do but drain the node from all jobs and wait for it to be idle before forcibly restarting it.

I'm going to remove it from the queue allocation now and let it drain; it'll take a while before every job goes away (I don't want to disrupt running tools), but it won't get assigned for new jobs in the meantime so nothing will hit the broken /public

tools-exec-01 was restarted early December and /home and /public seem to be properly mounted.