Page MenuHomePhabricator

WMFLabs: Auto-creation of home directories broken (new members and instances unable to login)
Closed, ResolvedPublic

Description

As of late, the home directory creation for users is broken.

I got errors in two scenarios:

  • Added user 'rxy' as member to the 'cvn' group.
  • Him connecting to a pre-existing pmtpa instance that I (krinkle) can log in on fine, yields:

$ ssh cvn-app2.pmtpa.wmflabs
(..)
Creating directory '/home/rxy'.
Unable to create and initialize directory '/home/rxy'.

$ ssh cvn-app3.eqiad.wmflabs
(..)

  • Created new instance in eqiad.
  • Me connecting to this 15 minutes after its creation, I still get:

Creating directory '/home/krinkle'.
Unable to create and initialize directory '/home/krinkle'.

I've updated my .ssh/config with the most recent version of the example on https://wikitech.wikimedia.org/wiki/Help:Access#ProxyCommand, but that only made it worse (main difference is using bastion-eqiad, instance of bastion2.pmtpa).

With that config I can't even connect to it:

$ ssh cvn-app3.eqiad.wmflabs
channel 0: open failed: connect failed: Connection timed out
ssh_exchange_identification: Connection closed by remote host

Trying manually:

$ ssh -A bastion1.eqiad.wmflabs
krinkle at bastion1.eqiad.wmflabs in ~
$ ping cvn-app3
PING cvn-app3.eqiad.wmflabs (10.68.16.170) 56(84) bytes of data.
64 bytes from cvn-app3.eqiad.wmflabs (10.68.16.170): icmp_req=1 ttl=64 time=2.26
64 bytes from cvn-app3.eqiad.wmflabs (10.68.16.170): icmp_req=2 ttl=64 time=0.71
$ ssh cvn-app3
ssh: connect to host cvn-app3 port 22: Connection timed out
$ ssh cvn-app3.eqiad.wmflabsssh: connect to host cvn-app3.eqiad.wmflabs port 22: Connection timed out


Version: unspecified
Severity: major

Details

Reference
bz62771

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:58 AM
bzimport set Reference to bz62771.

The pmtpa issue is an unrelated and soon-to-be-moot gluster failure.

The eqiad issue I've seen before, but don't know how to fix (other than by waiting and rebooting.) Perhaps Coren will have time to debug this sometime soon...

confirmed, had exact same issue today, with 2 newly created eqiad instances. the problem disappeared after rebooting the second one a second time (or so)

The nature of the problem is known (the instance attempts to mount /home and /data/project before the NFS server has updated its ACLs for it, then caches the negative result for some time), but a proper fix hasn't been found yet.

I have some ideas on how to prevent this from happening that I will be trying today.

In the meantime, doing a reboot at least 10 minutes after the issue occurs then waiting at least another 20 minutes seem to be sufficient to let the ACL time out.

(In reply to Marc A. Pelletier from comment #3)

The nature of the problem is known (the instance attempts to mount /home and
/data/project before the NFS server has updated its ACLs for it, then caches
the negative result for some time), but a proper fix hasn't been found yet.

I have some ideas on how to prevent this from happening that I will be
trying today.

In the meantime, doing a reboot at least 10 minutes after the issue occurs
then waiting at least another 20 minutes seem to be sufficient to let the
ACL time out.

I've rebooted cvn-app3 shortly after I created it and it wasn't working, then I reported this bug.

I've rebooted it again yesterday, and again today just now. Still getting:

krinkle at KrinkleMac in ~ $ ssh cvn-app3.eqiad.wmflabs
channel 0: open failed: connect failed: Connection timed out
ssh_exchange_identification: Connection closed by remote host

Could be unrelated, but it's also not showing any life signs in ganglia since and including during its creation:
http://ganglia.wmflabs.org/latest/?c=cvn&h=cvn-app3

The race condition has been prevented for new images (that is, attempts to mount a filesystem before it has been made available rw will now fail rather than mount readonly); subsequent puppet runs will try again.

This will prevent the fundamental issue (and the annoying caching that makes it hard to go away), but not for existing instances which will still require some manual manipulation.