Page MenuHomePhabricator

WMFLabs: New instances with precise image are broken (puppet run fails, no ssh access possible)
Closed, DeclinedPublic

Description

Creating a new instance with the precise image fails and leaves the instance inaccessible from ssh.

I wanted to create an additional integration-slave running Precise to scale out or Jenkins pool, but it failed to provision properly.

https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=integration&instanceid=b65a604d-40ef-4b16-b527-bfb862ca3904&region=eqiad

Oct 7 08:54:09 integration-slave1004 puppet-agent[981]: Enabling Puppet.
Oct 7 08:54:09 integration-slave1004 puppet-agent[773]: Could not request certificate: getaddrinfo: Name or service not known
Oct 7 08:54:10 integration-slave1004 puppet-agent[932]: Could not request certificate: getaddrinfo: Name or service not known
Oct 7 08:55:11 integration-slave1004 nslcd[901]: [b0dc51] <group/member="root"> ldap_start_tls_s() failed: Can't contact LDAP server: Connection timed out (uri="ldap://virt0.wikimedia.org:389")
Oct 7 08:55:11 integration-slave1004 nslcd[901]: [b0dc51] <group/member="root"> failed to bind to LDAP server ldap://virt0.wikimedia.org:389: Can't contact LDAP server: Connection refused
Oct 7 08:55:11 integration-slave1004 nslcd[901]: [334873] <group/member="root"> ldap_start_tls_s() failed: Can't contact LDAP server: Connection timed out (uri="ldap://virt0.wikimedia.org:389")
Oct 7 08:55:11 integration-slave1004 nslcd[901]: [334873] <group/member="root"> failed to bind to LDAP server ldap://virt0.wikimedia.org:389: Can't contact LDAP server: Connection timed out
Oct 7 08:55:12 integration-slave1004 nslcd[901]: [b0dc51] <group/member="root"> connected to LDAP server ldap://virt1000.wikimedia.org:389
Oct 7 08:55:12 integration-slave1004 nslcd[901]: [b0dc51] <group/member="root"> ldap_result() failed: No such object
Oct 7 08:55:12 integration-slave1004 nslcd[901]: [b0dc51] <group/member="root"> ldap_result() failed: No such object
Oct 7 08:55:13 integration-slave1004 nslcd[901]: [334873] <group/member="root"> connected to LDAP server ldap://virt1000.wikimedia.org:389
Oct 7 08:55:13 integration-slave1004 nslcd[901]: [334873] <group/member="root"> ldap_result() failed: No such object
Oct 7 08:55:13 integration-slave1004 nslcd[901]: [334873] <group/member="root"> ldap_result() failed: No such object
..
Oct 7 08:55:19 integration-slave1004 puppet-agent[1218]: Creating a new SSL key for i-00000670.eqiad.wmflabs
..
Oct 7 08:55:28 integration-slave1004 nslcd[1059]: [3c9869] <group(all)> ldap_start_tls_s() failed: Can't contact LDAP server: Connection timed out (uri="ldap://virt0.wikimedia.org:389")
Oct 7 08:55:28 integration-slave1004 nslcd[1059]: [3c9869] <group(all)> failed to bind to LDAP server ldap://virt0.wikimedia.org:389: Can't contact LDAP server: Connection timed out
Oct 7 08:55:29 integration-slave1004 nslcd[1059]: [3c9869] <group(all)> connected to LDAP server ldap://virt1000.wikimedia.org:389
Oct 7 08:55:29 integration-slave1004 nslcd[1059]: [3c9869] <group(all)> ldap_result() failed: No such object
Oct 7 08:55:29 integration-slave1004 nslcd[1059]: [7b23c6] <group/member="puppet"> ldap_start_tls_s() failed: Can't contact LDAP server: Connection timed out (uri="ldap://virt0.wikimedia.org:389")
Oct 7 08:55:29 integration-slave1004 nslcd[1059]: [7b23c6] <group/member="puppet"> failed to bind to LDAP server ldap://virt0.wikimedia.org:389: Can't contact LDAP server: Connection timed out
Oct 7 08:55:29 integration-slave1004 nslcd[1059]: [7b23c6] <group/member="puppet"> connected to LDAP server ldap://virt1000.wikimedia.org:389
Oct 7 08:55:29 integration-slave1004 nslcd[1059]: [7b23c6] <group/member="puppet"> ldap_result() failed: No such object
Oct 7 08:55:29 integration-slave1004 nslcd[1059]: [7b23c6] <group/member="puppet"> ldap_result() failed: No such object
Oct 7 08:55:30 integration-slave1004 nslcd[1059]: [334873] <group/member="puppet"> ldap_start_tls_s() failed: Can't contact LDAP server: Connection timed out (uri="ldap://virt0.wikimedia.org:389")
Oct 7 08:55:30 integration-slave1004 nslcd[1059]: [334873] <group/member="puppet"> failed to bind to LDAP server ldap://virt0.wikimedia.org:389: Can't contact LDAP server: Connection timed out
Oct 7 08:55:30 integration-slave1004 nslcd[1059]: [334873] <group/member="puppet"> connected to LDAP server ldap://virt1000.wikimedia.org:389
Oct 7 08:55:30 integration-slave1004 nslcd[1059]: [334873] <group/member="puppet"> ldap_result() failed: No such object
Oct 7 08:55:30 integration-slave1004 nslcd[1059]: [334873] <group/member="puppet"> ldap_result() failed: No such object
Oct 7 08:55:33 integration-slave1004 nslcd[1059]: [b0dc51] <group/member="puppet"> ldap_start_tls_s() failed: Can't contact LDAP server: Connection timed out (uri="ldap://virt0.wikimedia.org:389")
Oct 7 08:55:33 integration-slave1004 nslcd[1059]: [b0dc51] <group/member="puppet"> failed to bind to LDAP server ldap://virt0.wikimedia.org:389: Can't contact LDAP server: Connection timed out
Oct 7 08:55:33 integration-slave1004 nslcd[1059]: [b0dc51] <group/member="puppet"> connected to LDAP server ldap://virt1000.wikimedia.org:389
Oct 7 08:55:33 integration-slave1004 nslcd[1059]: [b0dc51] <group/member="puppet"> ldap_result() failed: No such object
Oct 7 08:55:33 integration-slave1004 nslcd[1059]: [b0dc51] <group/member="puppet"> ldap_result() failed: No such object
Oct 7 08:55:33 integration-slave1004 nslcd[1059]: [e8944a] <group/member="root"> ldap_result() failed: No such object
Oct 7 08:55:33 integration-slave1004 nslcd[1059]: [e8944a] <group/member="root"> ldap_result() failed: No such object
..
Oct 7 09:18:48 integration-slave1004 puppet-agent[932]: Could not request certificate: getaddrinfo: Temporary failure in name resolution


Version: unspecified
Severity: normal

Details

Reference
bz71741

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:51 AM
bzimport added a project: Cloud-VPS.
bzimport set Reference to bz71741.
bzimport added a subscriber: Unknown Object (MLST).

I suspect the labs image for Ubuntu Precise hasn't been updated to take in account the recent LDAP changes (phasing out pmtpa / ldap renaming). Seems to me the image need to be refreshed, for continuous integration purposes we still need Precise instances.

I just tested this a moment ago, and it worked fine for me. I installed a new precise base image on Friday that uses the new ldap settings as well as including an updated bash and a separate /var/log partition.

OK -- that last comment was both right and wrong.

New instances /do/ work. But there's still a smattering of virt0 and virt1000 references in them, which I am cleaning up.

I don't think the ldap thing is the problem. The log I pasted in comment 0 shows that it tried both. It's failing for a different reason.

I just created new images last night which seem generally happier. Try again?

The existing instance was never fixed, but it seems to work fine for new instances indeed (assuming it's not a race condition). I'll nuke the instance and re-create it for now.