Page MenuHomePhabricator

Instances fail to initialize on initial boot due to network communication failures
Closed, ResolvedPublic

Description

Initial boot console log for failing instance

I'm trying to build the four m1.large elasticsearch hosts for beta.eqiad in the deployment-prep project. Instance creation via the wikitech web interface succeeds and the hosts begin their initial boot process. During this first boot the hosts experience failures communicating with the LDAP servers and the labs puppetmaster. This leaves them in an unusable state where ssh by normal users is not possible. Rebooting the instances does not seem to correct the issues. This is possibly due to the failure of the initial puppet run.

The failure does not seem to be isolated to the deployment-prep project or the m1.large image. I can reproduce the problem in the wikimania-support and logstash projects and with small, medium, large and xlarge instances.

First seen by me around 2014-03-21T22:49Z, but this was the first time I had tried to build new instances that day. Problem persists today. Times in irc logs are MDT (GMT-6).

[16:49:29] <bd808> Coren: I'm trying to create some new instances for the deploymnet-prep project in eqiad and they are blowing up on initial boot with ldap connection timeout errors to the instance console.
[16:49:49] <Coren> o_O
[16:49:50] <bd808> Instances are deploymnet-es[012]
[16:50:37] <Coren> bd808: Checking.
[16:50:56] <bd808> The last time I saw this Andrew eventually found out the server they we placed on was missing a network cable
[16:51:24] <Coren> bd808: That's not the case here; the box is actually alive and reachable.
[16:51:32] <bd808> i-0000026[cde] if that helps
[16:52:01] <Coren> It also seems to have only a partial puppet run done.
[17:00:07] <Coren> bd808: I'm honestly not seeing anything wrong with your instances, except for the fact that it does't look like puppet ran correctly.
[17:00:39] <Coren> bd808: LDAP is up, at least, so I don't know where the connection errors might come from except, perhaps, that the config files weren't puppeted in?
[17:00:58] <Coren> Stupid question, have you tried rebooting them to force a new puppet run?
[17:01:46] <bd808> Coren: So … reboot and hope?
[17:01:53] <bd808> jinx
[17:02:02] <bd808> I can totally do that
[17:02:15] <bd808> and I can nuke them and try again if that doesn't work
[17:04:24] <bd808> "deployment-es0 puppet-agent[1200]: Could not request certificate: Connection timed out"
[17:04:51] * bd808 will blow them up and start over
[17:04:57] <Coren> Wait, that has nothing to do with LDAP; that's the puppet master being out of its gourd (which would explain why you don't have a complete puppet run)
[17:06:38] <bd808> They look jacked up. "Could not set 'directory on ensure: File exists - /var/run/puppet"
[17:07:14] <bd808> puppet agent failed to start on reboot
[17:07:37] <Coren> Well yeah, if it doesn't have a cert then it can't work.
[17:07:41] *** andrewbogott_afk is now known as andrewbogott
[17:07:56] <Coren> bd808: Try just one at first. I want to see why the first run failed.
[17:08:25] <bd808> Ok. I'll start with es0
[17:11:13] <bd808> Coren: Could not parse configuration file: Certificate names must be lower case; see #1168
[17:11:20] <bd808> Coren: Starting puppet agent [80G [74G[[31mfail[39;49m]
[17:12:05] <bd808> That's initial boot on the "new" es0 (i-0000026f)
[17:19:56] <bd808> Coren: Same final result "deployment-es0 puppet-agent[1194]: Could not request certificate: Connection timed out - connect(2)"

[17:22:33] <andrewbogott> bd808: what project is this?
[17:22:44] <bd808> andrewbogott: deployment-prep
[17:24:06] <bd808> 4 m1.large image creations in a row have died on first boot with logs full of ldap timeouts from nslcd followed by failure to get the cert from the puppet master
[17:24:56] <andrewbogott> bd808: is it just large instances that fail?
[17:25:16] <bd808> andrewbogott: I haven't tried other sizes today
[17:26:49] <bd808> I'm setting up the cirrus cluster. Created 3 m1.large in rapid succession, got an "instance not created" error when trying to create the 4th. Went to console of es0 (first one made) and saw these errors.
[17:27:27] <bd808> The next two instances showed the same error logs. Nuked es0 and created it again
[17:27:31] <bd808> same outcome
[17:28:54] <andrewbogott> bd808: your project was pushed right up against the quota for cores. I don't know if that was the problem, but… I just raised it quite a bit.

[17:51:31] <andrewbogott> bd808: have you ever had a large size instance work?
[17:51:51] <andrewbogott> I just tried, small is working but large it not… trying medium now
[17:52:07] <andrewbogott> Why would that affect network connectivity? I cannot guess.
[17:52:20] <bd808> andrewbogott: That's a good question. I don't know that I've tried to build one before. small and xl have worked in the past
[17:53:53] <bd808> I'm sure Nik wouldn't mind having xlarge instances if that's the case
[18:00:00] <bd808> andrewbogott: Not totally confirmed yet, but it looks like xlarge may be having the same issues
[18:00:25] <andrewbogott> Yeah, I can't make anything but 'small' start up.


Version: unspecified
Severity: critical

Attached:

Details

Reference
bz62958

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:53 AM
bzimport set Reference to bz62958.
bzimport added a subscriber: Unknown Object (MLST).

Here's something I noticed that is different about one of the instances having this problem, deployment-elastic01 (i-00000275.eqiad.wmflabs): it has been given the ip address 10.68.17.2. All of the other eqiad instances in deployment-prep project have ip addresses that would fall within the 10.68.16.0/24 CIDR range.

The assigned range for eqiad labs seems to be 10.68.16.0/21, but is it possible that there is a firewall of acl rule somehere that is set to 10.68.16.0/24 instead that would be blocking ldap and puppet communications?

Hah, I just noticed that a second ago as well. Pursuing that idea now...

  • Bug 62999 has been marked as a duplicate of this bug. ***

This seems to be fixed now. Andrew reported via irc that a static route for 10.68.16.0/24 was found on the routers. I assume this has been changed to a static route for 10.68.16.0/21.

The deployment-elastic01.eqiad.wmflabs instance that I left running with the broken configuration recovered and was able to communicate with LDAP and the labs puppetmaster. Some configuration seemed to remain broken however as the instance was not recognising me as a member of the group that is allowed to run sudo without a password. I tried one reboot to see if this would self correct and when it didn't I deleted the instance and built a replacement. The replacement is working as expected.