Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005)
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	hashar
	Oct 8 2014, 8:44 AM

Description

The Jenkins master on gallium is unable to connect to the deployment-cxserver01.eqiad.wmflabs instance to update the content translation code base (job is beta-cxserver-update-eqiad ).

When starting the Jenkins agent over ssh:

[10/08/14 08:41:58] [SSH] Opening SSH connection to 10.68.17.162:22.
java.io.IOException: There was a problem while connecting to 10.68.17.162:22
...
[10/08/14 08:42:01] [SSH] Connection closed.
[10/08/14 08:42:01] Launch failed - cleaning up connection

Version: unspecified
Severity: normal

Details

Reference: bz71783

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T73783 Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005)
		Resolved		coren	T73873 role::labs::lvm::mnt ends up with make-instance-vg: failed to create new partition

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:53 AM

• bzimport added a project: Cloud-VPS.

• bzimport set Reference to bz71783.

• bzimport added a subscriber: Unknown Object (MLST).

hashar created this task.Oct 8 2014, 8:44 AM

The virt1005 compute node died overnight, might explain the issue.

The instance is hosted on virt1005 which died overnight. I have marked the Jenkins slave as offline: https://integration.wikimedia.org/ci/computer/deployment-cxserver01/

I attempted to reboot it via OpenStackManager but it does not come back. I guess the the VM is corrupted.

Impact:

content translation server is not running for beta cluster
code updates for content translation servers are obviously not pushed :D

Moving to Infrastructure (corrupted VM apparently)

Link to instance informations: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000421.eqiad.wmflabs

If this instance has any important data I can try to reclaim the drive contents. Otherwise you should just delete and recreate.

(In reply to Andrew Bogott from comment #4)

If this instance has any important data I can try to reclaim the drive
contents. Otherwise you should just delete and recreate.

Thank you for the time spent on investigating the issue.

I will check with the cxserver folks (Kartik and Niklas, added to cc) and see whether they need any data. Else we will recreate it and update relevant configuration files. It is fully puppetized AFAIK.

Creating an instance deployment-cxserver02 :

Size: m1.medium
OS: Ubuntu Trusty
Security rules: default, cxserver

Ie the same as deployment-cxserver01 used to be.

Kart confirmed we can get rid of the instance. Since beta cluster is out of quota, that is convenient.

(Context:

The virt1005 compute node died overnight, might explain the issue.

https://lists.wikimedia.org/pipermail/labs-l/2014-October/002982.html )

CRITICAL: deployment-prep.deployment-cxserver02.puppetagent.failed_events.value (100.00%)

(In reply to Greg Grossmeier from comment #9)

CRITICAL:
deployment-prep.deployment-cxserver02.puppetagent.failed_events.value
(100.00%)

Yup that is due to this bug. I wanted to acknowledge the alarm, but since the monitor is on the production Icinga I lack permissions to do so.

(In reply to Antoine "hashar" Musso from comment #10)

(In reply to Greg Grossmeier from comment #9)

CRITICAL:
deployment-prep.deployment-cxserver02.puppetagent.failed_events.value
(100.00%)

Yup that is due to this bug. I wanted to acknowledge the alarm, but since
the monitor is on the production Icinga I lack permissions to do so.

Yuvi: Halp? How should we address this?

So puppet fails on cxserver02 because it tries to create a lvm volume and fails (/mnt, I think), leading to cascading failures (among which this is one, I presume). ^d ran into the same problem on his new ES box there as well, I think.

I'll investigate in a bit, but andrewbogott/coren/others feel free to take this as well...

(In reply to Yuvi Panda from comment #12)

So puppet fails on cxserver02 because it tries to create a lvm volume and
fails (/mnt, I think), leading to cascading failures (among which this is
one, I presume). ^d ran into the same problem on his new ES box there as
well, I think.

I'll investigate in a bit, but andrewbogott/coren/others feel free to take
this as well...

Yup made I have it another bug for Labs > Infrastructure:

https://bugzilla.wikimedia.org/show_bug.cgi?id=71873