analytics1012 fails Hadoop applications and jobs
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	QChris
	Apr 3 2014, 10:13 AM

Description

When walking through the Hadoop applications from early April 2014
(until 2014-04-03 09:00) on [1], it seems applications failed if and
only if they were started on analytics1012:8042 [2].

And I checked about a dozen of succeeded (hence started on nodes
different to analytics1012:8042) applications, and their subordinated
mapreduce jobs again failed if and only if they were run on
analytics1012:8042 [3].

Is there something wrong with analytics1012:8042 ?

[1] http://analytics1010.eqiad.wmnet:8088/cluster

[3] So for example application 1387838787660_2796 [4] was started on
analytics1015:8042 and hence succeeded. But it had one failed map
attempt, which was again on analytics1012:8042 [5].

Such failed subordinated mapreduce jobs on analytics1012:8042 fail
with notes about timeouts. As for example here:

AttemptID:attempt_1387838787660_2796_m_000001_0 Timed out after 600 secs

[4] http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2796

[5] http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2796/m/FAILED

Version: unspecified
Severity: normal

Details

Reference: bz63470

Event Timeline

• bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:10 AM

• bzimport added a project: Analytics-General-or-Unknown.

• bzimport set Reference to bz63470.

• bzimport added a subscriber: Unknown Object (MLST).

QChris created this task.Apr 3 2014, 10:13 AM

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1522

Bug 63472 might be related.

This matches some anecdotal evidence from Oliver that there were problems with the analytics2012 node.

Diederik updated the java version IIRC. I do not know how he made this change.

I suspect the fastest way forward with this node is to decommission it and repave it because we don't really know what Diederik did with it. Perhaps puppet can tell us if there versions are different?

(In reply to Toby Negrin from comment #3)

This matches some anecdotal evidence from Oliver that there were problems
with the analytics2012 node.

Yep. I reported this a while ago, but it looks like the bug turned out to be a pair of bugs ("analytics1012 keeps dropping jobs" and "INSERT OVERWRITE doesn't work") and the second one masked the first.

Diederik updated the java version IIRC. I do not know how he made this
change.

Not sure the details, but I'm pretty sure he just went into the box and upgraded by hand.

otto wrote:

YES! Found it. /etc/hosts had a bad IP listed on analytics1012 for itself. Fixed and things look much better now!

analytics1012 fails Hadoop applications and jobsClosed, ResolvedPublicActions

Description

Details

Event Timeline

analytics1012 fails Hadoop applications and jobs
Closed, ResolvedPublic
Actions