Page MenuHomePhabricator

analytics1012 fails Hadoop applications and jobs
Closed, ResolvedPublic

Description

When walking through the Hadoop applications from early April 2014
(until 2014-04-03 09:00) on [1], it seems applications failed if and
only if they were started on analytics1012:8042 [2].

And I checked about a dozen of succeeded (hence started on nodes
different to analytics1012:8042) applications, and their subordinated
mapreduce jobs again failed if and only if they were run on
analytics1012:8042 [3].

Is there something wrong with analytics1012:8042 ?

[1] http://analytics1010.eqiad.wmnet:8088/cluster

[2] The URLs for the corresponding failed applications are
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2843
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2837
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2836
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2820
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2798
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2790
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2788
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2787
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2786

[3] So for example application 1387838787660_2796 [4] was started on
analytics1015:8042 and hence succeeded. But it had one failed map
attempt, which was again on analytics1012:8042 [5].

Such failed subordinated mapreduce jobs on analytics1012:8042 fail
with notes about timeouts. As for example here:

AttemptID:attempt_1387838787660_2796_m_000001_0 Timed out after 600 secs

[4] http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2796

[5] http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2796/m/FAILED


Version: unspecified
Severity: normal

Details

Reference
bz63470

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:10 AM
bzimport set Reference to bz63470.
bzimport added a subscriber: Unknown Object (MLST).

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1522

This matches some anecdotal evidence from Oliver that there were problems with the analytics2012 node.

Diederik updated the java version IIRC. I do not know how he made this change.

I suspect the fastest way forward with this node is to decommission it and repave it because we don't really know what Diederik did with it. Perhaps puppet can tell us if there versions are different?

(In reply to Toby Negrin from comment #3)

This matches some anecdotal evidence from Oliver that there were problems
with the analytics2012 node.

Yep. I reported this a while ago, but it looks like the bug turned out to be a pair of bugs ("analytics1012 keeps dropping jobs" and "INSERT OVERWRITE doesn't work") and the second one masked the first.

Diederik updated the java version IIRC. I do not know how he made this
change.

Not sure the details, but I'm pretty sure he just went into the box and upgraded by hand.

otto wrote:

YES! Found it. /etc/hosts had a bad IP listed on analytics1012 for itself. Fixed and things look much better now!