Page MenuHomePhabricator

Hive freezes starting a query, and produces the following error...
Closed, ResolvedPublic

Description

"Ended Job = job_1387838787660_1390 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://analytics1010.eqiad.wmnet:8088/proxy/application_1387838787660_1390/
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec"

This happens to different types of queries, at different times, and doesn't seem to bear any relation to the query itself; I reran the query that generated the error /this/ time immediately after it errored out, and it worked fine.


Version: unspecified
Severity: major

Details

Reference
bz61100

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:53 AM
bzimport set Reference to bz61100.
bzimport added a subscriber: Unknown Object (MLST).

(Presumably the actual error console can break the errors down by task and so provide more useful data than 'code 2')

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1440

This bug (or class of bug) has continued to make itself known. It's particularly concerning and frequent when running queries that contain subqueries, since it's treated as multiple jobs, and that increases the probability that one will fail - and if any ONE element fails, it all fails. As an example, I've been running variants of:

INSERT OVERWRITE TABLE ironholds.distinct_ip
SELECT distip
FROM (SELECT ip AS distip, COUNT(*) as count FROM wmf.webrequest_mobile WHERE year = 2014 AND month = 1 AND day = 20 AND content_type IN ('text/html\; charset=utf-8','text/html\; charset=iso-8859-1','text/html\; charset=UTF-8','text/html') GROUP BY ip HAVING COUNT(*) >= 2) sub1 LIMIT 10000;

and I've had three failures out of the previous four queries (which, with subqueries, works out as 3/8). Syntactically valid queries failing seemingly-randomly with no explanation is a pretty substantial blocker to being able to rely on Hive for production tasks.

There were indeed some issues with analytics1012, it was running an old version of Java. Ottomata has resolved that and I tried your query with success.
@Oliver: can you run your query again to confirm that the issue has been resolved?

Now fixed; Analytics 1012 had an outdated version of Java.

Still broken, still on analytics1012 - see task 1387838787660_1540. Most helpfully, the errors message was " Application application_1387838787660_1540 failed 1 times due to . Failing the application. "

Digging through more log files, I found:

2014-02-14 01:05:07,873 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1387838787660_1547_r_000542_0: Error: java.lang.RuntimeException: Hive Runtime Error while closing operators: Unable to rename output from: hdfs://kraken/tmp/hive-ironholds/hive_2014-02-14_00-38-53_191_252484601784449773/_task_tmp.-mr-10002/_tmp.000542_0 to: hdfs://kraken/tmp/hive-ironholds/hive_2014-02-14_00-38-53_191_252484601784449773/_tmp.-mr-10002/000542_0

Which maps to a Hive issue: https://issues.apache.org/jira/browse/HIVE-4605

@Oliver: can you rerun the query without the OVERWRITE statement and see if that solves the problem?

Otto -- can you just pull this machine from the cluster? It's causing a lot of problems and we should repave it or something.

thanks,

-Toby

otto wrote:

Oliver's most recent issue doesn't seem to have anything to do with analytics1012 anymore. He's still having problems, just not related to his initial report.

There's also this issue:
https://issues.apache.org/jira/browse/HIVE-3828

Ooh; plausible. Thanks for the explanation :). I'm confused as to why it's only /sometimes/ failing, though.

otto wrote:

Btw, the analytics1012 problem is fixed, woo!

Ironholds claimed this task.

Seems resolved.