Page MenuHomePhabricator

Hive is broken on stat1002
Closed, ResolvedPublic

Description

ironholds@stat1002:~$ hive
Unable to determine Hadoop version information.
'hadoop version' returned:
No default-logstash-fields.properties resource present, using defaults Hadoop 2.3.0-cdh5.0.2 Subversion git://github.sf.cloudera.com/CDH/cdh.git -r 8e266e052e423af592871e2dfe09d54c03f6a0e8 Compiled by jenkins on 2014-06-09T16:20Z Compiled with protoc 2.5.0 From source with checksum 75596fe27f833e512f27fbdaaa7b0ab This command was run using /usr/lib/hadoop/hadoop-common-2.3.0-cdh5.0.2.jar


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=70330

Details

Reference
bz70203

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:38 AM
bzimport set Reference to bz70203.
bzimport added a subscriber: Unknown Object (MLST).

(just wanted to file the same bug :-) )

The breakage happened around 2014-08-30 ~00:49 [1].

Around that time bc8e34859268b6943f1e2c9621bd01bdc6676371 got merged,
which turns gelf logging on.

(We saw having gelf logging on to cause the exact same problems 4
days ago [2], which was worked around by
turning gelf logging off (See
82cab341b6070d95437b00f005280fed3289dcac)).


The immediate work-around is to create an empty default-logstash-fields.properties in the current directory:

touch default-logstash-fields.properties

Then hive again starts without issues, and also queries etc work.


[1] I had a couple jobs running during the night.
On 00:47:19 the last successful one started.
On 00:49:00 the first failing job started.

[2] See
http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140826.txt
starting at 20:49:30, and
http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140826.txt
starting at 20:55:17

Opsen -- can we please consider some sort of sanity check post cluster maintenance? I'm also wondering if the data quality scripts also broke.

Thanks for grabbing Christian.

jgage wrote:

Sorry folks. I did sanity check with 'hdfs', but because that output is just a warning I didn't think it would cause problems. I'll also test with 'hive' in the future. Did a lot of research into upstream defaults before making this change, was surprised at the outcome. I'll disable gelf again for now.

I discovered this ticket via Google search results while troubleshooting :P

(Adding jgage to CC)

(In reply to Toby Negrin from comment #2)

I'm also wondering if the data quality scripts also broke.

Even if ... our setup allows to re-check partitions easily without getting
Icinga confused. So we're safe and prepared for that.

However, the hive breakage is limited to non-cluster machines.
Like stat1002.
The monitoring however runs from within the cluster. So the monitoring
is working:

+---------------------+--------+--------+--------+--------+
| Date                |  bits  |  text  | mobile | upload |
+---------------------+--------+--------+--------+--------+
[...]
| 2014-08-30T00:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-08-30T01:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-08-30T02:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-08-30T03:xx:xx |    X   |    X   |    X   |    X   |  <-- problematic commit was merged.
| 2014-08-30T04:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-08-30T05:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-08-30T06:xx:xx |    .   |    .   |    .   |    X   |  <-- needs investigation
| 2014-08-30T07:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-08-30T08:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-08-30T09:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-08-30T10:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-08-30T11:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-08-30T12:xx:xx |    .   |    .   |    .   |    .   |    
[...]

Statuses:

. --> Partition is ok
X --> Partition is not ok (duplicates, missing, or nulls)

Thanks for grabbing Christian.

I didn't grab the issue -- I just provided a work-around :-)
There is not much I can to there. Only ops people can merge
to the operations/puppet repo. And since there is a workaround that
makes hive work again on stat1002, I think we can safely wait for
a proper fix next week.

Let's not forget: Hive is not yet a production service ;-)

Christian --

I ran a hive query and redirected output to file -- thus I thought hive was running :(

Totally agree -- Hive is not a production service and there is no expectation of off-hour support.

Gage --

We can cc you on all tickets if you want. We are pretty bugzilla focused here. Let's discuss Tuesday.

thanks all

-Toby

Works for me again (Hence closing). Thanks!

Just to keep bugs connected:

(In reply to christian from comment #4)

The monitoring however runs from within the cluster. So the monitoring
is working:

+---------------------+--------+--------+--------+--------+
| Date                |  bits  |  text  | mobile | upload |
+---------------------+--------+--------+--------+--------+
[...]

[...]

| 2014-08-30T03:xx:xx |    X   |    X   |    X   |    X   |  <--

problematic commit was merged.

This monitoring alert is tracked in bug 70330

[...]

| 2014-08-30T06:xx:xx |    .   |    .   |    .   |    X   |  <-- needs

investigation

This monitoring alert is tracked in bug 70331