Page MenuHomePhabricator

Wikipedia Zero job for 2014-03-01 failed on Hadoop with "java.io.IOException: stored gzip size doesn't match decompressed size"
Closed, ResolvedPublic

Description

The tail of the relevant log is

-----8<-----Begin: log tail-----8<-----
2014-04-01 11:24:09,655 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner -
2014-04-01 11:24:12,656 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner -
2014-04-01 11:24:15,657 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner -
2014-04-01 11:24:18,658 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner -
2014-04-01 11:24:21,659 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner -
2014-04-01 11:24:24,660 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner -
2014-04-01 11:24:27,673 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner -
2014-04-01 11:24:27,730 [Thread-2] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
java.io.IOException: stored gzip size doesn't match decompressed size

at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.executeTrailerState(BuiltInGzipDecompressor.java:389)
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.decompress(BuiltInGzipDecompressor.java:224)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:239)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

2014-04-01 11:24:27,924 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6017: Job failed! Error - NA
-----8<-----End: log tail-----8<-----

I'll investigate whether it's a random failure, or something broke.


Version: unspecified
Severity: normal

Details

Reference
bz63371

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:04 AM
bzimport set Reference to bz63371.
bzimport added a subscriber: Unknown Object (MLST).

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1505

Rerunning the job gave the same result, so it's probably not some random failure.

Mhmm ... uncompressed zero files for today are for the first time

2^32 bytes. Trimming each file below 2^32 bytes is making things

work again.

Our big data tooling cannot take more than 32-bit sized data?

And it's 1st April ... epic :-D

Upstream bug seems to be

https://issues.apache.org/jira/browse/HADOOP-8900

That's included Hadoop 1.2.0, but the Pig snapshot version we used up
to now for Wikipedia Zero is Hadoop <1.2.0.

Rebuilding the current Pig head from sources also uses Hadoop <1.2.0.

Cloudera picks up the upstream bug with CDH 4.2.0. However, the CDH
4.2.0 pig jar from

https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/pig/pig/0.10.0-cdh4.2.0/pig-0.10.0-cdh4.2.0.jar

does not include dependencies and fails with

Exception in thread "main" java.lang.NoClassDefFoundError: jline/ConsoleReaderInputStream
        at java.lang.Class.getDeclaredMethods0(Native Method)
[...]

.

Adding all dependencies by hand would be heavy lifting.

However, Cloudera's archive at

http://archive-primary.cloudera.com/cdh4/cdh/4/pig-0.10.0-cdh4.2.0.tar.gz

holds the full sources after the build completed. So in that archive

pig-0.10.0-cdh4.2.0.jar

is the jar with full dependencies that can be used to run pig in local
mode without having to extend the classpath by hand.

Using that jar, the carrier file could get generated again.

Doing some more tests tomorrow to make sure the switch in the used pig
version does not affect numbers.

I recomputed the data for a few days using the new pig.jar, and it matched
the data we received from the old jar.

Logs did not show any peculiarities with the new pig.jar.

Thanks Christian -- nice work.

-Toby