Page MenuHomePhabricator

Raw webrequest partition for 'upload' for 2014-10-10T15:xx:xx not marked successful
Closed, ResolvedPublic

Description

For the hour 2014-10-10T15:xx:xx, the upload partition [1] was marked
successful.

What happened?

[1]


qchris@stat1002 jobs: 0 time: 10:42:42 // exit code: 0
cwd: ~
cluster-scripts/dump_webrequest_status.sh

+---------------------+--------+--------+--------+--------+
| Date                |  bits  |  text  | mobile | upload |
+---------------------+--------+--------+--------+--------+

[...]

| 2014-10-10T13:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-10T14:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-10T15:xx:xx |    .   |    .   |    .   |    X   |    
| 2014-10-10T16:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-10T17:xx:xx |    .   |    .   |    .   |    .   |

[...]

+---------------------+--------+--------+--------+--------+

Statuses:

. --> Partition is ok
X --> Partition is not ok (duplicates, missing, or nulls)

Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=71994

Details

Reference
bz71948

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:47 AM
bzimport set Reference to bz71948.
bzimport added a subscriber: Unknown Object (MLST).

The Oozie job for checking that partition has status KILLED [1], and
seems to have been killed by user hdfs at 17:28 [2].
A few minutes later, bundles have been restarted, so I assume the
killing of the partition checking happend deliberately.

However, since the job's sequence statistics have not been fully
computed (Killed at 95% of reduce step), I started the recomputation
job by hand.

Sequence stats recomputation is done, and the partition has neither
missing nor duplicates.

Hence, I manually marked the partition good.

[1]

qchris@analytics1027:~$ oozie job -verbose -info 0037425-140725140105408-oozie-oozi-W

Job ID : 0037425-140725140105408-oozie-oozi-W

Workflow Name : hive_add_partition-wmf_raw.webrequest-upload,2014,10,10,15-wf
App Path : hdfs://analytics-hadoop/wmf/refinery/current/oozie/webrequest/partition/add/workflow.xml
Status : KILLED
Run : 0
User : hdfs
Group : -
Created : 2014-10-10 17:04:54 GMT
Started : 2014-10-10 17:04:54 GMT
Last Modified : 2014-10-10 17:28:15 GMT
Ended : 2014-10-10 17:28:13 GMT
CoordAction ID: 0003812-140725140105408-oozie-oozi-C@2060

Actions

ID Console URL Error Code Error Message External ID External Status Name Retries Tracker URI Type Started Status Ended

0037425-140725140105408-oozie-oozi-W@:start: - - - - OK :start: 0 - :START: 2014-10-10 17:04:54 GMT OK 2014-10-10 17:04:54 GMT

0037425-140725140105408-oozie-oozi-W@add_partition http://analytics1027.eqiad.wmnet:11000/oozie?job=0037426-140725140105408-oozie-oozi-W - - 0037426-140725140105408-oozie-oozi-W SUCCEEDED add_partition 0 local sub-workflow 2014-10-10 17:04:54 GMT OK 2014-10-10 17:05:11 GMT

0037425-140725140105408-oozie-oozi-W@generate_sequence_statistics http://analytics1010.eqiad.wmnet:8088/proxy/application_1409078537822_38526/ - -job_1409078537822_38526 KILLED generate_sequence_statistics 0 resourcemanager.analytics.eqiad.wmnet:8032 hive 2014-10-10 17:05:11 GMT KILLED2014-10-10 17:28:15 GMT

[2] See HDFS's /var/log/hadoop-yarn/apps/hdfs/logs/application_1409078537822_38526/analytics1029.eqiad.wmnet_8041 line 607:
:2014-10-10 17:28:13,907 INFO [IPC Server handler 0 on 36062] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job job_1409078537822_38526 received from hdfs (auth:SIMPLE) at 10.64.36.127