Page MenuHomePhabricator

Raw webrequest partitions for 2014-10-20T13/1H not marked successful
Closed, DeclinedPublic

Description

None of the webrequest partitions [1] for 2014-10-20T13/1H have been
been marked successful.

What happened?

[1]


qchris@stat1002 jobs: 0 time: 09:43:10 // exit code: 0
cwd: ~/refinery/hive/webrequest
~/cluster-scripts/dump_webrequest_status.sh

+------------------+--------+--------+--------+--------+
| Date             |  bits  | mobile |  text  | upload |
+------------------+--------+--------+--------+--------+

[...]

| 2014-10-20T11/1H |    .   |    .   |    .   |    .   |    
| 2014-10-20T12/1H |    .   |    .   |    .   |    .   |    
| 2014-10-20T13/1H |    X   |    X   |    X   |    X   |    
| 2014-10-20T14/1H |    .   |    .   |    .   |    .   |    
| 2014-10-20T15/1H |    .   |    .   |    .   |    .   |

[...]

+------------------+--------+--------+--------+--------+

Statuses:

. --> Partition is ok
M --> Partition manually marked ok
X --> Partition is not ok (duplicates, missing, or nulls)

Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=72306

Details

Reference
bz72296

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:52 AM
bzimport set Reference to bz72296.
bzimport added a subscriber: Unknown Object (MLST).

The affected period is 13:07:11--2014-10-20T13:25:38.
It affected only ulsfo caches, but all ulsfo caches.

The affected period shows round 2M duplicates, which are worth

  • 79 seconds of ulsfo data, or
  • 15 seconds of total data.

The affected period shows round 27M missing lines, which are worth

  • 16 minutes of ulsfo data, or
  • 3 minutes of total data.

Ops reported [1] that at 13:07 network issues between ulsfo and eqiad
started. This aligns and explains the issues that we're seeing.

[1] https://lists.wikimedia.org/mailman/private/ops/2014-October/042274.html