Page MenuHomePhabricator

Make webrequest partition validation handle races between time and sequence numbers
Closed, ResolvedPublic

Description

The timestamp reported by varnish is taken when the request arrives.
The sequence number reported by varnish is taken when the response is
sent.

So when requests Foo and Bar arrive at the same cache, and Foo arrives
slightly before Bar, but Bar get its response sooner, it may occur that
the requests get logged as:

+--------------+----------+-----------------+
| request name |     Time | Sequence number |
+--------------+----------+-----------------+
| ...          | 09:59:59 |            4708 |
| ...          | 09:59:59 |            4709 |
| Foo          | 09:59:59 |            4711 |
| Bar          | 10:00:00 |            4710 |
| ...          | 10:00:00 |            4712 |
| ...          | 10:00:00 |            4713 |
+--------------+----------+-----------------+

Since we partition by timestamp, the partition for the 9th hour would
look like

+--------------+----------+-----------------+
| request name |     Time | Sequence number |
+--------------+----------+-----------------+
| ...          | 09:59:59 |            4708 |
| ...          | 09:59:59 |            4709 |
| Foo          | 09:59:59 |            4711 |
+--------------+----------+-----------------+

(hence appearing to miss sequence number 4710), and the partition for
the 10th hour would look like:

+--------------+----------+-----------------+
| request name |     Time | Sequence number |
+--------------+----------+-----------------+
| Bar          | 10:00:00 |            4710 |
| ...          | 10:00:00 |            4712 |
| ...          | 10:00:00 |            4713 |
+--------------+----------+-----------------+

(hence appearing to miss sequence number 4711).

So both partitions look like they'd be missing lines when being looked
at in isolation, and our per partition monitoring flags them both as
faulty.

But when looking at both partitions combined, no line is actually
missing, and the monitoring could flag them as ok.

In the past 2 weeks, we had two such occasions (one for bits, one for
upload).

The manual fix is simple: Generate the _SUCCESS file by hand for both
partitions.

Let's improve our monitoring to be aware of such races and check
for them automatically.

(One way would be that if the naive validation fails, Oozie's <error
to="..."> would no longer go to "kill", but to a follow-up step that
check specifically for this race.)


Version: unspecified
Severity: normal

Details

Reference
bz69615

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:34 AM
bzimport set Reference to bz69615.
bzimport added a subscriber: Unknown Object (MLST).

Just to help us keep track how relevant this is, it again happened on

2014-08-17T20:xx:xx/2014-08-17T21:xx:xx (on upload)
2014-08-18T20:xx:xx/2014-08-18T21:xx:xx (on upload)

Change 155578 had a related patch set uploaded by QChris:
DO NOT MERGE HiveQL file to check for sequence stats across hours

https://gerrit.wikimedia.org/r/155578

Change 155578 abandoned by QChris:
DO NOT MERGE HiveQL file to check for sequence stats across hours

Reason:
Just parked code to aid manual checking across hours

https://gerrit.wikimedia.org/r/155578

Happened again for:

2014-08-23T16:xx:xx/2014-08-23T17:xx:xx (on upload)

Happened again for:

2014-08-31T15:xx:xx/2014-08-31T16:xx:xx (on upload)
2014-09-02T17:xx:xx/2014-09-02T18:xx:xx (on upload)

Happened again for:

2014-09-04T22:xx:xx/2014-09-04T23:xx:xx (on upload)

Happened again for:

2014-09-28T14:xx:xx/2014-09-28T15:xx:xx (on upload)

Happened again for:

2014-10-09T17:xx:xx/2014-10-09T18:xx:xx (on upload)

This looks like something we should resource. Christian, how does this line up with the other things that you are working on in severity?

In terms of severity and other work, it does not line up too well.

And since we're suffering more severe issues, it has not been fixed yet.

But since this race-condition a distraction nonetheless, it's one of the
things I meant when bringing up “work on cluster infrastructure” during
the discussion of goals for this quarter.

So, for me, it's also well within scope, and the fix will happen once the more
pressing things are done.

ok -- makes sense. Thanks Christian.

Happened again for:

2014-10-21T14/2H (on upload)

Happened again for:

2014-10-22T06/2H (on bits)

Happened again for:

2014-10-24T19/2H (on bits)

Happened again for:

2014-10-29T01/2H (on upload)

Happened again for:

2014-10-29T08/2H (on bits)

Happened again for:

2014-11-05T14/2H (on bits)

Happened again for:

2014-11-08T00/2H (on bits)

Happened again for:

2014-11-08T12/2H (on text) <-- first time on text

Happened again for:

2014-11-09T21/2H (on upload)

Happened again for:

2014-11-10T10/2H (on upload)

Happened again for

  • 2014-11-28T07/2H (on bits)
  • 2014-11-28T18/2H (on text)
kevinator set Security to None.

Happened again for:

  • 2014-12-07T15/2H (on bits)

Happened again for:

  • 2014-12-06T19/2H (on upload)

Happened again for:

  • 2014-11-30T21/2H (on upload)

Happened again for:

2014-12-09T12/2H (on text)

Happened again for:

2014-12-15T15/2H (on text)

Happened again for:

  • 2014-11-11T16/2H (on upload)

Happened again for:

  • 2014-11-12T18/2H (on upload)
Ottomata claimed this task.
Ottomata added subscribers: mforns, Ottomata.

@mforns did some work to make our validation and alarms account for this. Resolving.