Make webrequest partition validation handle races between time and sequence numbers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	QChris
	Aug 15 2014, 4:57 PM

Description

The timestamp reported by varnish is taken when the request arrives.
The sequence number reported by varnish is taken when the response is
sent.

So when requests Foo and Bar arrive at the same cache, and Foo arrives
slightly before Bar, but Bar get its response sooner, it may occur that
the requests get logged as:

+--------------+----------+-----------------+
| request name |     Time | Sequence number |
+--------------+----------+-----------------+
| ...          | 09:59:59 |            4708 |
| ...          | 09:59:59 |            4709 |
| Foo          | 09:59:59 |            4711 |
| Bar          | 10:00:00 |            4710 |
| ...          | 10:00:00 |            4712 |
| ...          | 10:00:00 |            4713 |
+--------------+----------+-----------------+

Since we partition by timestamp, the partition for the 9th hour would
look like

+--------------+----------+-----------------+
| request name |     Time | Sequence number |
+--------------+----------+-----------------+
| ...          | 09:59:59 |            4708 |
| ...          | 09:59:59 |            4709 |
| Foo          | 09:59:59 |            4711 |
+--------------+----------+-----------------+

(hence appearing to miss sequence number 4710), and the partition for
the 10th hour would look like:

+--------------+----------+-----------------+
| request name |     Time | Sequence number |
+--------------+----------+-----------------+
| Bar          | 10:00:00 |            4710 |
| ...          | 10:00:00 |            4712 |
| ...          | 10:00:00 |            4713 |
+--------------+----------+-----------------+

(hence appearing to miss sequence number 4711).

So both partitions look like they'd be missing lines when being looked
at in isolation, and our per partition monitoring flags them both as
faulty.

But when looking at both partitions combined, no line is actually
missing, and the monitoring could flag them as ok.

In the past 2 weeks, we had two such occasions (one for bits, one for
upload).

The manual fix is simple: Generate the _SUCCESS file by hand for both
partitions.

Let's improve our monitoring to be aware of such races and check
for them automatically.

(One way would be that if the naive validation fails, Oozie's <error
to="..."> would no longer go to "kill", but to a follow-up step that
check specifically for this race.)

Version: unspecified
Severity: normal

Details

Reference: bz69615

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Ottomata	T72085 Raw webrequest partitions that were not marked successful
		Resolved		Ottomata	T71615 Make webrequest partition validation handle races between time and sequence numbers

Event Timeline

• bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:34 AM

• bzimport added a project: Analytics-Clusters.

• bzimport set Reference to bz69615.

• bzimport added a subscriber: Unknown Object (MLST).

QChris created this task.Aug 15 2014, 4:57 PM

nice find!

Just to help us keep track how relevant this is, it again happened on

2014-08-17T20:xx:xx/2014-08-17T21:xx:xx (on upload)
2014-08-18T20:xx:xx/2014-08-18T21:xx:xx (on upload)

Change 155578 had a related patch set uploaded by QChris:
DO NOT MERGE HiveQL file to check for sequence stats across hours

https://gerrit.wikimedia.org/r/155578

Change 155578 abandoned by QChris:
DO NOT MERGE HiveQL file to check for sequence stats across hours

Reason:
Just parked code to aid manual checking across hours

https://gerrit.wikimedia.org/r/155578

Happened again for:

2014-08-23T16:xx:xx/2014-08-23T17:xx:xx (on upload)

Happened again for:

2014-08-31T15:xx:xx/2014-08-31T16:xx:xx (on upload)
2014-09-02T17:xx:xx/2014-09-02T18:xx:xx (on upload)

Happened again for:

2014-09-04T22:xx:xx/2014-09-04T23:xx:xx (on upload)

Happened again for:

2014-09-28T14:xx:xx/2014-09-28T15:xx:xx (on upload)

Happened again for:

2014-10-09T17:xx:xx/2014-10-09T18:xx:xx (on upload)

This looks like something we should resource. Christian, how does this line up with the other things that you are working on in severity?

In terms of severity and other work, it does not line up too well.

And since we're suffering more severe issues, it has not been fixed yet.

But since this race-condition a distraction nonetheless, it's one of the
things I meant when bringing up “work on cluster infrastructure” during
the discussion of goals for this quarter.

So, for me, it's also well within scope, and the fix will happen once the more
pressing things are done.

ok -- makes sense. Thanks Christian.

Happened again for:

2014-10-21T14/2H (on upload)

Happened again for:

2014-10-22T06/2H (on bits)

Happened again for:

2014-10-24T19/2H (on bits)

Happened again for:

2014-10-29T01/2H (on upload)

Happened again for:

2014-10-29T08/2H (on bits)

Happened again for:

2014-11-05T14/2H (on bits)

Happened again for:

2014-11-08T00/2H (on bits)

Happened again for:

2014-11-08T12/2H (on text) <-- first time on text

Happened again for:

2014-11-09T21/2H (on upload)

Happened again for:

2014-11-10T10/2H (on upload)

Happened again for

2014-11-28T07/2H (on bits)
2014-11-28T18/2H (on text)

Ironholds unsubscribed.Dec 5 2014, 7:39 PM

• kevinator triaged this task as Medium priority.Dec 5 2014, 7:47 PM

• kevinator set Security to None.

Happened again for:

2014-12-07T15/2H (on bits)

Happened again for:

2014-12-06T19/2H (on upload)

Happened again for:

2014-11-30T21/2H (on upload)

QChris mentioned this in T77022: Raw webrequest partitions for 2014-12-07T18/1H not marked successful.Dec 8 2014, 2:52 PM

Happened again for:

2014-12-09T12/2H (on text)

Happened again for:

2014-12-15T15/2H (on text)

Happened again for:

2014-11-11T16/2H (on upload)

Happened again for:

2014-11-12T18/2H (on upload)

@mforns did some work to make our validation and alarms account for this. Resolving.

Restricted Application added a project: Analytics. · View Herald TranscriptJan 24 2017, 3:12 PM

Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM

Maintenance_bot removed a project: Patch-For-Review.Jul 4 2020, 8:11 AM

Make webrequest partition validation handle races between time and sequence numbersClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Make webrequest partition validation handle races between time and sequence numbers
Closed, ResolvedPublic
Actions

Related Objects
Search...