The timestamp reported by varnish is taken when the request arrives.
The sequence number reported by varnish is taken when the response is
sent.
So when requests Foo and Bar arrive at the same cache, and Foo arrives
slightly before Bar, but Bar get its response sooner, it may occur that
the requests get logged as:
+--------------+----------+-----------------+ | request name | Time | Sequence number | +--------------+----------+-----------------+ | ... | 09:59:59 | 4708 | | ... | 09:59:59 | 4709 | | Foo | 09:59:59 | 4711 | | Bar | 10:00:00 | 4710 | | ... | 10:00:00 | 4712 | | ... | 10:00:00 | 4713 | +--------------+----------+-----------------+
Since we partition by timestamp, the partition for the 9th hour would
look like
+--------------+----------+-----------------+ | request name | Time | Sequence number | +--------------+----------+-----------------+ | ... | 09:59:59 | 4708 | | ... | 09:59:59 | 4709 | | Foo | 09:59:59 | 4711 | +--------------+----------+-----------------+
(hence appearing to miss sequence number 4710), and the partition for
the 10th hour would look like:
+--------------+----------+-----------------+ | request name | Time | Sequence number | +--------------+----------+-----------------+ | Bar | 10:00:00 | 4710 | | ... | 10:00:00 | 4712 | | ... | 10:00:00 | 4713 | +--------------+----------+-----------------+
(hence appearing to miss sequence number 4711).
So both partitions look like they'd be missing lines when being looked
at in isolation, and our per partition monitoring flags them both as
faulty.
But when looking at both partitions combined, no line is actually
missing, and the monitoring could flag them as ok.
In the past 2 weeks, we had two such occasions (one for bits, one for
upload).
The manual fix is simple: Generate the _SUCCESS file by hand for both
partitions.
Let's improve our monitoring to be aware of such races and check
for them automatically.
(One way would be that if the naive validation fails, Oozie's <error
to="..."> would no longer go to "kill", but to a follow-up step that
check specifically for this race.)
Version: unspecified
Severity: normal