Page MenuHomePhabricator

Lines with “nan” for “Request service time” column and empty HTTP status code in cache logs
Closed, ResolvedPublic

Description

Since 2013-12-10 we're seeing between 15-40 lines per day in the
sampled-1000 cache logs that have “nan” in the “Request service time”
column [1].
For the mobile-sampled-100 cache logs it is 10-30 such lines per day [2].
The zero logs do not show any affected lines up to now.

The affected lines come with “-/” for the “Request status / HTTP
status code” column, and “Reply size” is “-”.

[1] The first occurrence for the sampled-1000 stream can be obtained by
running

zgrep 49094481570 /a/squid/archive/sampled/sampled-1000.tsv.log-20131210.gz

on stat1002. That line is for cp1050.eqiad.wmnet on
2013-12-10T03:36:04.

[2] The first occurrence for the mobile-sampled-100 stream can be
obtained by running

zgrep 9650126710 /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20131210.gz

on stat1002. That line is for cp1047.eqiad.wmnet on
2013-12-10T02:01:24.


Version: unspecified
Severity: normal

Details

Reference
bz59645

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:17 AM
bzimport set Reference to bz59645.
bzimport added a subscriber: Unknown Object (MLST).

The host name of the affected lines range over all hosts in
upload-eqiad, upload-esams, and mobile-eqiad, but no other cache.

The request method of the affected lines look mangled. HEAD+GET
account for ~55% of the affected lines. The other 45% of the affected
lines, the logged request method looks like being part of the http
header [1].

The URL column of the affected lines is for about 8% the mobile
version of jawiki and ptwiki. None of those 8% show a User Agent,
Referrer, or Accept-Language.

For >80% of the affected lines, the URL column is upload.wikimedia.org.

For the remaining ~10% of affected lines, the URL column did not make
too much sense. They are for example “http://-en-US,en;q=0.8”. So it's
also header information that spilled over into this column.

Requested URLs are not distributed all over the place, but rather
isolated. So for example the requested URL for ~33% of the affected
lines was:

http://upload.wikimedia.orgHTTP/1.1

For ~12% it was

http://upload.wikimedia.org/wikipedia/en/c/ca/2M_TV_(logo).gif""

.

And those requests are not clustered around a certain date / time, but
span the whole date / time range between 2013-12-10 and now.

The referrers are “-” for ~60% of affected lines, and for ~32% some
Url from mobile enwiki.

The logged User-Agents, Accept-Languages look sane.

[1] Here are some examples:

Accept-Language:
x-Mercedes-Benz_Three_Point_Star.jpg
exagonal_Icon.svg.png
.svg.png
Referer:
HTTP/1.1

To see more of them, run

cut -f 8 /home/qchris/nan.tsv | sort | uniq -c | sort -n

on stat1002.

Hi Christian, Erik -- what's the impact of this? Do we need to spend some time working with ops to track it down?

thanks,

-Toby

Per Andrew/Stefan -- this ends up causing problems with Hive.

Magnus has created a patch; we hope to get this deployed this week.

(In reply to comment #2)

what's the impact of this?

It's not super critical.
But since it seems to be more than just the nans, it may be more than the
initial few lines that are affected.

Do we need to spend some
time
working with ops to track it down?

Ops already know and had a first look :-)

It seems only the nans started on 2013-12-10. Some bogus requests were
there before, but it seems they have been treated differently.
Not sure yet.

(In reply to comment #3)

Per Andrew/Stefan -- this ends up causing problems with Hive.

Magnus has created a patch; we hope to get this deployed this week.

Yes, but that'll only address the issue on the varnishkafka side.

varnishncsa (hence udp2log, webstatscollector, tsvs) do not benefit
from this fix.

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1391