Page MenuHomePhabricator

Duplicate entries with missing referers in webrequest logs.
Closed, InvalidPublic

Description

So, I've been noodling around in the request logs recently and I've seen a lot of rows that have null entries for some columns. Not too big a deal with most of em - some are things like referrer, or user language, where I can see it not being provided by the client.

Today, though, I encountered requests without a MIME type. Not even requests for weird things - I'm talking, pages like the enwiki main page, or the article on India. I've backtracked the examples I pulled out into hive itself and confirmed that the elements are blank there, too (happy to provide em in private to anyone investigating this).

I'm kinda confused about what's going on. It shouldn't really be possible to send a request and return it without that data, and the invalid requests are coming from both Android and iPhone devices. An investigation upstream (for example, checking if they're null in the varnish memstore, too) would be most welcome.


Version: unspecified
Severity: normal

Details

Reference
bz61063

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:51 AM
bzimport set Reference to bz61063.
bzimport added a subscriber: Unknown Object (MLST).

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1439

Further investigation:

*I went through some of the requestlogs manually and found duplicate requests, about 9-10ms apart, the latter of which had the MIME type and referer stripped. This could be the source of both the MIME type data loss and the referer data loss we've seen with Special:BannerRandom hits. Matt Walker theorises that the problem may be us consuming requestlog data from multiple layers of varnish machines, and thus getting the same requests multiple times. I'm going to yank out the hostnames for the weird hinky hits I've noticed to see.

(In reply to comment #0)

Today, though, I encountered requests without a MIME type.

Requests without a MIME type are fine in many settings.
We're seeing many of them.

Since you seem to be able to reproduce, could you provide a short snippet
that allows to exhibit such a log line?

(I am not asking for the log line itself, but for some chain of actions that
allows us to see a log line that you are concerned about)

Alright, you want to hunt for:

*hits to uri_path /wiki/File:Thailand_Surin_locator_map.svg
*Between 2014-01-20T10:14:00 and 2014-01-20T10:15:00

(hopefully that's anonymised enough)

From that particular example, it looks like the (intact) request was a MISS from the varnish cache's point of view, which explains the immediate repeat of the request. Whether it's also responsible for the lack of referrer data is too network engineer-y for me to know - but it is a potential limiter if we want to use MIME type filtering for say, pageviews. The good news is that, assuming my data sample is representative (and it's probably off, since it's 128k mobile views from a specific date), this only happens about 0.03 percent of the time.

*Blinks* actually, looking at that example, the MIME type is intact, it's the referrer that's vanished. My brain is...clearly not on today.

Closing for now, since there doesn't seem to be any easy link to identify why the referers and such are missing. Blah. I need to do a lot more work before BZing things, I think.