Page MenuHomePhabricator

Analytics: Can we start quoting our logging fields?
Closed, ResolvedPublic

Description

I'm sat here looking at a 6MB user agent field. It's not /actually/ a 6MB user agent field, it's a user agent field where some browser designer decided "let's put tabs in our UA, that won't cause anyone any problems!" and so, of course, the tab-separated files we store our logs in happily escaped it, meaning that when the TSV was read in, the field overflowed.

In the absence of hunting down the people who made that decision at the browser end and forcing them to use the internet through an early and experimental IE version for all of time, could we start quoting the fields in the request logs? I'm not sure how Erik Z reads his files in, but if it's tab-sensitive we're potentially looking at a data loss issue with wikistats. If it's not, we're looking at a data loss issue with my work. Either is to be avoided ;p.

Obviously VK will solve for this once it's dealing with the whole firehose.


Version: unspecified
Severity: normal

Details

Reference
bz60184

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:03 AM
bzimport set Reference to bz60184.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #0)

I'm sat here looking at a 6MB user agent field.

Interesting.
I was under the impression that requests >8K get truncated.
That's obviously wrong then :-)

Where can I find this user agent field?

I'm not sure how Erik Z reads his files in, but if it's tab-sensitive we're
potentially looking at a data loss issue with wikistats.

Although files may come with wrong number of columns, it's actually
only a minor problem. For example in December 2013 only about ~0.0028%
rows of the sampled-1000 stream had a wrong column count. In January
2014 it is up to now 0.0029%.

Adding escaping to the files would make many changes necessary
throughout all of our infrastructure (e.g.: Wikipedia Zero), which I'd
prefer we need not do.

To put those 0.0029% into perspective: Udp2log dropped 0.4% of the
packets in December. And when comparing with historical values, we see
that this is exceptionally low packet drop rate:
http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm

Obviously VK will solve for this once it's dealing with the whole firehose.

VK being varnishkafka?
If so ... Ja, I'd say waiting for Hadoop with the new JSON data
structures would be a good solution :-)

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1395

Ironholds claimed this task.

This will be resolved by switching to hadoop; done.