Page MenuHomePhabricator

Add sanitized User-Agent to default fields logged by EventLogging
Closed, DuplicatePublic

Description

Logging sanitized user-agents allows us to diagnose browser-specific performance and usability issues. UAs have already been logged as part of http://meta.wikimedia.org/wiki/Schema:NavigationTiming and we added them to the instrumentation requirements for http://meta.wikimedia.org/wiki/Schema:Edit.

This proposal is to make UA a default field logged by EventLogging for all client-side events.


Version: unspecified
Severity: normal
See Also:
T58575: Browser and platform stats for logged-in vs. anon users for security and product support decisions

Details

Reference
bz52295

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:05 AM
bzimport set Reference to bz52295.
bzimport added a subscriber: Unknown Object (MLST).

swalling wrote:

Can you expand on what "sanitized" means for user agents?

Instead of logging the full, unparsed UA string we match it against a list of the N most popular browser/agents and log everything else as "other".

Or perhaps not "match against a list", and instead simply bucket them - top 10 (or 100) get left as-is, rest get bucketed by OS/browser with details removed. Marc-Andre has implemented this in labs, so ccing him.

Only the sanitation part has been implemented, given the smaller scope of usefulness for Tool Labs (where debugging against specific versions is less of an issue). I was planning to use DevCamp as an opportunity to hack at bugzillas, I'll whip up a PHP version of the sanitizing code then for inclusion in EventLogging.

That said, keeping a dynamic "top N" for bucketing may or may not be reasonable in terms of performance for something called that often; we'll have to see how that fares in practice.

Marc, any news on the PHP version?

We also have new use cases from VE (estimating how many new registered users have VE-capable browsers)and MultiMedia (see https://meta.wikimedia.org/wiki/Schema:MediaViewerPerf) that would benefit from this change.

Change 102817 had a related patch set uploaded by Ori.livneh:
Add user-agent header to the format spec of EventLogging's varnishncsa instance

https://gerrit.wikimedia.org/r/102817

(In reply to comment #7)

Change 102817 had a related patch set uploaded by Ori.livneh:
Add user-agent header to the format spec of EventLogging's varnishncsa
instance

https://gerrit.wikimedia.org/r/102817

Will this make it possible to get statistics of browsers used for logged-in requests as discussed in https://bugzilla.wikimedia.org/show_bug.cgi?id=56575 ?

(In reply to comment #8)

Will this make it possible to get statistics of browsers used for logged-in
requests as discussed in
https://bugzilla.wikimedia.org/show_bug.cgi?id=56575 ?

Bug 56575 is already possible (as the initial report here said, we already log the UA in specific cases). There's an in-progress patch at https://gerrit.wikimedia.org/r/#/c/93526/ .

This is about logging it by default, which is possible, but not required for bug 56575.

(In reply to comment #9)

This is about logging it by default, which is possible, but not required for
bug 56575.

Yes, sub-sampling would definitely be sufficient to get a representative sample. For the data I am primarily interested in it would have to be sub-sampling of all page views though, which is again very close to what is discussed in this bug.

Is there information about the authentication status in the custom varnishncsa logging? If so, then https://gerrit.wikimedia.org/r/102817 could be used directly to get browser market shares for anonymous vs. logged-in users.

(In reply to comment #7)

Change 102817 had a related patch set uploaded by Ori.livneh:
Add user-agent header to the format spec of EventLogging's varnishncsa
instance

https://gerrit.wikimedia.org/r/102817

Is there a list of said callers doing "user-agent logging and processing" somewhere, for curiosity, to track the progress on their standardisation (which is a very good thing to do) and to help define requirements?

(In reply to comment #10)

Yes, sub-sampling would definitely be sufficient to get a representative
sample.

In what cases is sub-sampling not sufficient? Can EventLogging default to sampling for UA unless otherwise requested by callers? Or maybe the bucketing mentioned above has the same results, is the implementation mentioned in comment 4 described somewhere and/or the place where logs go documented (the latter was asked by MZMcBride in https://gerrit.wikimedia.org/r/#/c/93526/ )?

Just for fun (everybody here knows already), panopticlick.eff.org gives my main browser's UA an entropy of 13.25, making it the most tracking item of all; in the secondary browser (Chromium) it's 14.42 and the third worst after accept-language and plugins.

(In reply to comment #11)

Or maybe the bucketing
mentioned above has the same results, is the implementation mentioned in
comment 4 described somewhere and/or the place where logs go documented (the
latter was asked by MZMcBride in https://gerrit.wikimedia.org/r/#/c/93526/ )?

It's documented at https://wikitech.wikimedia.org/wiki/EventLogging#Data_storage; it might need an update. Basically, there are text logs (mainly used for debugging), Mongo (not sure if any analysts actually use this), and MySQL (commonly used by analysts).

You can see how the MySQL tables are generated at https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging.git/36cd7fbd9f763f369fb3d7ae503ef4c9133f99bf/server%2Feventlogging%2Fjrm.py

Change 102817 merged by Ottomata:
Add user-agent header to the format spec of EventLogging's varnishncsa instance

https://gerrit.wikimedia.org/r/102817

Nemo – I started adding some details on the sanitization logic (expanding on Nuria's draft) here: https://www.mediawiki.org/wiki/EventLogging/UserAgentAnonymization

This is still a draft, we will add more information on the next steps (particularly on the bucketing, which hasn't been implemented yet).

(In reply to comment #14)

Nemo – I started adding some details on the sanitization logic (expanding on
Nuria's draft) here:
https://www.mediawiki.org/wiki/EventLogging/UserAgentAnonymization

This is still a draft, we will add more information on the next steps
(particularly on the bucketing, which hasn't been implemented yet).

Thank you very much! Watchlisted, will look later.

(In reply to comment #8)

(In reply to comment #7)

Change 102817 had a related patch set uploaded by Ori.livneh:
Add user-agent header to the format spec of EventLogging's varnishncsa
instance

https://gerrit.wikimedia.org/r/102817

Will this make it possible to get statistics of browsers used for logged-in
requests as discussed in
https://bugzilla.wikimedia.org/show_bug.cgi?id=56575 ?

(sorry everyone for not answering comment any sooner)

Gabriel, this change is intended only for EventLogging data so (once implemented fully) you would hopefully be able to get some user agent data. But note, the data does not equally represent all requests to the site, rather the ones for which event logging events are send out.

(In reply to comment #10)

(In reply to comment #9)

This is about logging it by default, which is possible, but not required for
bug 56575.

Yes, sub-sampling would definitely be sufficient to get a representative
sample. For the data I am primarily interested in it would have to be
sub-sampling of all page views though, which is again very close to what is
discussed in this bug.

Is there information about the authentication status in the custom
varnishncsa
logging? If so, then https://gerrit.wikimedia.org/r/102817 could be used
directly to get browser market shares for anonymous vs. logged-in users.

There is no authentication info in varnishncsa at the time of logging. But with this change the logging will be happening for all events when fully implemented. Events themselves do have info about the logging status of the user.

(In reply to comment #14)

Nemo – I started adding some details on the sanitization logic (expanding on
Nuria's draft) here:
https://www.mediawiki.org/wiki/EventLogging/UserAgentAnonymization

This is still a draft, we will add more information on the next steps
(particularly on the bucketing, which hasn't been implemented yet).

As March pointed out above "keeping a dynamic "top N" for bucketing may or may not be reasonablein terms of performance for something called that often; we'll have to see how that fares in practice". A logging solution needs to be as light as possible, which means decoupled from any kind of storage lookups upon logging.

[Refered Gerrit patch has been merged; resetting status]

[moving from MediaWiki extensions to Analytics product - see bug 61946]

From [[mail:analytics]]: "We also finished a numbers of unplanned tasks: [...] User Agent discussions (EventLogging)". Any published notes/recap?

We have hired a hard working Product Manager for analytics that is getting up to speed on the issue regarding User Agents and Privacy. He shall be publishing documentation once he's had time to catch up.

This happened a while ago, except the "sanitized" part (I think). Can this task be closed, or should it be refocused on the sanitization part?