Page MenuHomePhabricator

Add sampling support in EventLogging
Closed, DeclinedPublic

Description

Different teams have implemented ad-hoc solutions to introduce sampling in EventLogging in order to perform measurements of the usage of features where a sample provides sufficient data to answer a research question.

In some cases, sampling needs to be applied to all events (so that, for example, only 1 out of 1000 events is logged). In other cases, unique clients need to be sampled by setting a session token so that only data for clients included in the sample is collected.

This pattern is sufficiently common to justify the creation of a general purpose solution to the problem (the most recent request for sampled data is [1]). The desired sampling method and rate could be specified via a dedicated element of a JSON schema; by default no sampling would be applied.

[1] http://lists.wikimedia.org/pipermail/analytics/2014-May/002053.html


Version: unspecified
Severity: normal

Details

Reference
bz65500

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:24 AM
bzimport set Reference to bz65500.
bzimport added a subscriber: Unknown Object (MLST).

Just a note about wording as terminology in this bug is confusing. We are mixing sampling ratio (1:100) with, let's say, a 'statistical sample' (a set of users/requests with a different treatment from the majority).

In some cases, sampling needs to be applied to all events (so that, for example, >only 1 out of 1000 events is logged)

Even if we keep track of ratio sampling in the schema (up for discussion) that likely just be an informative number on the short term. It likely will not be used to decide whether an event needs to be created and logged as doing those types of checks every time an event is generated could turn out to be a performance bottleneck (this depends on caching policies and bootstrapping of schemas)

In other cases, unique clients need to be sampled
by setting a session token so that only data for
clients included in the sample is collected.

I do not think we want EL in any to keep track of users or sessions to decide whether data needs to be logged. EL is a light system to keep track of events and as such it is agnostic to the events being logged. I do not see us doing any modifications on this regard to EL clients in the near future.

Adding comments posted by ori on e-mail thread:

"to do this {sampling] in the schema itself confuses the structure of the data with the mechanics of its use. I think having a couple of helpers in JavaScript and PHP for simple random sampling is sufficient."

Adding comments posted by ori on e-mail thread:

"to do this {sampling] in the schema itself confuses the structure of the data with the mechanics of its use. I think having a couple of helpers in JavaScript and PHP for simple random sampling is sufficient."

Given how complicated it is to deal with schema changes (need to update schema version in PHP/JS code, probably backport that to deployment, add a UNION ALL to any query using the schema), I agree putting the configuration for this in the schema would be a really bad idea.

On the other hand, putting the sampling ratio itself in the schema would be helpful. It is part of the data (or at the very least part of the context needed to interpret the data) and the logs should be self-contained in that regard. If I need to look up EL data from three months ago for some reason, I really, really don't want to go through the git logs for operations/mediawiki-config to figure out what sampling ratio was in use at the time.

So I think samplingRatio should be one of the core fields like wiki or user agent, should default to 1 and EL should offer a way (such as an extra parameter to log()) to change that for a given log event. (And once we are there, it could also make the dice roll to log or not, as a convenience. I.e. something like mw.eventLog.logWithSampling( 'Foo', 100, data ) would log data with the samplingRatio field set to 100 1% of the time, and do nothing in the rest.)

mw.eventLog.logWithSampling( 'Foo', 100, data )

I certainly agree that having a method that eases logging with sampling on the js end is a must.

If I need to look up EL data from three months ago for some reason, I really, really don't want to go through the git logs for >operations/mediawiki-config to figure out what sampling ratio was in use at the time.

I understand the problem with changes on sampling if you are counting "absolute numbers" like "number of users who click here".
Now, are those truly useful numbers? (just a thought, by definition they have no context). Without knowing a bunch of context (including the sampling ratio) you cannot interpret an absolute count value. For example, your absolute count of clicks increases because you changed your UI significantly, how are you keeping track of that event?

.... Ideally we would have annotations in graphs as a method of keeping track of these issues.... but yes, this is something to think about.

I lean towards that

In my experience with MediaViewer, we rarely cared about absolute numbers (there were cases when we did, but only a few), but we cared a lot about relative differences between various events (do the users click this or that button more?) and the same event at different points in time (did the number of clicks increase since deploying the new layout?).

So my point still stands: I often need to compare different numbers which don't necessarily have the same sampling ratio, and if the ratio is not available from the database, things become really uncomfortable. (Just consider a simple thing like showing a graph of the number of clicks in the last month, if we have to change the sampling rate halfway because the servers are overloaded, and there is no way to correct the counts inside the SQL query.)

Hi @Nuria, I just want to bump this thread as the issue has come up enough times in the last few months that I asked the team for solutions and they pointed me to this. The web team has had different sampling in beta and stable for some time now and are looking to do different sampling for russian v. hungarian wiki for a hovercards a/b test. While the absolute numbers are not as important, the relative numbers are. There have been numerous instances where tracking the date and parameters of sampling have been an issue. I am open to other solutions, but wanted to hear your thoughts on whether this was feasible or if you had any other ideas that might solve for the need here.

@JKatzWMF :

The Eventlogging client has added sampling abilities client side. Thus far the only way to keep track on sampling is to send it with the schema though (suboptimal, I know, but that is the state of affairs).

See addition of inSample method: https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/8b3cb1b6557e64adb3a071764a8e1697e1bb0204

Main point about counting "absolute" counts still stands, though. Those are hardly ever useful.

@Nuria Thanks! I wasn't aware of that possibility, but it makes sense and
seems like a decent fit for now, at least from a queryers perspective.