Page MenuHomePhabricator

Mobile and zero graphs lack api PageViews
Closed, DeclinedPublic

Description

Although api requests may constitute page views, we do not count them in
our software (e.g.: Zero dashboards), and we hence underreport.

Running for example

zgrep 1965912739 /a/squid/archive/sampled/sampled-1000.tsv.log-20130928.gz
zgrep 1935427111 /a/squid/archive/sampled/sampled-1000.tsv.log-20130928.gz

on stats1002 gives two log lines that meet our current definition of page
view, but they do not get counted.

We could for example fix our software, or update the definition of page view.


Version: unspecified
Severity: normal

Details

Reference
bz54782

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:27 AM
bzimport set Reference to bz54782.
bzimport added a subscriber: Unknown Object (MLST).

Let's fix the software and not the pageview definition ;)

But the real problem is of course that API calls are not getting tagged with the X-CS header, that should be fixed.

X-CS (and, by extension, X-Analytics from Varnish) is *not* meant to be used for analytics purposes. It is used for tagging requests so that separate responses can be created (and cached independently).

There is no reason for doing carrier tagging on API requests (or other requests for that matter), there isn't such a need and it's bad for performance to do so on the hot path of requests.

If you want to track these from an analytics standpoint, you should post-process the logs, similar on how you do GeoIP lookups. Zero is giving us a JSON file via a meta.wm.org URL with carrier prefixes, so it should be easy to adapt your analytics code.

X-Analytics, as its name implied, was introduced for real time analytics IIRC, and is never used for anything Zero related. Are you saying it was a mistake to add it and to build a UDP-based real time analytics system? (note - this is not sarcasm, I'm really trying to understand your position with regards to analyzing our traffic)

I don't understand the question or what UDP-based has to do with anything. It's incorrect that we don't use X-Analytics for anything Zero related -- Varnish stores the carrier (X-CS) into the "zero" X-Analytics key.

I'm saying that doing this is fine for mobile/zero traffic (mobile Varnish cluster), since we do carrier detection in that request path anyway, for the carrier banners, and copying a header into another header is not a big effort.

Doing carrier detection in realtime for other kinds of traffic (e.g. upload), though, is costly and there's no benefit compared to doing it on the log processing pipeline, as far as I'm aware.

I agree with Faidon.
Why should we put unnecessary burden on the performance critical
part of the infrastructure?

Especially, since it would not buy us anything:
As written in the bug's summary, the problem affects both /mobile and/ zero.

So even if we had a solution for zero, we'd still need a solution for mobile.
As they are closely connected, it did not make sense to me to separate them.
Please go ahead and split the bug, if we strife for separate solutions for
zero, and mobile.

So, Diederik responded to my request in #54779 and did an initial specification draft for X-Analytics (thanks!).

While reading it, I found another potential source of divergence between our interpretations of the "zero" field. The page mentions "this marker is only set for requests from the mobile varnishes (so no api, bits, upload, ...) and only if also the domains is free for the carrier.". The first part is what we've flagged in this bug; the second part of that sentence is equally as important though.

Part of the netmapper work is to remove as much logic as possible from Varnish; among those, is to remove all domain/language handling for carriers and just do IP matching and tag with X-CS (which we subsequently copy to X-Analytics).

This hasn't been completed yet for old carriers, but as of yesterday with Yuri's I176b6c, all future carriers will get automatically tagged with X-CS (= X-Analytics) *irrespective* of whether that particular domain/subdomain is free.

So the implementation differs already from the spec and this was a deliberate change to the implementation that we've been planning for months: moving in the direction of removing the business logic from Varnish and simplifying our configs.

This really needs more coordination because it means that all the carrier dashboards will overcount pageview requests and we have had no proper warning or time to change our scripts. https://gerrit.wikimedia.org/r/#/c/86708/ is the changeset.

Note that Yuri asked for an expedited merge a few minutes after submission due to a new carrier signup. As it's likely that it affects a carrier already, we can't revert now. The numbers will be off just for that new carrier(s), though.

Faidon, thank you for pointing it out, but we have already discussed this with analytics - and we agreed that I will tell them when we switch to "tag all traffic & use zero configs to determine what is actually whitelisted" mode. Plus I have already have given them some python code to pull config json blobs from meta to use for such analysis.

As for the yesterday's patch, according to Dan, all new carriers will be ALL languages for both M & Zero -- which means no change in stats. In case a carrier with more limitations come onboard, I was planning to still add them to varnish config until ESI gets implemented completely, X-CS variance gets turned off, and analytics starts doing extra analysis.

but as of yesterday with
Yuri's I176b6c, all future carriers will get automatically tagged with X-CS
(=
X-Analytics) *irrespective* of whether that particular domain/subdomain is
free.

I might be wrong here, but the way I read
https://gerrit.wikimedia.org/r/#/c/86708/3/templates/varnish/zero.inc.vcl.erb
is that X-CS2 (appended “2”) gets set unconditionally. X-CS (no “2” at
the end) only gets set if the domain matches.
(Note that 639-02, 297-01, 426-04 whitelist both).

As X-Analytics gets set from X-CS (no “2” at the end), the description on

https://wikitech.wikimedia.org/wiki/X-Analytics

should still be accurate.

(In reply to comment #9)

because it means that all the carrier
dashboards will overcount pageview requests [...]

No.

Your kraken code boils down to checking hasValidSubDomain [1] which
checks whether or not the domain is free.

[1] http://git.wikimedia.org/blob/analytics%2Fkraken.git/05685d1ed816f63b7b82a4ae5d5f7842f6ab0540/kraken-pig%2Fsrc%2Fmain%2Fjava%2Forg%2Fwikimedia%2Fanalytics%2Fkraken%2Fpig%2FZeroFilterFunc.java#L235

(In reply to comment #11)

Faidon, thank you for pointing it out,

Yes. Thanks Faidon for pointing that out.
This is indeed much appreciated.

but we have already discussed this with analytics - and we agreed
that I will tell them when we switch to "tag all traffic & use zero
configs to determine what is actually whitelisted" mode.

That's true. We're tracking that at:

https://mingle.corp.wikimedia.org/projects/analytics/cards/1193

I added tasks to make sure the documentation gets updated if that
change goes live.

(In reply to comment #12)

If I'm not mistaken, API calls invoked from a page served from a mobile web subdomain, such en.m.wikipedia.org, usually hit the API endpoint on the same domain, that is in this example, en.m.wikipedia.org. So if the X-CS header is present, I believe that means the X-Analytics field will contain a zero=<X-CS> value and the entry will be in the zero.tsv.log-yyyymmdd file.

The stuff in zero.tsv.log-* thus has API hits, so I think it's a matter of the Pageview.isWebstatscollectorPageview() additionally whitelisting '/w/api.php' hits according to Jon's definitions (instead of just '/wiki/') to ensure counting of the W0 traffic.

As for other non-W0 mobile web hits, I would need to understand the code better, but I'm guessing that additional definitions in Pageview.isWebstatscollectorPageview() would be reflected in the x1000 estimate of hits.

but as of yesterday with
Yuri's I176b6c, all future carriers will get automatically tagged with X-CS
(=
X-Analytics) *irrespective* of whether that particular domain/subdomain is
free.

I might be wrong here, but the way I read
https://gerrit.wikimedia.org/r/#/c/86708/3/templates/varnish/zero.inc.vcl.
erb
is that X-CS2 (appended “2”) gets set unconditionally. X-CS (no “2” at
the end) only gets set if the domain matches.
(Note that 639-02, 297-01, 426-04 whitelist both).

As X-Analytics gets set from X-CS (no “2” at the end), the description on

https://wikitech.wikimedia.org/wiki/X-Analytics

should still be accurate.

(In reply to comment #13)

(In reply to comment #9)

because it means that all the carrier
dashboards will overcount pageview requests [...]

No.

Your kraken code boils down to checking hasValidSubDomain [1] which
checks whether or not the domain is free.

[1]
http://git.wikimedia.org/blob/analytics%2Fkraken.git/
05685d1ed816f63b7b82a4ae5d5f7842f6ab0540/kraken-
pig%2Fsrc%2Fmain%2Fjava%2Forg%2Fwikimedia%2Fanalytics%2Fkraken%2Fpig%2FZeroFi
lterFunc.java#L235

(In reply to comment #15)

By the way, I agree that if all langs are tagged, that means stuff needs to be revisited.

(In reply to comment #12)

If I'm not mistaken, API calls invoked from a page served from a mobile web
subdomain, such en.m.wikipedia.org, usually hit the API endpoint on the same
domain, that is in this example, en.m.wikipedia.org. So if the X-CS header is
present, I believe that means the X-Analytics field will contain a
zero=<X-CS>
value and the entry will be in the zero.tsv.log-yyyymmdd file.

The stuff in zero.tsv.log-* thus has API hits, so I think it's a matter of
the
Pageview.isWebstatscollectorPageview() additionally whitelisting '/w/api.php'
hits according to Jon's definitions (instead of just '/wiki/') to ensure
counting of the W0 traffic.

As for other non-W0 mobile web hits, I would need to understand the code
better, but I'm guessing that additional definitions in
Pageview.isWebstatscollectorPageview() would be reflected in the x1000
estimate
of hits.

but as of yesterday with
Yuri's I176b6c, all future carriers will get automatically tagged with X-CS
(=
X-Analytics) *irrespective* of whether that particular domain/subdomain is
free.

I might be wrong here, but the way I read
https://gerrit.wikimedia.org/r/#/c/86708/3/templates/varnish/zero.inc.vcl.
erb
is that X-CS2 (appended “2”) gets set unconditionally. X-CS (no “2” at
the end) only gets set if the domain matches.
(Note that 639-02, 297-01, 426-04 whitelist both).

As X-Analytics gets set from X-CS (no “2” at the end), the description on

https://wikitech.wikimedia.org/wiki/X-Analytics

should still be accurate.

(In reply to comment #13)

(In reply to comment #9)

because it means that all the carrier
dashboards will overcount pageview requests [...]

No.

Your kraken code boils down to checking hasValidSubDomain [1] which
checks whether or not the domain is free.

[1]
http://git.wikimedia.org/blob/analytics%2Fkraken.git/
05685d1ed816f63b7b82a4ae5d5f7842f6ab0540/kraken-
pig%2Fsrc%2Fmain%2Fjava%2Forg%2Fwikimedia%2Fanalytics%2Fkraken%2Fpig%2FZeroFi
lterFunc.java#L235

(In reply to comment #15)

If I'm not mistaken, API calls invoked from a page served from a mobile web
subdomain, such en.m.wikipedia.org, usually hit the API endpoint on the same
domain, that is in this example, en.m.wikipedia.org. [...]

Does that mean that mobile api requests are expected to /exclusively/
come from the mobile domains (“m”, “zero”, ...)?

Best regards,
Christian

P.S.: This would contradict our endpoint discovery documentation at
[1], which declares the canonical place for the endpoint can be found
in the RSD. From the mobile site, this gives [2]:

<link rel="EditURI" type="application/rsd+xml" href="//en.wikipedia.org/w/api.php?action=rsd" />

So we are already lead off of the mobile site. And getting the rsd
file, we are sent [3] to the plain desktop api endpoint:

[...] apiLink="http://en.wikipedia.org/w/api.php" [...]

So following our documentation, http://en.wikipedia.org/w/api.php
looks like the canonical api endpoint that is to be used even from
the mobile site.

The examples from the bug's description follow this pattern.

[1] http://www.mediawiki.org/wiki/Api#The_endpoint
[2] curl --user-agent 'iPhone' 'http://en.m.wikipedia.org/wiki/Calvin_and_Hobbes' | grep EditURI
[3] curl --user-agent 'iPhone' 'http://en.wikipedia.org/w/api.php?action=rsd'

(In reply to comment #17)

(In reply to comment #15)

If I'm not mistaken, API calls invoked from a page served from a mobile web
subdomain, such en.m.wikipedia.org, usually hit the API endpoint on the same
domain, that is in this example, en.m.wikipedia.org. [...]

Does that mean that mobile api requests are expected to /exclusively/
come from the mobile domains (“m”, “zero”, ...)?

I believe it depends:

  • If it's mobile web (e.g., the stock Browser on Android or the stock Safari on iOS), then mdot or zerodot would be domains in play from what I can tell, yes. I've added Jon Robson to the bug in case he has additional information.
  • If it's the Wikipedia App (User-Agent will contain 'WikipediaMobile' in it), then it seems that the domains in play are more commonly desktop ones like de.wikipedia.org. Did "mobile" in the subject line of this bug and the Mingle card refer solely to the app, or also to the mobile web?

Best regards,
Christian

P.S.: This would contradict our endpoint discovery documentation at
[1], which declares the canonical place for the endpoint can be found
in the RSD. From the mobile site, this gives [2]:

<link rel="EditURI" type="application/rsd+xml"

href="//en.wikipedia.org/w/api.php?action=rsd" />

So we are already lead off of the mobile site. And getting the rsd
file, we are sent [3] to the plain desktop api endpoint:

[...] apiLink="http://en.wikipedia.org/w/api.php" [...]

So following our documentation, http://en.wikipedia.org/w/api.php
looks like the canonical api endpoint that is to be used even from
the mobile site.

IIRC, lang.(m|zero).wikipedia.org end up rewritten at the origin to lang.wikipedia.org. That said, it seems like in the zero.tsv.log-* files there are lots of api.php hits that have their domain names in the subdomains of m.wikipedia.org (sampled-1000.tsv.log* also seem to have lots of m.wikipedia.org subdomain'd api.php hits).

Only taking the W0 case into consideration for the moment, the following turns up hits that seem close to the definition of an API hit described at https://raw.github.com/wikimedia/metrics/master/pageviews/new_mobile_pageviews_report/pageview_definition.png - the only problem seems to be that the Referer doesn't perfectly conform to the format of action=mobileview&title=<title>.

$ grep m.wikipedia.org zero.tsv.log-20130930 | grep mobileview | grep sections=all

That a small percentage of the over all hits in the W0 logs /nearly/ match the expected format constituting an API hit may mean that we should ask Jon if the definition of an API-based web hit for the mobile web can also have a Referer that looks a little different than what's described at https://raw.github.com/wikimedia/metrics/master/pageviews/new_mobile_pageviews_report/pageview_definition.png.

The examples from the bug's description follow this pattern.

I think I see what you're saying. The first example (1965912739) is for the Wikipedia *App*, a telltale sign being the string 'WikipediaMobile' in the User-Agent field. Incidentally, there are indeed some 'WikipediaMobile' UA hits in the W0 logs.

Sounds like we probably need to do a videoconference working session.

[1] http://www.mediawiki.org/wiki/Api#The_endpoint
[2] curl --user-agent 'iPhone'
'http://en.m.wikipedia.org/wiki/Calvin_and_Hobbes' | grep EditURI
[3] curl --user-agent 'iPhone' 'http://en.wikipedia.org/w/api.php?action=rsd'

(In reply to comment #18)

  • If it's mobile web (e.g., the stock Browser on Android or the stock Safari

on
iOS), then mdot or zerodot would be domains in play from what I can tell,
yes.

Although the mdot and zerodot domain's EditURIs refer to the desktop
API endpoint?

[...]
Did "mobile" in the subject line of this bug and the Mingle
card refer solely to the app, or also to the mobile web?

For me it's about both.

I discovered the problem through the above provided examples, and they
seem to be for the app.

However, the issue is not limited to the app. For example

zgrep -e '-1210387371' /a/squid/archive/sampled/sampled-1000.tsv.log-20131010.gz

is an iPhone (non WikipediaMobile) requesting the desktop rsd with a
mobile site referrer. So this visitor sticks to the documentation on
how to discover the real api endpoint, and ends up at the desktop site
api with his mobile phone coming from the mobile site.

But maybe those two are separate concerns?
If you think that WikipediaMobile and non-WikipediaMobile requests are
separate things, let me know, and I'll split the bug.

[...] That said, it seems like in the zero.tsv.log-* files
there
are lots of api.php hits that have their domain names in the subdomains of
m.wikipedia.org [...]

Yes. The api requests on the mdot and zerodot subdomains are there.
They are totally ok.

[...]
Sounds like we probably need to do a videoconference working session.

I'd love to. I sent an email to arrange for this.

(In reply to comment #19)

(In reply to comment #18)

  • If it's mobile web (e.g., the stock Browser on Android or the stock Safari

on
iOS), then mdot or zerodot would be domains in play from what I can tell,
yes.

Although the mdot and zerodot domain's EditURIs refer to the desktop
API endpoint?

[...]
Did "mobile" in the subject line of this bug and the Mingle
card refer solely to the app, or also to the mobile web?

For me it's about both.

I discovered the problem through the above provided examples, and they
seem to be for the app.

However, the issue is not limited to the app. For example

zgrep -e '-1210387371'

/a/squid/archive/sampled/sampled-1000.tsv.log-20131010.gz

is an iPhone (non WikipediaMobile) requesting the desktop rsd with a
mobile site referrer. So this visitor sticks to the documentation on
how to discover the real api endpoint, and ends up at the desktop site
api with his mobile phone coming from the mobile site.

Strange. Looks like that type of request is pretty rare - try

$ zgrep 'action=rsd' /a/squid/archive/sampled/sampled-1000.tsv.log-20131010.gz | grep m.wikipedia.org | wc -l

for example - although I suppose there are always more variations on this theme that could add up. I'm wondering if this has something to with a plugin or certain type of app making a hit, or maybe some sort of banner running that attempted the hit in contravention of same origin policy (or maybe made a CORS pre-flight request, yet was out of the ordinary).

But maybe those two are separate concerns?
If you think that WikipediaMobile and non-WikipediaMobile requests are
separate things, let me know, and I'll split the bug.

Yeah, I think WikipediaMobile (app) requests should be separate from non-WikipediaMobile (probably a web browser). Maybe Brion and Jon can provide some input here, too.

[...] That said, it seems like in the zero.tsv.log-* files
there
are lots of api.php hits that have their domain names in the subdomains of
m.wikipedia.org [...]

Yes. The api requests on the mdot and zerodot subdomains are there.
They are totally ok.

[...]
Sounds like we probably need to do a videoconference working session.

I'd love to. I sent an email to arrange for this.

Thanks for the videoconference. I felt it was good.

Oh yeah, one more thing...

(In reply to comment #19)

(In reply to comment #18)

  • If it's mobile web (e.g., the stock Browser on Android or the stock Safari

on
iOS), then mdot or zerodot would be domains in play from what I can tell,
yes.

Although the mdot and zerodot domain's EditURIs refer to the desktop
API endpoint?

Ah, here's what that EditURI thing is for per OutputPage.php in MediaWiki Core. See code blob below. Looks like it's sort of uncommon (0.006% - yes, 6 one thousandths of a percent) in contrast to the rest of the hits. I think this code probably actually should be consulting the skin to get the correct URL perhaps, at least if the browser is supposed to observe same-origin policy for this..then again, small percentage of traffic.

For the sake of completeness, I suspect it's more likely the RSD is hit from within a "WebView" in mobile apps or desktop applications or "out of bounds" browser plugins a good chunk of the time. "WebView" components in mobile apps and desktop applications aren't necessarily bound by the SOP, sometimes mobile app and desktop application programmers even use these things in their apps to get around "pesky" security obstruction in the normal browser. Anyhow, this is just one case of unexpected hits against the desktop endpoint, and I realize there may be others.

if ( $wgEnableAPI ) {

  1. Real Simple Discovery link, provides auto-discovery information
  2. for the MediaWiki API (and potentially additional custom API
  3. support such as WordPress or Twitter-compatible APIs for a
  4. blogging extension, etc)

$tags['rsd'] = Html::element( 'link', array(

		'rel' => 'EditURI',
		'type' => 'application/rsd+xml',
		// Output a protocol-relative URL here if $wgServer is protocol-relative
		// Whether RSD accepts relative or protocol-relative URLs is completely undocumented, though
		'href' => wfExpandUrl( wfAppendQuery( wfScript( 'api' ), array( 'action' => 'rsd' ) ), PROTO_RELATIVE ),

) );
}