Page MenuHomePhabricator

CentralAuth not caching Special:CentralAutoLogin/start
Closed, ResolvedPublic

Description

Ops pointed out that almost 60% of the apache traffic is calls to Special:CentralAutoLogin. About half of that traffic is to /start, so it looks like Special:CentralAutoLogin/start is still not cached, even though we don't disable client caching in the output page. MediaWiki is sending Cache-Control private.

csteipp@linux-hv06:~> curl -Ik 'https://en.wikipedia.org/wiki/index.php?title=Special:CentralAutoLogin/start&type=script&returnto=One&returntoquery=' | grep Cache-Control
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate

It seems like that is because OutputPage checks to see if $this->mSquidMaxage != 0, otherwise it sends out the private header. On my dev system, I added $this->getOutput()->setSquidMaxage( 3600 ), and I then get

csteipp@linux-hv06:~> curl -Ik 'https://127.0.0.1/wiki_port83/de/index.php?title=Special:CentralAutoLogin/start&type=script&returnto=One&returntoquery='
HTTP/1.1 302 Found
Date: Mon, 16 Sep 2013 23:32:21 GMT
Server: Apache/2.2.22 (Linux/SUSE)
X-Powered-By: PHP/5.3.17
X-Content-Type-Options: nosniff
P3P: CP="This is not a P3P policy! See http://127.0.0.1/wiki_port83/de/index.php?title=Special:CentralAutoLogin/P3P for more info."
Vary: Accept-Encoding,Cookie
Cache-Control: s-maxage=3600, must-revalidate, max-age=0
Location: https://127.0.0.1/wiki_port83/pool/index.php?title=Special:CentralAutoLogin/checkLoggedIn&wikiid=dewiki83&proto=https&type=script&returnto=One&returntoquery=
Content-Type: text/html; charset=utf-8

Is that enough to have varnish cache these, but logged in users will always bypass the cache?


Version: master
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=52206

Details

Reference
bz54195

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:52 AM
bzimport set Reference to bz54195.

Change 85036 had a related patch set uploaded by CSteipp:
Set Squid maxage so proxies will cache

https://gerrit.wikimedia.org/r/85036

Change 85036 merged by jenkins-bot:
Set Squid maxage so proxies will cache

https://gerrit.wikimedia.org/r/85036

Faidon, Tim, or Mark, it would be good if you guys could take a look at gerrit 85036 too, and just make sure Brad and my assumptions about squid / varnish are correct. If anything doesn't seem right, let us know asap!

This change has been deployed, and load to eqiad application servers seems to have dropped a bit.

20:31 logmsgbot: csteipp synchronized php-1.22wmf17/extensions/CentralAuth 'update to master for SUL fix'
20:28 logmsgbot: csteipp synchronized php-1.22wmf18/extensions/CentralAuth 'update to master for SUL fix'

CentralAutoLogin traffic is still > 50% of all Apache traffic, so no huge effect there.

CentralAutoLogin/start is 27.5% in my small sample, however this has little to do with caching: all of the URLs are different because of the "returnto" query parameter. There is of course benefit with having those cached when "returnto" is some popular page (e.g. Main_Page).

Thanks Faidon. In the majority case, the ping to check if the anon user is logged in, we shouldn't need the returnto.. so if that's being added, I think we can fix that.

Brad, do remember why we add returnto and returntoquery in centralautologin.js for non-login pages? It's not used except when we get to the /setCookies phase, and there, it only adds a window.location if the request had a 'return' parameter, which I don't ever see set. So it seems like we're passing it around without actually using it.. but am I missing something?

Change 86207 had a related patch set uploaded by CSteipp:
Remove returnto for basic login test

https://gerrit.wikimedia.org/r/86207

(In reply to comment #6)

Brad, do remember why we add returnto and returntoquery in
centralautologin.js
for non-login pages? It's not used except when we get to the /setCookies
phase,
and there, it only adds a window.location if the request had a 'return'
parameter, which I don't ever see set. So it seems like we're passing it
around
without actually using it.. but am I missing something?

git blame would point you to Gerrit change 83007. The returnto gets used in SkinTemplate.php in core, when it generates the #p-personal replacement.

Change 86207 abandoned by CSteipp:
Remove returnto for basic login test

Reason:
We'll find a better solution

https://gerrit.wikimedia.org/r/86207

(In reply to comment #8)

(In reply to comment #6)

Brad, do remember why we add returnto and returntoquery in
centralautologin.js
for non-login pages? It's not used except when we get to the /setCookies
phase,
and there, it only adds a window.location if the request had a 'return'
parameter, which I don't ever see set. So it seems like we're passing it
around
without actually using it.. but am I missing something?

git blame would point you to Gerrit change #83007. The returnto gets used in
SkinTemplate.php in core, when it generates the #p-personal replacement.

Ah, thanks. Could we build the correct returnto and update the link in the javascript that we return from /setCookies?

(In reply to comment #10)

Ah, thanks. Could we build the correct returnto and update the link in the
javascript that we return from /setCookies?

Probably. Although my gut reaction is "this #p-personal thing keeps getting more and more hacky".

Change 86265 had a related patch set uploaded by Anomie:
For caching, avoid passing returnto to Special:CentralAutoLogin

https://gerrit.wikimedia.org/r/86265

Change 86265 merged by jenkins-bot:
For caching, avoid passing returnto to Special:CentralAutoLogin

https://gerrit.wikimedia.org/r/86265

The fix had the intended effect but I still see those URLs (without returnto this time) not being cached at all. It seems MediaWiki emits "Cache-control: private, must-revalidate, max-age=0" and "Expires: Thu, 01 Jan 1970 00:00:00 GMT"; a request/response sample would be:

GET /wiki/Special:CentralAutoLogin/start?type=script HTTP/1.0
Host: zh.wikipedia.org
Accept: */*
User-Agent: (...)
Accept-Encoding: gzip,deflate,sdch
Accept-Language: zh-CN,zh;q=0.8
Via: 1.0 cp1015.eqiad.wmnet:3128 (squid/2.7.STABLE9)
X-Forwarded-For: (...)
Connection: keep-alive

HTTP/1.1 302 Found
Date: Fri, 18 Oct 2013 00:35:12 GMT
Server: Apache
X-Powered-By: PHP/5.3.10-1ubuntu3.8+wmf2
X-Content-Type-Options: nosniff
Cache-control: private, must-revalidate, max-age=0
Vary: Accept-Encoding,X-Forwarded-Proto,Cookie
X-Vary-Options: Accept-Encoding;list-contains=gzip,X-Forwarded-Proto,Cookie;string-contains=zhwikiToken;string-contains=zhwikiLoggedOut;string-contains=forceHTTPS;string-contains=zhwikiSession;string-contains=centralauth_Token;string-contains=centralauth_Session;string-contains=centralauth_LoggedOut;string-contains=mf_useformat;string-contains=stopMobileRedirect
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Location: http://zh.wikipedia.org/w/index.php?title=Special:%E4%B8%AD%E5%A4%AE%E8%AE%A4%E8%AF%81%E7%99%BB%E5%BD%95/start&type=script
Content-Encoding: gzip
Content-Length: 20
Keep-Alive: timeout=2, max=147
Connection: Keep-Alive
Content-Type: text/html; charset=utf-8

This looks like another issue, since the user is just being redirected to a translated page name. That seems like it would be a safe redirect for mediawiki to cache, and we probably haven't had sufficient volumes to worry about caching it in the past. But maybe someone else knows for sure?

If the user actually gets to Special:CentralAutoLogin/start, they are redirected to loginwiki's Special:CentralAutoLogin/checkLoggedIn, which should be cached, and I'm guessing those are since Faidon didn't see many of those.

Faidon also mentioned on irc that he was seeing http->https redirects. We wouldn't want to cache those, since those are based on the user being logged in.

What do you mean "on the user being logged in"? Is it a preference (Cookie?), geolocation data for the IP or something else? We could likely enforce some logic with Vary/XVO/Varnish VCL and avoid hitting apaches for all of these.

(In reply to comment #15)

Faidon also mentioned on irc that he was seeing http->https redirects. We
wouldn't want to cache those, since those are based on the user being logged
in.

For Special:CentralAutoLogin, it redirects to https if $wgSecureLogin is set and wfCanIPUseHTTPS() returns true for the IP, regardless of whether the user is logged in or not. If the user has the forceHTTPS cookie or is logged in and returns true for $user->requiresHTTPS(), then I believe MediaWiki will do the same sort of redirect before getting to CentralAuth.

BTW, should that redirect be varying on X-Forwarded-Proto? Probably, so I submitted Gerrit change 90543 to do that.

Yes, the redirects should vary on XFP, but right now it won't have any effect, since these redirects also come with Cache-Control: private so they're not cached anyway.

I guess the rationale behind no caching is that the redirects are dependent on the client's IP and you can't Vary on that. We can do tricks to solve this (e.g. do the GeoIP lookup on the Varnish side for particular URLs and set an X-Country header that you could vary on, for example), but before we go down that road I'd like to analyze the traffic a bit further and figure out why these rank so high in amount of req/s.

Any updates Faidon? I'm curious to know if you found anything, so we can make sure this is as minimal load on the apaches as possible.

OK, I had another quick look, as this in combination with consistent hashing directing /wiki/Special:CentralAutoLogin/start?type=script to a single Varnish backend produced a partial outage by exhausting its resources. (this has been workarounded since with https://gerrit.wikimedia.org/r/#/c/95458/ )

This is far more than just "caching". I'm going to report these here so we don't get lost into multiple bugs, but feel free to fork into other bugs as needed:

  1. From Varnish text frontends, i.e. *all* text client-side requests (hits & misses), the top 10 URLs are: 6895.39 RxURL /wiki/Special:CentralAutoLogin/start?type=script 5627.74 RxURL /w/index.php?title=MediaWiki:Wikiminiatlas.js&action=raw&ctype=text/javascript&smaxage=21600&maxage=86400 3235.81 RxURL /wiki/Special:CentralAutoLogin/createSession?gu_id=0&type=script&proto=https 2137.83 RxURL /wiki/Special:CentralAutoLogin/createSession?gu_id=0&type=script&proto=http 1981.98 RxURL /w/index.php?title=MediaWiki:EnhancedCollapsibleElements.js&action=raw&ctype=text/javascript 1943.81 RxURL /w/index.php?title=MediaWiki:Common.js/NormalizeCharWidth.js&action=raw&ctype=text/javascript 1567.70 RxURL /wiki/Special:CentralAutoLogin/checkLoggedIn?wikiid=enwiki&proto=https&type=script 1383.12 RxURL /favicon.ico 1266.87 RxURL /w/index.php?title=%E7%89%B9%E5%88%A5:%E4%B8%AD%E5%A4%AE%E7%AE%A1%E7%90%86%E8%87%AA%E5%8B%95%E3%83%AD%E3%82%B0%E3%82%A4%E3%83%B3/star 1020.66 RxURL /w/index.php?title=MediaWiki:OSM.js&action=raw&ctype=text/javascript&smaxage=21600&maxage=86400

(also note how the first is double the requests from the third and triple the requests from fourth)

and from the top-50 URLs, 10 (20%) are CentralAutoLogin URLs, namely:

6944.76 RxURL          /wiki/Special:CentralAutoLogin/start?type=script
3003.28 RxURL          /wiki/Special:CentralAutoLogin/createSession?gu_id=0&type=script&proto=https
2273.40 RxURL          /wiki/Special:CentralAutoLogin/createSession?gu_id=0&type=script&proto=http
1452.62 RxURL          /wiki/Special:CentralAutoLogin/checkLoggedIn?wikiid=enwiki&proto=https&type=script
 551.49 RxURL          /wiki/Special:CentralAutoLogin/checkLoggedIn?wikiid=jawiki&proto=https&type=script
 362.93 RxURL          /wiki/Special:CentralAutoLogin/checkLoggedIn?wikiid=eswiki&proto=https&type=script
 298.33 RxURL          /wiki/Special:CentralAutoLogin/checkLoggedIn?wikiid=zhwiki&proto=http&type=script
 293.00 RxURL          /wiki/Special:CentralAutoLogin/checkLoggedIn?wikiid=enwiki&proto=http&type=script
 289.84 RxURL          /wiki/Special:CentralAutoLogin/createSession?gu_id=0&type=1x1&proto=https
 240.99 RxURL          /wiki/Special:CentralAutoLogin/checkLoggedIn?wikiid=ptwiki&proto=https&type=script

Moreover, these stats do not include the localized URLs that I'm seeing, such as
http://de.wikipedia.org/w/index.php?title=Spezial:Zentrale_automatische_Anmeldung/start&type=script
(localized URLs for such basic functions are a bad idea, as it doesn't help us aggregate URLs counts and find outliers like the above)

And, surpisingly, does not include the 1x1 URL that I see e.g. on all(?) enwiki pages as:
<noscript><img src="//en.wikipedia.org/w/index.php?title=Special:CentralAutoLogin/start&amp;type=1x1" alt="" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript></div>
which also comes with Cache-control: private.

The fact that these URLs are so popular points to far more than just editors attempting to login -- it surely includes anonymous page views and it's probably aggravated by the fact that due to caching headers, we do not make use of client-side caching. I see ext.centralauth.centralautologin.js doing such requests and that script being loaded on every anon, but it's clearly not my area of expertise.

Those URLs not being cached also means these get passed through to appservers and increasing their load significantly, which the original point of this report. It's also very apparent now, though, that they're causing problems in all kinds of layers in our infrastructure (Varnish frontend/backend hit ratio is down the drain, Varnish backends were/are suffering) and possibly causing a client-side slow-down.

I think we can fix the caching issues -even for redirects depending on GeoIP- but first and foremost we need to stop forcing all page views to requests all kinds of (uncacheable) CentralAutoLogin URLs in this way.

Oh, and the for the record, the top 5 from Varnish backends, i.e. frontend misses, which filters out cacheable hotspots:

6964.24 RxURL          /wiki/Special:CentralAutoLogin/start?type=script
5650.14 RxURL          /wiki/Special:CentralAutoLogin/createSession?gu_id=0&type=script&proto=https
3943.42 RxURL          /wiki/Special:CentralAutoLogin/createSession?gu_id=0&type=script&proto=http
2417.36 RxURL          /wiki/Special:CentralAutoLogin/checkLoggedIn?wikiid=enwiki&proto=http&type=script
1513.81 RxURL          /check

(all but /check being passed through as-is to appservers)

Unless the client lacks/blocks localStorage and cookies, it should be one pageview per client per site visited per day that is hitting /start?type=script, /checkLoggedIn?wikiid=*&proto=http(s)&type=script, and /createSession?gu_id=0&type=script&proto=http(s). There will also be extra hits coming from the login form, but those hits will include "&return=1" and likely returnto and returntoquery parameters.

As long as each of those URLs is varied on the CentralAuth cookies, they all should be fully cacheable.

We can't just not have anons hit these, though, because anons hitting them is an integral part of the SUL2 login scheme: browsers will no longer allow us to set cookies for (e.g.) wikivoyage.org on login if the user has never visited wikivoyage.org before, so when a user does visit wikivoyage.org anonymously we need to somehow check whether they're logged in centrally.

/start?type=1x1 does appear on anon pageviews, but there it's wrapped in <noscript> tags so it should only be hit when the browser lacks JavaScript. /start?type=1x1&from=* appears on the pageview after login.

Change 95609 had a related patch set uploaded by Anomie:
Skip Special:CentralAutoLogin/start for JS anons

https://gerrit.wikimedia.org/r/95609

Change 95609 merged by jenkins-bot:
Skip Special:CentralAutoLogin/start for JS anons

https://gerrit.wikimedia.org/r/95609

Change 98461 had a related patch set uploaded by Anomie:
Don't localize Special:CentralAutoLogin

https://gerrit.wikimedia.org/r/98461

Change 98461 merged by jenkins-bot:
Don't localize Special:CentralAutoLogin

https://gerrit.wikimedia.org/r/98461

I updated 1.23wmf5 (including enwiki) to include the patch that removes calls to /start for anons. After that, Faidon was looking at the logs and observed:

<paravoid> /wiki/Special:CentralAutoLogin/createSession?gu_id=0&type=script&proto=http is first
\x03<\x08paravoid\x08>\x08\x0f /wiki/Special:CentralAutoLogin/checkLoggedIn?type=script&wikiid=enwiki&proto=http is second, /wiki/Special:CentralAutoLogin/checkLoggedIn?type=script&wikiid=eswiki&proto=http is third
<paravoid> on backend requests

<paravoid> 6392.37 RxURL /w/index.php?title=MediaWiki:Wikiminiatlas.js&action=raw&ctype=text/javascript&smaxage=21600&maxage=86400
<paravoid> 6133.61 RxURL /wiki/Special:CentralAutoLogin/createSession?gu_id=0&type=script&proto=http
<paravoid> 4772.53 RxURL /w/api.php
<paravoid> on frontends

<paravoid> I also see on no5 /w/index.php?title=Spezial:Zentrale_automatische_Anmeldung/createSession&gu_id=0&type=script&proto=http so I guess https://gerrit.wikimedia.org/r/#/c/98461/ didn't work

It's very odd that /createSession is more popular than /checkLoggedIn, since /checkLoggedIn redirecting to /createSession is the only way /createSession is called. On another bug, Brad and I have been discussion removing the call to createSession for annons, and ending the protocol once we know they're anonymous. So we can address the symptoms, but I'm still concerned we have something weird going on causing this.

Change 100508 had a related patch set uploaded by CSteipp:
Don't redirects anons to /createSession

https://gerrit.wikimedia.org/r/100508

(In reply to comment #27)

<paravoid>
/wiki/Special:CentralAutoLogin/createSession?gu_id=0&type=script&proto=http
is
first
\x03<\x08paravoid\x08>\x08\x0f
/wiki/Special:CentralAutoLogin/
checkLoggedIn?type=script&wikiid=enwiki&proto=http
is second,
/wiki/Special:CentralAutoLogin/
checkLoggedIn?type=script&wikiid=eswiki&proto=http
is third
<paravoid> on backend requests

It's very odd that /createSession is more popular than /checkLoggedIn, since
/checkLoggedIn redirecting to /createSession is the only way /createSession
is called.

Not really. /checkLoggedIn has the wikiid parameter to tell it which wiki to redirect back to, so you have to add together /checkLoggedIn?wikiid=enwiki, /checkLoggedIn?wikiid=eswiki, and so on. /createSession will be on all the different domains, but it looks like the stats above aren't being split that way so all the /createSession?gu_id=0 are already added together. Or is /createSession still more popular after doing that?

<paravoid> I also see on no5
/w/index.php?title=Spezial:Zentrale_automatische_Anmeldung/
createSession&gu_id=0&type=script&proto=http
so I guess https://gerrit.wikimedia.org/r/#/c/98461/ didn't work

I'm not so sure, it doesn't seem like a localisation update was run after the patch was backported to wmf5. The nightly update just ran now, and now it seems to no longer be redirecting.

(In reply to comment #29)

It's very odd that /createSession is more popular than /checkLoggedIn, since
/checkLoggedIn redirecting to /createSession is the only way /createSession
is called.

Not really. /checkLoggedIn has the wikiid parameter to tell it which wiki to
redirect back to, so you have to add together /checkLoggedIn?wikiid=enwiki,
/checkLoggedIn?wikiid=eswiki, and so on. /createSession will be on all the
different domains, but it looks like the stats above aren't being split that
way so all the /createSession?gu_id=0 are already added together. Or is
/createSession still more popular after doing that?

Good point, I think those were only counting full urls. Faidon, could we get a list of urls without query parameters, to make sure /checkLoggedIn is being called roughly the same amount as /createSession?

Brad, if you get time to review and merge gerrit 100508, I'm happy to deploy it today, so that we reduce the load as much as possible before we make any changes to how we pass the parameters.

On a sample of 100k requests, "/createSession" matched 3650 times, while "/checkLoggedIn" 3679 times (this is literally greps, not filtered per domain or query params). I tried 3 different such samples with similar results.

Thanks Faidon, that is exactly what I was expecting. Just wanted to confirm.

Change 100508 merged by jenkins-bot:
Don't redirects anons to /createSession

https://gerrit.wikimedia.org/r/100508

These were all deployed and the effect is nicely visualized here:
https://ganglia.wikimedia.org/latest/stacked.php?c=Application+servers+eqiad&host_regex=&m=ap_rps&r=year&st=1388322880
(application server requests per second)

I'm happy with seeing this bug report closed. Chris, you mentioned some other optimizations you had in mind. Are you still planning these? Are we going to track them im this bug?

The only remaining thing would be to (very carefully) return a cached response from Varnish for loginwiki's /checkLoggedIn, if the user doesn't have any loggin cookies, since we return the exact same content for all logged-out users (but unique content for logged in users).

MediaWiki should be sending the correct headers-- CC:public, with a max-age of 1200 when the request came from a logged out user, or CC:private when the user is logged in. But I think Varnish still ignores all caching for Special:CentralAutoLogin after bug 53032?

(In reply to comment #35 by csteipp)

The only remaining thing would be to (very carefully) return a cached
response from Varnish for loginwiki's /checkLoggedIn

[...]

But I think Varnish still ignores all caching for
Special:CentralAutoLogin after bug 53032?

All patches are in - resetting status.

Chris: Is this still highest priority or is this "only" high now?

I'll let Faidon weigh in on how much we need to setup the varnish cache for that page. But given the current load on the backend vs the potential impact (positive, and negative if we mess any part of it up), I'm not sure how high of a priority we want that. Bumping this to normal for now.

Can this be closed or is this still a problem.

I don't think there are any remaining issue that are high enough priority for anyone to work on. If anyone in ops thinks different, let me know and we can plan it out.