Page MenuHomePhabricator

Random API issues - authenticated user seems unauthenticated randomly
Closed, ResolvedPublic

Description

Author: damian

Description:
I have 2 bots - ClueBot NG and ClueBot III - both using the same php class. These have previously run fine in pmtpa tools and where running fine in eqiad tools.

Since earlier today, api requests using a session id seem to fail a random amount of time - CBNG is missing maybe 20% of edits, III about 80%.

Tracing out the CURL requests - the session is kept, a successful login is done, but tokens are missing/the current user looks logged out.

A sample request:

  • About to connect() to en.wikipedia.org port 80 (#0)
  • Trying 208.80.154.224... * connected

POST /w/api.php?action=login&format=php HTTP/1.1

User-Agent: ClueBot/2.0
Host: en.wikipedia.org
Accept: */*
Accept-Encoding: deflate, gzip
Content-Length: 268
Content-Type: multipart/form-data; boundary=----------------------------0882348d6018

< HTTP/1.1 200 OK
< Server: Apache
< X-Powered-By: PHP/5.3.10-1ubuntu3.9+wmf1
< X-Content-Type-Options: nosniff
< Cache-control: private
< X-Frame-Options: SAMEORIGIN
< Content-Encoding: gzip
< Vary: Accept-Encoding
< X-Vary-Options: Accept-Encoding;list-contains=gzip
< Content-Type: application/vnd.php.serialized; charset=utf-8
< X-Varnish: 2852251556, 1052575718
< Via: 1.1 varnish, 1.1 varnish
< Content-Length: 162
< Accept-Ranges: bytes
< Date: Wed, 05 Mar 2014 23:27:10 GMT
< Age: 0
< Connection: keep-alive
< X-Cache: cp1066 miss (0), cp1068 frontend miss (0)

  • Added cookie enwikiSession="< REMOVED >" for domain en.wikipedia.org, path /, expire 0

< Set-Cookie: enwikiSession=< REMOVED >; path=/; HttpOnly; GeoIP=::::v4; path=/
<

  • Closing connection #0

POST: http://en.wikipedia.org/w/api.php?action=login&format=php (0.065814018249512 s) (194 b)
Array
(

[login] => Array
    (
        [result] => NeedToken
        [token] => < REMOVED >
        [cookieprefix] => enwiki
        [sessionid] => < REMOVED >
    )

)

  • About to connect() to en.wikipedia.org port 80 (#0)
  • Trying 208.80.154.224... * connected

POST /w/api.php?action=login&format=php HTTP/1.1

User-Agent: ClueBot/2.0
Host: en.wikipedia.org
Accept: */*
Accept-Encoding: deflate, gzip
Cookie: enwikiSession=< REMOVED >
Content-Length: 396
Content-Type: multipart/form-data; boundary=----------------------------ce92f93b040d

< HTTP/1.1 200 OK
< Server: Apache
< X-Powered-By: PHP/5.3.10-1ubuntu3.9+wmf1
< X-Content-Type-Options: nosniff
< Cache-control: private
< P3P: CP="This is not a P3P policy! See http://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
< X-Frame-Options: SAMEORIGIN
< Content-Encoding: gzip
< Vary: Accept-Encoding
< X-Vary-Options: Accept-Encoding;list-contains=gzip
< Content-Type: application/vnd.php.serialized; charset=utf-8
< X-Varnish: 2852251622, 1518781136
< Via: 1.1 varnish, 1.1 varnish
< Content-Length: 205
< Accept-Ranges: bytes
< Date: Wed, 05 Mar 2014 23:27:10 GMT
< Age: 0
< Connection: keep-alive
< X-Cache: cp1066 miss (0), cp1067 frontend miss (0)

  • Added cookie centralauth_User="ClueBot+III" for domain wikipedia.org, path /, expire 1396654030

< Set-Cookie: centralauth_User=ClueBot+III; expires=Fri, 04-Apr-2014 23:27:10 GMT; path=/; domain=.wikipedia.org; httponly; GeoIP=::::v4; path=/
<

  • Closing connection #0

POST: http://en.wikipedia.org/w/api.php?action=login&format=php (0.11309099197388 s) (256 b)
Array
(

[login] => Array
    (
        [result] => Success
        [lguserid] => < REMOVED >
        [lgusername] => ClueBot III
        [lgtoken] => < REMOVED >
        [cookieprefix] => enwiki
        [sessionid] => < REMOVED >
    )

)

  • About to connect() to en.wikipedia.org port 80 (#0)
  • Trying 208.80.154.224... * connected

GET /w/api.php?action=query&prop=revisions&titles=User%3ADamianZaremba&rvlimit=1&rvprop=timestamp|ids|user|comment&format=php&meta=userinfo&rvdir=older HTTP/1.1

User-Agent: ClueBot/2.0
Host: en.wikipedia.org
Accept: */*
Accept-Encoding: deflate, gzip
Cookie: centralauth_User=ClueBot+III; enwikiSession=< REMOVED >

< HTTP/1.1 200 OK
< Server: Apache
< X-Powered-By: PHP/5.3.10-1ubuntu3.9+wmf1
< X-Content-Type-Options: nosniff
< Cache-control: private
< X-Frame-Options: SAMEORIGIN
< Content-Encoding: gzip
< Vary: Accept-Encoding
< X-Vary-Options: Accept-Encoding;list-contains=gzip
< Content-Type: application/vnd.php.serialized; charset=utf-8
< X-Varnish: 2479625599, 1518781362
< Via: 1.1 varnish, 1.1 varnish
< Content-Length: 277
< Accept-Ranges: bytes
< Date: Wed, 05 Mar 2014 23:27:11 GMT
< Age: 0
< Connection: keep-alive
< X-Cache: cp1053 miss (0), cp1067 frontend miss (0)

  • Added cookie GeoIP="::::v4" for domain en.wikipedia.org, path /, expire 0

< Set-Cookie: GeoIP=::::v4; path=/
<

  • Closing connection #0

GET: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=User%3ADamianZaremba&rvlimit=1&rvprop=timestamp|ids|user|comment&format=php&meta=userinfo&rvdir=older (0.083457946777344 s) (475 b) (200 code)
Array
(

[0] => Array
    (
        [revid] => 510098058
        [parentid] => 509798527
        [user] => DamianZaremba
        [timestamp] => 2012-08-31T11:46:49Z
        [comment] => 
    )

[ns] => 2
[title] => User:DamianZaremba
[currentuser] => 10.68.16.7
[continue] => 
[pageid] => 29562889

)

All the values and return looks good - but the current user is anon and any actions using tokens or doing privileged actions fail.

This is the same across multiple bot users, all hitting the enwiki api from tools in eqiad.

Also tested from pmtpa to rule out migration issues and this issue is still present - it was not last week.

Possibly related: The internal labs IPs are now the anon ips, rather than the external - I believe it use to be the external, but apparently it shouldn't matter.

The varnish header suggests this is getting past the cache ok and fails 'randomly', so I'm having a hard time tracking down the issue.

Others have mentioned it as per http://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard#Possible_bot_malfunctioning.3F and other 'lost' sessions.

Maybe some migration related issue outstanding?

Hopefully others can add more clarity.


Version: wmf-deployment
Severity: major

Details

Reference
bz62288

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:02 AM
bzimport set Reference to bz62288.
bzimport added a subscriber: Unknown Object (MLST).

< HTTP/1.1 200 OK
< Server: Apache
< X-Powered-By: PHP/5.3.10-1ubuntu3.9+wmf1
< X-Content-Type-Options: nosniff
< Cache-control: private
< P3P: CP="This is not a P3P policy! See
http://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
< X-Frame-Options: SAMEORIGIN
< Content-Encoding: gzip
< Vary: Accept-Encoding
< X-Vary-Options: Accept-Encoding;list-contains=gzip
< Content-Type: application/vnd.php.serialized; charset=utf-8
< X-Varnish: 2852251622, 1518781136
< Via: 1.1 varnish, 1.1 varnish
< Content-Length: 205
< Accept-Ranges: bytes
< Date: Wed, 05 Mar 2014 23:27:10 GMT
< Age: 0
< Connection: keep-alive
< X-Cache: cp1066 miss (0), cp1067 frontend miss (0)

  • Added cookie centralauth_User="ClueBot+III" for domain wikipedia.org, path

/, expire 1396654030
< Set-Cookie: centralauth_User=ClueBot+III; expires=Fri, 04-Apr-2014
23:27:10 GMT; path=/; domain=.wikipedia.org; httponly; GeoIP=::::v4; path=/
<

  • Closing connection #0

There should be additional cookies being set here: centralauth_Token, centralauth_Session, probably forceHTTPS, and also enwikiUserID and enwikiUserName.

The one Set-Cookie header here looks weird too. Why the additional "; GeoIP=::::v4; path=/" after the cookie?

I bet this has something to do with the new GeoIP stuff they did recently, it smells like their code is screwing up any existing Set-Cookie headers in the response.

What is wrong with the way the GeoIP cookie is added to the set-cookie header in the example above?

We have the exact same issue on the beta cluster, logged as bug 62244. Copy pasting from there:


On deployment-cache-text1.pmtpa.wmflabs I edited /etc/varnish/text-frontend.inc.vcl and commented out the GeoIP snippet added by change 113935:

https://gerrit.wikimedia.org/r/#/c/113935/6/templates/varnish/text-frontend.inc.vcl.erb,unified )

Restarted varnish with:

/etc/init.d/varnish restart ; /etc/init.d/varnish-frontend restart

I then ran Chris script and I received three cookies:

< Set-Cookie: centralauth_Session=3428f185da995eaaf9550ae54b4e634f; path=/; httponly
< Set-Cookie: centralauth_User=TestBug16864; expires=Sat, 05-Apr-2014 09:34:59 GMT; path=/; httponly
< Set-Cookie: centralauth_Token=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/; httponly
< Location: http://en.wikipedia.beta.wmflabs.org/wiki/Special:CentralLogin/complete?token=ef3afb1dbc82549c1655b6841d8614ae
< Age: 0
< X-Cache: deployment-cache-text1 miss (0), deployment-cache-text1 frontend miss (0)
< Cache-Control: private, s-maxage=0, max-age=0, must-revalidate


Your bug report seems to indicate that happens in production as well.

This might have been fixed by Faidon with: https://gerrit.wikimedia.org/r/#/c/117004/ which "Disable cookie-based geolocation on text varnishes".

Can you please try again?

(In reply to Ori Livneh from comment #2)

What is wrong with the way the GeoIP cookie is added to the set-cookie
header in the example above?

You mean for example in this header:

Set-Cookie: centralauth_User=ClueBot+III; expires=Fri, 04-Apr-2014 23:27:10 GMT; path=/; domain=.wikipedia.org; httponly; GeoIP=::::v4; path=/

It's not adding a GeoIP cookie at all, according to RFC 6265. It's adding to extra parameters to the end of the centralauth_User cookie.

The more important issue is that the other six Set-Cookie headers that should have been in the response are entirely missing.

(In reply to Antoine "hashar" Musso from comment #4)

This might have been fixed by Faidon with:
https://gerrit.wikimedia.org/r/#/c/117004/ which "Disable cookie-based
geolocation on text varnishes".

Can you please try again?

The issue is currently not occurring with the same bot script that was showing the issue last night. I observe that the Set-Cookie headers do not have "; GeoIP=::::v4; path=/" appended.

damian wrote:

I can confirm that change being pulled out has fixed the bots. Been happily editing ever since and upon restart get all the cookies etc set properly as described above.

This was fixed around 9:30am UTC by Faidon who disabled a Varnish feature deployed yesterday. Details at https://bugzilla.wikimedia.org/show_bug.cgi?id=62244#c19

I am marking this bug as a dupe of bug 62244 since that is where we have the technical details and script to reproduce the issue.

  • This bug has been marked as a duplicate of bug 62244 ***