Page MenuHomePhabricator

stream.wikimedia.org throws websocket.WebSocketException: Handshake Status 502 Bad Gateway
Closed, ResolvedPublic

Description

Using either


Version: wmf-deployment
Severity: normal

Details

Reference
bz66989

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:33 AM
bzimport added a project: EventStreams.
bzimport set Reference to bz66989.
bzimport added a subscriber: Unknown Object (MLST).

Note that it works at some point if you're persistent enough in reconnection.

http://codepen.io/Krinkle/full/laucI/ seems to work most of the time, but about 1/20 I see the following in the network:

ws://stream.wikimedia.org/socket.io/1/websocket/281487980761

Error during WebSocket handshake: Unexpected response code: 502

After that it falls back to xhr-polling with loads of paired POST/GET requests.

If we want to get this up against beta, I have a WIP for that. https://gerrit.wikimedia.org/r/#/c/138312/ Ideas/code welcome for how to allow for beta in our site/family structure.

Merlijn van Deen offered to look into this with me and we were able to identify the problem: the WebSocket handshake requires two round-trips to the server, and the load balancers were configured to distribute incoming requests across backends in a round-robin fashion. Because the requests that make up the initial handshake follow each other in quick succession, the most common case was for one request to be routed to one server, and the follow-up request to be routed to another server, which had not started negotiating a session with the client and was therefore not expecting the request.

This also explains why it sometimes worked: if another client request intervened between the two requests, you'd get routed to the same server and the handshake would succeed.

Giuseppe and I decided to temporarily "fix" this by simply shutting down one of the servers, causing all requests to get routed to the single remaining server. This made the errors go away, validating the diagnosis. The more permanent fix is to use a different scheduling algorithm to make sessions sticky. This is implemented in https://gerrit.wikimedia.org/r/#/c/152960/, which will be deployed in the next few days, most likely.

ori claimed this task.