Today during the deploy, several of the parsoid hosts remained down for quite a long time. Their logs were filled with:
{"name":"parsoid","hostname":"wtp1001","pid":14985,"level":40,"logType":"warning","wiki":"enwiktionary
","title":"綉","oldId":26153377,"msg":"Cache MISS: http://10.2.2.29/enwiktionary/%E7%B6%89?oldid=26153
377","longMsg":"Cache MISS:\nhttp://10.2.2.29/enwiktionary/%E7%B6%89?oldid=26153377","time":"2014-11-1
3T21:15:42.711Z","v":0}
{"name":"parsoid","hostname":"wtp1001","pid":14984,"level":40,"logType":"warning","wiki":"enwiktionary
","title":"綊","oldId":17538871,"msg":"Cache MISS: http://10.2.2.29/enwiktionary/%E7%B6%8A?oldid=17538
871","longMsg":"Cache MISS:\nhttp://10.2.2.29/enwiktionary/%E7%B6%8A?oldid=17538871","time":"2014-11-1
3T21:15:42.713Z","v":0}
[warning][worker][14995] shutting down
[warning][worker][16672] shutting down
[warning][worker][14987] shutting down
[info][master][14981] shutting down, killing workers
[warning][worker][14996] shutting down
[warning][worker][15004] shutting down
[warning][worker][14990] shutting down
[warning][worker][14998] shutting down
[warning][worker][14988] shutting down
[warning][worker][14992] shutting down
[warning][worker][14985] shutting down
[warning][worker][15011] shutting down
[warning][worker][14984] shutting down
[warning][worker][15001] shutting down
[warning][worker][15007] shutting down
{"name":"parsoid","hostname":"wtp1001","pid":15003,"level":30,"logType":"info","wiki":"rowiki","title"
:"2012","oldId":8889091,"msg":"completed parsing in 7469 ms","longMsg":"completed parsing in\n7469\nms
","time":"2014-11-13T21:15:43.087Z","v":0}
[warning][worker][15003] shutting down
[info][master][14981] exiting
[info][master][16784] initializing 15 workers
[info][worker][16793] loading ...
[info][worker][16790] loading ...
[info][worker][16788] loading ...
[info][worker][16798] loading ...
[info][worker][16795] loading ...
[info][worker][16801] loading ...
[info][worker][16806] loading ...
[info][worker][16793] ready on :8000
[error][worker][16793] Port 8000 is already in use. Exiting.
Stack:
HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.5:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5)
[info][worker][16790] ready on :8000
[info][worker][16798] ready on :8000
[info][worker][16795] ready on :8000
[error][worker][16790] Port 8000 is already in use. Exiting.
Stack:
HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.3:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5)
[error][worker][16798] Port 8000 is already in use. Exiting.
Stack:
HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.7:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5)
[info][worker][16801] ready on :8000
[error][worker][16795] Port 8000 is already in use. Exiting.
Stack:
HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.6:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5)
[error][worker][16801] Port 8000 is already in use. Exiting.
Stack:
HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.9:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5)
[info][worker][16806] ready on :8000
[info][worker][16791] loading ...
[error][worker][16806] Port 8000 is already in use. Exiting.
Stack:
HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.11:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5)
[info][worker][16787] loading ...
[info][worker][16788] ready on :8000
[error][worker][16788] Port 8000 is already in use. Exiting.
Stack:
HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.2:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5)
[info][worker][16799] loading ...
[info][worker][16804] loading ...
[info][worker][16810] loading ...
[info][worker][16808] loading ...
[info][worker][16812] loading ...
[info][worker][16814] loading ...
[info][worker][16791] ready on :8000
[error][worker][16791] Port 8000 is already in use. Exiting.
Stack:
HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.4:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5)
[info][worker][16787] ready on :8000
[error][worker][16787] Port 8000 is already in use. Exiting.
Stack:
HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18) HTTPServer.EventEmitter.emit (events.js:95:17) net.js:1081:16 Object.1:1 (cluster.js:592:5) handleResponse (cluster.js:171:41) respond (cluster.js:192:5)
[..etc...]
Two things to note here: first, logs from the master aren't going through bunyan, and thus not ending up in logstash. There's is some discussion of this on https://gerrit.wikimedia.org/r/171618.
Secondly, we need to use SO_REUSEADDR when we open port 800 so that we don't have to wait for the full linger time to bring up the service after a deploy.
See http://stackoverflow.com/questions/14388706/socket-options-so-reuseaddr-and-so-reuseport-how-do-they-differ-do-they-mean-t
Version: unspecified
Severity: normal