Page MenuHomePhabricator

node 0.10 memory leak / lockup of client (possible test cases)
Closed, ResolvedPublic

Description

On RT-testing clients, a few pages seem to cause long post-result pauses (could be serializer, could be diff classifier) .. which could either be memory leak, a library issue, ..

plwiki:Podsumowanie_startów_zespołu_Lotus_w_Formule_1 (88064 ms)
itwiki:Campionato_Dilettanti_1958-1959 (parses in 11558 ms)
itwiki:Città_della_Pennsylvania_(M-P) (parses in 35763 ms)

This could be node 0.10 specific, but I haven't tested on node 0.8 yet. I can verify that at least the plwiki page above takes a loooooong time to parse.

But, adding info for now before I run out so this can be investigated later.


Version: unspecified
Severity: normal

Details

Reference
bz58952

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:25 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz58952.

Another: enwiki:Russia_at_the_2004_Summer_Olympics in 35387 ms

Change 103639 had a related patch set uploaded by GWicke:
Bug 58952: Avoid repeated require of Util from defines

https://gerrit.wikimedia.org/r/103639

Change 103756 had a related patch set uploaded by GWicke:
Bug 58952 WIP: Try to make sure env and old documents are freed early

https://gerrit.wikimedia.org/r/103756

Actually, on some of these pages, it looks like at least in RT testing, diff classification is the culprit. Same behavior on node 0.8.26, node 0.10.19,and node 0.10.24.


[subbu@earth tests] time node roundtrip-test.js --prefix itwiki Campionato_Dilettanti_1958-1959 > /dev/null
completed parsing of itwiki:Campionato_Dilettanti_1958-1959 in 11524 ms
wt2html + html2wt time: 14674

130.808u 1.072s 2:28.11 89.0% 0+0k 0+0io 0pf+0w

[subbu@earth tests] time node roundtrip-test.js --prefix plwiki 'Podsumowanie_startów_zespołu_Lotus_w_Formule_1' > /dev/null
completed parsing of plwiki:Podsumowanie_startów_zespołu_Lotus_w_Formule_1 in 57277 ms
wt2html + html2wt time: 64485

464.153u 3.188s 9:25.44 82.6% 0+0k 0+0io 0pf+0w

node 0.10 seems faster than 0.8 on the plwiki page. In addition, roundtrip-test.js on node 0.8 failed with FATAL ERROR: JS Allocation failed - process out of memory.

So, RT-issues on these pages are present in both 0.8 and 0.10, it seems.

It looks like jsdiff is the culprit at least in rt-testing for inefficient processing of some pages -- main cause for both long processing times + memory usage (peak 2.3g virtual, 1.3g resident) on the plwiki test page.


[subbu@earth tests] time node roundtrip-test.js --prefix itwiki Campionato_Dilettanti_1958-1959 > /dev/null
completed parsing of itwiki:Campionato_Dilettanti_1958-1959 in 45254 ms
after parse: Thu Dec 26 2013 15:48:46 GMT-0600 (CST)
wt2html + html2wt time: 49303
after serialization before rt-diff: Thu Dec 26 2013 15:48:49 GMT-0600 (CST)
after jsdiff: Thu Dec 26 2013 15:51:12 GMT-0600 (CST)
after convert-diff: Thu Dec 26 2013 15:51:12 GMT-0600 (CST)
before double-rt-diff: Thu Dec 26 2013 15:51:13 GMT-0600 (CST)
before checkIfSignificant: Thu Dec 26 2013 15:51:55 GMT-0600 (CST)

159.505u 1.272s 3:58.13 67.5% 0+0k 0+0io 0pf+0w

Relevant entry for plwiki:Podsumowanie_startów_zespołu_Lotus_w_Formule_1


after serialization before rt-diff: Thu Dec 26 2013 15:56:42 GMT-0600 (CST)

after jsdiff: Thu Dec 26 2013 16:03:40 GMT-0600 (CST)

The general GC / possible memory leak issue happens without jsdiff too. The backend server forked by roundtrip-test.js grows in rss when processing many requests in a row until 1.7G is reached. Only pure wt2html and html2wt modes are used there, and only html2wt POSTs seem to trigger the memory growth. Some of my patches above are geared towards making it more likely for the small incremental collections in 0.10 to still pick up garbage as early as possible by explicitly breaking reference cycles.

The v8 GC in node 0.10 is apparently geared more towards low pause times than thorough GC. I experimented with a limited --max_old_space_size and a few other GC related parameters to force full old space collections, and was able to get rt tests to run within 300m rss for a while. GC would use most time when it got close to the limit though, so more experimentation with a slightly higher limit is needed.

This issue is also discussed at https://github.com/joyent/node/issues/5828. The GC in v8 3.20 (will be in node 0.12) might have relevant fixes.

Hmm ... okay .. so, I guess it looks like we have to do a direct upgrade to 0.12 instead then unless we get a reliable workaround with the --max_old_space_size param.

Also possibly relevant: The "LATENCY AND IDLE GARBAGE COLLECTION" paragraph in http://blog.nodejs.org/2013/03/11/node-v0-10-0-stable/

Change 105875 had a related patch set uploaded by GWicke:
WIP Bug 58952: Help incremental GC further by breaking refs

https://gerrit.wikimedia.org/r/105875

Change 105875 merged by jenkins-bot:
Bug 58952: Help incremental GC further by breaking refs

https://gerrit.wikimedia.org/r/105875

The 0.10 specific memory leak issue seems to be solved by these patches. After the last round-trip test run, the largest node process used 600m with most using around 300, which is very close to what we have seen in 0.8.

The jsdiff issue on large pages is however still present and indpendent of the node version. I have created bug 59840 to track just the jsdiff issue.

I'm closing this bug as fixed, as it was primarily about node 0.10 issues.

A dsh command to list memory of all node processes in rt testing (from https://www.mediawiki.org/wiki/Parsoid/Round-trip_testing#Some_dsh_tricks):

dsh -M -cf /home/gwicke/rtclients ps -C nodejs -o rss= | sort -n -k2