Page MenuHomePhabricator

Use Map and Set instead of plain objects / ES6 shim cleanup
Closed, ResolvedPublic

Description

We currently use plain objects in many places where users can pass in keys. Even with Object.create(null), users can pass in the key __proto__ which will happily set the prototype, for example to an object if that is what we set as a value.

ES6 / harmony Map and Set (see http://wiki.ecmascript.org/doku.php?id=harmony:simple_maps_and_sets) fix this problem (and allow objects as keys), so we should use them. Native Map / Set is not yet complete in node 0.10, so we should probably use one of the existing shims to get it:

https://github.com/paulmillr/es6-shim
https://github.com/Benvie/harmony-collections

The es6 shim also gives us more goodies which might be useful to have for cleaner code. We should not go overboard though, and only use features where a native implementation is likely to happen soon.


Version: unspecified
Severity: normal

Details

Reference
bz53241

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:11 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz53241.

(In reply to comment #1)

Another contender: https://github.com/WebReflection/es6-collections

This one uses linear search on a key array to support object keys. Search for 'function betterIndexOf' in https://github.com/WebReflection/es6-collections/blob/master/src/es6-collections.js.

https://github.com/paulmillr/es6-shim/blob/master/es6-shim.js

uses iteration on a linked list in its Map implementation. Search for 'function Map'.

https://github.com/Benvie/harmony-collections/blob/master/harmony-collections.js

seems to be the only one of the contenders so far that uses native objects for primitive keys (and linear search for object keys), so only incurs constant overhead for each string key access. proto is mapped to a randomized string. Memory overhead is only incurred per Map, but not per item.

Seems to be the best option for large maps. For small maps linear search is likely faster.

The standard thunk is something like:

function Map() { this.store = Object.create(null); }
Map.prototype.get = function( k ) { return this.store['$'+k]; }
Map.prototype.set = function( k, v ) { this.store['$'+k] = v; }

...but we should be a little careful not to go overboard here. In many cases we know that 'proto' cannot be a possible key. In other case we are doing queries only (for example, to determine whether a tag name is a member of a specific set), and Object.create(null).proto===null, so that's perfectly safe as well.

So, sure we should use safer maps where needed, but it probably doesn't have to be a hugely-invasive patch set.

(In reply to comment #5)

The standard thunk is something like:

function Map() { this.store = Object.create(null); }
Map.prototype.get = function( k ) { return this.store['$'+k]; }
Map.prototype.set = function( k, v ) { this.store['$'+k] = v; }

This is basically what harmony-collections does for primitive keys when native maps are not available. It also implements the full spec, which is good for consistency even if we don't need objects as keys right now.

So, sure we should use safer maps where needed, but it probably doesn't have
to
be a hugely-invasive patch set.

Yup, only where users pass in arbitrary keys for new items, as I noted in comment #1.

Change 85432 had a related patch set uploaded by Arlolra:
WIP: ES6 shim cleanup.

https://gerrit.wikimedia.org/r/85432

harmony-collections are quite slow in production:


Latest master:
[subbu@earth tests] time node parserTests --wt2html --wt2wt --selser --html2wt --html2html >&! /dev/null

94.013u 0.728s 1:34.32 100.4% 0+0k 0+0io 0pf+0w

[subbu@earth lib] git checkout 064cee74
[subbu@earth tests] time node parserTests --wt2html --wt2wt --selser --html2wt --html2html > & ! /dev/null

59.475u 0.456s 0:59.65 100.4% 0+0k 0+0io 0pf+0w

Cscott has a fork that performs better at https://github.com/cscott/harmony-collections.

I've fixed the performance regression by patching harmony-collections.

I might also patch es6-shim to be O(1) so that would be another option.

(In reply to comment #10)

I've fixed the performance regression by patching harmony-collections.

I might also patch es6-shim to be O(1) so that would be another option.

Can you post the current timings?

$ node --version
v0.10.20
$ git checkout master
$ time tests/parserTests.js --quiet
real 1m8.476s
user 1m3.036s
sys 0m0.628s

$ git checkout no-es6
$ time tests/parserTests.js --quiet
real 0m37.091s
user 0m37.036s
sys 0m0.444s

With the gross hack in https://gerrit.wikimedia.org/r/88148 :

$ time tests/parserTests.js --quiet
real 0m39.008s
user 0m38.876s
sys 0m0.540s

With https://gerrit.wikimedia.org/r/88194 and https://gerrit.wikimedia.org/r/88193 :

$ time tests/parserTests.js --quiet
real 0m42.924s
user 0m42.668s
sys 0m0.456s

With es6-shim (https://gerrit.wikimedia.org/r/88241 and https://gerrit.wikimedia.org/r/88242 ):

$ time tests/parserTests.js --quiet
real 0m46.265s
user 0m46.228s
sys 0m0.488s

Unlike the harmony-collections patch, this is a fork that can be upstreamed (https://github.com/paulmillr/es6-shim/issues/148).

Don't know why exactly this is slower than the harmony-collections version; I suspect it's due to the other stuff hacked by es6-shim. The actual Map/Set implementation should be faster.

(In reply to comment #13)

Don't know why exactly this is slower than the harmony-collections version; I
suspect it's due to the other stuff hacked by es6-shim. The actual Map/Set
implementation should be faster.

I was wrong. Simply adding es6-shim does not make the parserTests benchmark, in my tests. The es6-shim implementation seems to be slower solely because its constructor is slower. We make 88,143 instances of Set via JSUtils.arrayToSet during a run of parserTests, and those constructor calls seem to account for the extra 7s of runtime.

OK. I suggest moving to es6-shim. The codebase is much simpler.

Things left to do:

a) profile our code and figure out why we are creating so many different Set objects during a parserTests run. Possible we can hoist many of them out (into mediawiki.wikitext.constants.js) which would greatly reduce the speed penalty from using es6-shim.

b) (possibly) hack es6-shim to lazy-initialize the Map objects. Only create the backing linked-list if non-string keys are ever used. [gwicke suggests that numeric keys could also be sped up, by prefixing string keys with '$' instead of testing against 'proto']

c) (possibly) add a FastStringSet to JSUtils (along the lines of https://gerrit.wikimedia.org/r/88148 ).

c2) figure out what the hotspots are for Set.has() -- possibly only a few sites actually need to use FastStringSet.

d) subbu has also requested re-doing some of the performance comparisons above using a single (large) article as a benchmark, instead of ParserTests. This would help determine whether the performance loss was significant in real use (as opposed to just in parserTests).

Arlo: I'm moving on to other tasks right now. Go ahead and tackle any of the above that you like.

Regarding (d), here is one possible page: fr:Coupe_du_pays_de_Galles_de_football
See https://gist.github.com/subbuss/6876447 for some tests I did y'day morning.

wrt (b) https://github.com/cscott/es6-shim/tree/faster-map recovers the lost speed:

$ time tests/parserTests.js --quiet
real 0m38.924s
user 0m38.896s
sys 0m0.464s

Subbu, here's a comparison as requested in d),
https://gist.github.com/arlolra/6913667

So it's clear, in this case:

master is before that merge that just went in, so ec485737eaa10f5336578e5a9f54fe944de9fc23

no-es6 is master with, git revert 560c79e9a1ee8d01bc9cc4b862f280fa6752a610

and es6regress is what just went in dea9cbfa8ff4a6aea12930ca13c2d0f3491d8477

Still seeing a regression so taking one of Cscott's solutions is probably best.

Change 88242 had a related patch set uploaded by Cscott:
Replace harmony-collections with es6-shim.

https://gerrit.wikimedia.org/r/88242

Change 88242 merged by jenkins-bot:
Replace harmony-collections with es6-shim

https://gerrit.wikimedia.org/r/88242

Seems good.

starting parsing of fr:Coupe_du_pays_de_Galles_de_football
completed parsing of fr:Coupe_du_pays_de_Galles_de_football in 6671 ms
starting parsing of fr:Coupe_du_pays_de_Galles_de_football
completed parsing of fr:Coupe_du_pays_de_Galles_de_football in 4247 ms
starting parsing of fr:Coupe_du_pays_de_Galles_de_football
completed parsing of fr:Coupe_du_pays_de_Galles_de_football in 4395 ms