Page MenuHomePhabricator

VisualEditor: Data model needs characters, not code points
Closed, ResolvedPublic

Description

At present, the VisualEditor treats UTF-16 code points as if they were synonymous with abstract characters. Here are two cases where this causes bugs:

  1. UTF-16 uses a surrogate pair to represent each Unicode character above U+FFFF. For instance, U+282E2 ('elevator' in Cantonese) is a single character represented in Javascript as "\uD860\uDEE2". In a plain textarea, this behaves like a single character from the point of view of the user. However in the VisualEditor, cursoring and backspacing requires two presses; and after cursoring once, any text typed will go in the middle of the surrogate pair, creating invalid UTF-16. (see The Unicode Standard, Version 6.2, Section 3.8, Surrogates).
  1. Combining accents can be used in sequences to build up abstract characters. For example, the Javascript string "m\u0300" represents a single abstract character (m with grave accent). In a plain textarea, this behaves like a single character when cursoring, but like two characters when backspacing (so the first backspace just removes the accent). However in the VisualEditor, cursoring requires two presses; and after cursoring once, any typed text will go between the letter and the accent, creating an inappropriate dangling combining accent.

These kinds of issues occur because the DataModel uses Arrays with code point elements, say ['\uD860', '\uDEE2', ..., 'm', '\u0300']). My hunch is that this is slightly too low level, and it should instead use abstract character elements, say ['\uD860\uDEE2', ..., 'm\u0300'], where each element represents a whole character.

A good start would be to abstract out away calls to string.split( '' ) into a single function like this:

ve.splitCharacters = function ( value ) {
    return value.split( /(?![\uDC00-\uDFFF])/ ); // don't split surrogate pairs
};

The rest of the codebase should call this function to perform splits, and then not assume that data[i] is a single character. Then we can refine splitCharacters as needed.

Alternatively, since the overwhelming majority of characters will in fact be single code points, perhaps the DataModel structure could "encode" the exceptional multi-code point characters as objects, so that 'typeof data[i] === "string"' can still detect the simple cases.

This sounds like a big change for a small issue, but I think it would avoid problems in the future. With a character representation, you can safely perform useful operations like splicing and truncating without having to check the surrounding context very carefully every time.


Version: unspecified
Severity: normal

Details

Reference
bz48630

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:18 AM
bzimport set Reference to bz48630.

A related question is when invalid UTF-16 is checked and how errors are handled.

I don't know enough to suggest where in the system the checks should be. But it obviously relates to the parts of ve.dm.Converter and ve.ce.Surface (for paste) which would call this new splitCharacters function. Note that two chunks of invalid UTF-16 can concatenate into one valid but unexpected chunk.

Ideally merging would be done at the transaction level (e.g. in fixupInsertion), although we would need to extend that method to be allowed to modify the remove part of a transaction, e.g.

Data: ['F','o','o']

Transaction:

Retain: 3
Replace:
  Remove: ''
  Insert: '<accent>'

Would become:

Transaction:

Retain: 2
Replace:
  Remove: 'o'
  Insert: 'o<accent>'

Another major issue is that window.getSelection (which is what we use to get cursor position, via Rangy) calculates using number of codepoints, not characters, so this would need to be translated.

(In reply to comment #3)

Another major issue is that window.getSelection (which is what we use to get
cursor position, via Rangy) calculates using number of codepoints, not
characters, so this would need to be translated.

Which would be done in ve.ce.getOffsetFromTextNode

Related URL: https://gerrit.wikimedia.org/r/64965 (Gerrit Change I8d936fb15d82f73cd45fac142c540a7950850d55)

To be precise about what should constitute a character, see Grapheme Cluster Boundary Rules from TR29 (http://unicode.org/reports/tr29/) . We probably want to implement these, except insofar as it differs from the browsers' implementations of what constitutes a character for cursoring purposes.

The important thing right now is to abstract the code away from assuming single-codepoint characters, so the structure is correct. Then we can get the splitting rules right in isolation.

Ed's main block of code is merged and will go out with wmf5; keeping this open for follow-up, and removing milestone of tomorrow.

  • Bug 49233 has been marked as a duplicate of this bug. ***

Marking this as merged; follow-up in bug 49257.