Page MenuHomePhabricator

VisualEditor: Backspace deletes combined character clusters together with diacritics
Closed, ResolvedPublic40 Estimated Story Points

Description

Some scripts, among them Arabic, Hebrew, and most scripts of India and SE Asia, are written as combinations of consonants and vowel marks that combine with them.

In most text editors and word processors, when the cursor is after a combination of a consonant and a vowel, and the backspace key s pressed, the vowel is deleted first and the the consonant.

For example if you have the Devanagari combination गा (ग [g] + ा [a]), these are two Unicode characters, which the font joins automatically. If the cursor is after them and you press the backspace key, then the second character ( ा) is supposed to be deleted, and only then the first (ग). That is what happens in most text editors, including MediaWiki's source editor.

In the VisualEditor, backspace immediately deletes the whole cluster. This behavior is unexpected for most users.

To complicate things, when the cursor is before the combined character and the Delete key is pressed, the expected behavior is to delete the whole cluster. This is what happens in the VisualEditor now, and this must be kept like that. For cursor movement, back and forth, the cluster must also be treated as one character, so if the cursor is before गा and the right-pointing arrow is pressed, the cursor is supposed to immediately go after the गा. This also works correctly now, and must be kept.


Version: unspecified
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=49233
https://bugzilla.wikimedia.org/show_bug.cgi?id=53757

Details

Reference
bz51472

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedNikerabbit
OpenReleaseNone
OpenNone
InvalidJdforrester-WMF
Resolveddchan
Resolveddchan
Resolveddchan
Resolveddchan
Resolveddchan
Resolveddchan
Resolveddchan
Resolveddchan
ResolvedJdforrester-WMF
Resolveddchan
Resolveddchan
Resolveddchan
Resolveddchan
Resolveddchan
Resolveddchan
Resolveddchan
Resolveddchan
Resolveddchan
Resolveddchan

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:04 AM
bzimport set Reference to bz51472.

The expected behaviour of backspace can be different with different writing systems.

In Indic scripts, as explained in the bug description, the most common behaviour is that backspace should erase one characters and delete should erase a cluster. See http://publib.boulder.ibm.com/infocenter/hodhelp/v10r0/topic/com.ibm.hod.doc/help/hindi.html#hindispecialkeys
http://www-archive.mozilla.org/projects/ctl/tests/#indiceditoper

For other scripts it might be different, particularly for Latin, Greek and Cyrillic where, because of the precomposed accented characters, it is expected that characters and character sequences (base character + combining diacritic) that represent units will behave the same way, i.e. backspace and delete erase the base and diacritic, for example the single character à and the two characters ɛ̀ should be treated the same way.

http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries talks about this.

(In reply to comment #1)

Could this be a dupe of -
https://bugzilla.wikimedia.org/show_bug.cgi?id=49233 ?

Well, that one is marked as FIXED, and this one is definitely not fixed on master.

This is how we expect backspace to work with our current support of grapheme clusters. As Denis points out, being able to delete combining marks separately would have to be enabled on a per script basis, as we wouldn't want to require multiple keystrokes to remove e-acute, or a Jamo-constructed Hangul character.

There's code in progress to fix this in gerrit 80689 which is currently a work-in-progress.

Change 80689 had a related patch set uploaded by Divec:
DONTMERGE:Revert model to use simple UTF-16 code units

https://gerrit.wikimedia.org/r/80689

Change 80689 merged by jenkins-bot:
Revert model to use simple UTF-16 code units

https://gerrit.wikimedia.org/r/80689

Jdforrester-WMF set Security to None.