VisualEditor: ve.dm.SurfaceFragment.wordBoundaryPattern treats non-lower-ASCII word characters as boundaries
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• TrevorParscal
	Jan 18 2013, 12:30 AM

Description

See http://inimino.org/~inimino/blog/javascript_cset for some work in this area.

Version: unspecified
Severity: major

Details

Reference: bz44085

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Invalid		Jdforrester-WMF	T35077 VisualEditor multilingual input / i18n issues (tracking)
		Resolved		Esanders	T46085 VisualEditor: ve.dm.SurfaceFragment.wordBoundaryPattern treats non-lower-ASCII word characters as boundaries

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 1:22 AM

• bzimport added projects: VE-deploy-2013-04-01, VisualEditor-DataModel, I18n.

• bzimport set Reference to bz44085.

• TrevorParscal created this task.Jan 18 2013, 12:30 AM

Bit of clarification:

When the user clicks the link button in the toolbar and they haven't selected any text, we expand the selection in both directions from the cursor position and select the word the cursor is in, make that a link, then show the link inspector. The code that expands the selection to a full word is in ve.dm.SurfaceFragment, and apparently treats non-ASCII characters as word boundaries. The practical bug that this leads to is that if you put the cursor in the middle of "Möckernbrücke" (or "égalité", if you prefer French) and click the link button, only "ckernbr" (or "galit", respectively) will be selected and linkified. Obviously this is a problem for i18n in languages using an extended Latin alphabet like German, French and Polish, but it's a total nightmare for non-Latin languages like Russian, Hebrew and Japanese.

Acutually Chinese & Japanese don't have any word boundaries at all. The only way to detect them is with a dictionary. We'll need a special case for these languages so we don't end up selecting entire sentences.

http://xregexp.com/ has unicode character class support. We may be able to pick out the data we need from it instead of using the whole library.

To begin with a patch to add some test structure and fix what we have already: https://gerrit.wikimedia.org/r/#/c/53564

If you're going to do lexicon-based word boundary detection in Chinese, maybe you could use a word list stored in a client-side Bloom Filter.

I don't know if it's as much of a problem in Japanese; you could probably use (?<=\P{Han})(?=\p{Han}) as a good start (i.e. there is a word break be.

As an incremental improvement I've expanded the letters and numbers groups to their Unicode categories: https://gerrit.wikimedia.org/r/#/c/53583/
We still need to think about which punctuation categories to add.

The Unicode standard has a fair amount to say on the matter. Ideally we would implement their standard.

http://www.unicode.org/reports/tr29/#Word_Boundaries

Like this: https://gerrit.wikimedia.org/r/#/c/54480 (well, apart from non-BMP characters...)

Jdforrester-WMF added a project: VisualEditor.Dec 3 2014, 2:22 AM

VisualEditor: ve.dm.SurfaceFragment.wordBoundaryPattern treats non-lower-ASCII word characters as boundariesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

VisualEditor: ve.dm.SurfaceFragment.wordBoundaryPattern treats non-lower-ASCII word characters as boundaries
Closed, ResolvedPublic
Actions

Related Objects
Search...