Page MenuHomePhabricator

CharInsert cannot insert combining diacritics on their own
Closed, ResolvedPublic

Description

IPA and some languages need glyphs that do not exist in precomposed forms in
Unicode. The only way to compose them is with combining diacritics.
For languages using those it's not a major issue since the alphabet is not too
large, we can use <charinsert>ɛ̀ ɛ́ ɛ̌ ɛ̂</charinsert> for example(that's U+025B
with diacritics U+0300, U+0301, U+0302 and U+030C).
For IPA, it would be necessary.

Currently adding a combining diacritic (any of U+0300-U+036B) between
<charinsert> tags does not work.

  • <charinsert> ̀ <charinsert> is useless since the diacritcs is attached to the

preceding character, but we want to insert only the diacritics not the previous
character with the diacritic. Besides, it's not displayed at all.

  • <charinsert>&#x302;</charinsert> is not displayed either.

This isn't major bugs since any serious IPAer should have an input method
(there's close to none: one for Mac from SIL and uncomplete or buggy ones for
Linux). But it would be nice if it worked.


Version: unspecified
Severity: enhancement
URL: http://test.leuksman.com/index.php?title=MediaWiki:Edittools&oldid=10632

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 8:54 PM
bzimport added projects: CharInsert, I18n.
bzimport set Reference to bz3821.
bzimport added a subscriber: Unknown Object (MLST).

btw, this could be used to reduce the number of char to insert
Instead of having é á ó ú í ń ŕ ś ź etc, we could simply have ́ that would follow
the character to be combined with. But people aren't used to or aware of this.

gangleri wrote:

There are some other bugs related to this:
compare http://www.fileformat.info/info/unicode/char/30b/index.htm
see http://en.wikipedia.org/wiki/%CC%8B and
http://en.wikipedia.org/w/index.php?title=%CC%8B&action=edit

bugs requiring more or less restrictions:
bug 3819: strip phantom general punctuation characters from page titles
bug 1524: usernames should use unicode whitelist

gangleri wrote:

See also
Bug 4175 comment 1
Bug 4175: feature request: provide a way of inserting UTF-characters by
specifying them as HTML entities or in &#nnnn; or &#xnnnn; notation

robchur wrote:

(In reply to comment #2)

There are some other bugs related to this:
compare http://www.fileformat.info/info/unicode/char/30b/index.htm
see http://en.wikipedia.org/wiki/%CC%8B and
http://en.wikipedia.org/w/index.php?title=%CC%8B&action=edit

bugs requiring more or less restrictions:
bug 3819: strip phantom general punctuation characters from page titles
bug 1524: usernames should use unicode whitelist

Not related to this at all. They're asking for different things altogether.

gangleri wrote:

(In reply to comment #4)

Not related to this at all. They're asking for different things altogether.

The "relation" should point to what does the community *want* to be implemented
first. Everey feature is both used by contributors and "vandals". The question
is if we provide first features to make it "easier" for "vandals" to keep sysops
busy or if we should agree first

  • on having a "unicode whitelist for usernames" (bug 1524)
  • on "restricting Unicode Control Characters in titles" (bug 3696)
  • on disalowing the use of "various whitespace characters in titles" (bug 1414).

These three are "restrictive requirements" bug 3821 requires more functionality.
In order to push that discussion I made:
Bug 4185: feature request: provide a notification for irregular links

robchur wrote:

That's incorrect. The bugs you added have nothing to do with this one, which is
asking for a way for certain unicode character entities to be preserved during
editing and saved and "acknowledged" in the wiki markup in the database, when
parsing, etc.

Please keep all your "maybe related" and "(not actually) related to" and
"somehow related to" comments out of the way unless they _ARE_.

Finally, this is a discussion thread used by those who develop and are
interested in developing the MediaWiki software; community issues aren't
discussed here. So the following assumption on "what the question is"...

"The question is if we provide first features to make it "easier" for "vandals"
to keep sysops
busy or if we should agree first"

...is completely inappropriate for this bug and feature request tracker.

gangleri wrote:

Hallo!

changed subject from "diacritic cannot be used on their own" to
"Define and implement a markup syntax for CharInsert"
added url
http://test.leuksman.com/index.php?title=MediaWiki:Edittools&oldid=10632

Denis please reference to your tests with urls. I could not find any of your
contribution in relevant MediaWiki messages at various wiki's.

Please tray to use different browsers when looking at the various versions of
FiverAlpha's MediaWiki:Edittools. There are some links with browser tests.
Don't be surprised when you can see nothing usefull at
http://www.fileformat.info/info/unicode/block/combining_diacritical_marks/utf8test.htm
. It seems that these characters can not be used together with space. This is
the problem mentioned by Denis.

To see the tests you should verify if your computer is configured properly for
combined characters in ar:, he:, hi: . Read at relevant WP:EN pages.

I could not find a syntax description. This is why I made various tests.

  • I used many variations: unicode characters, &#xnnn; notation and %nn%nn%nn

notation. While unicode characters can be concatenated the characters in &#nnnn;
and &#xnnnn; notation *must* use the "+" character to "get somthing usefull".

  • Because I did not spend time to identify characters which can follow "space"

and generate a visible link with CharInsert I used *underscore* to make the /
some tests. Tray to see yourself which of the 26 version generate the result
from 17 :: _̯ ::.

The real problem with Charinsert is that there is no way / I could find no way
that Characters

  • inserts foo
  • renders the link with bar

This would be required for a whole group of characters: IPA diacritics,
whitespace, general punctuation but could also used to label longer sequences of
characters.

wiki markup syntax can not be used here. one could use
::foo:::bar::
or
foo/bar
because "
" would break Apaches however (?)
see bug 2088: Consecutive /s merged in PATH_INFO URLs on Apache 2.

The markup syntax will allow to include in the name of the link both the unicode
character itself and a description. This will prevent from inserting homoglyphs.
I am not shure if there are punctuation characters which are added before the
character and others after the character. Naming could look like bar=foobar+ or
bar=+foobar. The url contains RTL characters and RTL punctuation characters.

The workaround offered with example 17 is that users will need only to remove
the underscore after inserting a character from the Unicode block "Combining
Diacritical Marks". There are 112 such characters and the number of combinations
can be reduced considerably.

A question about the implementation is what formats are allowed: Unicode
characters, HTML entities (named, decimal, hex) and also (?) %nn%nn%nn notation?
How CharInsert would interfear with normalisation. I understand that
normalisation is performed when a page is submitted as a preview or as a save
(but not with action=purge).

Once characters are added you may see them *only* when editing as the :: _̯ ::
which shows in the editbox :: ̯ :: but disapears when saved. Other characters
as general punctuation characters and / or zero width space you will not see at
all. This relates to comment 3 and comment 5 above.

I noticed that CharInsert conflicts (does not render well) with # lists. Please
look in the history of the url for appropriate examples.

The CVS url is
http://cvs.sourceforge.net/viewcvs.py/wikipedia/extensions/CharInsert .

Please contact Brion to become a sysop if you would like to test some issues
related to [[Devanāgarī]], [[Niqqud]] (bug 2399) issues or others. CharInsert
should work properly with all these characters.

best regards reinhardt [[user:gangleri]]

gangleri wrote:

*** Bug 4175 has been marked as a duplicate of this bug. ***

There's a test on http://fr.wiktionary.org/wiki/MediaWiki:Edittools
using <charinsert> ː ‿ ́ ̀ ̌ ̂ </charinsert> for IPA (API in French)

None of the combining diacritics are avalaible. Yet they are in the wikitext.

gangleri wrote:

(In reply to comment #7)

wiki markup syntax can not be used here. one could use
::foo:::bar::
or
foo/bar//

maybe
:::foo::bar:::
or
/foobar///
is better to parce

(In reply to comment #9)

There's a test on http://fr.wiktionary.org/wiki/MediaWiki:Edittools
using <charinsert> ː ‿ ́ ̀ ̌ ̂ </charinsert> for IPA (API in French)

None of the combining diacritics are avalaible. Yet they are in the wikitext.

[[wiktionary:fr:MediaWiki:Edittools]] is changed now.
I made a successfull test at [[wiktionary:fr:MediaWiki:Edittools]].

Items discussed at # irc://irc.freenode.net/#wiktionary .

because of an FF bug !?

*questions*
a) Is there a way to make a Monobook.css configuration that characters from one
range as from the Unicode Block "Combining Diacritical Marks" and Unicode Block
"IPA Extensions" should have as prefered *font* FONT_A and the other another
prefered font?
Why this question? [[MediaWiki:Edittools]] should be available when editing
*any* page (especially in WIKT projects). However the best font for the content
language might differ from the best font suitable for IPA. How this could be solved.
b) Is there a way to make a Monobook.css configuration that characters from one
range as from the Unicode Block "Hebrew" should have one *size* and the others
*another* size?
You can see that I the Hebrew character at
http://test.leuksman.com/view/MediaWiki:Edittools *not* embeded in <small>
</small> in order to keep them ligible.

*note*
It seems that CharInsert conflicts with <div ...> </div>.
See
http://test.leuksman.com/index.php?title=MediaWiki:Edittools&diff=10651&oldid=10639

gangleri wrote:

screen dump for bug 03821 MediaWiki:Edittools seen with FireFox

as discussed / agreed at IRC

Attached:

bugzilla_03821_Firefox_001.jpg (662×949 px, 128 KB)

gangleri wrote:

screen dump for bug 03821 MediaWiki:Edittools seen with IE and Lucida Sans Unicode

as discussed / agreed at IRC

Attached:

bugzilla_03821_IE_and_Lucida_Sans_Unicode_001.jpg (680×918 px, 126 KB)

gangleri wrote:

(In reply to comment #10)

(In reply to comment #7)

wiki markup syntax can not be used here. one could use
::foo:::bar::
or
foo/bar//

maybe
:::foo::bar:::
or
/foobar///
is better to parce

bug 3550: Character insertion box should have titles for the characters
is about a "hint" for mouseover

a syntax could be
:::foo::bar::hint::: or :::foo::bar::: then hint==bar
or
/foobarhint/ or /foobar/// then hint==bar

*note* use only one or use another character then ":" or "/"

Since this is a specialized extension, making yet another piece of base markup for it doesn't
make any sense. Resolving WONTFIX.

reopening the bug.
it is still not possible to insert combining diacritics on their own.
this bug is still relevant.

Putting the base character before the charinsert block (such as: a<charinsert>'</charinsert>) sort of works, but it looks rather ugly. See the Hebrew ("Héber") charset un the Hungarian Wikipedia edittools for example: http://hu.wikipedia.org/w/index.php?title=Wikip%C3%A9dia:Homokoz%C3%B3&action=edit (leads to sandbox)

Proposed syntax: <charinsert title="hacek" display="č">ˇ</charinsert> (in case of multiple characters in the charinsert block, all but the first are thrown away), which would result in something like <a title="hacek" href="javascript:insertTag('ˇ');return false;">č</a>. This also takes care of bug 3550 (important since some diacritics are hard to differentiate visually).

Ref #1è (Tisza Gergő): you write:
''in case of multiple characters in the charinsert block, all but the first are
thrown away''

Certainly not, this would be exactly against the purpose of the initial bug, as we really need the possibility of inserting a sequence of '''several''' characters when they do not exist as precomposed characters.

My proposal would be similar to yours, but the text element content of the charinsert tag would be '''fully''' inserted.

But given the fact that the text content of charinsert is a list of characters, the best that we can do is to allow separating them by spaces (including the possibility of using an initial space to avoid the collision and possible precomposition with the trailing ">" character that terminates the charinsert start tag). Then split this text by spaces, ignore leading and traling spaces.
Each remaining sequence becomes a candidate for insertion in the list.

Note that the characters in the text content may also be inserted using numeric character references (to their unicode code point value in decimal or hexadecimal). For full XML conformance (and easier editing in edit tools featuring characters not found in the native script and language of the host wiki), these numeric character references '''must''' still be interpreted equivalently.

But I would favor another syntax where the text content of the charinsert tag would always be the displayed string, using an attribute only to qualify some spans and limit the characters that will be actually inserted. For example:
<charinsert>č Č š Š ž Ž <char title="hacek" insert="ˇ">&#x25cc;ˇ</char></charinsert>

Note that for isolated diacritics (combining characters), the character to display before it in the selector (and only needed there and that should not present) should probably be implied. Above I use U+25CC(◌) DOTTED CIRCLE which is the recommanded one (and supported in many fonts), to avoid the confusion with the actual precomposed letters which should be clickable directly and should be visibly distinct. In that case the insert attribute used above (which specifies which string will be actually inserted in the edited text when the displayed character is selected), could be avoided completely:
<charinsert>č Č š Š ž Ž <comb title="hacek">ˇ</comb></charinsert>

If some diacritics cannot be used with the default dotted circle, an "ignorable" (but displayed) quotted substring could be specified:
<charinsert>č Č š Š ž Ž <comb title="hacek"><q>&#x25cc;</q>ˇ</comb></charinsert>

Note that the separate <comb> element is not really different from the <char> element above. If the intent is to insert a single Unicode character, the fact that the referenced character is combining can be infered directly from the Unicode character properties (there are not so many combining characters in Unicode, so they could easily be detected in Javascript from a small preinitialized array of booleans indexed by character, for fast lookups; note also that some Unicode combining characters are also decomposable, so the actual characters can also be any string that starts with a combining character). In that case, this reduces the code to just:

<charinsert>č Č š Š ž Ž <span title="hacek"><q>&#x25cc;</q>ˇ</span></charinsert>

when specifiying the base character explicitly (U+25CC here, even if this is the default), or

<charinsert>č Č š Š ž Ž <span title="hacek">ˇ</span></charinsert>

when using the default (the string to insert will just ignore the substrings between <q>, and if the resulting string still starts by a combining character, it will display a leading U+25CC DOTTED CIRCLE implicitly.

However, I still think that using a separate <comb> element will be more explicit (and will allow a different presentation that can be customized (for example the display tool could display the diacritic after a non-breaking space U+00A0, instead of a dotted circle, and will use a distinctive background color or could display it within a table cell with a thin dotted border, according to site's stylesheet or user's preference).

So I militate for:

<charinsert>č Č š Š ž Ž <comb>ˇ</comb></charinsert>

(the simplest form), with the following optional extensions:

<charinsert>č Č š <char title="S with caron">Š</char> ž Ž <comb title="hacek"><q>&nbsp;</q>ˇ</comb></charinsert>

The content of charinsert will be a free list of text elements or <char> elements or <comb> elements. Here is the DTD:

<!ELEMENT charinsert ((#PCDATA | char | comb)*) >
<!ELEMENT char (#PCDATA) ((#PCDATA | q)*) ><!-- force the normal presentation -->
<!ELEMENT comb (#PCDATA) ((#PCDATA | q)*) ><!-- force the alternate presentation for combining characters or strings starting by one -->
<!ELEMENT q (#PCDATA) >
<!ATTRIB comb

title #PCDATA IMPLIED <!-- default is empty -->

<!ATTRIB char

title #PCDATA IMPLIED <!-- default is empty -->

Note also that *not all* diacritics are combining in Unicode: this is true for Thaï which is encoded in visual order without using combining characters for leading diacritics, but that will often still not display correctly if they are used before any random characer (with which they may create ligatures, or could simply be displayed with an undesired trailing dotted circle generated by the text renderere or by the font).

There is also the need to support the insertion of other "invisible" characters, notably format controls, and to render, in the diaply tool, various spaces and make them easily distinguished (for example in clickable table cells).
In those cases, it may even be desirable to not display at all the character that will be inserted when the table cell will be clicked (for example, "ZWJ", "ZWNJ", "NBSP"...). Generally, in those cases, there will be a separate label that will be used instead of the character itself and this label should probably be displayed with a smaller font within the table cell.

The "title" attribute is not made for this, as its role is give an hint helper which may be much longer than what is displayed in the clickable table cell (and too large to fit there cleanly). Such hint will be displayed and made accessible elsewhere (for example in a "bubble", or on the browser's status bar, or on any other more convenient single element within the HTML page and whose content will be refreshed to display this hint string, when the table cell is active or hovered by the mouse; it may also be vocalized, or directed (out of flow) to the contextual helper line of a Braille reader, according to its local user preferences, who can reserve a part of the display pad for displaying HTML "title" hints or image descriptions). The title attribute is then descriptive and can be arbitrarily long (it could be several sentences) and is not intended for containing visual abbreviations like "NBSP".

To support an alternate representation of the character using abbreviations (which will replace the actual rendering of the character within the table cell, we can use another optional attribute. It may even be preferable to make visual distinctions for format controls:

<!ELEMENT charinsert ((#PCDATA | char | comb | ctrl)*) ><!-- new definition here: adding ctrl -->
<!ELEMENT ctrl (#PCDATA) ((#PCDATA | q)*) ><!-- force the alternate presentation for controls or strings starting by one -->
<!ATTRIB ctrl

title #PCDATA IMPLIED <!-- default is empty -->
alt #PCDATA IMPLIED <!-- default is empty -->

<!ATTRIB comb

title #PCDATA IMPLIED <!-- default is empty -->
alt #PCDATA IMPLIED <!-- default is empty -->

<!ATTRIB char

title #PCDATA IMPLIED <!-- default is empty -->
alt #PCDATA IMPLIED <!-- default is empty -->

For example:

<charinsert><char title="non-breaking space" alt="NBSP">&nbsp;</char></charinsert>
<charinsert><comb title="acute accent" alt="ʹ">&#x301;</comb></charinsert>
<charinsert><comb title="dotted circle">&#x25CC;</comb></charinsert>
<charinsert><ctrl title="zero-width space" alt="ZWSP">&#x200B;</ctrl></charinsert><!-- this is not a format control -->
<charinsert><ctrl title="zero-width non-joiner" alt="ZWNJ">&#x200C;</ctrl></charinsert><!-- this is a format control -->
<charinsert><ctrl title="left-to-right override" alt="»">&#x202A;</ctrl></charinsert>
<charinsert><ctrl title="right-to-left override" alt="«">&#x202B;</ctrl></charinsert>
<charinsert><char title="narrow non-breaking space" alt="NNBSP">&#x203F;</char></charinsert><!-- this is not a control, it is visible ! -->

Separating the <char>, <comb>, and <ctrl> elements allow distinct presentations when needed (such as distinct table cell background colors in a characters selector). The <q> element will only be usable when there's not "alt=" attribute to remove the display of the actual character within the rendered table cell.

The U+25CC default base character (DOTTED CIRCLE) for sequences starting by a combining character will be implied and generated automatically by default, if:

  • the text content within the <comb> or <ctrl> or <char> (including the text within <q> elements which are also rendered by simple concatenation, but discarded from the actually inserted text) starts by a combining character (even if its "combining class" is 0, which just means that it can just never be precomposed with a prior base character, and never be reordered through normalization).
  • and there's no alt attribute in the <comb> or <ctrl> or <char> element start tag.

It will also be used implicitly if the characters within <charinsert> are packed in a space-separated string without surrounding <comb> or <ctrl> or <char> to subqualify them (meaning that by default they are treated as if each space separated sequence was within a <char> element with unspecified "alt=" and "title=" attributes.

It will be also illegal to use <q> within the content of <charinsert>, except within the content of <char> or <ctrl> or <comb> as it would not be clear where to associate them with surrounding space separated char sequences. They must be delimited, at least within a <char> without attributes.

The text content of a single <char> or <comb> or <ctrl> element can contain any text, it is not restricted to a single Unicode character. And it can also include spaces (however the spaces are implicitly packed, with leading and trailing spaces discarded; if one still wants to be able to include a litteral space within the text to insert and that must not be discarded complely, I propose adding the <space> element:

<!ELEMENT space #EMPTY>
<!ATTRIB space

title #PCDATA IMPLIED <!-- default is empty -->
alt #PCDATA IMPLIED <!-- default is empty -->

And allow it within the content of <charinsert>, <ctrl>, <comb> and <char> :

<!ELEMENT charinsert ((#PCDATA | space | char | comb | ctrl)*) ><!-- new definition here: adding space -->
<!ELEMENT ctrl (#PCDATA) ((#PCDATA | space | q)*) ><!-- force the alternate presentation for controls or strings starting by one -->
<!ELEMENT char (#PCDATA) ((#PCDATA | space | q)*) ><!-- force the normal presentation -->
<!ELEMENT comb (#PCDATA) ((#PCDATA | space | q)*) ><!-- force the alternate presentation for combining characters or strings starting by one -->

All these definitions are easy to parse through XML DOM in the Wiki parser. I hope they are precise enough. Comments are welcome.

Philippe.

Note that I prefer using a XML/DOM syntax, instead of a Wiki-like syntax, because parsing wiki is already very tricky, and <charinsert> has a very limited usage and is not intended to be used within normal wiki pages, but only within the "Mediawiki:" configuration and localization namespace. I think it's best to use existing (and strong) DOM parsers here.

As an alternative, if space separation is not an option to separate sequences, because character lists within <charinsert> are preferably packed as much as possible like:
<charinsert>a-zA-Z àÀâÂäÄ çÇ éÉèÈêÊëË îÎïÎ ôÔöÖ ùÙûÛüÜ ÿŸ</charinsert>
where the extensions (<char>, <ctrl>, <comb>, <q> elements above) are just present when needed, then text elements contained (separately from the <char>, <ctrl> and <comb>) will be parsed specially: spaces will be ignored there, unless they are specified as a content-less <space> element.

The only way to separate elements that must contain several characters (including spaces) that must be inserted as whole will be to separate them by embedding them within <char>, <ctrl> or <comb> subelements as above. So when parsing the text child elements of <charinsert>, what we get is a list of isolated Unicode characters (which are implicitly converted as if they were each within a <char> child element, unless they are combining or controls according to their Unicode properties, in which case they are converted into the relevant <comb> or <ctrl>), and of <char> elements (or <ctrl> or <comb> elements) or <space>. The list is explicitly ordered for the default display order (but the user interface may allow this list to be sorted differently, without having to edit it).

An interesting option, for speciying ranges of characters is to support the A-Z syntax, in the text child elements of <charinsert> only (but not within the child elements of <char>, <ctrl> or <comb> which can only be used for a single unbreakable sequence, and not in <space> which is empty or in <q> which is used for displaying ignorable characters, and not in the "alt=" and "title=" attributes above). In that case the MINUS-HYPHEN (-) is a syntaxic character whcih is only valid between two isolated Unicode characters. To specify a literal minus-hyphen as an insert string, use a <char>-</char>.

To specify a litteral string to insert (for example a wikitable), use also <char> as in:
<charinsert>
<char title="insert a wikitable" alt="Table">{|class="wikitable"

-

! Head1 !! Head2

-
Cell1Cell2
}

</char>
</charinsert>

It may also be interesting to use an image/icon instead of a "alt" text (rendered in a small font within the selector table cell). Instead of using <char> or <ctrl> or <comb>, use an <icon> element instead, whose child text element will be the text to insert in the wiki editor:

<!ELEMENT charinsert ((#PCDATA | space | char | comb | ctrl | icon)*) ><!-- new
definition here: adding space -->
...
<!ELEMENT icon ((#PCDATA | space)*) >
<!ATTRIB icon

title #PCDATA IMPLIED <!-- default is empty: the descriptive hint -->
alt #PCDATA IMPLIED <!-- default is empty: a small alternative text if the image can't be loaded -->
src #PCDATA IMPLIED <!-- URI : a page name URN
...

As the default rendering of <charinsert> would probably generate a list of table cells, it may need other things such as the maximum number of cells to render in each row, and row breaks, or row (unselectable) headers for subgroups of characters or strings.

Adding headers will be tricky, becauser it can be any kind of visual content in the GUI, so it may be any mix of HTML and or Wiki... but may be it can be restricted to only basic group titles (and the Wiki parser will generate the appropriate interface for the characters selector). Groups could be for example the name of a script, or the name of a language or orthography or notation which needs those characters or strings to insert.

Generally, we'll have a list of groups (possibly rendered as a combo box within a <select> form input element). But it may contain optional subgroups (for example in the larget sets of ideograms or Hangul syllables, or within the Latin script group for romanized languages). The purpose of the <charinsert> is of course to be able to generate a form with clickable buttons. It should be able to generate arbitrary list of buttons and allow them to be structured and possibly not all displayed at the same time: if a group is selected, it will replace the other list of subgroups currently displayed for another group currently selected at the same level. Comboboxes are not necessarily the best option (for usability), as they can also be replaced by a horizontal "ribbon" that can list either a list of individual buttons (instances of <char>, <ctrl>, <comb>, <space>), or other subgroups, all listed inline within it: pressing a subgroup will just activate another ribbon at the lower level, and ribbons can be stacked vertically.

In that case we'll have recursively:

<!ELEMENT charinsert ((charinsert | #PCDATA | space | char | comb | ctrl | icon)*) ><!-- new -->

And to name each group, all that is required is that the <charinsert> element itself also has the "title=" attribute for the description hint (which should then be required), and the "alt=" for the text rendered for the group header itself (optional, default will be the same as the title).

Will it need an icon rendered instead of this title text ? Would it need more HTML/CSS rendering, we could then allow the <charinsert> to contain a <caption> element as its first child, and whose content will be rendered instead of the title, and that will contain arbitrary HTML/CSS code including images. If that "rich" caption cannot be rendered, the title should still be present and used (for example when rendering a <charinsert> that represents a group (without any subgroup), as a named item within a combo-box generated from a higher-level <charinsert> container allowing to select within a list of groups). Comboboxes are precious to save screen space, but their rendering is more limited than general ribbons (toolbars, or blocks of inlineelements) as they can only show short strings of text with limited formatting.

It may also be interesting to allow other dynamic actions, when the table cell or button or span is clicked: by default it uses a Javascript charsinsert() event, but why not allowing something else such as calling custom Javascript? In that case, <charinsert> could use an optional attribute containing the default javascript event to call, which would be inherited by all elements in its parsed list of characters or strings, and the same attribute could be overriden by specifying it within <char>, or <ctrl>, or <comb> or <space>. This optional attribute should be onclick="", (and there may optionally be onmousehover="" and so on).

We could also have CSS attributes, for the <charinsert> element (specifying the default CSS attributes of each generated button) or specifically for each of <char>, <ctrl>, <comb> and <space> children elements (if we assume that each of them will visually generate a table cell (i.e. a rectangular inline span, containing one or more visual blocks). But I don't think that the <charinsert should contain any CSS element for the whole container (it will probably generate an inline span, and if needed it can be formatted by including it within a usual HTML <span> or <div> element.

matmarex subscribed.

I'll be honest, I did not even read the awesomely long comments above. I would like to propose a very simple solution: just ensure the links have a minimum width. That will make them clickable even if the characters to insert are of zero width.

The following patch makes code like <charinsert>̈ ̀ ̃ ̋ ̭</charinsert> work as expected. (Note that there are spaces between the individual combining marks.)

Change 406792 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/CharInsert@master] Enforce minimum width for very narrow or zero-width items

https://gerrit.wikimedia.org/r/406792

Change 406792 merged by jenkins-bot:
[mediawiki/extensions/CharInsert@master] Enforce minimum width for very narrow or zero-width items

https://gerrit.wikimedia.org/r/406792