Page MenuHomePhabricator

Setting labels should normalize some things, API should return the actual label on success
Closed, ResolvedPublic

Description

When setting the label of an item (via API), some normalization should be done. The case I am thinking about right now is about having several spaces within the label like "My Item" where the four spaces should be replaced with one. This would be consistent with MediaWiki page titles where the same thing is done.

Also, the API should return the label when setting it, so we can grab it and display it to the user in the ui accordingly.


Version: unspecified
Severity: normal
Whiteboard: storypoints: 5
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=36432

Details

Reference
bz36439

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 12:27 AM
bzimport set Reference to bz36439.
bzimport added a subscriber: Unknown Object (MLST).

Se also "Bug 36432 - Normalize titles and namespaces". Whitespace (and also the underscore) is stripped in front of text and afterwards, some places also stripped infix, it is done some up-/lowercasing and so forth.

It is not clear where ordinary normalization should be done, that is in the API or in the WikibaseItem.

If the strings somehow changes before, during or after storing the pre- and post normalized form should be reported. so the UI could adjust itself accordingly.

API does report the new values "as is" fter it is set in the same style as the rest of the API, but this is somewhat cumbersome to unwind later. It is set in a "normalized" structure with "from" and "to", if they are different, but this can later lead to a inconsistency if several language attributes are set at the same time.

A better solution would be to unconditionally report back the structure as it actually are after changes.

This actually does work for the labels if I am not mistaken -- but it does not seem to work for descriptions and aliases.

There is a very rudimentary mechanism in place for labels. I propose we do something similar as for titles for the labels and aliases, but I am more unsure about how harshly we shall normalize the description. I'm tempted to do something similar as for summary. That is allow links but disallow templates.

The following normalization should be done for Labels, Descriptions, and Aliases:

  • Unicode normalization of the labels to be done on the Repo.
  • Trimming
  • Internal whitespace compression

The UI should display the returned normalized value.

Note

  • the vast majority of input data is already in form C, using precomposed characters
  • Form C is supposed to be relatively lossless, with the only changes being invisible transformations between base character + combining character sequences and precomposed chars. In theory text should never change appearance because it's been normalized to form C.
  • and further, the W3C recommends it

http://www.mediawiki.org/wiki/Unicode_normalization_considerations#What_is_it.3F

This means that an accented character works if it can be normalized into a precomposed character. For example O₂ and O² works because they can be normalized into precomposed characters. The code sequence U+30A COMBINING RING ABOVE preceded by a might be interpreted as a U+00E5 LATIN SMALL LETTER A WITH RING ABOVE, but it can also be interpreted as an a followed by a small ring. The same thing happens with a lot of accented letters.

There are also the problem with similarly looking character, which the following shows

package main
import "fmt"
func main() {

a1 := string([]byte{0xe2,0x84,0xab})
a2 := string([]byte{0xc3,0x85})
fmt.Println(a1, a2, a1 == a2)

}

Prints:

Å Å false

One character is Angstrom while the other is an A with a ring above, that is the usual character in Danish and Norwegian.

For now the aliases, labels and descriptions will be normalized into the form C, and the text will then be trimmed for leading and trailing whitespace and internal whitespace will be compressed. Whitespace will only be handled for a limited set of whitespace characters.

Thanks for the write up!

What would

toNFC(a1) == toNFC(a2)

return?

Some results from normalization
Source - encoded - normalized - comment
Åland - %C3%85land - %C3%85land - codepoint for char
Åland - A%CC%8Aland - %C3%85land - combining ring above
Ångstrom - %E2%84%ABngstrom - %C3%85ngstrom - The initial letter is code point for an unit

So seems like our current normalization (C) rewrites from capital letter A with an combining ring above into a valid code point.

"Characters are decomposed and then recomposed by canonical equivalence."

Seems like it only will fail in kases with multiple combining characters, but I'm not sure if that will ever happen.

In my opinion, this works now, case closed.

See also http://en.wikipedia.org/wiki/Unicode_normalization#Normalization

Just for the record, conversion of the initial letter in Ångstrøm into a normal codepoint for Å seems a little bit weird.

Verified in Wikidata demo time for sprint 8