Page MenuHomePhabricator

Turkish needs lc / uc methods
Open, LowPublic

Description

Split from bug 31490.

Our Turkish language class lacks proper implementation of lc() and uc() methods for that language. It uses a dotted i and a dotless i, which mean that I and i are actually different letter in that language!

Useful context to read is https://en.wikipedia.org/wiki/Dotted_and_dotless_I

An implementation was deployed on wmf wiki for MediaWiki 1.18 but it was reverted by r99289 and r99290. The reason is that the patches broke magic words and related parser functions (i.e. {{#lcfirst}}) on the turkish wikis.

The MediaWiki code handling magic words normalize the wordsto lower case using the content language (look for lc() calls in the MagicWord class). Hence a magic word such as LCFIRST is treated just like any Turkish word (since we use content language) and it ends up lower cased but with a dotted i and the word is not found.

Two possibilities:

  • magic words could optionally be made an array referencing the language. Then we could use that language to use the proper lc / uc implementations
  • for Turkish language, forge magic words aliases having dotted or dot less i. i.e. 'ucfirst' (with dot) could have an alias UCFIRST (without dot). Both would then be valid.

Optionally, parser functions could use a parameter to change the language being used. This would let Turkish project to use the English lc / uc function, for example to upper case iPhone to IPhone (dotless i).

Details

Reference
bz33643

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:06 AM
bzimport set Reference to bz33643.
bzimport added a subscriber: Unknown Object (MLST).
  • Bug 33299 has been marked as a duplicate of this bug. ***
  • Bug 32707 has been marked as a duplicate of this bug. ***
  • Bug 40012 has been marked as a duplicate of this bug. ***

This is still an ongoing issue though I am not working in it myself.

vitomedia wrote:

The issue about the system messages at TR projects is quite annoying (see Bug 40012), so it'd be fantastic if it could be worked out.

hashar updated the task description. (Show Details)
hashar added a subscriber: Arrbee.

Hi @Arrbee, I am triaging some old bugs of mine. I believe this one will fit the language team pretty well but might involve people from the Parsing team.

The summary is that Turkish has two characters for i, one with a dot, the other without a dot: https://en.wikipedia.org/wiki/Dotted_and_dotless_I and need lc and uc methods to take that account (we already do for lcfirst and ucfirst). A few patches I wrote ages ago ended up being reverted cause lc is also used to normalizes wikitext magic word using the content language. Hence LCFIRST ended up normalized to lcfırst (with a lower case dotless i) which is not in the magic word array. It broke reaching special pages as well.

So I don't quite know how to fix it, but it would be nice to have someone knowing with the internal of our languages to dig into it. Possibly with the help of people actually knowing Turkish :]