Page MenuHomePhabricator

special firstChar() routine for Korean characters
Closed, ResolvedPublic

Description

Author: puzzlet

Description:
Since the written Korean language -- hangul -- is syllablic, pages in a category
page are sectioned with their initial syllables other than letters or phonemes.
As a result, almost every page has eventually its own section. Look at the URL,
which is equivalent to the Category:People in the English Wikipedia. In the
Korean category page, many pages have their own sections, such as
Category:Austrian_people, which falls in the "Au" section,
Category:Polish_people, which falls in the "Pol" section, etc. (They can be
recategorized to Category:People_by_nationality of course, but that's not the
point of the discussion.)

Every hangul letter can be divided to consonants and vowels, and it could be the
better index scheme for category pages if we section by the initial consonants
of initial letters of the pages:

  • articles starting with from 가(U+AC00) to 낗(U+B097) under the section with a

title ㄱ(U+1100),

  • from 나(U+B098) to 닣(U+B2E3) under ㄴ(U+1102),
  • from 다(U+B2E4) to 띻(U+B77B) under ㄷ(U+1103),
  • from 라(U+B77C) to 맇(U+B9C7) under ㄹ(U+1105),
  • from 마(U+B9C8) to 밓(U+BC13) under ㅁ(U+1106),
  • from 바(U+BC14) to 삫(U+C0AB) under ㅂ(U+1107),
  • from 사(U+C0AC) to 앃(U+C543) under ㅅ(U+1109),
  • from 아(U+C544) to 잏(U+C78F) under ㅇ(U+110B),
  • from 자(U+C790) to 찧(U+CC27) under ㅈ(U+110C),
  • from 차(U+CC28) to 칳(U+CE73) under ㅊ(U+110E),
  • from 카(U+CE74) to 킿(U+D0BF) under ㅋ(U+110F),
  • from 타(U+D0C0) to 팋(U+D30B) under ㅌ(U+1110),
  • from 파(U+D30C) to 핗(U+D557) under ㅍ(U+1111),
  • and from 하(U+D558) to 힣(U+D7A3) under ㅎ(U+1112).

Version: unspecified
Severity: enhancement
URL: http://ko.wikipedia.org/wiki/Category:%EC%9D%B8%EB%AC%BC

Details

Reference
bz1701

Related Objects

StatusSubtypeAssignedTask
DeclinedNone
ResolvedNone

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 8:14 PM
bzimport set Reference to bz1701.
bzimport added a subscriber: Unknown Object (MLST).

avarab wrote:

A duplicate of bug 1984.

*** This bug has been marked as a duplicate of 1984 ***

puzzlet wrote:

Patch for LanguageUtf8.php

Attached:

puzzlet wrote:

Changes in LanguageKo.php work fine in Korean Wikipedia, but multilingual
projects like Meta-wiki Wikisource need to be updated too. I attached the patch
file, which only modifies firstChar() to specially treat the Hangul Syllables
Area(U+AC00 ~ U+D7A3), but for any other characters it will do as what it has
been doing. But I'm not sure which file is the appropriate to be patched -
Language.php or LanguageUtf8.php. Take this for a test -
http://wikisource.org/wiki/Category:%ED%95%9C%EA%B5%AD%EC%96%B4 - which should
be not more than 10 sections after commit.

puzzlet wrote:

It's now OK for Korean Wikisource (
http://ko.wikisource.org/wiki/%EB%B6%84%EB%A5%98:%EC%8B%9C%EC%A1%B0 ) but
multilingual wiki like Meta-wiki still has this issue (
http://meta.wikimedia.org/wiki/Category:KO ).

My point is that this feature should be applied universally if it matters with
the pagename with Korean characters.

anon.hui wrote:

I second to this, this firstChar() of ko should apply to all wiki language, especially, on multilingual wiki.
Not just on ko wiki.

kjoonlee wrote:

Another vote for support here.

Done in r35055. Also did a tiny bit of cleanup to use utf8ToCodepoint() func instead of the manual UTF-8 decomp code.

(Could just use raw characters here instead of the hex positions, should one desire, but this isn't a performance-critical code path.)