Page MenuHomePhabricator

Category collation sort order should ignore spaces, hyphens, apostrophes (?)
Open, LowPublicFeature

Description

Category collation sort order should ignore spaces, hyphens, apostrophes.

PHP Collator provides a way to disable comparison of all punctuation using Collator::ALTERNATE_HANDLING; however, it would be useful to keep, for example, commas meaningful to correctly sort biographies ("Last, First" > "Las, Tzzzzz").

Suggested at https://fr.wikipedia.org/wiki/Wikipédia:Le_Bistro/26_septembre_2013#Discussion


Version: 1.22.0
Severity: enhancement

Details

Reference
bz54689

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:20 AM
bzimport set Reference to bz54689.
bzimport added a subscriber: Unknown Object (MLST).

Spaces and hyohens are too much! Actually we need to preserve word separators, with the exception (tunable by language) of apostrophes which should be considered either ignorable (when they are used as elision marks, most often at end of words, which will be fused with the next word), or as significant letters (when they denote a phoneme like a glottal stop; most ofen in leading positions after a word separation).

So in English, French, Italian, the apostrophe is ignorable for collation and plain-text searches ("its OK" or "it's OK" will match the same)

But in Napolitan (for example) there are distinctions between two types of apostrophes: final elision and initial glottal stops, both may occur in a sequence (and some Napolitan articles use the ASCII double quote (") for them only to avoid the two quotes ('') being interpreted as italics in MediaWiki syntax, when Napolitan wikis should have better used distinctive left and right apostrophes to distinguish them. You can detect these unexpected double quotes because they are surrounded by letters without any space on either sides.

But MediaWiki cannot currently work with ('') between two letters (without any space on either side) as meaning two apostrophes (right apostrophe for final elision, then left apostrophe for the initial glottal stop): it currently always interprets them as the Wiki syntax for italics (single words that switch between roman and italics in the middle are extremely rare, and if needed, you could still insert a <nowiki/> before or after the Wikicode markup ('') to restore its function as an italic style delimiter.

Partial work-around: articles can also use ('<nowiki/>') to separate the apostrophes, but this does not work in contexts where markup is undesired (such as page title names, or title attributes of elements), and users also cannot use the ASCII double-quote kludge in these contexts, because both single and double quotes can occur anywhere in plain-text. So they should use the left and right apostrophes inserted of the ASCII apostrophe-quote and double-quote.

This may not be a concern for this particular bug report, but sortkeys beginning with or consisting only of a space or other punctuation mark should be handled separately - on the English Wikipedia, at least, a sortkey beginning with a space is frequently used to sort "key" articles (especially the category's eponymous article) to the top of the category; I've also seen asterisks used for this purpose, and it wouldn't surprise me if other language wikis have similar conventions.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:13 AM
Aklapper removed a subscriber: wikibugs-l-list.