Page MenuHomePhabricator

Set $wgCategoryCollation to 'uca-default' and rebuild category sort keys on Portuguese Wikipedia
Closed, ResolvedPublic

Description

On T32996, Tim recommended changing the collation method to "uca-default" on a language-by-language basis, after checking each language for correct collation and first-letter behaviour on a test wiki.

According to the tests made by Bawolff, the "uca-default" collation method seems to work fine on Portuguese:
http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/59758/focus=59767
and in the same topic, Tim said the results are good enough.

On
pt:Wikipédia:Esplanada/propostas/Melhorar a ordenação das páginas com títulos acentuados (20mar2012)
editors from Portuguese Wikipedia agreed to enabling uca-default sorting on ptwiki.

So, I believe it is feasible to set $wgCategoryCollation to 'uca-default' on ptwiki (and make any necessary updates, or run the necessary maintenance scripts).


Version: unspecified
Severity: enhancement
URL: https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:Esplanada/propostas/Melhorar_a_ordena%C3%A7%C3%A3o_das_p%C3%A1ginas_com_t%C3%ADtulos_acentuados_%2820mar2012%29
See Also:

Details

Reference
bz35632

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:16 AM
bzimport set Reference to bz35632.

Tim recommendation is more exactly:

"So I recommend doing this change on a language-by-language basis, after checking
each language for correct collation and first-letter behaviour on a test wiki.

Also, it would be nice to know in advance what percentage of sort keys will be
larger than the 230 bytes allowed by the database field, and if that percentage
is significant, whether there are categories on the target wikis where the
order will be changed by truncation after 230 bytes.".

For the first part, If I prepare you a testwiki with this setting enabled, would you be willing to populate pages and categories for this test?

Sure. I could request some help on ptwiki's village pump.

Just to be sure: wouldn't be easier to use special:import to get a list of (categorized) pages directly from ptwiki?

Indeed, that could also be a way.

I will prepare that this Monday or Tuesday.

Sort key size histogram for ptwiki with uca-default:

0-25: 1349546 |********
26-51: 2124309 |
************
52-76: 878662 |
****
77-102: 163018 |
**
103-128: 42182 |*
129-154: 13498 |
155-180: 3402 |
181-205: 1679 |
206-231: 482 |
232-257: 214 |
258-283: 59 |
284-309: 42 |
310-334: 8 |
335-360: 2 |
361-386: 2 |
387-412: 0 |
413-438: 2 |
439-463: 0 |
464-489: 0 |
490-516: 3 |

99.993% of category entries have sort keys smaller than the limit of 230 bytes; 332 entries would have their sort keys truncated. It's unlikely that the order of any categories would be affected by truncation. The total index size would go up from about 116MB to 172MB.

I think we just need to schedule a deployment window now.

Scheduled for Tuesday, August 21 23:30-01:30 (next day) (4:30pm-6:30pm PDT) - Tim will be doing this deploy

Created attachment 11125
Changes related to [[Categoria:Sociedade de Transportes Colectivos do Porto]]

For the record: a user noticed two articles were out of the expected order for no reason in one of our categories:
https://pt.wikipedia.org/w/index.php?title=Wikip%C3%A9dia:Caf%C3%A9_dos_programadores&oldid=32271948#Ordena.C3.A7.C3.A3o_de_categorias

It seems they fixed the order by editing the text after the pipe of the category as in
https://pt.wikipedia.org/w/index.php?title=Linha_602_da_STCP&diff=32266548&oldid=32216579

Here are the differences between two requests to the API before and after the changes made:
https://pt.wikipedia.org/w/index.php?title=Wikip%C3%A9dia:P%C3%A1gina_de_testes/1&diff=32271872

I'm also attaching a copy of the list of recent changes to articles of that category in case any of them are relevant.

Attached: