Page MenuHomePhabricator

action=parse shows different sortkey then one outputted by prop=categories (prefix vs actual binary sortkey)
Open, MediumPublicFeature

Description

Using a very long sortkey (>255 bytes) let the api module action=parse to output that long sortkey, but the database stored later a truncated sortkey.

Please truncate the sortkey when adding to the ParserOutput, so action=parse shows the right sortkey.

Please truncate whole multibyte characters and avoid half bytes inside the database.

By the way, you can remove the truncation from LinksUpdate (r79706).

Thanks.


Version: 1.18.x
Severity: enhancement

Details

Reference
bz26614

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:15 PM
bzimport set Reference to bz26614.

The reason i didn't do it in ParserOutput (well in addition to the parser being scary), is that an extension can add stuff to that array in several different ways, which makes it hard to put the check directly in ParserOutput.

Perhaps we could add a check in ParserOutput::addCategory where categories are added 99% of the time, and still keep the extra substring check in LinksUpdate. Does that sound sane?

(In reply to comment #1)

The reason i didn't do it in ParserOutput (well in addition to the parser being
scary), is that an extension can add stuff to that array in several different
ways, which makes it hard to put the check directly in ParserOutput.
Perhaps we could add a check in ParserOutput::addCategory where categories are
added 99% of the time, and still keep the extra substring check in LinksUpdate.
Does that sound sane?

Yes, when the truncate inside LinksUpdate is needed, because of extensions, than keep the extra substring.

On second thought, even if it was truncated in ParserOutput, the api would still output the wrong sortkey on action=parse, because its outputting the equivalent of cl_sortkey_prefix, not cl_sortkey. The question is, should it be outputting the prefix, or the final (binary) sortkey?

Does anyone know how these values are actually used in practise (so which is appropriate?). Perhaps we should output both? I'm leaning towards outputting the final binary sortkey here, but I could also imagine someone would have use for the human readable sort key as well, so I'm unsure.

cc'ing Aryeh Gregor in case he has any thoughts on this, seeing as he did most of the category stuff.

I'm guessing this has some relation to bug 24650

ayg wrote:

The way they're used in practice is that if cl_sortkey_prefix is empty, then cl_sortkey = $wgContLang->convertToSortkey( page_title ). Otherwise, cl_sortkey = $wgContLang->convertToSortkey( cl_sortkey_prefix . "\0" . page_title ). See Title::getCategorySortkey() and Language::convertToSortkey().

cl_sortkey is in general an arbitrary binary string which may bear no discernible relationship to the original page title or sortkey prefix, once we start using proper ICU or CLDR or whatever for sortkeys. So I can't imagine why anyone would want the actual value; logically, you'd only want to sort by it.

I don't know why you'd want cl_sortkey_prefix either, for that matter (although it is going to be human-readable text). I can't see any use for it other than constructing cl_sortkey. Do you know of anyone who actually wants or uses these?

This isn't actually an API par se, it's more the parser sort of.

Where is it actually output from a parse query? I can't see it anywhere obviously.

bug 24650 has been fixed now.

Where are these actually used anywhere by anyone? I'm poking around, and have logged bug 26736 as a related bug, depending on this perceived usefulness..

Especially, as per Aryeh, it may move further from being a human readable thing, which is probably useless.

+------------+-------------------+

cl_sortkeycl_sortkey_prefix

+------------+-------------------+

FULLTWOLOW
MAIN PAGE

+------------+-------------------+

Where is it actually output from a parse query? I can't see it anywhere
obviously.

The sortkey attribute of the cl element in this query:
api.php?action=parse&text=[[category:Foo|bar]]&prop=categories

Honestly, it wouldn't be horrible to leave it the way it is. In a sense, its outputing what it reads the sortkey as, not what the final sortkey is.

Since Bug #24650 is fixed now, this obviously wasn't blocking it. Removing blocker status.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM