Page MenuHomePhabricator

Cite: 'Cite error ref too many keys' not generated if name chunk contains other than A–Z, a–z, 0–9
Open, MediumPublic

Description

When a reference name is not enclosed in quotes and contains an invalid character, then it should trigger Cite error ref too many keys. Example:

<ref name = smith p1>x</ref>

But, if the the name fragment after the space contains (not just begins) any other character than A–Z, a–z, 0–9, then the error is not triggered. Example:

<ref name = smith p!1>x</ref>

This should generate the error message, but instead it truncates the name at the space.


Version: unspecified
Severity: normal

Details

Reference
bz42040

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:00 AM
bzimport added a project: Cite.
bzimport set Reference to bz42040.
bzimport added a subscriber: Unknown Object (MLST).
thiemowmde added subscribers: thiemowmde, Anomie, cscott and 2 others.

I can confirm this is still an issue. I understand why it appears like a bug in the Cite extension. But it is an issue with MediaWiki's Parser, specifically with the way it processes attributes for so called "tag hooks". The <ref> tag is one of these.

This is the most minimal example I could come up with:

This shows the expected error message
<ref x>a</ref>
This does not
<ref !>a</ref>

I was able to track the issue down to the line https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/parser/Parser.php$4039, which calls Sanitizer::decodeTagAttributes(). This function is called with the expected string !, but returns an empty array. From this point onward, later code does not have any idea there was an invalid attribute in the tag.

There is a line that intentionally skips attributes with invalid characters, without reporting anything. This line was introduced just recently via https://gerrit.wikimedia.org/r/471363, in 2018. However, a closer look shows that the same issue existed before. The old code relied on a regular expression that skipped invalid characters, creating the exact same result.

I'm really not sure if there is anything we can do about this. I believe it would be wrong to add the ! to the list of attributes. This will most certainly create a lot of new issues in code consuming such a set of attributes. Another idea is to let the Parser report such invalid character sequences. However, this would be a breaking change in the Parser. Historically, there was no such thing as "invalid" wikitext.