Page MenuHomePhabricator

API: add normalized info also for unicode normalization of titles
Closed, ResolvedPublic

Description

When you give the API titles in non-NFC form (see URL), in the output they are silently normalized to NFC, which makes it difficult for the user to match the input with the output.

So there should be a 'normalized' entry for every given title in non-NFC form, like the ones for other title normalizations.


URL: http://en.wikipedia.org/w/api.php?action=query&titles=%CE%A5%CC%88
See Also:
T47848: API is applying normalization that Title doesn't

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:32 PM
bzimport set Reference to bz27849.
bzimport added a subscriber: Unknown Object (MLST).

IIRC, this normalization is applied on raw input in WebRequest, so the API code would only ever see the NFC form in the first place.

For it to know anything had changed, it would have to manually compare against $_GET and $_POST source variables.

Bryan.TongMinh wrote:

We can add a function to WebRequest to return the original value instead of the normalized.

Looks like we might need to cache it earlier...

As it looks like whenever the normalize is called, it just overrides them all..

Bryan.TongMinh wrote:

The normalization is done in getGPCValue. Just add a boolean parameter $normalize.

This should also be implemented for special utf-8 normalization like it is done for ml and ar . e.g. http://ml.wikipedia.org/w/api.php?action=query&titles=അനിമേഷന്‍ I think its NFKC?

I took a stab at this this afternoon, but ran into an issue that I think makes this impossible to solve. I managed to delay Unicode normalization of the titles parameter until ApiPageSet::processTitlesArray(), and got ?action=query&titles=Ϋ&format=jsonfm to output a 'normalized' object. However, all data in the API result data structure is Unicode-normalized before being output, so you get stuff like:

		"normalized": [
			{
				"from": "\u03ab",
				"to": "\u03ab"
			}
		],

where the "from" entry was originally "\u03a5\u0308" (the value specified in the query string) but got normalized prior to being output. This means from and to will always be equal (sans underscores to spaces and other existing normalizations), so this is useless.

I could armor the from value to protect it from Unicode normalization (I've written code for that before; I threw it out but I should be able to reproduce it quickly), but that would allow the injection or arbitrary non-normalized data into the result, which may be invalid UTF-8, which would break e.g. XML parsers.

Is there a way I can do this only for cases where we want this? Is "\u03a5\u0308" a string that is valid UTF-8/Unicode but is nevertheless changed by Language::normalize()? Is this true for all cases where we want this feature? Is it possible to detect this somehow? CC Brion because he probably knows more about this subject that I do.

Created attachment 8504
Stashing my work-in-progress changes here, this is as good a place as any

attachment unicode-normalize.patch ignored as obsolete

Created attachment 8505
Patch with debugging code removed

Attached:

Bryan.TongMinh wrote:

(In reply to comment #6)

I could armor the from value to protect it from Unicode normalization (I've
written code for that before; I threw it out but I should be able to reproduce
it quickly), but that would allow the injection or arbitrary non-normalized
data into the result, which may be invalid UTF-8, which would break e.g. XML
parsers.

Invalid UTF-8 is essentially random binary data and should thus be encoded, for example in base64.

(In reply to comment #9)

Invalid UTF-8 is essentially random binary data and should thus be encoded, for
example in base64.

Yeah. But I think it's fair not to offer this feature (normalization data) when the client gives us invalid UTF-8. Clients sending invalid UTF-8 deserve that punishment :)

However, I think there are edge cases where we normalize things that are already valid UTF-8 but not quite the way we like to see it. I'd like to be able to detect those cases.

There are essentially two layers of work here, which our input validation merges into a single step:

  1. invalid UTF-8 sequences must be found and replaced with valid placeholder characters
  1. valid UTF-8 sequences are normalized to form C (eg, replacing 'e' followed by 'combining acute accent' into precombined character 'e with acute')

The invalid UTF-8 sequences found in part 1) cannot be represented as strings in JSON or XML output, because JSON and XML formats are based on Unicode text. Even if you wanted them, you can't just output them directly, nor can you use any escaping method to represent the original bad sequences.

Outputting the original bogus UTF-8 into the document would cause it to be unreadable, breaking the API.

Most likely, only 2) is of real interest: "\u03a5\u0308" is a perfectly valid Unicode string, and can be shipped around either with the JSON string escapes as above or as literals in any Unicode encoding for any JSON or XML document. We can perfectly well expect clients to sent that string, and we should be able to represent it in output.

That we normalize strings into NFC for most internal purposes should generally be an implementation detail of our data formats and how we do title comparisons, so it's reasonable to expect clients that input a given non-NFC string to see the same thing on the other side when we report how we normalized the title string.

Running only UTF-8 sequence validation at the $wgRequest boundary, and doing stuff like the NFC conversion to avoid extra combining characters should really be at processing and comparison boundaries like Title normalization.

So in short: don't worry about representing invalid UTF-8 byte sequences: either use a 'before' value that's been validated as UTF-8, or let the API output do UTF-8 validation (but make sure it *doesn't* apply NFC conversion on all output)

(In reply to comment #11)

So in short: don't worry about representing invalid UTF-8 byte sequences:
either use a 'before' value that's been validated as UTF-8, or let the API
output do UTF-8 validation (but make sure it *doesn't* apply NFC conversion on
all output)

Yeah I wanted to do the former, but I have no idea how. What kind of function would I call to find out if something is valid UTF-8?

(This is mostly why I'm asking you for help, I know next to nothing about MediaWiki's Unicode facilities, and IIRC you wrote them ;) )

Honestly I don't think we have a good way to do that right now; UtfNormal combines it with the NFC stuff in quickIsNFCVerify(), and our fallbacks mean that a call to iconv() or mv_convert_encoding() might not actually apply anything...

Blech!

Can't you do something like
$string2 = $string
UtfNormal::quickIsNFCVerify( $string2 );
$stringIsValidUTF8 = $string === $string2 ? true : false;

As far as I can tell, the quickIsNFCVerify doesn't seem to do anything with the string argument other then remove invalid sequences, and remove control characters (or replace with the replacement character).

(In reply to comment #14)

Can't you do something like
$string2 = $string
UtfNormal::quickIsNFCVerify( $string2 );
$stringIsValidUTF8 = $string === $string2 ? true : false;

As far as I can tell, the quickIsNFCVerify doesn't seem to do anything with the
string argument other then remove invalid sequences, and remove control
characters (or replace with the replacement character).

Hmmmmm, you know what, that should work just fine actually. :)

Downside: may be slower than UtfNormal::cleanUp() on some input texts on some systems, eg if NORMALIZE_ICU is on and using that extension. In other modes, that same code is already getting run if we're calling UtfNormal::cleanUp(), so it should be about the same speed for common cases if we're using either the default or the NORMALIZE_INTL mode (since it calls quickIsNFCVerify anyway to validate UTF-8 before doing the normalization call).

Just some statistics from my interwiki bot:
Each of my api requests normally contains 50 titles values. The title values itself are result of other api requests, so it should all be valid utf8.
After reading normalized and converted information in average on mlwiki 17 requested titles cannot found in the result, and on arwiki its about 3-4 titles.

To solve this problem i am rerequesting each not founded title using its own request, so i know that the single element contained in the response must meet the requested title. In summary my bot needs about 18 reading requests for 50 titles on mlwiki instead of only one.

btw, if i recall we do some other normalization beyond NFC for ml and ar wikis (that are done only on wikis with those content languages for performance reasons, so if you get an interwiki link title from an en wiki, it might have different normalization on ml or ar).

Bryan.TongMinh wrote:

(In reply to comment #15)

(In reply to comment #14)

Can't you do something like
$string2 = $string
UtfNormal::quickIsNFCVerify( $string2 );
$stringIsValidUTF8 = $string === $string2 ? true : false;

As far as I can tell, the quickIsNFCVerify doesn't seem to do anything with the
string argument other then remove invalid sequences, and remove control
characters (or replace with the replacement character).

Hmmmmm, you know what, that should work just fine actually. :)

Downside: may be slower than UtfNormal::cleanUp() on some input texts on some
systems, eg if NORMALIZE_ICU is on and using that extension. In other modes,
that same code is already getting run if we're calling UtfNormal::cleanUp(), so
it should be about the same speed for common cases if we're using either the
default or the NORMALIZE_INTL mode (since it calls quickIsNFCVerify anyway to
validate UTF-8 before doing the normalization call).

It's only done on 255 byte strings, so the slow down should be negligible.

Bryan, Bawolff,

Could one of you take this and make the necessary changes to close the bug?

leaving this as a deployment blocker since all that seems to be needed here is a SMOP.

(In reply to comment #20)

leaving this as a deployment blocker since all that seems to be needed here is
a SMOP.

This could potentially lead to invalid output for XML formats (since certain characters like some control characters are not allowed to appear in XML files, even in entity form)

(In reply to comment #21)

(In reply to comment #20)

leaving this as a deployment blocker since all that seems to be needed here is
a SMOP.

This could potentially lead to invalid output for XML formats (since certain
characters like some control characters are not allowed to appear in XML files,
even in entity form)

Hmm, I suppose I should read what other people say before commenting ;)

switch to milestone, remove release tracking dep

Moved patch into Gerrit, see https://gerrit.wikimedia.org/r/#/c/22831/ . It doesn't actually work yet, because the unnormalized data needs to be armored to bypass ApiResult::cleanUpUTF8() somehow.

punting to some point in the future.

(In reply to comment #0)

When you give the API titles in non-NFC form (see URL), in the output they
are
silently normalized to NFC, which makes it difficult for the user to match
the
input with the output.

I suppose bug 45848 is another way to look at the problem? It's causing major problems with some API consumers like LiquidThreads; Nikerabbit proposed a solution there.

Change 22831 abandoned by Hashar:
(bug 27849) Add normalized info for Unicode normalization of titles

Reason:
Cleaning up very old change. Feel free to resurrect if there is any interest in finishing this.

https://gerrit.wikimedia.org/r/22831

High priority set for more than three years => reflecting reality by setting to Normal priority. Comment 27 and comment 28 describe the status here.

We currently go to great lengths both to prevent any non-NFC text from entering MediaWiki and to prevent any non-NFC text from leaving it. I don't think we should change that. My preferred solution to this would be mapping the index of each input title in titles (rather than the text itself, like we now do in normalized) to the normalized name.

Change 306491 had a related patch set uploaded (by Anomie):
API: Warn when input parameters are normalized

https://gerrit.wikimedia.org/r/306491

Change 306491 merged by jenkins-bot:
API: Warn when input parameters are normalized

https://gerrit.wikimedia.org/r/306491

matmarex assigned this task to Anomie.
matmarex removed a project: Patch-For-Review.

Note that there are some interesting things about the new entries in 'normalized':

  • A new 'fromencoded' key is added, if the value is true, the 'from' value must be percent-decoded (URL-decoded). This is because it may include bytes that we can't safely include in the output (invalid UTF-8 which is just a pain to work with, or null bytes which can't be represented in XML at all).
  • 'normalized' is no longer a direct map. If input needs both NFC normalization and MediaWiki title canonicalization, there will be two entries and you have to follow the map twice.

For example: "ǎ" (a + combining caron) gets NFC normalized to "ǎ" (a with combining caron), which when treated as a page title gets normalized to "Ǎ".

/w/api.php?action=query&format=json&formatversion=2&titles=a%CC%8C

{
    "batchcomplete": true,
    "warnings": {
        "query": {
            "warnings": "The value passed for 'titles' contains invalid or non-normalized data. Textual data should be valid, NFC-normalized Unicode without C0 control characters other than HT (\\t), LF (\\n), and CR (\\r)."
        }
    },
    "query": {
        "normalized": [
            {
                "fromencoded": true,
                "from": "a%CC%8C",
                "to": "ǎ"
            },
            {
                "fromencoded": false,
                "from": "ǎ",
                "to": "Ǎ"
            }
        ],
        "pages": [
            {
                "ns": 0,
                "title": "Ǎ",
                "missing": true
            }
        ]
    }
}

/w/api.php?action=query&format=xml&titles=a%CC%8C

<?xml version="1.0"?>
<api batchcomplete="">
  <warnings>
    <query xml:space="preserve">The value passed for 'titles' contains invalid or non-normalized data. Textual data should be valid, NFC-normalized Unicode without C0 control characters other than HT (\t), LF (\n), and CR (\r).</query>
  </warnings>
  <query>
    <normalized>
      <n fromencoded="" from="a%CC%8C" to="ǎ" />
      <n from="ǎ" to="Ǎ" />
    </normalized>
    <pages>
      <page _idx="-1" ns="0" title="Ǎ" missing="" />
    </pages>
  </query>
</api>