Page MenuHomePhabricator

[Task] Apply normalization to string values in statements
Closed, ResolvedPublic

Description

I suggest to trim leading/trailing spaces around statements of string type, and apply unicode normalization.

Due to a c&p error I added some trainling spaces at a VIAF statement and had to remove them with an extra edit, see URL.

Example URL: https://www.wikidata.org/w/index.php?title=Q6486700&diff=10051134&oldid=10051107


Other issues:

  • CR
  • unprintable characters (if not included in unicode normalization)

Details

Reference
bz45925

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bzimport added a subscriber: Unknown Object (MLST).

The "malformed input" error is thrown by the RegexValidator in function buildStringType from WikibaseDataTypeBuilders. It often happens when copy-pasting text from other sources (see http://www.wikidata.org/wiki/Wikidata:Project_chat#Error when adding commonscat) and it is made worse by bug 63301 which sometimes adds newlines when pressing return to save the claim, thus triggering the error.

The trimming was previously done in ValueView but it has since been removed (see http://github.com/wmde/ValueView/commit/0a8350999bb1ee028db9487286173aee0b20640f). We could use the StringNormalizer class to trim the string and also remove incomplete UTF-8 sequences (see related bug 50486), however I'm not familiar with the data value processing code so I'm not sure where is the correct place to do it.

Just to be clear: ValueView did not trim the value, it just checked for empty or whitespace-only strings.

At the time of writing, the current behaviour seems to be: When a value with a leading space character is submitted, the back-end parser will return a "malformed input" error.
I wonder whether silent trimming should rather be implemented in the back-end parser instead of in the front-end.
Can some authority please decide on that?

IIRC When I discussed this with Daniel his comment was that the parser shouldn't do anything to the string but only parse it. That makes sense to me.
If we only do it in the frontend then these could still get in via the API for example. That seems sub-optimal.

Daniel: Can you chime in on the pros and cons of doing this in the backend please?

I said that? Hm, then I have to disagree with myself there... The parsevalue API module (resp. StringValueParser) should trim input (and apply utf8 normalization).

Hm... apparently, StringValueParser does not create StringValues from strings. We seem to use the NullParser for this, which returns an UnknownValue. Whatever.

So, let's have an actual StringValueParser that applies normalization.

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).Dec 1 2014, 2:33 PM
thiemowmde renamed this task from Trim spaces around statements of string type to [Task] Trim spaces around statements of string type.Aug 13 2015, 3:19 PM
thiemowmde lowered the priority of this task from High to Medium.
thiemowmde updated the task description. (Show Details)
thiemowmde added a project: DataValues.

wbparsevalue now supports datatype=string. The UI should start using it. In fact, the UI should use the same code to parse, and at the same time validate, all data values. All special case code for string values should go away.

daniel raised the priority of this task from Medium to High.Aug 27 2015, 3:09 PM

Bumping to high because of implications missing unicode normalization. This could potentially break JSON dumps.

daniel renamed this task from [Task] Trim spaces around statements of string type to [Task] Apply normalization to string values in statements.Aug 27 2015, 3:11 PM
daniel updated the task description. (Show Details)

From what I wrote in T147037:

I think it's a frequent problem when strings start or end with odd characters or even include CR in the middle.

It happens a lot when trying to create statements for article titles. Oddly, I think the same text for labels tends to work.

Instead of rejecting it completely, it would be helpful if a cleaned version was offered for saving instead.

Please distinguish between

a) normalization applied by the API, when receiving a JASON structure representing a sting value.
b) normalization applied by the API when parsing user input, and returning a string value
c) normalization applied by the JavaScript widget before sending user input to the API

I believe we should do (b), but not (a). We can also do (c), but with (b) in place, this seems redundant.

The reason I believe we should not do (a) is that we are not dealing with user input then, but with a DataValue object. DataValue objects are either valid or they are not, we should not guess intent. Clients that supply DataValues should take care to apply any normalization they desire.

thiemowmde raised the priority of this task from High to Needs Triage.
thiemowmde removed a project: Patch-For-Review.

This has been known about for SEVEN YEARS?? Why do I even bother reporting anything

Because unfortunately there are too many to not loose track of some important ones :(

The only real option that I see working if we want to allow whitespace values in the UI is....

  • Have a list of whitespace chars that we want to trim
  • If a string is made up ONLY of these whitespace chars, do not trim
  • If a string has these whitespace chars at the beginning or end, and also has other chars, trim them?

If the cases of having whitespace at the start or end is all covered by the “only whitespace” case, then we can just add this validation in the API, and if commons calls this api (as it should) then this would fix both issues. T251380 and this

I suggest two separate data types for strings that should be trimmed (which should be the norm) and "binary" strings that aren't.

I've run into this a lot recently. It can be difficult when you are importing from a data source that may have less clean data, and tracking down something like an illegal non-breaking space character in a value is a major pain, especially when it can't even be seen in the UI message. These sorts of characters are already normalized on input in other places in MediaWiki, so it makes sense to just silently normalize rather than give the user a vague error they may or may not be able to fix, or expect them to do all their own normalization on data sources before importing.

Hi @Lucas_Werkmeister_WMDE, how about "illegal non-breaking space characters" (see Dominicbm's comment)? Shouldn't these already be normalized? (wbparsevalue shows me a warning, but it seems to do the job). Can you confirm this or are we also not using the parse API for relevant sting inputs in the Wikibase UI?

Our UI uses the parse API for strings (since T261071#6493792), but if I understand correctly, @Dominicbm is using the API directly, not editing via the UI. We could decide to do more of these “cleanups” as normalizations, like the Unicode normalization (which means they would also apply when not using the parse API); I’m not sure how I feel about that myself.

Another thing we could do, which would be a more “harmless” change, would be to improve the error message you get from the API for bad strings. Currently, it’s a generic “malformed value” message, but it should be possible to provide a custom message without too much trouble (ValidatorBuilders::getCommonStringValidators() uses a RegexValidator to check whitespace, which already supports a custom $errorCode as an optional constructor argument) – might even qualify as a good first task.

(Also, IMHO, the meaning of this task is by now so broad that it’s almost pointless. Initially it was about trailing whitespace actually being saved, which I assume hasn’t been an issue for years; at some point in between, it was about unicode normalization, which we now also do; now it feels like a generic catch-all for whitespace-related problems?)

Hi @Dominicbm, thank you for your comment. Are you using the API or the UI for your edits? In case you are using the API, please try to use the parse API (wbparsevalue) before making the actual edit. In case you are using the UI, could you please provide us with additional information?

@Lucas_Werkmeister_WMDE: I agree about the task. I want to close this task and asked you to find the right way to do it (e.g. create a more specific ticket for what still might be a problem). And yes, this should sit in the parse API and not in the edit API. I like your idea to improve the documentation and make this a first task. Thx!

Let's consider this to be an epic fail of problem management in Wikidata. This is now open for more than 7 years, and it is about trimming whitespaces. facepalm.
Everybody paralysed? The current proposal is to improve documentation and error messages, needless to say this is NOT the solution but an annoyance.
The UI is apparently not a first class interface to Wikidata.

Hi @Dominicbm, thank you for your comment. Are you using the API or the UI for your edits? In case you are using the API, please try to use the parse API (wbparsevalue) before making the actual edit. In case you are using the UI, could you please provide us with additional information?

@Lucas_Werkmeister_WMDE: I agree about the task. I want to close this task and asked you to find the right way to do it (e.g. create a more specific ticket for what still might be a problem). And yes, this should sit in the parse API and not in the edit API. I like your idea to improve the documentation and make this a first task. Thx!

Sorry to confuse things. I thought this was the same behavior in the UI, which is why I was saying it would be very difficult to expect a user to debug a message when characters like NBSP are invisible. To clarify, I do think I observe this issue in the UI, but I've only tried in SDC, not Wikidata. Maybe this issue should be split out if it's only in Commons. In any case, yes, I am personally encountering this in the API. My process was that I first experienced this in the API, and then I tried copying and pasting into the UI for debugging. That's where I say that the error message is pretty useless in the UI. This is me trying to enter in three different variations with trailing whitespace/NBSP, and all look the same with no clarity of what is the problem.

Screen Shot 2022-03-23 at 6.05.16 PM.png (990×1 px, 405 KB)

Regarding wbparsevalue, this is a useful suggestion that I was actually unaware of it. Perhaps that suggestion would be useful user education to also put in an improved error message.

I'm not sure I can utilize this for my use case, though. I run DPLA_bot, which is sometimes making millions of edits in a batch. I would rather not double the number of API requests because of a few hundred of those millions having illegal characters.

I get that we're discussing a couple of different things now. Sorry again if I contributed to making it messy.

Hi @Dominicbm, thank you for your additional input!

I do think I observe this issue in the UI, but I've only tried in SDC, not Wikidata.
Maybe this issue should be split out if it's only in Commons.

Yes, this should be specific to SDC. There seems to be a more specific task for that already, see T251380.

In any case, yes, I am personally encountering this in the API. (...)
That's where I say that the error message is pretty useless in the UI.

Thank you for explaining your process, this helped! I have created a new task based on your and Luca's input about improving the error message, see T304943.

Manuel claimed this task.

I am closing this task as the original scope seems solved already!