[Task] Apply normalization to string values in statements
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Raymond
	Mar 9 2013, 9:04 AM

Description

I suggest to trim leading/trailing spaces around statements of string type, and apply unicode normalization.

Due to a c&p error I added some trainling spaces at a VIAF statement and had to remove them with an extra edit, see URL.

Example URL: https://www.wikidata.org/w/index.php?title=Q6486700&diff=10051134&oldid=10051107

Other issues:

CR
unprintable characters (if not included in unicode normalization)

Details

Reference: bz45925

Related Objects
Search...

Status	Assigned	Task
Open	None	T251380 Trim text input before saving on Structured Data on Commons string input
Resolved	Manuel	T47925 [Task] Apply normalization to string values in statements
Resolved	daniel	T104873 [Task] wbparsevalue should support StringValues, instead of relying on UnknownValue
Resolved	daniel	T104874 Implement standalone StringValueParser

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• bzimport set Reference to bz45925.Nov 22 2014, 1:32 AM

• bzimport added a subscriber: Unknown Object (MLST).

Raymond created this task.Mar 9 2013, 9:04 AM

The "malformed input" error is thrown by the RegexValidator in function buildStringType from WikibaseDataTypeBuilders. It often happens when copy-pasting text from other sources (see http://www.wikidata.org/wiki/Wikidata:Project_chat#Error when adding commonscat) and it is made worse by bug 63301 which sometimes adds newlines when pressing return to save the claim, thus triggering the error.

The trimming was previously done in ValueView but it has since been removed (see http://github.com/wmde/ValueView/commit/0a8350999bb1ee028db9487286173aee0b20640f). We could use the StringNormalizer class to trim the string and also remove incomplete UTF-8 sequences (see related bug 50486), however I'm not familiar with the data value processing code so I'm not sure where is the correct place to do it.

Sorry, correct link is https://www.wikidata.org/wiki/Wikidata:Project_chat#Error_when_adding_commonscat

Just to be clear: ValueView did not trim the value, it just checked for empty or whitespace-only strings.

At the time of writing, the current behaviour seems to be: When a value with a leading space character is submitted, the back-end parser will return a "malformed input" error.
I wonder whether silent trimming should rather be implemented in the back-end parser instead of in the front-end.
Can some authority please decide on that?

IIRC When I discussed this with Daniel his comment was that the parser shouldn't do anything to the string but only parse it. That makes sense to me.
If we only do it in the frontend then these could still get in via the API for example. That seems sub-optimal.

Daniel: Can you chime in on the pros and cons of doing this in the backend please?

I said that? Hm, then I have to disagree with myself there... The parsevalue API module (resp. StringValueParser) should trim input (and apply utf8 normalization).

Hm... apparently, StringValueParser does not create StringValues from strings. We seem to use the NullParser for this, which returns an UnknownValue. Whatever.

So, let's have an actual StringValueParser that applies normalization.

Lydia_Pintscher added a project: Wikidata.Dec 1 2014, 2:29 PM

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).Dec 1 2014, 2:33 PM

Lydia_Pintscher moved this task from incoming to needs discussion or investigation on the Wikidata board.Dec 27 2014, 4:40 PM

Lydia_Pintscher added a project: patch-welcome.

Lydia_Pintscher set Security to None.

Lydia_Pintscher removed a subscriber: Wikidata-bugs.

Lydia_Pintscher moved this task from needs discussion or investigation to ready to go on the Wikidata board.Mar 17 2015, 10:02 AM

Ricordisamoa subscribed.Jul 27 2015, 7:25 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 27 2015, 7:25 AM

thiemowmde renamed this task from Trim spaces around statements of string type to [Task] Trim spaces around statements of string type.Aug 13 2015, 3:19 PM

thiemowmde lowered the priority of this task from High to Medium.

thiemowmde added a subtask: T104873: [Task] wbparsevalue should support StringValues, instead of relying on UnknownValue.

thiemowmde updated the task description. (Show Details)

thiemowmde added a project: DataValues.

Patches for review:

https://github.com/DataValues/Common/pull/35

wbparsevalue now supports datatype=string. The UI should start using it. In fact, the UI should use the same code to parse, and at the same time validate, all data values. All special case code for string values should go away.

Bumping to high because of implications missing unicode normalization. This could potentially break JSON dumps.

daniel renamed this task from [Task] Trim spaces around statements of string type to [Task] Apply normalization to string values in statements.Aug 27 2015, 3:11 PM

daniel updated the task description. (Show Details)

daniel reopened subtask T104873: [Task] wbparsevalue should support StringValues, instead of relying on UnknownValue as Open.Sep 7 2015, 1:42 PM

• Jonas closed subtask T104873: [Task] wbparsevalue should support StringValues, instead of relying on UnknownValue as Resolved.Sep 11 2015, 12:10 PM

matej_suchanek reopened this task as Open.Oct 2 2016, 2:34 PM

matej_suchanek closed this task as a duplicate of T147037: When Strings that end in "\n" get entered into the string interface Wikidata should simply strip the "\n" away instead of throwing an error.

matej_suchanek merged a task: T147037: When Strings that end in "\n" get entered into the string interface Wikidata should simply strip the "\n" away instead of throwing an error.

matej_suchanek added subscribers: ChristianKl, Esc3300.

From what I wrote in T147037:

I think it's a frequent problem when strings start or end with odd characters or even include CR in the middle.

It happens a lot when trying to create statements for article titles. Oddly, I think the same text for labels tends to work.

Instead of rejecting it completely, it would be helpful if a cleaned version was offered for saving instead.

Esc3300 awarded a token.Oct 2 2016, 3:23 PM

Esc3300 updated the task description. (Show Details)

ChristianKl awarded a token.Oct 2 2016, 3:50 PM

Please distinguish between

a) normalization applied by the API, when receiving a JASON structure representing a sting value.
b) normalization applied by the API when parsing user input, and returning a string value
c) normalization applied by the JavaScript widget before sending user input to the API

I believe we should do (b), but not (a). We can also do (c), but with (b) in place, this seems redundant.

The reason I believe we should not do (a) is that we are not dealing with user input then, but with a DataValue object. DataValue objects are either valid or they are not, we should not guess intent. Clients that supply DataValues should take care to apply any normalization they desire.

• AnotherLadsgroup assigned this task to thiemowmde.Jul 15 2017, 11:11 AM

• AnotherLadsgroup added a project: Wikidata-Former-Sprint-Board.

• AnotherLadsgroup moved this task from Proposed to Review on the Wikidata-Former-Sprint-Board board.

thiemowmde moved this task from Review to Proposed on the Wikidata-Former-Sprint-Board board.Jul 19 2017, 10:04 AM

matej_suchanek merged a task: T168810: Chomp leading and trailing spaces in Property:P742.Sep 2 2017, 3:25 PM

matej_suchanek added a subscriber: Billinghurst.

Lydia_Pintscher removed a project: Wikidata-Former-Sprint-Board.Nov 21 2017, 8:54 AM

matej_suchanek merged a task: T186411: Tailing whitspaces in string-values fail with confusing error message.Feb 4 2018, 9:18 AM

matej_suchanek added a subscriber: MichaelSchoenitzer.

MichaelSchoenitzer awarded a token.Apr 7 2018, 2:56 PM

MichaelSchoenitzer mentioned this in T188802: String values with leading/trailing whitespace are (inconsistently) reported as invalid.Apr 7 2018, 2:59 PM

thiemowmde removed thiemowmde as the assignee of this task.Sep 4 2018, 12:02 PM

thiemowmde raised the priority of this task from High to Needs Triage.

thiemowmde removed a project: Patch-For-Review.

Herzi.Pinki subscribed.Apr 20 2020, 12:54 PM

Lea_Lacroix_WMDE mentioned this in T250695: Improve the error message in case of a blank space in a string.Apr 20 2020, 1:50 PM

Bugreporter added a parent task: T251380: Trim text input before saving on Structured Data on Commons string input.May 27 2020, 10:09 PM

Bugreporter merged a task: T261071: Malformed input error on text which is not malformed.Aug 23 2020, 11:46 PM

Bugreporter added a subscriber: Nikki.

This has been known about for SEVEN YEARS?? Why do I even bother reporting anything

Nikki unsubscribed.Aug 24 2020, 12:38 PM

Because unfortunately there are too many to not loose track of some important ones :(

Lydia_Pintscher moved this task from Incoming to Needs Wikidata PM Work on the Wikidata-Campsite board.Aug 24 2020, 1:08 PM

Addshore mentioned this in T261071: Malformed input error on text which is not malformed.Sep 3 2020, 10:41 AM

The only real option that I see working if we want to allow whitespace values in the UI is....

Have a list of whitespace chars that we want to trim
If a string is made up ONLY of these whitespace chars, do not trim
If a string has these whitespace chars at the beginning or end, and also has other chars, trim them?

If the cases of having whitespace at the start or end is all covered by the “only whitespace” case, then we can just add this validation in the API, and if commons calls this api (as it should) then this would fix both issues. T251380 and this

I suggest two separate data types for strings that should be trimmed (which should be the norm) and "binary" strings that aren't.

Lucas_Werkmeister_WMDE subscribed.Aug 3 2021, 2:09 PM

Manuel subscribed.Aug 3 2021, 2:09 PM

Bugreporter mentioned this in T251480: Normalize pagenames/filenames on save in Wikibase.Aug 23 2021, 2:27 PM

Nikki mentioned this in T251380: Trim text input before saving on Structured Data on Commons string input.Feb 1 2022, 6:16 PM

I've run into this a lot recently. It can be difficult when you are importing from a data source that may have less clean data, and tracking down something like an illegal non-breaking space character in a value is a major pain, especially when it can't even be seen in the UI message. These sorts of characters are already normalized on input in other places in MediaWiki, so it makes sense to just silently normalize rather than give the user a vague error they may or may not be able to fix, or expect them to do all their own normalization on data sources before importing.

Hi @Lucas_Werkmeister_WMDE, how about "illegal non-breaking space characters" (see Dominicbm's comment)? Shouldn't these already be normalized? (wbparsevalue shows me a warning, but it seems to do the job). Can you confirm this or are we also not using the parse API for relevant sting inputs in the Wikibase UI?

Our UI uses the parse API for strings (since T261071#6493792), but if I understand correctly, @Dominicbm is using the API directly, not editing via the UI. We could decide to do more of these “cleanups” as normalizations, like the Unicode normalization (which means they would also apply when not using the parse API); I’m not sure how I feel about that myself.

Another thing we could do, which would be a more “harmless” change, would be to improve the error message you get from the API for bad strings. Currently, it’s a generic “malformed value” message, but it should be possible to provide a custom message without too much trouble (ValidatorBuilders::getCommonStringValidators() uses a RegexValidator to check whitespace, which already supports a custom $errorCode as an optional constructor argument) – might even qualify as a good first task.

(Also, IMHO, the meaning of this task is by now so broad that it’s almost pointless. Initially it was about trailing whitespace actually being saved, which I assume hasn’t been an issue for years; at some point in between, it was about unicode normalization, which we now also do; now it feels like a generic catch-all for whitespace-related problems?)

Hi @Dominicbm, thank you for your comment. Are you using the API or the UI for your edits? In case you are using the API, please try to use the parse API (wbparsevalue) before making the actual edit. In case you are using the UI, could you please provide us with additional information?

@Lucas_Werkmeister_WMDE: I agree about the task. I want to close this task and asked you to find the right way to do it (e.g. create a more specific ticket for what still might be a problem). And yes, this should sit in the parse API and not in the edit API. I like your idea to improve the documentation and make this a first task. Thx!

Let's consider this to be an epic fail of problem management in Wikidata. This is now open for more than 7 years, and it is about trimming whitespaces. facepalm.
Everybody paralysed? The current proposal is to improve documentation and error messages, needless to say this is NOT the solution but an annoyance.
The UI is apparently not a first class interface to Wikidata.

thiemowmde unsubscribed.Mar 23 2022, 9:59 PM

In T47925#7800934, @Manuel wrote:

Hi @Dominicbm, thank you for your comment. Are you using the API or the UI for your edits? In case you are using the API, please try to use the parse API (wbparsevalue) before making the actual edit. In case you are using the UI, could you please provide us with additional information?

@Lucas_Werkmeister_WMDE: I agree about the task. I want to close this task and asked you to find the right way to do it (e.g. create a more specific ticket for what still might be a problem). And yes, this should sit in the parse API and not in the edit API. I like your idea to improve the documentation and make this a first task. Thx!

Sorry to confuse things. I thought this was the same behavior in the UI, which is why I was saying it would be very difficult to expect a user to debug a message when characters like NBSP are invisible. To clarify, I do think I observe this issue in the UI, but I've only tried in SDC, not Wikidata. Maybe this issue should be split out if it's only in Commons. In any case, yes, I am personally encountering this in the API. My process was that I first experienced this in the API, and then I tried copying and pasting into the UI for debugging. That's where I say that the error message is pretty useless in the UI. This is me trying to enter in three different variations with trailing whitespace/NBSP, and all look the same with no clarity of what is the problem.

Screen Shot 2022-03-23 at 6.05.16 PM.png (990×1 px, 405 KB)

Regarding wbparsevalue, this is a useful suggestion that I was actually unaware of it. Perhaps that suggestion would be useful user education to also put in an improved error message.

I'm not sure I can utilize this for my use case, though. I run DPLA_bot, which is sometimes making millions of edits in a batch. I would rather not double the number of API requests because of a few hundred of those millions having illegal characters.

I get that we're discussing a couple of different things now. Sorry again if I contributed to making it messy.

Manuel mentioned this in T304943: Make Wikibase error message "Malformed input" more meaningful .Mar 29 2022, 1:08 PM

Hi @Dominicbm, thank you for your additional input!

I do think I observe this issue in the UI, but I've only tried in SDC, not Wikidata.
Maybe this issue should be split out if it's only in Commons.

Yes, this should be specific to SDC. There seems to be a more specific task for that already, see T251380.

In any case, yes, I am personally encountering this in the API. (...)
That's where I say that the error message is pretty useless in the UI.

Thank you for explaining your process, this helped! I have created a new task based on your and Luca's input about improving the error message, see T304943.

I am closing this task as the original scope seems solved already!

	F35021645: Screen Shot 2022-03-23 at 6.05.16 PM.png
	Mar 23 2022, 10:15 PM

[Task] Apply normalization to string values in statementsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

[Task] Apply normalization to string values in statements
Closed, ResolvedPublic
Actions

Related Objects
Search...