Page MenuHomePhabricator

Invalid Timevalues stored in database
Closed, ResolvedPublic

Description

Author: byrial

Description:
While parsing a database dump for Wikidata item [[Q441536]], I found this statement:

{"m":["value",570,"time",{"time":"+0000000 1998-10-21T00:00:00Z","timezone":0,"before":0,"after":0,"precision":11,"calendarmodel":"http://www.wikidata.org/entity/Q1985727"}],"q":[],"g":"q441536$3568592E-BA6D-48A3-9FBE-C558F8A47415","rank":1,"refs":[[["value",143,"wikibase-entityid",{"entity-type":"item","numeric-id":328}]]]}

There is a space in the year after the first 7 digits. When I look at wiki page for the item, the date is shown as "October 21, 1 BCE" so only the first digits (all zeroes) until the space is used. The correct year is of course 1998.

You will also see the space in the diff for the insertion of the statement: http://www.wikidata.org/w/index.php?title=Q441536&diff=48523994&oldid=48523992


Version: unspecified
Severity: major

Details

Reference
bz49425

Related Objects

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:07 AM
bzimport set Reference to bz49425.
bzimport added a subscriber: Unknown Object (MLST).

byrial wrote:

There is also a space between the day and "T" in some cases, like this "+00000001940-10-5 T00:00:00Z" (taken from Q1825747, diff for insertion: http://www.wikidata.org/w/index.php?title=Q1825747&diff=48577191&oldid=45206040 )

That gives this error text on the wiki page for the item: The value does not comply with the property's definition.
The value's data value type "ununserializable" does not match the property's data type's data value type "time".

There is a new complete database dump in progress right now. When it is done, I can make a list of all occurrences for of invalid time values.

This is caused by a bug in the bot making the edit. Please advise the bot's owner and block the bot if necessary.

Of course, wikidata shouldn't accept broken dates. This is a known issue: Time values are currently not properly validated by the API, see bug 49264. Since I6990983 is merged, this is technically fixed; I recommend a backport of the fix though, and a hotfix deployment, so let's keep the bug open until that is done.

Related URL: https://gerrit.wikimedia.org/r/67962 (Gerrit Change I6990983ef0c0cad7c9d4f271bdf803902b94230b)

byrial wrote:

There is 55 cases of malformed timevalues in the new database dump dated 2013-06-10. There is a list at http://www.wikidata.org/wiki/User:Byrial/Bad_time_values

Related URL: https://gerrit.wikimedia.org/r/68397 (Gerrit Change Ib3e7b16c203d08008d7465859af0e1e7f940db14)

https://gerrit.wikimedia.org/r/68397 (Gerrit Change Ib3e7b16c203d08008d7465859af0e1e7f940db14) | change ABANDONED [by Daniel Kinzler]

byrial wrote:

There was 15 new cases of malformed time values in the database dump of 2013-06-23, all inserted by the same bot at 2013-06-14 and 2013-06-15. It was values like:

+0000000or-02-22T00:00:00Z
+00000001603.-01-01T00:00:00Z
+00000001239-06-17/18T00:00:00Z
+00000001650)-06-19T00:00:00Z
+00000001601/1602-05-02T00:00:00Z
+00000001869-09-26 (disputed)T00:00:00Z
+0000000or-07-31T00:00:00Z
+0000000January-01-16T00:00:00Z
+00000001878''(''Some-05-12T00:00:00Z
+00000001587/8-01-12T00:00:00Z
+00000001766?-03-16T00:00:00Z
+00000001985-10-22 correct date is October 27, 1985T00:00:00Z
+0000000or-12-17T00:00:00Z

Note: with I67b9ae480c, this should not happen any more. I67b9ae480c provides input validation for time values.

It does not however enforce time format rules on TimeValue objects, so "bad" time values are not yet detected when found in the database, etc.

This bug should remain open until strict validation is implemented (see I72d6b6d890), but I don't think it's very urgent any more since it should now be impossible to enter bad values via the API.

Keep this open until I72d6b6d89 is merged.

Is merged. We will check the dump one more time and then close this.

byrial wrote:

Either this is fixed now, or the bots is better. There was no malformed time values in the 2013-07-17 database dump.