Page MenuHomePhabricator

Length of dump text and length field in API do not match
Closed, DeclinedPublic

Description

Length of dump text and length field in API do not match (even after UF8 encoding) due to inconsistent line break characters and beginning/ending whitespace.
Note that this results in false negatives when detecting identity reverts

Current workaround:
Strip whitespace from the beginning/end and replace all "\r\n" (windows linebreak) with "\n". With this approach, you get acceptable (99%), but still imperfect consistency between API and dump.


Version: unspecified
Severity: minor

Details

Reference
bz27773

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:27 PM
bzimport set Reference to bz27773.

Are there any specific examples?

Are whitespace mismatches due to problems parsing the way whitespace is encoded in the XML, or due to the XML dumps actually containing incorrect whitespace?

(The dumps may well contain incorrect whitespace, most likely due to inconsistencies in parsing the previous whitespace when doing multiple passes combining text from previous dumps with new stub dumps, etc.)

(In reply to comment #1)

Are there any specific examples?

Are whitespace mismatches due to problems parsing the way whitespace is encoded
in the XML, or due to the XML dumps actually containing incorrect whitespace?

Do the XML dumps use the xml:space="preserve" attribute?

I would like a specific page ID, revision ID and dump file to look at, if someone can point me to one.

Anarchism(12)
RevisionId: 233194

From the 2010-01-30 XML dump at the end of the 233194 revision (notice the line breaks before the closing </text> tag)

[...]
/Talk &lt;br&gt;

/Todo &lt;br&gt;

[[Anarchy/Talk]] [http://www.wikipedia.com/wiki.cgi?action=history&amp;id=Anarchy Anarchy History] (The content of Anarchy and Anarchism have since been merged into this version)

</text>

From the API (http://en.wikipedia.org/w/api.php?action=query&prop=revisions&revids=233194&rvprop=content&format=jsonfm) (notice that the string ends right after the last non-whitespace character)

{
"query": {

		"pages": {
			"12": {
				"pageid": 12,
				"ns": 0,
				"title": "Anarchism",
				"revisions": [
					{
						"*": "''Anarchism'' is <removed most of the text here -Aaron Halfaker> (The content of Anarchy and Anarchism have since been merged into this version)"
					}
				]
			}
		}

}

}

(Yes, the XML files have <text xml:space="preserve"> in them.)

I had a look at the output we get from ExternalStore::fetchFromURL()

The text we get back has a newline after the final parenthesis.

That text is 8884 bytes long, which matches the rev_len recorded in the revision table and in the XML dump file. When I apply the various conversions for & < > " and strip the ^Ms I get the byte count of the text entry in the xml file: 8930.

When I do the same conversions for the json format (for " \r \n and /) I come up one byte longer, 9160, than the actual json output text, 9159. My conclusion is that the json formatter or perhaps generally the API loses that newline at the end.

Nemo_bis added subscribers: Anomie, Nemo_bis.

Note that this results in false negatives when detecting identity reverts

AFAICS this was resolved downstream, or at least https://pythonhosted.org/mediawiki-utilities/lib/reverts.html#mw-lib-reverts doesn't mention any such issue.

My conclusion is that the json formatter or perhaps generally the API loses that newline at the end.

Then, if still current/interesting, it should be reported as an API bug. Cc @Anomie.

My conclusion is that the json formatter or perhaps generally the API loses that newline at the end.

Then, if still current/interesting, it should be reported as an API bug. Cc @Anomie.

The one example given, https://en.wikipedia.org/w/api.php?action=query&prop=revisions&revids=233194&rvprop=content&format=jsonfm, does not match the reported output with respect to trailing newlines. The page content from the json object is currently 8884 bytes long.