Page MenuHomePhabricator

Add characters changed per revision for stub and full article dumps
Closed, ResolvedPublic

Description

Adding a delta characters change to each revision is needed for edit analytics. This is needed for both the stub and full article dumps.
Rob suggested that using PHP's UTF-8 support (e.g. just calling mb_strlen($buffer, 'UTF-8')) to quickly dispatch of the multi-byte problem would give us a fairly accurate character count. Counting characters will allow us to compare across different languages.

If there are serious performance concerns then we can fall back to byte count.


Version: unspecified
Severity: enhancement

Details

Reference
bz26563

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:11 PM
bzimport set Reference to bz26563.

Byte count will be way easier, and might happen sooner than character count, since we already have revision length in the database. Ariel asks that we update the version number of the dumps if that happens, so users of the dumps can correlate contents to versions.

The code to modify is here:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Export.php?view=markup

To update the version, we need to update schemaVersion().

In order for this to get into production, it of course needs to get deployed to the production branch.

Ariel doesn't have time to implement this right now, so an interested volunteer would be appreciated.

Committed r79856 into trunk. I did bytes because characters was a little more involved. I added byte counts to both stub and full dumps.

I thought about not including the byte count in the full dump because it's pretty trivial to get that count from most XML parsers. However, it is nice to have the byte count that doesn't include any XML escaping introduced by the dump, so I left it in.

I'll document how I'd go about characters, just in case anyone wants to tackle it. The JOIN of the "text" table in WikiExporter::dumpFrom would have to be performed even in the case of a stub dump. WikiExporter()->text would need to be passed as a new parameter into XMLDumpWriter::writeRevision(). The stub logic in XMLDumpWriter::writeRevision() would need to be changed to use the new parameter to see if we're dealing with a stub dump, rather than inferring it from the absence of text. Finally, mb_strlen($foo, 'UTF-8') could be called. It's not a ton of code (probably 10-15 lines of code change, tops) but that's less likely to get fast-tracked to production.

(In reply to comment #2)

I'll document how I'd go about characters, just in case anyone wants to tackle
it. The JOIN of the "text" table in WikiExporter::dumpFrom would have to be
performed even in the case of a stub dump. WikiExporter()->text would need to
be passed as a new parameter into XMLDumpWriter::writeRevision(). The stub
logic in XMLDumpWriter::writeRevision() would need to be changed to use the new
parameter to see if we're dealing with a stub dump, rather than inferring it
from the absence of text. Finally, mb_strlen($foo, 'UTF-8') could be called.
It's not a ton of code (probably 10-15 lines of code change, tops) but that's
less likely to get fast-tracked to production.

Wouldn't this cause stub dumps to load the text of each revision, significantly slowing down their generation?

Exactly. What we want to do is follow the same procedure we did for bytes: add a field in the revision table, automatically populate it for new revs, run a job to populate for old revs.

Even more reason to punt on character count. :) If we ever add character count to the database, we really ought to address bug 21860 (checksum per rev) while we're at it.

This is fixed in r79856, and will be deployed as part of 1.17

Maybe we should include the delta byte count or cumulative number of bytes in the database to enable feature requests such as:

The updated schema never got published on mediawiki.org: bug 22750

This will break anything trying to automatically run XSD validation due to being unable to fetch the schema file.

I think we can close this bug, or not?