Page MenuHomePhabricator

Data too long for column 'rev_comment'
Closed, ResolvedPublic

Description

Author: mohamed.m.k

Description:
C:\dumper>java -client -classpath mwdumper.jar;mysql-connector-java-3.1.14/mysql
-connector-java-3.1.14-bin.jar org.mediawiki.dumper.Dumper "--output=mysql://127
.0.0.1/wikiar?user=usr&password=pass" "--format=sql:1.5" "D:\arwiki
-20080405-pages-articles.xml.bz2"
1,000 pages (25.65/sec), 1,000 revs (25.65/sec)
2,000 pages (20.713/sec), 2,000 revs (20.713/sec)
3,000 pages (24.385/sec), 3,000 revs (24.385/sec)
4,000 pages (24.352/sec), 4,000 revs (24.352/sec)
5,000 pages (25.293/sec), 5,000 revs (25.293/sec)
Exception in thread "main" java.io.IOException: com.mysql.jdbc.MysqlDataTruncati
on: Data truncation: Data too long for column 'rev_comment' at row 809

at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
at org.mediawiki.dumper.Dumper.main(Unknown Source)

mysql->5.0.50-enterprise-gpl-nt

C:\dumper>java -showversion
java version "1.6.0_04"
Java(TM) SE Runtime Environment (build 1.6.0_04-b12)
Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)


Version: unspecified
Severity: critical

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:05 PM
bzimport set Reference to bz13721.

Can you double-check that the proper encoding's being used?

The most compatible case is probably to use the binary schema. You may or may not have troubles with other modes.

mohamed.m.k wrote:

I have tried all three:

  1. Backwards-compatible UTF-8
  2. Experimental MySQL 4.1/5.0 UTF-8
  3. Experimental MySQL 4.1/5.0 binary

But I get the exact error (I drop the db then reinstall mw).. does mwdumper has some encoding schema setting I should change?

mohamed.m.k wrote:

I found that the error isn't from mwdumper but from the data dumps. the problem is that it is trying to put too much data and the column type is small. when i changed rev_comment from tinyblob to blob..it imported without errors. should it be changed in mediawiki or what?

jcsahnwaldt wrote:

As a workaround until the dump is fixed, mwdumper should make sure that a comment is at most 255 bytes long and truncate it if necessary. I implemented this fix and checked it in at http://dbpedia.svn.sourceforge.net/viewvc/dbpedia?view=rev&revision=1771 . Seems to fix that problem for me. Feel free to copy that code back to mediawiki if you want.

Ahhhh ok I think I see the base issue -- if a 2-byte or 3-byte char is cut off at the 255-byte boundary when stored, it becomes an invalid char. The XML dump outputter runs UTF-8 validation and turns the bad char into a valid U+FFFD ... which is 3 bytes of UTF-8, over the 255-char limit again.

Yeah, this should be fixed in our DB and MediaWiki should be smarter about truncation, but in the meantime it should be easy to make mwdumper smarter for this too.

Created attachment 7263
truncate comment at 255 Bytes

It also works when you append

&jdbcCompliantTruncation=false

to the --output parameter.

But I have also add a patch to truncate the comment. Based on the implementation of Christopher Sahnwaldt (comment 4).

Attached:

jcsahnwaldt wrote:

Minor gripe: the patch uses String.isEmpty(), which was only added in JDK 1.6. Maybe use String.length() == 0 instead, so MWDumper still compiles under 1.5.

This doesn't suddently a blocker after 3 years of existence... :)

(In reply to comment #9)
Why it isn't? Mwdumper can't be used to import dumps because of this bug.

(In reply to comment #11)
Unfortunately it works only for the jdbc connector, and it's not a solution for the sql output, is it?

(In reply to comment #12)

(In reply to comment #11)
Unfortunately it works only for the jdbc connector, and it's not a solution for
the sql output, is it?

Yes, that is true. For the raw sql this is not a solution.

(In reply to comment #11)

See comment 6 for a workaround

Didn't work for me. Still gives Data too long for column 'rev_comment'.

Chane Ic078f6ee is merged now.

hashar added subscribers: Yves.renier, hashar.

https://gerrit.wikimedia.org/r/c/mediawiki/tools/mwdumper/+/30932 Truncate comment at 255 Bytes never got merged and has been abandoned at some point. It is now being proposed again as https://gerrit.wikimedia.org/r/c/mediawiki/tools/mwdumper/+/720998/ by @Yves.renier

Change 720998 had a related patch set uploaded (by Hashar; author: Yves.renier):

[mediawiki/tools/mwdumper@master] Fix the truncating of UTF8 characters

https://gerrit.wikimedia.org/r/720998

Change 720998 merged by jenkins-bot:

[mediawiki/tools/mwdumper@master] Fix the truncating of UTF8 characters

https://gerrit.wikimedia.org/r/720998

Tentatively that is fixed in the master branch after restoring an old patch :) Thank you @Yves.renier