Page MenuHomePhabricator

Import strips angle brackets on some installations (libxml2 entity bug)
Closed, ResolvedPublic

Description

When Exporting, the greater and less than signs are turned into HTML entities. The importer doesn't seem to account for this. Importing (via interwiki and XML upload) both give "ref" "/ref" all over the place, missing their brackets.

Marking as CRITICAL as it's a blocker to export/import.


Version: unspecified
Severity: normal

Details

Reference
bz16554

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:28 PM
bzimport set Reference to bz16554.
bzimport added a subscriber: Unknown Object (MLST).

Works just fine for me. They are turned into entities on export -- which is correct -- and reinterpreted back to their original values on import -- which is correct.

Not so sure...ran an interwiki import last night on "Test" from enwiki to my localhost. Ended up with all < and > stripped, exposing an HTML comment. This is vanilla trunk, pretty standard config.

Works for me with both transwiki and file import. Downgrading severity. Need more information about the installations where this occurs.

Doesn't seem to happen on my win32 box (no extensions, no tidy), but continues to happen on my CentOS machine (identical LocalSettings). Anything in particular you want me to check?

It's likely that the entity &lt; is not being sent (decoded) to the character data handler. Maybe it's being sent to some other handler (such as the default handler), maybe it's just discarded. The reason for this probably has something to do with the version or configuration of the libxml2 library. What would be nice is if you could help debug it. I think the first thing to try would be something like:

Index: includes/Import.php

  • includes/Import.php (revision 45593)

+++ includes/Import.php (working copy)
@@ -864,6 +864,7 @@

			$this->appendfield = $name;
			xml_set_element_handler( $parser, "in_nothing", "out_append" );
			xml_set_character_data_handler( $parser, "char_append" );

+ xml_set_default_handler( $parser, "char_append" );

			break;
		case "contributor":
			$this->push( "contributor" );

and then see what comes out on import. Please also report the following information about your system:

  • From phpinfo(), whether there's a --with-libxml-dir or --with-libexpat-dir under "Configure Command" and what it's set to
  • If expat is used, what version it is
  • What the libxml2 version is
  • The distribution and package version used to install PHP

(In reply to comment #5)

It's likely that the entity &lt; is not being sent (decoded) to the character
data handler. Maybe it's being sent to some other handler (such as the default
handler), maybe it's just discarded. The reason for this probably has something
to do with the version or configuration of the libxml2 library. What would be
nice is if you could help debug it. I think the first thing to try would be
something like:

Index: includes/Import.php

  • includes/Import.php (revision 45593)

+++ includes/Import.php (working copy)
@@ -864,6 +864,7 @@

$this->appendfield = $name;
xml_set_element_handler( $parser, "in_nothing",

"out_append" );

xml_set_character_data_handler( $parser, "char_append"

);
+ xml_set_default_handler( $parser, "char_append" );

        break;
case "contributor":
        $this->push( "contributor" );

and then see what comes out on import.

Didn't fix it, no change in behavior.

Please also report the following information about your system:

  • From phpinfo(), whether there's a --with-libxml-dir or --with-libexpat-dir

under "Configure Command" and what it's set to

  • If expat is used, what version it is
  • What the libxml2 version is
  • The distribution and package version used to install PHP

--with-libxml-dir=/opt/xml2/ version 2.7.2.
Not using --with-expat. We're on PHP 5.2.6. This is (was? I know 5.2.8 is out) the default php-mysql build for CentOS 4.7, as far as I know. I haven't changed it.

Created attachment 5655
Export of enwiki:Test

Here's the exact XML I've been attempting this on and getting the same error on upload and interwiki, 100% of the time. It's the Special:Export of "Test" from enwiki, r45489, importing to r45536

Attached:

CentOS 4.7 base is still on PHP 4, and CentOS 4.7 plus has 5.1.6. I'm assuming PHP and libxml2 are both source installs.

Submitted upstream at http://bugs.php.net/bug.php?id=47066 . The workaround is to recompile with an ancient libxml2.

Which was duped to http://bugs.php.net/bug.php?id=45996. Reported as fixed within the last 24 hours. Note however, that it requires the (not yet released libxml 2.7.3).

I suggest we leave this open until it's confirmed to be fixed on all commonly-used versions of libxml2. It'll help people search for a workaround.

rrichards (via IRC) advises us to migrate to xmlreader. The old xml extension suffers from inelegant and easily-broken expat-compatibility code.

There's (fairly minimal) XMLReader-based code in backupPrefetch.inc which might be helpful as a base to work from; I do agree it's a much nicer interface to work with, and a redo of the import code would be a lot cleaner using it.

Note though that XMLReader is not bundled with PHP 5.0 (available only via PECL), and in 5.1 and later it's on by default in a *fresh* compile but many distro packages may not install it by default.

If we rely on XMLReader for core import functionality, we'll want to officially drop PHP 5.0 compatibility and do a check for the extension at install time (and at run time so we can fail gracefully).

  • Bug 18022 has been marked as a duplicate of this bug. ***

One option might be to do a runtime test, like we do for the PHP 5.0 64-bit array index bug; if the XML parser is buggy, we can throw a nice visible error explaining that you have to fix your installation instead of silently corrupting input.

Exactly.

This will prevent no end of pain months later when they try to untangle thier garbled edits.

Top priority if I were in charge.

I would issue an announcement:

"If you have used Special:Import, ...., ....,
since approximately .....
please check your imported pages for subtle corruptions, e.g.,
< Please see my [http://example.com/index.php?title=Resume&uselang=en resume]

Please see my [http://example.com/index.php?title=Resumeuselang=en resume]

Unnoticed, there corruptions may become entangled in later edits,
making repair even more frustrating.
Users are advised to upgrade to MediaWiki 1.14.xx, 1.13.yy,..
The new versions of Special:Import,...
contain a test that will terminate with an error message:
The following faulty libraries out of MediaWiki's control and must be
updated first to avoid data corruption: ..."

I hope I'm not overdoing it, but subtle data corruption is one of the
most insidious bugs.

This is an edge case, affecting a small subset of installs. Resetting priority and severity.

  • Bug 18355 has been marked as a duplicate of this bug. ***

seth wrote:

Anyone care to give me a direct solution then, if there is one? I recently moved off of Wikia and I really need to import this Database Dump as soon as possible, and this bracket problem is holding me back.

Upgrade PHP and libxml2 to the latest versions, or downgrade them to versions from before the problem. There are several links above which should provide more information.

  • Bug 18877 has been marked as a duplicate of this bug. ***

moissinac wrote:

"Upgrade PHP and libxml2 to the latest versions, or downgrade them to versions
from before the problem. There are several links above which should provide
more information."

The latest version available is 2.7.2 at the time of this comment
It doesn't work with that version. So, the comment #23 is erroneous

And links before refers to libxml2-2.7.3 which is not clearly available.
I just spend near from an hour to try to find it. Nothing for now
code from
Code from the W3C svn base libxml2 module, updated hourly libxml2-cvs-snapshot.tar.gz.
is 2.7.2

I will try to install a previous version on my xampp install
If anyone has tried it, give me an idea of the result. Thank's

Did you not check their website?

http://www.xmlsoft.org/news.html - latest version (as of January) is 2.7.3.

moissinac wrote:

Sure
I don't know when and where was the mistake, but I really saw v2.7.2 as the latest
Now, I can find the 2.7.3
I will try it tomorrow
Thank's for the comment

I added a test in r54828 which'll run at install or update time, but I don't have a broken system to test atm so haven't confirmed it...

Checked against a known broken system. Works.

Marking this as WORKSFORME, as everything has been fixed upstream. Installs affected by this bug are inherently broken, both for MediaWiki and other PHP web apps (same as 64bit bug we check for). There's really nothing more we can do here.

Also confirmed that it detects the bug on CentOS 5 liveCD with a libxml 2.7.2 RPM smashed on top.

I'm re-marking this FIXED. :)

Also went ahead and merged this to REL1_15, so if we push out a 1.15.2 release it'll include the check, as will 1.16.x releases when they come.

ayg wrote:

Does this really warrant an uncircumventable fatal error on install? If this is only a problem for import/export, maybe we could just raise a warning and disable those? The user *might* have other apps they care about that are broken by the bug, but it's perfectly possible they don't, and there's no reason to flat-out prohibit installation in that case. Rgoodermote on IRC was running into this bug and was fairly frustrated that installation of 1.16 just failed unpreventably.

(Also, fixed the version number in the error message in r57568.)

We know it affects export/import badly. To be honest I haven't looked elsewhere in Mediawiki to see what else we might be corrupting--or if everything else is clear.

If it's only an import/export issue, then we can probably get away with just disabling those features rather than flat-out prohibiting install.

  • Bug 24238 has been marked as a duplicate of this bug. ***

wvcs wrote:

This bug is affecting later versions as well. Once the install of all the components for Media Wiki v 1.15.4 on Solaris 9 is completed and I begin the configuration from the web page, you receive the following message:

"Your system has a combination of PHP and libxml which is buggy and can cause
hidden data corruption in MediaWiki and other web apps. Upgrade to PHP 5.2.9 or
later and libxml2 2.7.3 or later! ABORTING (http://bugs.php.net?id=45996 for
details).

However, my installation is PHP 5.2.13 and libxml 2.7.7, both of which is later
than the above two versions.

Was this a problem in previous versions of MediaWiki? I ask because we've been running 1.13.5 for months (years?) with the bad combination of PHP and libxml2 and never noticed any issues. Granted we don't do much importing. Now we are trying to upgrade to 1.16.0 and can't because of this error. Since we can't do anything about the versions of PHP and libxml on our server, I'm tempted to comment out the check and move on. Any advice?

(In reply to comment #35)

This bug is affecting later versions as well. Once the install of all the
components for Media Wiki v 1.15.4 on Solaris 9 is completed and I begin the
configuration from the web page, you receive the following message:

"Your system has a combination of PHP and libxml which is buggy and can cause
hidden data corruption in MediaWiki and other web apps. Upgrade to PHP 5.2.9 or
later and libxml2 2.7.3 or later! ABORTING (http://bugs.php.net?id=45996 for
details).

However, my installation is PHP 5.2.13 and libxml 2.7.7, both of which is later
than the above two versions.

The installer tests for the bug itself, not for version numbers. It's possible that the documented versions we know work on most systems don't work in all circumstances, or that your PHP is actually linked with a different version of libxml2 than the one you're seeing on your system. (It's even possible that it's a slightly different, but related bug!)

(In reply to comment #36)

Was this a problem in previous versions of MediaWiki? I ask because we've been
running 1.13.5 for months (years?) with the bad combination of PHP and libxml2
and never noticed any issues. Granted we don't do much importing. Now we are
trying to upgrade to 1.16.0 and can't because of this error. Since we can't do
anything about the versions of PHP and libxml on our server, I'm tempted to
comment out the check and move on. Any advice?

If you have the bug it would cause breakage on any version of MediaWiki, in at least the particular areas using XML parsing.

We added the big flashy warning on the installer because people would often not realize their setup was broken until *after* they ended up corrupting a bunch of data and getting very confused...

You might be able to get away with disabling the check as long as you don't use any of the following:

  • Special:Import or its various command-line friends
  • Blahtex, ExternalData, FCKEditor, MediaVid, SyntaxHighlight_GeSHi, WiktionaryInflection extensions

There may also be problems with SVG handling, as well as in other areas that didn't show up on a search for xml_parser_create().

You may also be more evilly affected with other apps running on your server; similar bugs were very disruptive to StatusNet's identi.ca site back when it was running on a flaky Solaris setup that we couldn't upgrade ourselves... we fixed that problem by changing hosts! :P

Be aware that disabling these checks is at your own risk -- you're acknowledging that you know that the software is telling you it will not work properly on your system.

I'm re-resolving this bug; if there's a better resource to help people diagnose and upgrade their broken PHP setups we can change the link, but that's about all we can do at this stage.

  • Bug 30526 has been marked as a duplicate of this bug. ***