Page MenuHomePhabricator

Entries in the image table with invalid titles
Closed, ResolvedPublic

Details

Reference
bz14365

Related Objects

StatusSubtypeAssignedTask
OpenFeatureNone
ResolvedNone

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:12 PM
bzimport set Reference to bz14365.
bzimport added a subscriber: Unknown Object (MLST).

maintenance/cleanupImages.php appears to be broken; probably needs to be updated for recent DB load balancer changes. (via cleanupTable.inc via FiveUpgrade.inc)

*** Bug 16056 has been marked as a duplicate of this bug. ***

List of invalid titles at commons image table from bug 16056:
*Eaton's_Ninth_Floor_Restaurant.jpg
*Honest_Ed's.jpg
*Passing_the_time_at_Jeffrey's_Bay-_South_Africa.jpg
*Toronto's_Opera_House_-_In_construction.jpg
*Here's_looking_at_you_kid-raffi_torres.jpg
*Nice_Côte_d'Azur.jpg
*Ward's_ferry_line_-_1_12hr_later.jpg

Increasing priority: When trying working with those images
(so far only with the api) a php error is produced:
PHP fatal error in
/usr/local/apache/common-local/php-1.5/includes/filerepo/RepoGroup.php line 94:
Call to a member function getDBkey() on a non-object

There're also images with a wrong title with " entity:
*Picswiss_BE-90-14_Hotel_"Oeschinensee"_beim_Oeschinensee.jpg
*Picswiss_BE-90-15_Hotel_und_Berghaus_"Oeschinensee"_beim_Oeschinensee.jpg
*Picswiss_BE-94-01_Kirche_von_Würzbrunnen_(Röthenbach)_-_"Gotthelf-Kirche&quo.jpg
*Picswiss_BE-94-07_Kirche_von_Würzbrunnen_(Röthenbach)_-_"Gotthelf-Kirche&quot.jpg
*Picswiss_GR-84-06_Splügen-_Hinterrhein,_Hotel_"Suretta".jpg
*Picswiss_GR-84-14_Hotel_"Weiss_Kreuz"_in_Splügen.jpg
*Picswiss_GR-84-16_Splügen_mit_Teurihorn_(Talstation_der_"Tambo"-Bahnen).jpg
*Picswiss_GR-84-31_Splügen_mit_dem_Teurihorn_(Talstation_"Tambo"-Bahnen).jpg
*Picswiss_GR-84-32_Ruine_"Zur_Burg"_in_Splügen.jpg

Uploaded August 2007

The fatal error should be fixed by r44370: now those titles are simply skipped. Actually fixing them requires running maintenance/cleanupImages.php. Reducing priority back to normal.

The Picswiss files have been fixed now, but the ones with ' are still broken, apparently because Sanitizer::decodeCharReferences() doesn't recognize it. I've committed a fix in r45387 -- it's a bit ugly due to the special status of ' as the only named character entity defined in XHTML 1.0 but not in HTML 4.01.

ayg wrote:

(In reply to comment #6)

The Picswiss files have been fixed now, but the ones with ' are still
broken, apparently because Sanitizer::decodeCharReferences() doesn't recognize
it. I've committed a fix in r45387 -- it's a bit ugly due to the special
status of ' as the only named character entity defined in XHTML 1.0 but
not in HTML 4.01.

Why do we care about HTML 4.01?

Reverted in r45477 as the special casing seems totally unnecessary...

If we add ' to $wgHtmlEntities, the Sanitizer will allow it through normalizeCharReferences(). This could cause inconsistent rendering on old browsers that predate XHTML, or apparently even on some rather modern versions of IE if the page is served with a doctype (or maybe MIME type?) indicating HTML4 rather than XHTML. See e.g. http://cssvault.com/blog/2007/10/17/internet-explorer-apos-feature/

Still, we claim to be serving XHTML, so I suppose this should not be a problem (even if I do believe we're serving it with a "text/html" MIME type). And even on old browsers that don't support it, all that's likely to happen is that it'll be rendered verbatim (as indeed the sanitizer currently forces it to be).

ayg wrote:

We always serve with an XHTML doctype, so that's not an issue. What browsers won't recognize it? Are we talking like NN4 here, or like IE6?

I've tried searching for a browser compatibility table for ', but haven't found any so far. Anyway, XHTML has existed for almost a decade now, so I suspect only very old browsers would be completely unaware of it. The IE behavior worries me more: one page I found, http://seewhatever.de/blog/?p=114 , says even IE 7 won't recognize ' in HTML mode, and seems to suggest that it's the MIME type that makes the difference.

Anyway, why do we output named character entities at all? We already have a table of the Unicode code points corresponding to all of them, so it would be trivial to make normalizeEntity() output numeric entities only.

ayg wrote:

Because that's uglier, maybe? It should be simple enough to test ' in various browsers and see which work, anyway.

' is not valid XHTML per se; it's only usable in XHTML because XHTML is supposed to be XML, and XHTML is XML only if served with an XML mime type. If served as text/html, it's just the version of HTML that the browser supports (i.e. still HTML4 in many cases) sugared with some invalid XMLisms that the browser may or may not support - see http://www.w3.org/TR/xhtml1/#C_16.

Mozilla, Opera and Webkit accept ' in HTML anyway (maybe because it is valid HTML5), but IE versions 6, 7 and 8 do not.

ayg wrote:

Yeah, confirmed, IE doesn't accept &apos;, even IE8 with <!doctype html>.

What's the state of this then?

cleanupImages in it's current incarnation finds no issues...

The images with &apos; in the title (see comment 3) appear to still exist in the image table, so I guess there's still an issue. r45387 would've fixed it, but Brion reverted it and apparently nobody ever got around to either unreverting it or committing the alternative fix Brion suggested.

(In reply to comment #16)

The images with &apos; in the title (see comment 3) appear to still exist in
the image table, so I guess there's still an issue. r45387 would've fixed it,
but Brion reverted it and apparently nobody ever got around to either
unreverting it or committing the alternative fix Brion suggested.

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Sanitizer.php?view=annotate#l75

r89681

So brion did fix it...

So it's in 1.18

(In reply to comment #17)

(In reply to comment #16)

The images with &apos; in the title (see comment 3) appear to still exist in
the image table, so I guess there's still an issue. r45387 would've fixed it,
but Brion reverted it and apparently nobody ever got around to either
unreverting it or committing the alternative fix Brion suggested.

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Sanitizer.php?view=annotate#l75

r89681

So brion did fix it...

So it's in 1.18

Does cleanupImages need to be run anywhere? Is there anything else that needs to happen?

  • This bug has been marked as a duplicate of bug 22939 ***