Page MenuHomePhabricator

Broken UTF-8 cutoff breaks display in some browsers
Closed, DeclinedPublic

Description

Author: timwi

Description:
BUG MIGRATED FROM SOURCEFORGE
http://sourceforge.net/tracker/index.php?func=detail&aid=855680&group_id=34373&atid=411192
Originally submitted by Nobody/Anonymous - nobody 2003-12-07 10:32

When someone write a long summary comment, it
messes RecentChanges, History, and other texts.

I think this is unique to languages using 2-byte
characters - when a character is cut-off in the middle, it
turns into some wierd character, and affects other part
of the page.

As an example, please see the following history page in
which the text (including the sidebar) is inappropriately
italicized.

http://ja.wikipedia.org/w/wiki.phtml?title=Wikipedia:%
E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%
83%87%E3%82%A3%E3%82%A2%E3%81%AE%E4%BB%
B2%E9%96%93&action=history

When this happens at RecentChanges, it is quite difficult
to read through it.

As a fix, it would be nice to automatically detect too
long summary comment and ask the user to shorten it.

Or there may be a way to properly cut 2byte-char texts.
That would be good, too.

Or maybe some other solution is available.

Thanks for the help,

Tomos ( wiki_tomos at hotmail dot com )

  • Additional comments ------------------------

Date: 2003-12-07 11:31
Sender: SF user vibber

Confirmed; this seems to be a problem with how Internet
Explorer handles broken UTF-8 code; in at least some
circumstances it will eat the non-UTF-8-trail byte(s) that
follow the broken sequence. (I presume it's reading ahead
the entire number of bytes that the head byte specifies and
eating the false tail bytes instead of resynchronizing at
the break point. That's a real shame, since this ability is
one of the neatest things about UTF-8 compared with
traditional double-byte character sets.)

In the attached screenshot (from IE 6.0 on WinXP) this shows
it destroying the following ")" and even
the "<" that starts
the closing </em> tag, so the rest of the page is left in
italics when the markup is incorrectly interpreted.

Most other browsers I have tested (Mozilla, Camino, Safari)
replace only the broken sequence with a placeholder 'broken'
glyph, and correctly restart the UTF-8 interpretation at the
next byte, which as ASCII is itself a valid UTF-8 character
sequence. Konqueror 3.1.2 seems to break the following
")"
but not the "<", so the tags at least are intact.

Text gets cutoff at maximum lengths in a number of places;
titles as well as comments have a max size in the database,
which knows nothing of UTF-8 and treats our data as raw byte
strings. We should add a function to our code to perform a
UTF-8-safe max-byte-length string trimmer to keep the bad
ones out on general principle; since we can't fix IE from
choking on them we should also go through and eliminate any
remaining in the database.

Impact: mostly a cosmetic annoyance, but because of the
ability to damage markup in some popular browsers it could
harm usability. It's unlikely that cross-site scripting
attacks are possible through this, but it's bad juju anyway.
Database should be cleaned of any broken strings there are
now, and code should be fixed to avoid putting them in in
the future.

Only affects UTF-8 wikis, but that's a large and growing
portion of the user base (and we want to switch everything
to UTF-8 at some point). Asian languages are particularly
affected because UTF-8 balloons to 3 bytes per character in
most Asian scripts, so the byte limits are reached with a
smaller number of characters.


Version: unspecified
Severity: minor
OS: Windows XP

Details

Reference
bz332

Related Objects

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 6:48 PM
bzimport set Reference to bz332.
bzimport added a subscriber: Unknown Object (MLST).

mulukhiyya-lj wrote:

Partial screenshot of problem in IE 6/WinXP

originally taken by Brion; copied from SF.net.

Attached:

explorer-broken-utf8.png (146×810 px, 7 KB)

mulukhiyya-lj wrote:

Another kind of effects also exist on the bug. I remembered :)

Example
http://ja.wikipedia.org/w/wiki.phtml?title=%E4%BB%99%E5%8F%B0%E5%B8%82&action=edit&oldid=805477

For some reason the end of wikitext in the editbox has broken, so there are no buttons et al., so
it seems there is nothing but to revert it. As I suspected, if original edits stepped into such
trouble, they wouldn't be able to be reverted without a sysop. Do, do, do, dōdeshō?

plugwash wrote:

im not convinced that the utf-8 tag is being used correctly here

utf8 This keyword tags bugs that would automatically be fixed if all wikis
without exception would use UTF-8.

it seems that this bug is one that ONLY breaks utf-8 wikis the opposite of what
this tag is supposed to mean

another possible fix would be to parse for broken utf-8 at output time (Which
may be easier than trying to find all places where strings are chopped)

mulukhiyya-lj wrote:

(In reply to comment #3)

im not convinced that the utf-8 tag is being used correctly here

utf8 This keyword tags bugs that would automatically be fixed if all wikis
without exception would use UTF-8.

it seems that this bug is one that ONLY breaks utf-8 wikis the opposite of what
this tag is supposed to mean

You are right; therefore I was TOO wrong and incomparably SLOW! I am sorry for
my poor comprehension, and thank you for correcting. Now I understand. Or, at
least, I hope so.

By the way, I have just tried to write a naive code for interest. But I cannot
guess how useful this is.

mulukhiyya-lj wrote:

A naive code to solve similar problems

attachment cleanup_utf8_end.php ignored as obsolete

mulukhiyya-lj wrote:

A naive code to solve similar problems (revised)

Attached:

robchur wrote:

*** Bug 5401 has been marked as a duplicate of this bug. ***

ayg wrote:

Is this still an issue?

mulukhiyya_soup wrote:

Just a few months ago, an automated tool, on Wikimedia Toolserver, seemed to stumble at this bug (malformed XML whatever?). But sorry my memory about that case is a bit obscure...

gangleri wrote:

screen dump - special Recentchanges - deletion event text should truncate at UTF-8 character boundaries · 01.jpg

(In reply to comment #8)

Is this still an issue?

I just wanted to create a new report with the summary

[[special:Recentchanges]] - deletion event text should truncate at UTF-8 character boundaries

� the Unicode Character REPLACEMENT CHARACTER U+FFFD
http://www.fileformat.info/info/unicode/char/fffd/index.htm
HTML Entity (decimal) �
HTML Entity (hex) �
UTF-8 (hex) 0xEF 0xBF 0xBD (efbfbd)
shows up in [[yi:special:Recentchanges]].

It does not show up in [[yi:special:Logs/delete]].

Hiw does this relate to
bug 12359 Deletion summary lengths problems ?

Best regards Reinhardt [[user:Gangleri]]

references:

[[yi:special:Versios]] shows

  • MediaWiki: 1.12alpha (r30286)
  • PHP: 5.1.4 (apache)
  • MySQL: 4.0.29-nightly-20070112-wikimedia-log

Attached:

special_Recentchanges_-_deletion_event_text_should_truncate_at_UTF-8_character_boundaries_·_01.jpg (704×1 px, 192 KB)

Some recent examples:

Description cut-off at the first byte: http://commons.wikimedia.org/wiki/Image:Banka_mydlana.jpg?uselang=pl

hexdump:

00003970 44 20 2d 2d 3c 61 20 68 72 65 66 3d 22 2f 77 2f |D --<a href="/w/|
00003980 69 6e 64 65 78 2e 70 68 70 3f 74 69 74 6c 65 3d |index.php?title=|
00003990 57 69 6b 69 70 65 64 79 73 74 61 3a 4d 72 74 6e |Wikipedysta:Mrtn|
000039a0 26 61 6d 70 3b 61 63 74 69 6f 6e 3d 65 64 69 74 |&amp;action=edit|
000039b0 26 61 6d 70 3b 72 65 64 6c 69 6e 6b 3d 31 22 20 |&amp;redlink=1" |
000039c0 63 6c 61 73 73 3d 22 6e 65 77 22 20 74 69 74 6c |class="new" titl|
000039d0 65 3d 22 57 69 6b 69 70 65 64 79 73 74 61 3a 4d |e="Wikipedysta:M|
000039e0 72 74 6e 20 28 6a 65 73 7a 63 7a 65 20 6e 69 65 |rtn (jeszcze nie|
000039f0 20 75 74 77 6f 72 7a 6f 6e 61 29 22 3e 4d 61 72 | utworzona)">Mar|
00003a00 63 69 6e 20 44 65 72 c4 99 67 6f 77 73 6b 69 3c |cin Der..gowski<|
00003a10 2f 61 3e 20 32 31 3a 30 35 2c 20 32 36 20 73 69 |/a> 21:05, 26 si|
00003a20 65 20 32 30 30 34 20 28 43 45 53 54 29 20 20 5a |e 2004 (CEST) Z|
00003a30 64 6a c4 99 63 69 65 20 70 72 7a 65 64 73 74 61 |dj..cie przedsta|
00003a40 77 69 61 20 62 61 c5 29 3c 2f 73 70 61 6e 3e 3c |wia ba.)</span><|

At byte 0x3a46 one can see 0xC5 byte standing alone.

Another one:

http://pl.wikipedia.org/w/index.php?useskin=monobook&title=Grafika%3ABialyMarszWloclawek.jpg&redirect=no

000026a0 9b 63 69 c5 82 20 77 20 31 39 39 31 20 72 6f 6b |.ci.. w 1991 rok|
000026b0 75 20 4a 61 6e 20 50 61 77 65 c5 82 20 49 49 2c |u Jan Pawe.. II,|
000026c0 20 64 6f 20 6f 64 64 61 6c 6f 6e 65 67 6f 20 31 | do oddalonego 1|
000026d0 32 6b 6d 20 70 6f 64 20 77 c5 82 6f 63 c5 82 61 |2km pod w..oc..a|
000026e0 77 73 6b 69 65 67 6f 20 6c 6f 74 6e 69 73 6b 61 |wskiego lotniska|
000026f0 20 4b 72 75 73 7a 79 6e 2c 20 67 64 7a 69 65 20 | Kruszyn, gdzie |
00002700 6f 64 62 79 c5 29 3c 2f 73 70 61 6e 3e 3c 2f 74 |odby.)</span></t|

And here this is byte 0x2704 - also cutoff character 0xc5

Just found out that in the recordUpload2() function:

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/filerepo/LocalFile.php?revision=38312&content-type=text%2Fplain

$comment parameter is inserted as-is into the

image.img_description and oldimage.oi_description - both fields are TINYBLOBs, and they get cut off at the 255th character.

An additional check should be introduced there plus existing database entries should be cleaned up.

  • Bug 11087 has been marked as a duplicate of this bug. ***

Removing testme – still present in the current trunk (see e.g. http://cs.wikipedia.org/w/index.php?diff=2989674), raising severity at least to minor (we are generating invalid UTF-8!), adding a tracking bug dependence.

fran wrote:

Fixed in r40837. All output to the browser will now be scanned for invalid forms per the rules in RFC 3629; invalid forms will be replaced with �. :)

Hold on for a second… OK, the is a solution to the “breaks display” part of this bug, and it is a nice improvement of the general behavior of MW. But still, shouldn’t we, in the first place, do the string cutoffs properly?

There is absolutely no reason why the history should display any � characters. Or should I open a new bug for that?

a) the data should be stored correctly in the first place

b) this post-op scan on all output looks like it will perform abominably.

This needs to be reverted.

And of course c) it duplicates existing code for UTF-8 fixups. :)

Reverted r40837, r40839, r40840 in r40861.

Adding testme. Please test with Internet Explorer 8 and note the result here.

Bryan.TongMinh wrote:

*** Bug 19712 has been marked as a duplicate of this bug. ***

It seems we're still generating upload summaries with badly truncated UTF-8. See for example http://commons.wikimedia.org/wiki/File:NuclearMedicineImageOfAHandAfterShadowFilter-2.png

Bug 28649 (fixed in r95456) is related to this bug. In r62387 also some truncate bugs are fixed. Are there still any truncate bugs?

(In reply to comment #21)

It seems we're still generating upload summaries with badly truncated UTF-8.
See for example
http://commons.wikimedia.org/wiki/File:NuclearMedicineImageOfAHandAfterShadowFilter-2.png

This was fixed in r103362 (well for new uploads anyways, uploads before this revision would still be affected).

I'm not aware of any more examples of this bug.

Tested in IE, seems to have no issues any more. closing