Broken UTF-8 cutoff breaks display in some browsers
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• bzimport
	Sep 3 2004, 3:23 AM

Description

Author: timwi

Description:
BUG MIGRATED FROM SOURCEFORGE
http://sourceforge.net/tracker/index.php?func=detail&aid=855680&group_id=34373&atid=411192
Originally submitted by Nobody/Anonymous - nobody 2003-12-07 10:32

When someone write a long summary comment, it
messes RecentChanges, History, and other texts.

I think this is unique to languages using 2-byte
characters - when a character is cut-off in the middle, it
turns into some wierd character, and affects other part
of the page.

As an example, please see the following history page in
which the text (including the sidebar) is inappropriately
italicized.

http://ja.wikipedia.org/w/wiki.phtml?title=Wikipedia:%
E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%
83%87%E3%82%A3%E3%82%A2%E3%81%AE%E4%BB%
B2%E9%96%93&action=history

When this happens at RecentChanges, it is quite difficult
to read through it.

As a fix, it would be nice to automatically detect too
long summary comment and ask the user to shorten it.

Or there may be a way to properly cut 2byte-char texts.
That would be good, too.

Or maybe some other solution is available.

Thanks for the help,

Tomos ( wiki_tomos at hotmail dot com )

Additional comments ------------------------

Date: 2003-12-07 11:31
Sender: SF user vibber

Confirmed; this seems to be a problem with how Internet
Explorer handles broken UTF-8 code; in at least some
circumstances it will eat the non-UTF-8-trail byte(s) that
follow the broken sequence. (I presume it's reading ahead
the entire number of bytes that the head byte specifies and
eating the false tail bytes instead of resynchronizing at
the break point. That's a real shame, since this ability is
one of the neatest things about UTF-8 compared with
traditional double-byte character sets.)

In the attached screenshot (from IE 6.0 on WinXP) this shows
it destroying the following ")" and even
the "<" that starts
the closing </em> tag, so the rest of the page is left in
italics when the markup is incorrectly interpreted.

Most other browsers I have tested (Mozilla, Camino, Safari)
replace only the broken sequence with a placeholder 'broken'
glyph, and correctly restart the UTF-8 interpretation at the
next byte, which as ASCII is itself a valid UTF-8 character
sequence. Konqueror 3.1.2 seems to break the following
")"
but not the "<", so the tags at least are intact.

Text gets cutoff at maximum lengths in a number of places;
titles as well as comments have a max size in the database,
which knows nothing of UTF-8 and treats our data as raw byte
strings. We should add a function to our code to perform a
UTF-8-safe max-byte-length string trimmer to keep the bad
ones out on general principle; since we can't fix IE from
choking on them we should also go through and eliminate any
remaining in the database.

Impact: mostly a cosmetic annoyance, but because of the
ability to damage markup in some popular browsers it could
harm usability. It's unlikely that cross-site scripting
attacks are possible through this, but it's bad juju anyway.
Database should be cleaned of any broken strings there are
now, and code should be fixed to avoid putting them in in
the future.

Only affects UTF-8 wikis, but that's a large and growing
portion of the user base (and we want to switch everything
to UTF-8 at some point). Asian languages are particularly
affected because UTF-8 balloons to 3 bytes per character in
most Asian scripts, so the byte limits are reached with a
smaller number of characters.

Version: unspecified
Severity: minor
OS: Windows XP

Details

Reference: bz332

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T5969 Unicode (UTF-8, utf8) compatibility (tracking)
		Declined		None	T2332 Broken UTF-8 cutoff breaks display in some browsers

Event Timeline

• bzimport raised the priority of this task from to Low.Nov 21 2014, 6:48 PM

• bzimport added projects: Browser-Support-Internet-Explorer, MediaWiki-Parser, TestMe.

• bzimport set Reference to bz332.

• bzimport added a subscriber: Unknown Object (MLST).

• bzimport created this task.Sep 3 2004, 3:23 AM

mulukhiyya-lj wrote:

Partial screenshot of problem in IE 6/WinXP

originally taken by Brion; copied from SF.net.

Attached:

mulukhiyya-lj wrote:

Another kind of effects also exist on the bug. I remembered :)

Example
http://ja.wikipedia.org/w/wiki.phtml?title=%E4%BB%99%E5%8F%B0%E5%B8%82&action=edit&oldid=805477

For some reason the end of wikitext in the editbox has broken, so there are no buttons et al., so
it seems there is nothing but to revert it. As I suspected, if original edits stepped into such
trouble, they wouldn't be able to be reverted without a sysop. Do, do, do, dōdeshō?

plugwash wrote:

im not convinced that the utf-8 tag is being used correctly here

utf8 This keyword tags bugs that would automatically be fixed if all wikis
without exception would use UTF-8.

it seems that this bug is one that ONLY breaks utf-8 wikis the opposite of what
this tag is supposed to mean

another possible fix would be to parse for broken utf-8 at output time (Which
may be easier than trying to find all places where strings are chopped)

mulukhiyya-lj wrote:

(In reply to comment #3)

im not convinced that the utf-8 tag is being used correctly here

utf8 This keyword tags bugs that would automatically be fixed if all wikis
without exception would use UTF-8.

it seems that this bug is one that ONLY breaks utf-8 wikis the opposite of what
this tag is supposed to mean

You are right; therefore I was TOO wrong and incomparably SLOW! I am sorry for
my poor comprehension, and thank you for correcting. Now I understand. Or, at
least, I hope so.

By the way, I have just tried to write a naive code for interest. But I cannot
guess how useful this is.

mulukhiyya-lj wrote:

A naive code to solve similar problems

attachment cleanup_utf8_end.php ignored as obsolete

mulukhiyya-lj wrote:

A naive code to solve similar problems (revised)

Attached:

cleanup_utf8_end.php1 KBDownload

robchur wrote:

*** Bug 5401 has been marked as a duplicate of this bug. ***

ayg wrote:

Is this still an issue?

mulukhiyya_soup wrote:

Just a few months ago, an automated tool, on Wikimedia Toolserver, seemed to stumble at this bug (malformed XML whatever?). But sorry my memory about that case is a bit obscure...

gangleri wrote:

screen dump - special Recentchanges - deletion event text should truncate at UTF-8 character boundaries · 01.jpg

(In reply to comment #8)

Is this still an issue?

I just wanted to create a new report with the summary

[[special:Recentchanges]] - deletion event text should truncate at UTF-8 character boundaries

� the Unicode Character REPLACEMENT CHARACTER U+FFFD
http://www.fileformat.info/info/unicode/char/fffd/index.htm
HTML Entity (decimal) �
HTML Entity (hex) �
UTF-8 (hex) 0xEF 0xBF 0xBD (efbfbd)
shows up in [[yi:special:Recentchanges]].

It does not show up in [[yi:special:Logs/delete]].

Hiw does this relate to
bug 12359 Deletion summary lengths problems ?

Best regards Reinhardt [[user:Gangleri]]

references:

[[yi:special:Versios]] shows

MediaWiki: 1.12alpha (r30286)
PHP: 5.1.4 (apache)
MySQL: 4.0.29-nightly-20070112-wikimedia-log

Attached:

special_Recentchanges_-_deletion_event_text_should_truncate_at_UTF-8_character_boundaries_·_01.jpg (704×1 px, 192 KB)

Some recent examples:

Description cut-off at the first byte: http://commons.wikimedia.org/wiki/Image:Banka_mydlana.jpg?uselang=pl

hexdump:

At byte 0x3a46 one can see 0xC5 byte standing alone.

Another one:

http://pl.wikipedia.org/w/index.php?useskin=monobook&title=Grafika%3ABialyMarszWloclawek.jpg&redirect=no

And here this is byte 0x2704 - also cutoff character 0xc5

Just found out that in the recordUpload2() function:

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/filerepo/LocalFile.php?revision=38312&content-type=text%2Fplain

$comment parameter is inserted as-is into the

image.img_description and oldimage.oi_description - both fields are TINYBLOBs, and they get cut off at the 255th character.

An additional check should be introduced there plus existing database entries should be cleaned up.

Bug 11087 has been marked as a duplicate of this bug. ***

Removing testme – still present in the current trunk (see e.g. http://cs.wikipedia.org/w/index.php?diff=2989674), raising severity at least to minor (we are generating invalid UTF-8!), adding a tracking bug dependence.

fran wrote:

Fixed in r40837. All output to the browser will now be scanned for invalid forms per the rules in RFC 3629; invalid forms will be replaced with �. :)

Hold on for a second… OK, the is a solution to the “breaks display” part of this bug, and it is a nice improvement of the general behavior of MW. But still, shouldn’t we, in the first place, do the string cutoffs properly?

There is absolutely no reason why the history should display any � characters. Or should I open a new bug for that?

a) the data should be stored correctly in the first place

b) this post-op scan on all output looks like it will perform abominably.

This needs to be reverted.

And of course c) it duplicates existing code for UTF-8 fixups. :)

Reverted r40837, r40839, r40840 in r40861.

Adding testme. Please test with Internet Explorer 8 and note the result here.

Bryan.TongMinh wrote:

*** Bug 19712 has been marked as a duplicate of this bug. ***

It seems we're still generating upload summaries with badly truncated UTF-8. See for example http://commons.wikimedia.org/wiki/File:NuclearMedicineImageOfAHandAfterShadowFilter-2.png

Bug 28649 (fixed in r95456) is related to this bug. In r62387 also some truncate bugs are fixed. Are there still any truncate bugs?

(In reply to comment #21)

It seems we're still generating upload summaries with badly truncated UTF-8.
See for example
http://commons.wikimedia.org/wiki/File:NuclearMedicineImageOfAHandAfterShadowFilter-2.png

This was fixed in r103362 (well for new uploads anyways, uploads before this revision would still be affected).

I'm not aware of any more examples of this bug.

Tested in IE, seems to have no issues any more. closing

• Phabricator_maintenance removed a subscriber: • wikibugs-l-list.Jul 30 2016, 10:53 AM

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptJul 30 2016, 10:53 AM

• Phabricator_maintenance removed a parent task: T2640: [DO NOT USE] Internet Explorer (IE) issues on Windows (tracking) [superseded by #Browser-Support-Internet-Explorer].Jul 30 2016, 10:54 AM

	F1256: special_Recentchanges_-_deletion_event_text_should_truncate_at_UTF-8_character_boundaries_·_01.jpg
	Nov 21 2014, 6:48 PM

Broken UTF-8 cutoff breaks display in some browsersClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Broken UTF-8 cutoff breaks display in some browsers
Closed, DeclinedPublic
Actions

Related Objects
Search...