Page MenuHomePhabricator

dealing with non unicode aware browsers.
Closed, ResolvedPublic

Description

Author: plugwash

Description:
mediawiki has a setting for blacklisting browsers that can't deal with unicode
properly. Currently this setting only lists IE for the mac. Furthermore all this
blacklist does is give a warning which is liable to be missed or ignored. This
issue leads to fairly frequent bad edits messing up unicode charactors.

to deal with this issue we need two new features
1: store the user agent with every request so that problem browsers can be
identified.
2: provide an alternative means of editing (possiblly based on entities or UTF-7
or something) for those browsers which are incapable of handling unicode. (UTF-7
would probablly be easy but would be ugly as hell).


Version: 1.5.x
Severity: normal
OS: Mac System 9.x
Platform: Macintosh
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=65297

Details

Reference
bz2676

Related Objects

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 8:38 PM
bzimport set Reference to bz2676.

plugwash wrote:

ok i have a plan

the process described below is only for browsers that are known to be non
unicode aware.

this idea was inspired by the x-codo system used on the eo wikipedia.
1: use strtr to add an extra leading 0 to existing hexadecimal entities in the
page text
2: replace characters outside the 7 bit ascii range with hexadecimal entities
with no leading zeros
3: send the resulting text to the old browser for editing
4: get the edited text back
5: replace hexadecimal entities with no leading 0s with the characters they
represent
6: remove a leading 0 from every html entity in the page.

rationale:
this process will
1: not effect parts of the text the user doesn't edit
2: make the text in the edit box still valid wiki code that can be copy/pasted
to/from other wikis without issues.

i'm going to try to implement this but i'm new to php and php doesn't seem too
friendly to this type of processing work.

That's kind of sick, but might work. :)

Suggestion: use preg_replace_callback(), it tends to simplify these things nicely.

Take a look also at the existing UTF-8 support code in includes/normal (and the
Sanitizer code for interpreting character references to UTF-8). This includes some
simple helper functions for translating between numeric codepoints and UTF-8
characters.

plugwash wrote:

ok status update.

the conversion routines are written and tested.

the text is converted when the user of mac ie (since i don't have access to a
mac myself i'm using firefoxes user agent selector extention to test).

i still have to figure out how to make the conversion on save once thats working
a patch will follow.

plugwash wrote:

implementation of workaround for non-unicode browsers as described

attachment unicodeworkaround.patch ignored as obsolete

Created attachment 752
Updated version of plugwash's patch

Here's plugwash's patch with my changes as I'm committing it:

  • Reformatted to match other code
  • Removed some duplicate code to use preexisting UTF-8 functions
  • Arranged functions to extract checks from the mainline code.
  • Added phpdoc comments.
  • Adjusted the message warning a bit.

Attached:

Applied to CVS HEAD and installed on Wikimedia.

Seems to work correctly with IE 5.2/Mac as tested on my machine.

plugwash wrote:

i think you made a slight mistake in the comments

+ * Filter an output field through a Unicode de-armoring process if it
+ * came from an old browser with known broken Unicode editing issues.

shouldn't that be

+ * Filter an output field through a Unicode armoring process if it is
+ * going to an old browser with known broken Unicode editing issues

Ah, the dangers of cut and paste ;)

Comment typo fixed.

plugwash wrote:

also not urgent since the browser is pretty uncommon now but the blacklist entry
for netscape 4.x should cover all platforms and all 4.x versions not just 4.78
for linux

ui2t5v002 wrote:

This is great, but Unicode-unaware *browsers* aren't the only problem. A lot of
people want to work in Unicode-unaware text editors as well, and this makes it
difficult for them. They'd have to fake out the server into thinking they had
an old browser or something. I have a different proposal:

  1. Convert all HTML entities (named or Unicode numbers or whatever) into plain

Unicode characters in the wikisource.

  1. Provide an option in the editing interface to view the source in either

"plain Unicode" format (with actual characters) or "plain text" format (with
entities) on a per-edit basis.

2.a. When editing in "plain text" mode, all the bad characters (non-ASCII?) will
be converted into named HTML entities if possible (— and the like), or
into numbered HTML entities if not possible (— and the like).

2.b. The default editing format will be selectable in preferences.

gangleri wrote:

(In reply to comment #12)

Hi Omegatron!

Your request should be covered by
Bug 4012: feature request: add a felexible magic character conversion to the
build in editor
You request a kind of "convert all UTF-8" caharcter setup described there.

Best regards reinhardt [[user:gangleri]]

michael wrote:

I don't think the current scheme protects Unicode non-breaking space
characters (U+00A0). I often enter these into en.wikipedia by typing
alt-space in Safari, and they work fine and survive most edits—wikitext
is ''much'' cleaner than littered with a bunch of  .

But once in a while, some other editor's browser will convert them all to
plain spaces. Should U+00A0 be added to the list of characters
protected from old browsers?

plugwash wrote:

"I don't think the current scheme protects Unicode non-breaking space
characters (U+00A0)."
it does, all non-ascii characters are protected, the problem is that as of right
now the bad browser list is very limited (and if the "bad browser" is a plugin
or similar that doesn't affect the headers in any way or the user is
copy/pasting into a seperate editor there isn't much we can do).

"I often enter these into en.wikipedia by typing
alt-space in Safari, and they work fine and survive most edits—wikitext
is ''much'' cleaner than littered with a bunch of  ."
cleaner maybe but unless you are using a specialist editor that highlights non
breaking spaces virtually impossible to edit correctly.

michael wrote:

Thanks for the reply.

I don't see non-breaking spaces as a problem. I only enter them where it's good practice and recommended by the MOS,
e.g., in unit expressions such as "100 mm". If they need to be found for some reason, wikitext can be pasted into
practically any text editor or word processor for more sophisticated processing.