Page MenuHomePhabricator

Problem with 0x200B ZERO WIDTH SPACE in page titles
Open, LowPublic

Description

Originally from: http://sourceforge.net/p/pywikipediabot/bugs/1295/
Reported by: ganz-ru
Created on: 2011-02-15 20:40:15
Subject: Problem with Tibetan script
Original description:
Here is hard edit war: http://en.wikipedia.org/w/index.php?title=Podolsk&action=history . Bots with the old python version add incorrect tibetan interwiki. And bot with version 2.7.1 do it correctly.


Version: unspecified
Severity: normal
See Also:
https://sourceforge.net/p/pywikipediabot/bugs/1295
https://bugzilla.wikimedia.org/show_bug.cgi?id=27446

Details

Reference
bz55246

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:29 AM
bzimport set Reference to bz55246.
bzimport added a subscriber: Unknown Object (????).

My crystal ball suggests:
\* Wikitanvirbot is running from the toolserver, which has a patched 2.7.1 without unicode bug
\* TXiKiBoT is running an old version of pywikipediabot \(there is no python version in the edit summary\) on python 2.6.5+

Conclusion: TXiKiBot should be blocked until its owner fixes his/hers setup.

Same problem have many other bots: my bot, LucienBOT, VolkovBot. All of them have versions newer than 2.6.5.

I'sorry. All of them have versions older than 2.6.5.

I see, indeed.

Could you post the output of

import query
print query.json.\_\_file\_\_

for your bot? I cannot reproduce the bug, so I suspect it might be due to a buggy json package. I'll do some package sniffing to check this further.

If I did it right output is 4 files:
\_\_init\_\_.pyc
decoder.pyc
encoder.pyc
scanner.pyc

Are you needing them?
I'm sorry, I'm not the python programmer.

I meant the output if you type those lines into the python interpreter - but I've been poking around some more. It does not seem to be JSON related - or maybe it is, or maybe it isn't it. I think it has to do with some very old code called 'getall', which gets batches of pages.

Sigh. I would very much like to say: "bad luck, try the rewrite" - I'm almost afraid to touch that piece of code. I'll see if I can whip up a test you can run, though, to confirm my suspicions.

In the meanwhile, could you post the output of version.py?

Thanks.

Version.py:
Pywikipedia \[http\] trunk/pywikipedia \(r8948, 2011/02/13, 09:19:56\)
Python 2.6.4 \(r264:75708, Oct 26 2009, 08:23:19\) \[MSC v.1500 32 bit \(Intel\)\]
config-settings:
use\_api = True
use\_api\_login = True
unicode test: ok

I'll be glad to help if you write the test code.

Ok. There are two issues playing a role here.

1\) the 'correct' page name ends in a 0x200B ZERO WIDTH SPACE. This makes no sense, other than to annoy people.
2\) the XML parser strips spaces around titles, including the 0x200B ZERO WIDTH SPACE.
3\) Mediawiki does \*not\* do this

So, first of all, I will rename the article, so it no longer has the 0x200B ZERO WIDTH SPACE in the title. I will see if I can pinpoint the XML bug, so someone else may fix it. However, due to the fact bots are killing eachother about it, I suspect this is a small change somewhere in the python APIs - or default setting that changed.

Lastly, maybe we should broaden the discussion into mediawiki-tech -- should page titles be allowed to have unicode whitespace characters embedded, especially if they are invisible?

Except I don't have move privileges on that wiki. Added a comment on the user talk page of the guy who created the page.

Stripping is done in xmlreader.py:194. Calling strip\(\) seems to remove the U+200B character indeed.

I agree that this symbol in titles is absolutely useless. Not only for bots, but for usual users too since it can break their copy-paste operations.

If you can start discussion on mediawiki-tech, please do it.

Thank you.

JAn Dudik moved the page, so the problem should be fixed for now. Keeping this open \(it's a bug in pywikipedia, after all\).

Related:

Python 2.6.5 \(r265:79063, Apr 16 2010, 13:09:56\)
\[GCC 4.4.3\] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\u200b'.strip\(\)
u''

Python 2.7.1 \(r271:86832, Jan 4 2011, 13:57:14\)
\[GCC 4.5.2\] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\u200b'.strip\(\)
u'\u200b'

\u200b is technically not whitespace, so strip\(\) probably should not delete it.

Of course, pwb should not be stripping page titles in the first place.

Aaand http://bugs.python.org/issue10567 is related to that.

In essence: bots running < 2.7 were technically doing the wrong thing, but this did not go noticed as no-one used the interwiki to the tibetan wikipedia, and all bots did the same wrong thing. Now there are bots running 2.7+, from the toolserver, and the bug surfaced.

  • Group: --> confirmed
  • Priority: 7 --> 2
  • Bug 55227 has been marked as a duplicate of this bug. ***