Page MenuHomePhabricator

additional space in langlinks data causes crash
Closed, ResolvedPublic

Description

When is in interwiki link space betveen namespace and name, bot crashes:

pwb.py interwiki -async -family:wiktionary -cleanup -continue

...
Retrieving pages from wiktionary:fr.

WARNING: loadpageinfo: Query on [[fr:Categorie: Abreviations en italien]] returned data on 'Categorie:Abreviations en italien' Dump cs (wiktionary) written. Traceback (most recent call last): File "D:\Py\rewrite\pwb.py", line 178, in <module> run_python_file(fn, argv, argvu) File "D:\Py\rewrite\pwb.py", line 75, in run_python_file exec(compile(source, filename, "exec"), main_mod.dict) File "D:\Py\rewrite\scripts\interwiki.py", line 2646, in <module> main() File "D:\Py\rewrite\scripts\interwiki.py", line 2621, in main bot.run() File "D:\Py\rewrite\scripts\interwiki.py", line 2365, in run self.queryStep() File "D:\Py\rewrite\scripts\interwiki.py", line 2338, in queryStep self.oneQuery() File "D:\Py\rewrite\scripts\interwiki.py", line 2334, in oneQuery subject.batchLoaded(self) File "D:\Py\rewrite\scripts\interwiki.py", line 1305, in batchLoaded if not page.exists(): File "D:\Py\rewrite\pywikibot\page.py", line 564, in exists return self.site.page_exists(self) File "D:\Py\rewrite\pywikibot\site.py", line 2288, in page_exists return page._pageid > 0 AttributeError: 'Page' object has no attribute '_pageid' <type 'exceptions.AttributeError'> CRITICAL: Waiting for 1 network thread(s) to finish. Press ctrl-c to abort

Because of impossibility of change dumpfile (https://bugzilla.wikimedia.org/show_bug.cgi?id=72943 )
I modified this page
https://cs.wiktionary.org/w/index.php?title=Kategorie:Italské_zkratky&diff=531814&oldid=522026

so if anyone wants to reproduce, must edit another page


Version: core-(2.0)
Severity: major

Details

Reference
bz73124

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:56 AM
bzimport set Reference to bz73124.
bzimport added a subscriber: Unknown Object (????).

The hint here is "Query on [[fr:Categorie: Abreviations en italien]] returned data on 'Categorie:Abreviations en italien'" That is only a warning in _update_page , and it because of Site.sametitle.

I set up a test case:

https://en.wikipedia.org/wiki/User:John_Vandenberg/test is:

fooo

[[fr:Catégorie: Pantonyme]]

https://pt.wikipedia.org/wiki/Usu%C3%A1rio:John_Vandenberg/test is:

fooo

[[en:User:John Vandenberg/test]]

Then:

$ python pwb.py interwiki -page:"Usuário:John_Vandenberg/test" -family:wikipedia -lang:pt

NOTE: Number of pages queued is 0, trying to add 50 more. Retrieving 1 pages from wikipedia:pt. [[pt:Usuário(a):John Vandenberg/test]]: [[pt:Usuário(a):John Vandenberg/test]] gives new interwiki [[en:User:John Vandenberg/test]] Retrieving 1 pages from wikipedia:en. WARNING: [[pt:Usuário(a):John Vandenberg/test]] is in namespace 2, but [[fr:Catégorie: Abréviations en italien]] is in namespace 14. Follow it anyway? ([y]es, [n]o, [a]dd an alternative, [g]ive up) y [[pt:Usuário(a):John Vandenberg/test]]: [[en:User:John Vandenberg/test]] gives new interwiki [[fr:Catégorie: Abréviations en italien]] Retrieving 1 pages from wikipedia:fr. WARNING: preloadpages: Query returned unexpected title'Catégorie:Abréviations en italien' WARNING: loadpageinfo: Query on [[fr:Catégorie: Abréviations en italien]] returned data on 'Catégorie:Abréviations en italien' Dump pt (wikipedia) appended. Traceback (most recent call last): File "pwb.py", line 178, in <module> run_python_file(fn, argv, argvu) File "pwb.py", line 75, in run_python_file exec(compile(source, filename, "exec"), main_mod.dict) File "scripts/interwiki.py", line 2646, in <module> main() File "scripts/interwiki.py", line 2621, in main bot.run() File "scripts/interwiki.py", line 2365, in run self.queryStep() File "scripts/interwiki.py", line 2338, in queryStep self.oneQuery() File "scripts/interwiki.py", line 2334, in oneQuery subject.batchLoaded(self) File "scripts/interwiki.py", line 1305, in batchLoaded if not page.exists(): File ".../pywikibot/page.py", line 564, in exists return self.site.page_exists(self) File ".../pywikibot/site.py", line 2306, in page_exists return page._pageid > 0 AttributeError: 'Page' object has no attribute '_pageid' <type 'exceptions.AttributeError'> CRITICAL: Waiting for 1 network thread(s) to finish. Press ctrl-c to abort

Using this patch fixes the problem for me (so, I'll review it now)

https://gerrit.wikimedia.org/r/#/c/151809/

And https://gerrit.wikimedia.org/r/172108/ is also needed because of a recently created bug.

API langlinks data retains the space.

https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&titles=User:John%20Vandenberg/test

{

"query": {
    "pages": {
        "40071800": {
            "pageid": 40071800,
            "ns": 2,
            "title": "User:John Vandenberg/test",
            "langlinks": [
                {
                    "lang": "fr",
                    "*": "Cat\u00e9gorie: Pantonyme"
                }
            ]
        }
    }
}

}

api.py update_page uses pywikibot.Link.langlinkUnsafe to create a Link object, and that doesnt remove spaces.

s = pywikibot.Site()
l = pywikibot.Link.langlinkUnsafe('fr', 'Catégorie: Pantonyme', source=s)
l.title

' Pantonyme'

The bug has been fixed by improved Site.sametitle() , but the underlying bug in pywikibot.Link.langlinkUnsafe still exists.

Xqt raised the priority of this task from Low to Medium.May 28 2017, 11:55 AM
Xqt claimed this task.
Xqt subscribed.

I close this because I cannot validate this bug anymore. Please reopen if you have additional hints and working with current Pywikibor release.