Page MenuHomePhabricator

Page generators not working with Wikidata
Closed, ResolvedPublic

Description

Author: sofardamngood

Description:
Pegegenerator has been broken for Wikidata scripts like harvest_template or claimit since February. The diff http://git.wikimedia.org/blob/pywikibot%2Fcore.git/b9ddecb363a1c208b507dbfe5bc0774dfb7cd253/pywikibot%2Fpagegenerators.py is the last working version.

A command like
python pwb.py claimit -family:wikipedia -lang:en -transcludes:'Infobox video game' P19 Q30
is supposed to create a generator with pages transcluding the template on the given Wikipedia, but the current version of pagegenerators ignores the arguments and tries to fetch the pages from wikidatawiki instead, which of course fails.

More information about this bug is available here: https://de.wikipedia.org/w/index.php?title=Benutzer_Diskussion:Xqt&oldid=129234287#Page_generators


Version: core-(2.0)
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=72120

Details

Reference
bz63800

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:15 AM
bzimport set Reference to bz63800.

My guess is that you have your user-config.py family & lang set to wikidatawiki.

The problem is that these scripts, and many others, instantiate a pagegenerators.py class GeneratorFactory before calling pywikibot.handleArgs(). handleArgs is where the command line family & lang are parsed and set up. GeneratorFactory instantiates a default site object in the constructor, which means handleArgs must be completed prior to instantiating the GeneratorFactory.

If your user-config.py was defaulted to wikidatawiki, the generator factory would create generators against wikidatawiki.

It looks like this regression was caused by bug 54540 / https://gerrit.wikimedia.org/r/#/c/112436/

Before that change, a default site object was instantiated for each argument that the GeneratorFactory parsed.

There are three ways I can see to fix this:

  1. the GeneratorFactory obtains a default site object for each argument again, if a site object wasnt provided in the constructor
  2. change all the scripts to call pywikibot.handleArgs() before instantiating a GeneratorFactory. (i.e. same as delete.py)
  3. bot.handleArgs is called transparently from the GeneratorFactory constructor, or the Site constructor, with nonGlobalArgs cached to be later processed when bot.handleArgs is called a second time.

Note that there has been the possibility of pagegenerators pulling in pages from multiple wikis using args:

... -family:wikipedia -lang:nl -transcludes:'Taxobox' -lang:en -transcludes:'Taxobox'

However it also means the following are not identical:

... -family:wikipedia -lang:nl -transcludes:'Taxobox'
... -transcludes:'Taxobox' -lang:en -family:wikipedia

option 2 & 3 above would prevent that hack from working, but would mean the order of global arguments is not important.

option 3 makes the global arguments effective in any script which doesnt currently call handleArgs. There are no scripts in core that this would apply to, but scripts in the wild may break as a result if they have 're-purposed' a global argument name. (the three scripts in core which dont call handleArgs also dont use page generators).

Option 2 looks to be the most efficient and best at self-documenting code. Im happy to do any of the options, or other options I havent thought about.

Change 135287 had a related patch set uploaded by John Vandenberg:
Bug 63800: Call handleArgs before GeneratorFactory

https://gerrit.wikimedia.org/r/135287

Change 135287 merged by jenkins-bot:
Bug 63800: Call handleArgs before GeneratorFactory

https://gerrit.wikimedia.org/r/135287