Page MenuHomePhabricator

GWToolset fails to upload files and throws no warning
Closed, DeclinedPublicBUG REPORT

Description

This happens on both Wikimedia Commons and on the Beta cluster.

GWToolset happily goes through all four steps, but no files from this dataset is uploaded.

See XML attached.


Version: unspecified
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=68506

Details

Reference
bz68285

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:32 AM
bzimport set Reference to bz68285.
bzimport added a subscriber: Unknown Object (MLST).

Created attachment 15984
XML file to be processed by the GWtoolset, failing.

Attaching culprit XML file

Attached:

(In reply to dan from comment #2)

is this the mediawiki template you’re mapping to?
http://commons.wikimedia.beta.wmflabs.org/wiki/Template:MH_IDF-Ingestion

Yes. And it’s that one on the production site:
https://commons.wikimedia.org/wiki/Template:Ingestion-MH_IDF

local test

when i first ran the upload i ran into a duplicate file issue. this error is only visible in the runJobs log or in my case, in the console output when i ran maintenance/runJobs.php. i had to comment out the LocalSettings.php config value $wgUseInstantCommons = true; in order to run the batch successfully. i don’t think this would be an issue on commons or commons beta, but if there are duplicates on either of those servers, GWToolset would unfortunately fail silently at the moment. I‘m hoping to add the use of Echo in order to report batch upload status, but that is currently something for the future.

categories

one thing i noticed that GWToolset didn’t handle well was the use of one XML element to represent a series of categories. the preview page reports an error and and while the resulting output seems to work, i think it would be better to possibly improve the GWToolset parsing of such an XML element, or for now, create multiple XML elements, each containing their own category. so instead of

<categories>[[Category:1901 photographs]]
[[Category:Photographs by Eugène Atget]]
[[Category:Paris XIXe arrondissement]]</categories>

<category>1901 photographs</category>
<category>Photographs by Eugène Atget</category>
<category>Paris XIXe arrondissement]]</category>

current action to take

since the XML subset did work for me locally, i’m wondering if some of the recent core and GWToolset patches may have resolved the issue you were experiencing. i suggest trying a small subset on commons beta again and report back the result.

This might be a slight tangent, but chipping in my category experience...

I have been adding multiple categories to several batch upload projects now by flexibly adding categories as a separated list of elements at the time when I generate the XML file. Slightly naff early example I happen to have to hand:

<cat_1>Brussels</cat_1>
<cat_2>Images uploaded by Fæ</cat_2>
<cat_3>Belgium</cat_3>

These are then added as categories at GWT "step 2" under "Item specific categories". Problems are avoided by ensuring that a record with the maximum number of categories appears at the top of the file, so is used as an exemplar for mapping.

There is no particular benefit in bundling these as a parent single element, so for the moment I would recommend new GWT users design their xml as a flat layout of elements in this way. Anything other than a flat structure is far more likely to cause unexpected issues during upload.

(In reply to dan from comment #4)

when i first ran the upload i ran into a duplicate file issue. this error is
only visible in the runJobs log or in my case, in the console output when i
ran maintenance/runJobs.php. i had to comment out the LocalSettings.php
config value $wgUseInstantCommons = true; in order to run the batch
successfully. i don’t think this would be an issue on commons or commons
beta, but if there are duplicates on either of those servers, GWToolset
would unfortunately fail silently at the moment. I‘m hoping to add the use
of Echo in order to report batch upload status, but that is currently
something for the future.

I see. I opened bug 68577 to track the duplicate file issue.

categories

one thing i noticed that GWToolset didn’t handle well was the use of one XML
element to represent a series of categories.

Hmmm, I used that for nearly 3000 files without any problem though.

i think it would be
better to possibly improve the GWToolset parsing of such an XML element, or
for now, create multiple XML elements, each containing their own category.
so instead of

<categories>[[Category:1901 photographs]]
[[Category:Photographs by Eugène Atget]]
[[Category:Paris XIXe arrondissement]]</categories>

<category>1901 photographs</category>
<category>Photographs by Eugène Atget</category>
<category>Paris XIXe arrondissement]]</category>

You mean with the same XML element? Is that acceptable XML-wise?

(In reply to Fæ from comment #5)

This might be a slight tangent, but chipping in my category experience...

I have been adding multiple categories to several batch upload projects now
by flexibly adding categories as a separated list of elements at the time
when I generate the XML file. Slightly naff early example I happen to have
to hand:

<cat_1>Brussels</cat_1>
<cat_2>Images uploaded by Fæ</cat_2>
<cat_3>Belgium</cat_3>

These are then added as categories at GWT "step 2" under "Item specific
categories".

Yeah, I tried that at the beginning but decided that...

Problems are avoided by ensuring that a record with the maximum
number of categories appears at the top of the file, so is used as an
exemplar for mapping.

this was just too annoying to do. :-þ

There is no particular benefit in bundling these as a parent single element,
so for the moment I would recommend new GWT users design their xml as a flat
layout of elements in this way. Anything other than a flat structure is far
more likely to cause unexpected issues during upload.

(Just for the record (no pun intended), I am only using flat structure)

Problems are avoided by ensuring that a record with the maximum
number of categories appears at the top of the file, so is used as an
exemplar for mapping.

this was just too annoying to do. :-þ

Ah, an easy alternative I have used is to fix your own maximum and just add the same number of categories to all records, leaving some of them empty. Using <cat_3></cat_3> seems to work just fine, and does not leave a messy false entry on the image page. My solution was just a way of leaving the maximum flexible, but it does mean that you spend more time on pre-processing the xml file.

(In reply to Jean-Fred from comment #6)>

Hmmm, I used that for nearly 3000 files without any problem though.

yes, it does work, but GWToolset doesn’t parse the single element well and places the wikitext as:

[[Category:Category:1901 photographs Category:Photographs by Eugène Atget Category:Paris XIXe arrondissement]]

instead of as:

[[Category:1901 photographs]]
[[Category:Photographs by Eugène Atget]]
[[Category:Paris XIXe arrondissement]]

i think it would be
better to possibly improve the GWToolset parsing of such an XML element, or
for now, create multiple XML elements, each containing their own category.
so instead of

<categories>[[Category:1901 photographs]]
[[Category:Photographs by Eugène Atget]]
[[Category:Paris XIXe arrondissement]]</categories>

<category>1901 photographs</category>
<category>Photographs by Eugène Atget</category>
<category>Paris XIXe arrondissement]]</category>

You mean with the same XML element? Is that acceptable XML-wise?

yes, having more than one element of the same name is valid XML, so you can do the above transposition of one <categories /> element into multiple <category /> elements without issue. and then refer to that element in the item specific categories section drop down. then GWToolset will "properly" place the wikitext as:

[[Category:1901 photographs]]
[[Category:Photographs by Eugène Atget]]
[[Category:Paris XIXe arrondissement]]

i ran the batch upload locally using the attached XML, which contained 29 items. many were duplicates, but the following had titles that were longer than the 240 byte title limit. if you want, shorten the titles and upload a them in a new version of the XML file and i'll test it locally ...

This title evaluates to 245 bytes in length.
Abbaye_Saint-Martin-des-Champs_(ancienne),_Conservatoire_National_des_Arts_et_Métiers,_Musée_National_des_Techniques_-_Eglise._Chevet,_Fenêtre_du_1e_étage,_côté_droit..._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00017548.jpg

This title evaluates to 242 bytes in length.
Eglise_Saint-Martin_-_Vitrail,_baie_5_(ensemble),_Anne_de_Montmorency,_épouse_de_Guy_de_Laval_et_de_Rochefort,_et_ses_filles._Guy_de_Laval_et_de_Rochefort_et_saint_Guy...._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00005410.jpg

This title evaluates to 244 bytes in length.
Eglise_Saint-Martin_-_Vitrail,_baie_4_(ensemble),_Charles_de_Villiers,_évêque_de_Beauvais,_ambassadeur_à_la_cour_de_Charles_Quint._Vierge_à_l'Enfant._Le_pape_Adrien_VI..._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00005393.jpg

This title evaluates to 245 bytes in length.
Eglise_Saint-Martin_-_Vitrail,_baie_7_(partie_inférieure),_Guillaume_de_Montmorency,_fondateur_de_l'église,_accompagné_de_ses_fils_(Jean,_Anne,_François_et_Philippe)_et..._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00005420.jpg

This title evaluates to 244 bytes in length.
Abbaye_Saint-Martin-des-Champs_(ancienne),_Conservatoire_National_des_Arts_et_Métiers,_Musée_National_des_Techniques_-_Eglise._Façade_sud,_Fenêtre_du_1e_étage,_côté..._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00017550.jpg

This title evaluates to 241 bytes in length.
Eglise_Saint-Martin_-_Vitrail,_baie_9_(partie_inférieure),_Anne_Pot,_épouse_de_Guillaume_de_Montmorency,_accompagnée_de_ses_filles_(Louise,_Marie_et_Anne)_et_de_leur..._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00005430.jpg

This title evaluates to 242 bytes in length.
Abbaye_Saint-Martin-des-Champs_(ancienne),_Conservatoire_National_des_Arts_et_Métiers,_Musée_National_des_Techniques_-_Eglise._Trumeau_d'une_baie_du_1e_étage,_vers_la..._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00017551.jpg

This title evaluates to 243 bytes in length.
Abbaye_Saint-Martin-des-Champs_(ancienne),_Conservatoire_National_des_Arts_et_Métiers,_Musée_National_des_Techniques_-_Eglise._Clocher,_Baie_du_rez-de-chaussée,_côté..._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00017547.jpg

This title evaluates to 241 bytes in length.
Abbaye_Saint-Martin-des-Champs_(ancienne),_Conservatoire_National_des_Arts_et_Métiers,_Musée_National_des_Techniques_-_Eglise._Façade_ouest,_Contrefort,_au_niveau_du..._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00017552.jpg

This title evaluates to 241 bytes in length.
Abbaye_Saint-Martin-des-Champs_(ancienne),_Conservatoire_National_des_Arts_et_Métiers,_Musée_National_des_Techniques_-_Eglise,_Porte_murée_à_la_base_sud_du_clocher_..._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00016776.jpg

This title evaluates to 241 bytes in length.
Eglise_Saint-Martin_-_Vitrail,_baie_12_(ensemble),_Guillaume_Gouffier,_seigneur_de_Bonnivet,_accompagné_de_ses_fils_(Artus,_Guillaume,_Adrien,_Louis,_Pierre_et_Aymar)_..._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00005449.jpg

This title evaluates to 242 bytes in length.
Eglise_Saint-Martin_-_Vitrail,_baie_11_(ensemble),_Nativité._Gaspard_de_Coligny_et_son_saint_patron._Louise_de_Montmorency,_fille_de_Guillaume_et_d'Anne_Pot,_épouse_de..._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00005441.jpg

This title evaluates to 244 bytes in length.
Chapelle_des_Dames-de-Saint-Chaumont_ou_des_Filles_de_l'Union_Chrétienne_(ancienne)_-_Chapelle_des_Dames-de-Saint-Chaumont_ou_des_Filles_de_l'Union_Chrétienne_(ancienne)..._-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00004658.jpg

dan-nl set Security to None.
Aklapper changed the subtype of this task from "Task" to "Bug Report".Feb 15 2022, 9:39 PM
Aklapper removed a subscriber: wikibugs-l-list.