Page MenuHomePhabricator

Import should always use original wiki's namespace names in log entries and trim namespaces it doesn't know in the target title to allow manual choice
Closed, ResolvedPublic

Description

Reference the latin wikisource import log: http://la.wikisource.org/wiki/Specialis:Acta/import, there is a problem with transwiki importing. Most imports indicate the wrong sending space, using instead the local name for the namespace, so the backlinks don't work. Example, several imports from fr.ws pagespace, the proper namespace is "page" on fr, but the log recorded "pagina" the latin namespace name. The same occurred with page, user, and template imports from english wikisource. HOWEVER, a transwiki from fr.ws template space of the template "{{Page}}" properly recorded the sending location as "fr:Modèle:Page" but placed it in the wrong local page "Formula:Modèle:Page". It should have moved it from "fr:Modèle:Page" to the local page "Formula:Page".


Version: unspecified
Severity: enhancement
URL: https://test.wikipedia.org/w/index.php?title=Special:Log/import&dir=prev&offset=20121009224105&limit=6&type=import&user=
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=10342
https://bugzilla.wikimedia.org/show_bug.cgi?id=7240

Details

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:55 PM
bzimport set Reference to bz30723.
bzimport added a subscriber: Unknown Object (MLST).

Increasing importance to "high" this is really a rather serious matter as it means that logs do not give correct information on the source of an imported page. For text of works this is trivial but for templates, etc. it could have licensing implications. It also means that transwiki importing is essentially broken as many imports go to the wrong place.

This is a MediaWiki, not a Wikimedia bug, and part of the general issue of "metadata" (not) generated or considered by import, see also bug 5770.

This problem also exists at mul.ws, it was just discovered because transwiki import didn't work there at all until recently. See the log at mul: http://wikisource.org/wiki/Special:Log/import

The import logs on mul.source and la.source show different problems.
The underlying limit is that MediaWiki doesn't load the localised namespace names in all languages and can't possibly know the names of extra namespaces and local namespace aliases (which are wiki-specific configuration), nor it can use namespace IDs to do translations as the same ID can be used for different things (even across different Wikisources).
I've done some tests to (hopefully) show the problem better on test.wiki: https://test.wikipedia.org/w/index.php?title=Special:Log/import&dir=prev&offset=20121009224105&limit=6&type=import&user= ; note that "Hilfe" is defined locally as extra namespace for "Help" in German (separate from local Help:).

I think we can't expect Special:Import to be able to resolve all namespace issues, but it should definitely avoid to create pages like [[Template:Vorlage:SperrSchrift]] and let the importer fix things.

  • Bug 5770 has been marked as a duplicate of this bug. ***

How about taking a stepwise approach like:

  1. Show a list of all namespacenames in the import and in the local wiki with an automatically generated mapping suggestion.
  2. Allow the importer to adjust the mapping.
  3. Do the final import.

The downsides:
A) An uploaded file has to be preserved over some time including possibly multiple data submissions by the importer.
B) The import file has to be read twice. It has to be read and analyzed in its entirity during the 1st scan already since the the list of original namespaces in the beginning does not deal with possible occurrences of namespacenamealiases embedded in page data. Those need to be part of the mapping, however.

The good sides:

  • Most flexible.
  • Often used mappings can be preserved and automagically be recalled by the import process.
  • Step 1) could by the way reveal some statistics to the importer, allowing to not import implausible data.

Purodha, what you're asking is an entirely new import/convert interface, even more obscure than the current one and to be built from scratch. Please open another bug for that.

Oh, I did not mean to make this much fuzz :-)
Reported as bug 41969

I'm not really sure that this is an enhancement. Yes, it technically changes an existing function but the existing function *does not* function, it leaves faulty log entries which arguably violate our license and as I note in comment 1, essentially means that transwiki import doesn't work as designed and frequently puts imports in the wrong namespace or a pseudo namespace.

If this technically really depends on bug 41969 (which is low priority), this also needs to be low priority.

Increasing importance to "high" this is really a rather serious matter as it
means that logs do not give correct information on the source of an imported
page.

To me this seem to not directly affect urgency, but maybe severity.

Bug 40192 seems similar to this one.

(In reply to comment #11)

Bug 40192 seems similar to this one.

Yes, they could be considered duplicates but the proposed solution in this bug is slightly more general.

Change 149293 had a related patch set uploaded by TTO:
Proper namespace handling for WikiImporter

https://gerrit.wikimedia.org/r/149293

Change 149293 had a related patch set uploaded (by TTO):
Proper namespace handling for WikiImporter

https://gerrit.wikimedia.org/r/149293

Patch-For-Review

Change 149293 had a related patch set (by TTO) published:
Proper namespace handling for WikiImporter

https://gerrit.wikimedia.org/r/149293

Patch-For-Review

Change 149293 had a related patch set (by TTO) published:
Proper namespace handling for WikiImporter

https://gerrit.wikimedia.org/r/149293

Patch-For-Review

Change 149293 merged by jenkins-bot:
Proper namespace handling for WikiImporter

https://gerrit.wikimedia.org/r/149293

TTO lowered the priority of this task from Low to Lowest.Jan 6 2015, 1:27 AM
TTO added a subscriber: TTO.

This is 90% resolved. The only remaining issue is that it is not possible to individually choose namespace mappings for each page in a dump import. However, that is an extremely low priority issue.

TTO set Security to None.
TTO removed a subscriber: Unknown Object (MLST).
TTO claimed this task.
In T32723#956842, @TTO wrote:

The only remaining issue is that it is not possible to individually choose namespace mappings for each page in a dump import.

Actually, I guess that is T43969: Import should allow mapping of namespace names and aliases. I don't think we need this task open any longer.