Page MenuHomePhabricator

Add canonical namespaces and aliases to XML dumps
Open, MediumPublicFeature

Description

The XML dump contains a siteinfo header with a <namespaces> tag that is very useful for processing the text in the dumps. It looks something like this:

<mediawiki ...snip... >
  <siteinfo>
    <sitename>Վիքիպեդիա</sitename>
    <base>http://hy.wikipedia.org/wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D5%A7%D5%BB</base>
    <generator>MediaWiki 1.23wmf15</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Մեդիա</namespace>
      <namespace key="-1" case="first-letter">Սպասարկող</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Քննարկում</namespace>
      <namespace key="2" case="first-letter">Մասնակից</namespace>

  ...snip...

    </namespaces>
  </siteinfo>

Regretfully, this header does not include canonical namespace names or namespace aliases. However, an API request for "meta=siteinfo" does include these bits. For example, the call for http://hy.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases returns the following XML:

<api>
  <query>
    <namespaces>
      <ns id="-2" case="first-letter" canonical="Media" xml:space="preserve">Մեդիա</ns>
      <ns id="-1" case="first-letter" canonical="Special" xml:space="preserve">Սպասարկող</ns>
      <ns id="0" case="first-letter" content="" xml:space="preserve" />
      <ns id="1" case="first-letter" subpages="" canonical="Talk" xml:space="preserve">Քննարկում</ns>
      <ns id="2" case="first-letter" subpages="" canonical="User" xml:space="preserve">Մասնակից</ns>

  ...snip...

    </namespaces>
    <namespacealiases>
      <ns id="6" xml:space="preserve">Image</ns>
      <ns id="7" xml:space="preserve">Image talk</ns>
    </namespacealiases>
  </query>
</api>

The XML dump should be updated to include this important metadata about namespaces.


Version: 1.23.0
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=40010

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:07 AM
bzimport set Reference to bz62109.
bzimport added a subscriber: Unknown Object (MLST).

What would be the use case of having this information in the dump?

(In reply to Jesús Martínez Novo (Ciencia Al Poder) from comment #1)

What would be the use case of having this information in the dump?

As I understand it, the XML dumps are targeted for offline use.

(In reply to Aaron Halfaker from comment #0)

Regretfully, this header does not include canonical namespace names or
namespace aliases. However, an API request for "meta=siteinfo" does include
these bits.

This sounds as though people trying to re-use the dumps need to go online to get this information. I think this is a perfectly reasonable enhancement request.

I'm marking this ticket with the "easy" keyword because it shouldn't be very difficult to add this additional information to the XML dumps. The most challenging part here is figuring out whether it's the PHP or the Python maintenance scripts that generate these particular dumps. The actual output logic can probably be cribbed from the MediaWiki API.

Re. use case,

One common activity when processing wiki dumps is to extract historical link information -- something that can't be done with pagelinks. Let's say I'm processing an enwiki dump and I encounter the following link:

[[WP:Foo]]

Without knowing that "WP" is an alias of ns=4 ("Project"/"Wikipedia") I'd have to assume that "WP:Foo" is the title of an ns=0 article.

This is a problem for canonical namespace names too. The following link would reference the same page:

[[Project:Foo]]

What processing are you talking about? Do you have any script that handles the dump, other than importDump.php?

And what about interwiki links? Would you assume that [[commons:Foo]] would be also a page in the main namespace?

And what about interwiki links? Would you assume that [[commons:Foo]] would be also a page in the main namespace?

WikiTeam always saves the siteinfo data in its dumps: without such information, it can be impossible in few months or years from now to discover what a wikitext was supposed to mean. Parser functions are the worst offenders, but things like wikilinks are an easy win: I think pretty much anything returned by the siteinfo API module is worth including in the XML dump, in principle.

Note that the GCI task only relates to namespace aliases.

As for the namespaces themselves, I'd like to know what information should be provided in the dump. Currently we provide:

  • Namespace ID
  • Name in wiki's content language
  • Case sensitivity

The API query=siteinfo module provides the following additional info:

  • Canonical name. This is only necessary for non-core namespaces, but I suppose for consistency's sake, we need to give it for all namespaces.
    • Should the canonical name be included even when it is identical to the local name?
    • Actually, from the perspective of a dump, the canonical name is surely just another alias. Could we just include it in the list of namespace aliases? That would make life simpler for importers.
  • Whether it has subpages. Very unlikely to be useful. This should be included, per Nemo.
  • Whether it is a content namespace. Not needed.
  • Whether it is a "nonincludable" namespace. This is a wacky feature, probably very rarely used, but it can affect page appearance slightly when enabled. Is it worth worrying about in a dump? I doubt it.
  • Default content model. Since content model is explicitly specified for each page, we don't need this in the <siteinfo> section.

Actually, from the perspective of a dump, the canonical name is surely just another alias. Could we just include it in the list of namespace aliases?

I can't think of a reason to avoid doing so. At worst, someone will add an alias that a future or past release of MediaWiki will bark at, but that's easy to fix without information loss.

Whether it has subpages. Very unlikely to be useful.

Well, in theory this information is necessary to be able to parse the relative titles, like [[../]] and [[/]].

Change 261269 had a related patch set uploaded (by Georggi199):
Export: Added namespace aliases to siteinfo

https://gerrit.wikimedia.org/r/261269

Reedy set Security to None.

Change 261912 had a related patch set uploaded (by Georggi199):
Export: Added check for subpages

https://gerrit.wikimedia.org/r/261912

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:23 PM
Aklapper removed a subscriber: wikibugs-l-list.