Page MenuHomePhabricator

XML dump contains gender-specific namespaces that breaks search indexing of those namespaces
Closed, ResolvedPublic

Description

Author: rainman

Description:
Currently lucene doesn't support the gender-specific namespaces, which appear in XML dumps although they don't appear in the header. Could we have the XML dumps just use the canonical namespaces again, or add the non-canonical to the header?

Please note this completely breaks User namespace indexing, and makes user pages appear as main pages in search!!!


Version: unspecified
Severity: major

Details

Reference
bz32376

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:02 AM
bzimport set Reference to bz32376.

We should add them to the XML dumps.

What? Dumps don't have canonical namespace names or ids?

(In reply to comment #2)

What? Dumps don't have canonical namespace names or ids?

There is a patch on bug 30513 to add namespace IDs. Right now people have to parse the namespace out of each title using the namespace map at the beginning of the dump, and that namespace map is missing gendered namespaces.

rainman wrote:

Test case: try exporting User_talk:MrsMyer on de.wp, then look at the title of the exported page.

My German is a bit rustic, but I think this illustrates your point:

http://de.wikipedia.org/w/index.php?title=Benutzerin_Diskussion:MrsMyer&printable=yes

in the URL the title starts with Benutzerin_Diskussion while the export page shows Benutzer Diskussion.

Rev. http://www.mediawiki.org/wiki/Special:Code/MediaWiki/82029 and Rev. http://www.mediawiki.org/wiki/Special:Code/MediaWiki/97461 (part of MW 1.18) introduced gender sensitive namespaces.

The XML dump file contains both Benutzerin and Benutzer. Updating the <namespace> tag with both variants is probably the cleanest solution.

(In reply to comment #6)

The XML dump file contains both Benutzerin and Benutzer. Updating the
<namespace> tag with both variants is probably the cleanest solution.

While that's a good idea, your patch on bug 30513 for adding a namespace tag/field would also make this problem go away.

I think the right behavior is probably what was suggested by the reporter, to use the canonical namespace names in the dump. I'm not opposed to including the variants in the siteinfo along with which gender they go with, but that's secondary.

r103945 switches the export to canonical form.

I'd like to list the aliases and all, but we'll need to adjust the <siteinfo> format to make sure it doesn't esplode on anything.

Removed dep on bug 30513 -- that can remain indepedently open.

rainman wrote:

Resolved in r104124

rainman wrote:

*** Bug 32629 has been marked as a duplicate of this bug. ***