Page MenuHomePhabricator

Dump page titles for other namespaces
Closed, ResolvedPublic

Description

Currently the only page titles available separately are namespace 0: all-titles-in-ns0.gz

Apart from this most other titles are available in pages-articles.xml.bz2

Except for User pages and Talk pages, which are available in pages-meta-current.xml.bz2

The articles and meta-current dumps are typically a couple of orders of magnitude larger than the all-titles-in-ns0 dump.

The only ways to get complete lists of page titles are to download and process these two enormous dump files or making excessive use of the API.

  • We could dump a page title list to accompany each of pages-articles.xml.bz2 and pages-meta-current.xml.bz2
  • We could dump a page title list for all namespaces.
  • We could dump a page title list for all pages not already covered by all-titles-in-ns0.gz
  • We could dump a page title list for each namespace.

For my current purpose I already need to process pages-articles.xml.bz2 so I only lack page titles for User and Talk pages so a dump of the titles for those namespaces would be enough for me, but might not be the best for other potential users of the data.


Version: unspecified
Severity: enhancement

Details

Reference
bz19542

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:38 PM
bzimport set Reference to bz19542.

I'd like to hear from other users of the dumps about what would be most useful. Starting a thread on wikitech-l and xmldatadumps-l about this.

All titles are available at page.sql.gz if needed.

It's not in as convenient a format; I'm assuming that's the reason for the specific request. However I'd love to hear from the people on the bug (and from other users of the dumps).

Related URL: https://gerrit.wikimedia.org/r/66666 (Gerrit Change I7f53f1eb2f4396d6fc9f80625919a6c745bfa21f)

https://gerrit.wikimedia.org/r/66666 (Gerrit Change I7f53f1eb2f4396d6fc9f80625919a6c745bfa21f) | change APPROVED and MERGED [by ArielGlenn]

This is live for some projects and will be live for all after the next deployment. Closing.