Page MenuHomePhabricator

Dump the article titles lists (all-titles-in-ns0.gz) every day
Closed, ResolvedPublic

Description

Author: Wiki.Melancholie

Description:
The dump process loops in an periode of roughly 2-3 weeks for smaller wikis and roughly 1-2 months for big ones like [[en:]]!

This means that the very helpful all-titles-in-ns0 lists can be up to 2 months old, totally outdated!

For enwiki it takes less than 30 seconds (26 sec in March) to dump this list, according to http://download.wikimedia.org/enwiki/20080312/

For all wikis this would mean about 1 minute or so. Is it possible to dump the all-titles-in-ns0 lists on a daily basis?

It would be very helpful to be able to analyse db dumps with up-to-date lists! Another reason is a potential feature in analysing User:Midom's stats, see [[User_talk:Henrik#Most_popular_nonexistent_articles.3F]].


Version: unspecified
Severity: enhancement
URL: http://download.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz

Details

Reference
bz13693

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:03 PM
bzimport set Reference to bz13693.

mathias.schindler wrote:

Please note that you can access this list via http://download.wikipedia.org/sitemap/

Wiki.Melancholie wrote:

?
The sitemaps even have not been updated since 2007-Dec-27 ;-)

This bug was recently made dependant on Bug 14415 -- Dump the article titles lists (all-titles-in-ns0.gz) unsorted

Was this an error? Is making this file available more frequently really dependent on the sort order the file utilizes?

I'm being bold and removing the dependency but please restore it if the link was in fact intentional.

Dumping this (In reply to comment #0)

The dump process loops in an periode of roughly 2-3 weeks for smaller wikis and
roughly 1-2 months for big ones like [[en:]]!

This means that the very helpful all-titles-in-ns0 lists can be up to 2 months
old, totally outdated!

For enwiki it takes less than 30 seconds (26 sec in March) to dump this list,
according to http://download.wikimedia.org/enwiki/20080312/

For all wikis this would mean about 1 minute or so. Is it possible to dump the
all-titles-in-ns0 lists on a daily basis?

It would be very helpful to be able to analyse db dumps with up-to-date lists!
Another reason is a potential feature in analysing User:Midom's stats, see
[[User_talk:Henrik#Most_popular_nonexistent_articles.3F]].

Apologies for the really late pickup on this but were just now moving through all the data dump issues. Think you could
elaborate a bit on what your daily use case is?

We haven't received too many requests for a daily list and thus are more thinking of having this made available in two week durations. How does that fit into your use cases?

Wiki.Melancholie wrote:

It's just that the list for enwiki currently can be months old, making it not very usable when handling live content (e.g. pywikipedia bot etc.) A regular two weeks scheme would be much much better of course (if it's really regular, so not intermitted). But the coolest thing would be to have up-to-date title lists, to not be forced to use the API for this (reverting bots, stats, missing articles etc.).

http://dumps.wikimedia.org/other/pagetitles/

These will be dumped on a daily basis. We don't plan to keep them forever, maybe about 30 days' worth before they get tossed. Enjoy.

Also p.s. we might move the location around depending on if there are more daily things that get dumped.

And a final p.s., anyone not on the xmldatadumps-l list should get on it because announcements and discussions happen there which could affect users of the dumps.