Page MenuHomePhabricator

Create -latest alias for dumps
Closed, ResolvedPublic

Description

On toolserver the dumps were stored as "enwiki-latest-pages-articles.xml" for example. This allowed users to hardcode the path without worrying about the date.

It would be nice if there were symlinks so that we can just hardcode a path without trying to figure out which date is the latest and then using that one.


Version: unspecified
Severity: normal

Details

Reference
bz45646

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:14 AM
bzimport added a project: Cloud-VPS.
bzimport set Reference to bz45646.

changing to infrastructure as this is something labs-wide which I have no access to (before was assigned to bots project)

  • Bug 56093 has been marked as a duplicate of this bug. ***

Can we please get some progress on this? it shouldnt be rocket science. (a basic 20 line python script could probably achieve the goal, if we cannot find a nicer way)

I'm thinking this is easiest done around the dumps process itself. Ariel, thoughts?

I've looked at that recently ("it should be so easy!"), but the dumps process as present isn't as straight-forward as you would think.

I assumed it would be enough to replace "rsync DIR" with "rsync DIR && ln -s ...", but the reality (cf. http://git.wikimedia.org/tree/operations%2Fpuppet.git; "download::gluster" is the class that feeds /public/datasets/public) is much more complicated: A list of files (!) to sync is produced on the remote, that is rsync'ed to local and then fed to several (!) rsync workers (limited by count of files and size to transfer) that do the actually copying. This process is run continuously, so there is no obvious point to hook into.

This complexity is probably due to the requirement to sync more than 4 TBytes of data :-). I *think* it would be possible to have a cron job that just sets the symlinks without upsetting rsync too much, but I'm definitely not sure about that :-).

Im adding platonides to the CC list and asking for their input, they run the /dumps project on the toolserver

The dumps script does this at time of dump creation, leaving other symlinks in the directory untouched. Someone would have to write a short script that goes through and updates links after each rsync completes.

coren lowered the priority of this task from Medium to Lowest.Jan 8 2015, 9:08 PM
coren set Security to None.
yuvipanda raised the priority of this task from Lowest to Medium.Feb 27 2015, 2:17 PM

This was about the copies of dumps rsynced to labs; now those are made available directly from the web server, which should have the appropriate -latest links there too. Are folks still seeing a problem?

Adding @Bstorm for a definitive answer to this question; the web server has the 'latest' links Do labs instances/toolforge see those?

If this looks right to you, then yes.

bstorm@tools-sgebastion-07:/public/dumps/public/zawiktionary/latest$ ls
zawiktionary-latest-abstract.xml.gz               zawiktionary-latest-md5sums.txt                                       zawiktionary-latest-protected_titles.sql.gz-rss.xml
zawiktionary-latest-abstract.xml.gz-rss.xml       zawiktionary-latest-pagelinks.sql.gz                                  zawiktionary-latest-redirect.sql.gz
zawiktionary-latest-all-titles.gz                 zawiktionary-latest-pagelinks.sql.gz-rss.xml                          zawiktionary-latest-redirect.sql.gz-rss.xml
zawiktionary-latest-all-titles.gz-rss.xml         zawiktionary-latest-page_props.sql.gz                                 zawiktionary-latest-sha1sums.txt
zawiktionary-latest-all-titles-in-ns0.gz          zawiktionary-latest-page_props.sql.gz-rss.xml                         zawiktionary-latest-siteinfo-namespaces.json.gz
zawiktionary-latest-all-titles-in-ns0.gz-rss.xml  zawiktionary-latest-page_restrictions.sql.gz                          zawiktionary-latest-siteinfo-namespaces.json.gz-rss.xml
zawiktionary-latest-categorylinks.sql.gz          zawiktionary-latest-page_restrictions.sql.gz-rss.xml                  zawiktionary-latest-sites.sql.gz
zawiktionary-latest-categorylinks.sql.gz-rss.xml  zawiktionary-latest-pages-articles-multistream-index.txt.bz2          zawiktionary-latest-sites.sql.gz-rss.xml
zawiktionary-latest-category.sql.gz               zawiktionary-latest-pages-articles-multistream-index.txt.bz2-rss.xml  zawiktionary-latest-site_stats.sql.gz
zawiktionary-latest-category.sql.gz-rss.xml       zawiktionary-latest-pages-articles-multistream.xml.bz2                zawiktionary-latest-site_stats.sql.gz-rss.xml
zawiktionary-latest-change_tag.sql.gz             zawiktionary-latest-pages-articles-multistream.xml.bz2-rss.xml        zawiktionary-latest-stub-articles.xml.gz
zawiktionary-latest-change_tag.sql.gz-rss.xml     zawiktionary-latest-pages-articles.xml.bz2                            zawiktionary-latest-stub-articles.xml.gz-rss.xml
zawiktionary-latest-externallinks.sql.gz          zawiktionary-latest-pages-articles.xml.bz2-rss.xml                    zawiktionary-latest-stub-meta-current.xml.gz
zawiktionary-latest-externallinks.sql.gz-rss.xml  zawiktionary-latest-pages-logging.xml.gz                              zawiktionary-latest-stub-meta-current.xml.gz-rss.xml
zawiktionary-latest-geo_tags.sql.gz               zawiktionary-latest-pages-logging.xml.gz-rss.xml                      zawiktionary-latest-stub-meta-history.xml.gz
zawiktionary-latest-geo_tags.sql.gz-rss.xml       zawiktionary-latest-pages-meta-current.xml.bz2                        zawiktionary-latest-stub-meta-history.xml.gz-rss.xml
zawiktionary-latest-imagelinks.sql.gz             zawiktionary-latest-pages-meta-current.xml.bz2-rss.xml                zawiktionary-latest-templatelinks.sql.gz
zawiktionary-latest-imagelinks.sql.gz-rss.xml     zawiktionary-latest-pages-meta-history.xml.7z                         zawiktionary-latest-templatelinks.sql.gz-rss.xml
zawiktionary-latest-image.sql.gz                  zawiktionary-latest-pages-meta-history.xml.7z-rss.xml                 zawiktionary-latest-user_groups.sql.gz
zawiktionary-latest-image.sql.gz-rss.xml          zawiktionary-latest-pages-meta-history.xml.bz2                        zawiktionary-latest-user_groups.sql.gz-rss.xml
zawiktionary-latest-iwlinks.sql.gz                zawiktionary-latest-pages-meta-history.xml.bz2-rss.xml                zawiktionary-latest-wbc_entity_usage.sql.gz
zawiktionary-latest-iwlinks.sql.gz-rss.xml        zawiktionary-latest-page.sql.gz                                       zawiktionary-latest-wbc_entity_usage.sql.gz-rss.xml
zawiktionary-latest-langlinks.sql.gz              zawiktionary-latest-page.sql.gz-rss.xml
zawiktionary-latest-langlinks.sql.gz-rss.xml      zawiktionary-latest-protected_titles.sql.gz

These are symlinks to the latest dump.

Do those links resolve to actual files (in the case of the non rss ones)? If so then this can be closed, which would be awesome.

They look good (not broken), and I can zless the ones I tried as a sample. I'll close it.