Page MenuHomePhabricator

Mirror more Kiwix downloads directories
Closed, ResolvedPublicFeature

Description

The WMF hosts a mirror of the ZIM files we generate at Kiwix at
http://dumps.wikimedia.org/kiwix/. This is a great value for us.

Since 6 months, we have been advertising in priority "portable packages"
on our web site. They are big zip files containing (Kiwix+ZIM+fulltext
index). This is really easier to use and well appreciated by Windows and
Linux users.

Our problem is that we have a growing traffic and a important part of
the traffic is generated by these portable packages. These zip files are available at http://download.kiwix.org/portable/ and this would be great if this directory could be also be mirrored.

You can consider that this repertory is around 2 times bigger than the zim directory. Currently 239G for the "zim" directory and 329G for the "portable" directory.

We are also reorganizing the directory structure to create thematic sub-directories, so the directories which should be mirrored additionally to the current "zim/0.9/" one would be:
zim/wikipedia
zim/wikisource
zim/wiktionary
zim/wikivoyage
...


Version: unspecified
Severity: enhancement

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:19 AM
bzimport set Reference to bz55503.

No feedback on this. This is not really urgent, but would that be a least possible to mirror also "zim/wikipedia" in addition to "zim/0.9/"? New wikipedia ZIM files are stored there instead of in "zim/0.9" (which is still necessary for legacy purposes).

Ariel: Could you answer comment 1?

@Ariel, it would be really great if the rsync conf. would be adapted now. More an more ZIM files are in the new hierarchy.

so after a short chat on irc with Kelson, it turns out that right now they are looking at about 2.5TB which we don't have spare. This is a fine time to get more storage in any case, as we have been close to the edge on dataset1001 for some time now. I'll be adding a ticket for that shortly.

In the meantime I've updated the kiwix rsync job to pull from 'wikipedia' instead of the obsolete '0.9' directory for now.

This was held up due to directory permissions but is running now.

excerpt from discussion on irc:

(06:01:59 μμ) Kelson: andre: apergos: we currently working on a solution (with wmflabs) to create, each month, new version of all our project. This should be ready in the next month. So next step is to prepare a nice page summarize everything (for example @http://dumps.wikimedia.org/kiwix/)
(06:02:27 μμ) Kelson: andre
: apergos : but to do that we need to have the snapshots available
(06:03:48 μμ) apergos: ok. I just need an estimate of the total space, so I can make sure we have or can get capacity in a timely fashion
(06:05:12 μμ) apergos: how many snapshots would you want us to keep?
(06:05:17 μμ) Kelson: apergos: it's moving but we talk about ~2.5 TB
(06:06:54 μμ) Kelson: apergos: I don't plan to keep trace of old snapshots - so don't keep more than one "old" snapshot to let the started downloads finish correctly and then delete them
(06:07:57 μμ) apergos: well we won't get 3t of more storage, we'll get a chunk, if I have anything to say about it that is
(06:08:35 μμ) Kelson: apergos: for now, http://dumps.wikimedia.org/kiwix is a "slave" of http://download.kiwix.org... but if everything works well... we might think to change the way it works. The master being then download.wikimedia.org

https://phabricator.wikimedia.org/T93118 not a blocker but this is needed for the final configuration; for short-term we can proxy through dataset1001 to the kiwix dump creation box

Are T91853: Hardware for HTML / zim dumps and T93113: deploy francium for html/zim dumps actually related? They don't mention giving 3 TB for Kiwix ZIM files. If there is no space for mirroring any further directory, this seems blocked on T93118#1149493

The current approach is to share francium as a processing AND storage solution. I have a few concerns about the ability of francium RAID system to be able to deliver correct performances for both at the same time, but let try it like this and see then if this need any improvement.

The current approach is mean to be temporary (serving off of francium).

ArielGlenn raised the priority of this task from Low to Medium.Sep 29 2015, 11:10 AM
ArielGlenn set Security to None.

Since we have the new array in place for some time now, let's revisit this and see how much more we can serve from WMF servers.

@Ariel
Great, please let me know if you need something from my side.

@ArielGlenn Any chance this ticket could be implemented some time? It seems as well that the Wikimedia does not mirror since at least a month because all the Wikipedia ZIM files are older than a month.

@ArielGlenn Any chance this ticket could be implemented some time? It seems as well that the Wikimedia does not mirror since at least a month because all the Wikipedia ZIM files are older than a month.

Nice to see this task come back after being idle so long.
The servers which mirror externally generated bundles like these belong to the WMCS team as of several years ago. So I'm looping them in. Those servers are currently up for a refresh due in part to space being tight. They will be able to talk about capacity and expected growth over the next few years.

For the mirroring problem you describe, would you mind opening a separate task about that? I'd prefer to keep this one just for the space discussion.

@ArielGlenn Thank you for your feedback. I have created an other task here https://phabricator.wikimedia.org/T299993

@ArielGlenn Thank you for putting WMCS in the loop. In which timeline this refresh should happen? I guess nothing will be done as far as this is not done.

Here are the current disk usages:

$ du -sh wiki*
46G	wikibooks
343G	wikihow
38G	wikinews
2.3T	wikipedia
7.5G	wikiquote
258G	wikisource
20G	wikiversity
9.7G	wikivoyage

I don't know details myself but the relevant task is T286588

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:23 PM
Aklapper removed a subscriber: Mholloway.

@ArielGlenn Hi, I already come back to you! T299993 is hidden to me and I have no visibility on it. Is that already implemented. If "no", in which timeline are we moving in?

A related topic is the location of http://dumps.wikimedia.org/kiwix/. It's in US, but more and more downloads come from Asia, with China in the TOP downloading country. We have as well this war in Ukraine which leads to a many downloads from Russia. See https://slate.com/technology/2022/03/russia-wikipedia-download-kiwix.html. We have many mirrors in Europe (with ISP and universities) but nothing in Asia. I'm quite pessimistic about getting a free mirror there in the near future. I wonder if the WMF could mirror as well (or instead) the ZIM files in Esqin?

@ArielGlenn Hi, I already come back to you! T299993 is hidden to me and I have no visibility on it. Is that already implemented. If "no", in which timeline are we moving in?

I don't see why it's hidden, I was able to see it as a not-logged-in user. It's closed/resolved.

A related topic is the location of http://dumps.wikimedia.org/kiwix/. It's in US, but more and more downloads come from Asia, with China in the TOP downloading country. We have as well this war in Ukraine which leads to a many downloads from Russia. See https://slate.com/technology/2022/03/russia-wikipedia-download-kiwix.html. We have many mirrors in Europe (with ISP and universities) but nothing in Asia. I'm quite pessimistic about getting a free mirror there in the near future. I wonder if the WMF could mirror as well (or instead) the ZIM files in Esqin?

Location isn't something I control. There are just the two backing hosts, both in the US; you might talk with the WMCS team about it.

@ArielGlenn Sorry, I meant T286588

You can follow along with the install at T302981

@ArielGlenn It seems that T302981 has just been implemented. Does that mean you have no blocker anymore for this task?

@ArielGlenn It seems that T302981 has just been implemented. Does that mean you have no blocker anymore for this task?

I don't even have ownership now :-) WMCS coordinates all the space needs, so you should talk to them about what you need and how much can be accomodated. When the new boxes arrive, there ought to be something that can be done!

@ArielGlenn Can you please reassign the ticket? I have no clue who - concretly - is WMCS?

Unassigning (if I understand correctly); this is already tagged with Cloud-Services

Aklapper added a subscriber: ArielGlenn.

I wouldn't assign it to a specific person there. But you could maybe ping @nskaggs to raise awareness (team manager). Or I could... just did, heh :-)

Looping in @Andrew. @Kelson note that yes, we are installing new, more capable machines that have more capacity than in years past. Once they are up and running, we can explore mirroring this additional data.

On the question of the dumps location, I can confirm it is still in the US. That's unlikely to change unfortunately. You can read more at https://wikitech.wikimedia.org/wiki/Data_centers.

Looping in @Andrew. @Kelson note that yes, we are installing new, more capable machines that have more capacity than in years past. Once they are up and running, we can explore mirroring this additional data.

Thank you!

@Kelson Can you clarify how much additional space would be needed now? I saw the description of around .5T, but also your chart showing 3T total worth of data. Either way, I believe the new hosts can support this additional data. T302981 tells the progress of the new hosts, but in short, they aren't yet ready for syncing.

@Kelson: Could you please answer the last comment? Thanks in advance!

@nskaggs @Aklapper Not sure to fully understand the question. The answer has been posted earlier this year at https://phabricator.wikimedia.org/T57503#7655695 (but you should not mirror "wikihow"). So just remove the "wikipedia" folder (which is already mirrored) and you will have the answer, which is a bit less than 0.5 TB.

The new boxes are installed and storage should no longer be an issue. What is needed to proceed with this request?

@nskaggs Not sure this is a question to me, but in the case it needed, could you please change the ZIM mirroring rsync command to:

  • request on master.download.kiwix.org in place of download.kiwix.org
  • sync the following repositories in addition to wikipedia: wikibooks, wikinews, wikiquote, wikisource, wikiversity, wiktionary, wikivoyage

Change 848441 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] dumps: switch kiwix download host to master.download.kiwix.org

https://gerrit.wikimedia.org/r/848441

Change 848444 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] dumps: add sister projects to kiwix dumps rsync

https://gerrit.wikimedia.org/r/848444

Change 848441 merged by Dzahn:

[operations/puppet@production] dumps: switch kiwix download host to master.download.kiwix.org

https://gerrit.wikimedia.org/r/848441

Mentioned in SAL (#wikimedia-operations) [2022-10-28T20:37:55Z] <mutante> clouddumps1001 - puppet run after merging gerrit:848441 for kiwix, changed ferm status from "stopped" to "running". manually ran 'sudo systemctl start kiwix-mirror-update' T57503

Change 848444 merged by Dzahn:

[operations/puppet@production] dumps: add sister projects to kiwix dumps rsync

https://gerrit.wikimedia.org/r/848444

Mentioned in SAL (#wikimedia-operations) [2022-10-28T20:42:50Z] <mutante> clouddumps* - deployed gerrit:848444 - as kind of expected it fails - most likely the project dirs are not automatically created before rsync runs the first time - T57503

Change 850588 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] dumps: fix syntax error in kiwix-rsync-cron.sh

https://gerrit.wikimedia.org/r/850588

Change 850588 merged by Dzahn:

[operations/puppet@production] dumps: fix syntax error in kiwix-rsync-cron.sh

https://gerrit.wikimedia.org/r/850588

I deployed the changes above, a little bugfix follow-up, started the sync service manually.

actual command now running on clouddumps1001...

├─2969709 /usr/bin/rsync -rlptq --delete --bwlimit=40000 master.download.kiwix.org::download.kiwix.org/zim/wikibooks/ /srv/dumps/xmldatadumps/public/other/kiwix/zim/wikibooks/
Dzahn changed the task status from Open to In Progress.Oct 28 2022, 10:58 PM
Dzahn claimed this task.