Page MenuHomePhabricator

Let Internet Archive's Wayback machine archive etherpads
Closed, ResolvedPublic

Description

We all make heavy use of web.archive.org and we're expanding it ([[mw:Archived Pages]]), so let's use it also for Etherpad.
Akosiaris tells me the current robots.txt is just the default, so this is IMHO a trivially desirable change.

Hopefully, adding this should be enough (https://webarchive.jira.com/browse/HER-1):

User-agent: ia_archiver
Allow: /
Allow: /p/

But once deployed it's easy to check with their new live-retrieving/on-demand saving feature.

More background from #wikimedia-tech:
akosiaris> [...] I must say etherpad.wikimedia.org never was intended for permanent storage. Preservation of a pad is up to the people interested in preserving that pad in another format. The software is well known to corrupt pads (hopefully the latest issues are resolved with 1.3.0 but we never know when others might show up) and restoring a pad from database backups is neigh to impossible. [...]
Nemo_bis> akosiaris: that's what I'm saying :) if we don't plan to make archives, let's let others do so


Version: unspecified
Severity: enhancement
URL: http://etherpad.wikimedia.org/robots.txt

Details

Reference
bz56893

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:28 AM
bzimport set Reference to bz56893.
bzimport added a subscriber: Unknown Object (MLST).

Commenting just to make something clear. Changing the robots.txt will not have the Internet Archive automagically archive pads. The reason being that no links exist for any spider to follow. It might be possible for pads whose links have been posted in various places to be archived but whether that will happen or not depends entirely on IA's spider implementation. The "no links" problem can be solved by having a page list all pads. That in turn could possibly be solved with any of the various pad listing plugins but last we checked none of them were production quality.

Some more info can be found here:
https://bugzilla.wikimedia.org/show_bug.cgi?id=30240

Yes, I plan to list or submit all publicly known URLs myself later.

Do we know this approach will work ?

(In reply to comment #3)

Do we know this approach will work ?

What do you mean? This bug is currently "Let Internet Archive's Wayback machine archive etherpads", not "Ensure Internet Archive's Wayback machine has copies of all etherpads". As long as retrieval works, this bug can be closed. Enhancing the crawling over their average performance will be a separate effort.

Quite true. I was looking at the forest and forgot about the tree. Anyway I 'll submit a patchset to implement this.

I believe this is fixed by https://gerrit.wikimedia.org/r/#/c/117845/

I will now close this ticket, feel free to reopen.