Page MenuHomePhabricator

Add exception in robots.txt to allow the Internet Archiver to index action=raw
Closed, DeclinedPublic

Description

Presently, robots.txt has Disallow: /w/ Thus, the raw wikitext of pages isn't accessible via the Internet Archive; see e.g. https://web.archive.org/web/20140307111730/http://en.wikipedia.org/w/index.php?title=Main_Page&action=raw

This is in contrast to sites like WikiIndex, which allow it: https://web.archive.org/web/20131021230044/http://wikiindex.org/index.php?title=Welcome&action=raw

We should allow the Internet Archiver to index these pages so that the raw wikitext will be available for future generations, even if the page goes away. See [[mw:Manual:Robots.txt#Allow_indexing_of_raw_pages_by_the_Internet_Archiver]].


Version: wmf-deployment
Severity: enhancement

Details

Reference
bz62494

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:57 AM
bzimport set Reference to bz62494.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to Nathan Larson from comment #0)

We should allow the Internet Archiver to index these pages so that the raw
wikitext will be available for future generations, even if the page goes
away.

We already regularly dump the DBs and push those dumps to Internet Archive.

What else does action=raw get us?

(In reply to jeremyb from comment #1)

We already regularly dump the DBs and push those dumps to Internet Archive.

What else does action=raw get us?

I guess it depends; is there a way to get wikitext of individual pages without downloading the whole dump, assuming the page is no longer on-wiki?

(In reply to jeremyb from comment #1)

We already regularly dump the DBs and push those dumps to Internet Archive.

What else does action=raw get us?

Integration in Wayback machine. Not sure it's worth it though.

(In reply to Nathan Larson from comment #2)

I guess it depends; is there a way to get wikitext of individual pages
without downloading the whole dump, assuming the page is no longer on-wiki?

Server-side bzgrep? :P Probably no.

I'd recommend we decline this request.

Aside from action=raw indexing being questionable, it wouldn't actually provide any sort of realistic coverage because we don't link to it anywhere. The index would be limited to anomalies where a user linked to a raw url directly.

As for indexing "even if the page goes away", isn't that exactly what the Internet Archive's Wayback-Machine is for?

Isn't action=raw legacy and deprecated? T39745 makes it seem so.

Mdann52 claimed this task.
Mdann52 subscribed.

Per comments above.