The robots.txt rules are unnecessarily restrictive. As bugzilla is being deprecated, and only a portion of its content migrated to phabricator, it's essential that we allow third parties to do their job. All crawlers, or at least ia_archiver (wayback machine), should be allowed to crawl:
- any content which
- doesn't specifically cause load issues and
- is not being semantically migrated to phabricator.
Ideally we'd drop requirement (3) but let's start somewhere.
Example URLs which shouldn't be blacklisted:
- /page.cgi?id=voting/bug.html*
- /duplicates.cgi*
- /report.cgi* (unless load)
- /weekly-bug-summary.cgi*
- /describecomponents.cgi*
In fact, is there any reason not to allow everything, minus:
- /show_bug.cgi
- /showdependencytree.cgi
- /query.cgi
?
Version: wmf-deployment
Severity: enhancement
URL: http://web.archive.org/save/https://bugzilla.wikimedia.org/duplicates.cgi