Page MenuHomePhabricator

robots.txt should let search engines to index tools.wmflabs.org
Closed, DeclinedPublic

Description

http://tools.wmflabs.org/robots.txt:

User-agent: *
Disallow: /

Tools are obscure and hard to find enough without forbidding search engines to do their job as they do on Toolserver...

(In reply to bug 59118 comment 5)

Set up robots.txt as a temporary measure to:

User-agent: *
Disallow: /

Version: unspecified
Severity: normal
URL: http://tools.wmflabs.org/robots.txt
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=61133

Details

Reference
bz61132

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:55 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz61132.

Do you mean the tools themselves (e. g. https://tools.wmflabs.org/wikilint/) or the index (just https://tools.wmflabs.org/)?

The first is a WONTFIX, for the second I haven't found a solution yet. Do you have an idea?

Why would the first be a WONTFIX?
For the second see the docs,

Allow: /$

is supposed to work (at least with Google).

(In reply to comment #2)

Why would the first be a WONTFIX?

Because there are tools that are linked from every wiki page and any spider accessing them brings the house down. As tools are created and updated without any review by admins and wiki edits are not monitored as well, blacklisting them after the meltdown doesn't work.

So unlimited spider access is not possible.

For the second see the docs,

Unfortunately, there is no specification for robots.txt; that's the core of the problem.

Allow: /$

is supposed to work (at least with Google).

According to [[de:Robots Exclusion Standard]] with Googlebot, Yahoo! Slurp and msnbot. And the other spiders? Will they read it in the same way or as "/"? How do we whitelist "/?Rules"?

(In reply to comment #3)

(In reply to comment #2)

Why would the first be a WONTFIX?

Because there are tools that are linked from every wiki page

Blacklist them, then?

https://toolserver.org/robots.txt has:

User-agent: *
Disallow: /~magnus/geo/geohack.php
Disallow: /~daniel/WikiSense
Disallow: /~geohack/
Disallow: /~enwp10/
Disallow: /~cbm/cgi-bin/

and any spider
accessing them brings the house down. As tools are created and updated
without
any review by admins and wiki edits are not monitored as well, blacklisting
them after the meltdown doesn't work.

So unlimited spider access is not possible.

Nobody said unlimited. This works on Toolserver, it's not inherently impossible. It's unfortunate that migration implies such usability regressions, because then tool developers will try to postpone migration as long as possible and we'll have little time.

For the second see the docs,

Unfortunately, there is no specification for robots.txt; that's the core of
the
problem.

Not really, there is a specification but everyone has extensions. I meant Google's, as I said.

msnbot. And the other spiders? Will they read it in the same way or as
"/"?

You'll find out with experience.

How do we whitelist "/?Rules"?

Mentioning it specifically, no?
However, while I can understand blocking everything except the root page, whitelisting individual pages is rather crazy and I don't see how /?Rules would be more interesting than most other pages. Horrible waste of time to go haunt them, you could as well just snail mail a print of webpages on demand.

(In reply to comment #4)

[...]

and any spider
accessing them brings the house down. As tools are created and updated
without
any review by admins and wiki edits are not monitored as well, blacklisting
them after the meltdown doesn't work.

So unlimited spider access is not possible.

Nobody said unlimited. This works on Toolserver, it's not inherently
impossible. It's unfortunate that migration implies such usability
regressions,
because then tool developers will try to postpone migration as long as
possible
and we'll have little time.

I haven't met a tool developer who postpones migration because of robots.txt (or cares about that at all, because their tools are linked from Wikipedia). Noone even asked to change robots.txt. Who are they?

If tool developers guarantee that a specific tool is resistant to spiders, we can whitelist that (even automated à la ~/.description).

[...]

msnbot. And the other spiders? Will they read it in the same way or as
"/"?

You'll find out with experience.

[...]

Why would we take that risk with only marginal benefit gained? "Experience" means a lot of people yelling.

(In reply to comment #5)

I haven't met a tool developer who postpones migration because of robots.txt

Why would you meet them? People unaware of this obscure dark corner of the internet called tool labs, hidden from the rest of the WWW, will never arrive to us.

(In reply to comment #6)

I haven't met a tool developer who postpones migration because of robots.txt

Why would you meet them? People unaware of this obscure dark corner of the
internet called tool labs, hidden from the rest of the WWW, will never arrive
to us.

That's why I asked you: Who postpones migration to Labs because of robots.txt?

(In reply to comment #7)

That's why I asked you: Who postpones migration to Labs because of
robots.txt?

Sorry, it's not my job to go ask dozens or hundreds of tools owners why they've not yet migrated their tools.

Missed this:

(In reply to comment #5)

Why would we take that risk with only marginal benefit gained? [...]

Ah, right, marginal benefit. I had forgotten that Tool Labs was only built as a monument to computer science; having people finding and using tools and pages useful for them is just an accessory, a marginal benefit.

(In reply to comment #9)

(In reply to comment #7)

That's why I asked you: Who postpones migration to Labs because of
robots.txt?

Sorry, it's not my job to go ask dozens or hundreds of tools owners why
they've
not yet migrated their tools.

Then why do you claim that it is related to robots.txt?

Missed this:

(In reply to comment #5)

Why would we take that risk with only marginal benefit gained? [...]

Ah, right, marginal benefit. I had forgotten that Tool Labs was only built
as a
monument to computer science; having people finding and using tools and pages
useful for them is just an accessory, a marginal benefit.

This bug isn't about "people finding and using tools and pages useful for them", but robots.txt. If you want to increase the visibility of the available tools at Tools, you can set up a mirror at a more prominent wiki very easily. The code for https://tools.wmflabs.org/ is at http://git.wikimedia.org/blob/labs%2Ftoollabs.git/master/www%2Fcontent%2Flist.php.

I need robots.txt-esque access for my tool, http://tools.wmflabs.org/wmukevents , which is a calendar feed. For users to be able to add it to their Google calendars requires the Google Calendar Bot to be able to access it. Unfortunately Google Calendar Bot uses the same user agent as the regular Google spider.

That said, I mentioned this to Coren a while back, he twiddled some levers (can't recall precisely what) and now it WORKSFORME, so perhaps I've misremembered the problem on some level.

Ah, right, marginal benefit. I had forgotten that Tool Labs was only built
as a
monument to computer science; having people finding and using tools and pages
useful for them is just an accessory, a marginal benefit.

Google is smart enough to do it's job even without robots.txt:

https://encrypted.google.com/search?q=gerrit%20patch%20uploader

Sorry, that should have read 'Google is smart enough to do it's job even when blocked by robots.txt'

Closing as WONTFIX for the general case. Individual tool owners are welcome to request a whitelisting of their tool so long as they have properly validated that a bot spidering them cannot cause issues.

In particular, tools which return pages with dynamic content that is or may be expensive on the database to generate and which contains further internal links generally throw spiders in a loop and consume a great deal of resources, impacting all other tools.

Meh. Ok, will host my stuff elsewhere. I'd like it to be found and used. :)