Page MenuHomePhabricator

option to protect pages from being indexed by search engines
Closed, ResolvedPublic

Description

Author: inbox

Description:
Add an option to the protect tab which allows administrators to "protect" a page
from being index by search engines (by adding a <meta name="robots"
content="noindex, nofollow">.) This would be useful on pages containing
sensitive information but are not in a namespace which are not indexed by default.


Version: unspecified
Severity: enhancement
URL: patch, need-review

Details

Reference
bz9415

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:40 PM
bzimport set Reference to bz9415.
bzimport added a subscriber: Unknown Object (MLST).

robchur wrote:

Too prone to abuse.

inbox wrote:

Do you mean too prone to abuse as in administrators who will abuse this feature
(how?) or search engines etc. not respecting the <meta name="robots"
content="noindex, nofollow"> tag and thereby creating a false sense of security?

Admins who don't "like" revisions can easily just remove them from major search
engine results.

Why would we need this? Either you delete the page or you leave it. As for
discussion pages/AfD, if you don't want outsiders, perhaps an extension to
remove whole namespaces from indexes might be an idea.

inbox wrote:

I cannot readily imagine a way to abuse this feature (removing an article from
search engine results being not as effective as, say, completely deleting the
article and there usually being enough peer-review among administrators to
successfully any hypothetical abuse) but given the terseness of your answer I
suspect this has been discussed before. Could you perhaps give me a pointer to a
relevant feature request/mailing list discussion/...?

Remove from search engines as a good way to suppress pages you don't like, POV
or whatever from much of the public. I suppose if it where logged, there could
be some oversight. But why? What is the use of making it harder to get to, but
still accessible. That's broken inconsistent CMS.

If you want to stop outsiders from flooding a project discussion, why do it
selectively, why not take out the whole project/whatever namespace from the index?

I just can't see a use for this.

inbox wrote:

The concrete example leading to this feature request can be found here
http://lists.wikimedia.org/pipermail/wikien-l/2007-March/066466.html. Assume
the action is properly logged etc. I still do not see how this is more prone to
abuse than giving administrators the ability to protect or delete pages.

robchur wrote:

Individual communities abuse features all the time. Decisions leading to hiding
content from the general public need to be made with Board or equivalent level
approval, not a five second straw poll on the English Wikipedia.

Such a feature would not have an immediate effect owing to the nature of search
engine spidering schedules, and the fact that not all spiders will bother to
honour the tag.

cannon.danielc wrote:

Enable editing of robot metadata per page, via Special:Protect

Unfortunately, I did not notice this bug until after I finished writing this, when Simetrical pointed it out. I talked to Tim Starling earlier today, who just finished adding in $wgArticleRobotPolicies, and he stated that adding a user interface component to allow setting the robot policies per article would be a fine idea, so long as the implementation was "relatively elegant". This is about as elegant as I could get it, though it may need some cleaning up around the edges.

The primary incentive was that requests for modifying robots.txt (or now, thanks to Tim, modifying $wgArticleRobotPolicies) have been steadily on the rise recently, and there are very, very few people on Wikimedia capable of fulfilling these requests (and who certainly have better things to do with their time). Unfortunately, Google's cache has recently become of increasing use (or misuse) to individuals attempting to dig out private information that momentarily appears on pages before it is oversighted or deleted. As such, requests to hide pages prone to being oversighted or pages that should not generally be entirely public and cached for other reasons have become a necessity to fulfill.

The concerns about abuse I view as valid, but they are of far lesser concern than the potential for abuse offered by allowing sysops to delete pages, let alone for oversights to remove revisions with no public record whatsoever of this removal. In an attempt to curb this abuse, my implementation will allow the modification of robots policies on pages only to users with the "editrobots" permission, alloted by default to sysops, but that can be reallotted to bureaucrats or, if need be, to oversights. As this is an operation that will need to be performed very rarely, there should be no problem allowing it only to a much smaller user group, such as oversights; it is, however, important that it be allowed to a larger group than it currently is.

Anyway, I hope that you will reconsider or at least review the patch; if it's not accepted, I won't be heartbroken, but I do think it has the potential to be quite useful. I also have it running on a live install at https://amidaniel.com/testwiki if you want to try it out.

Attached:

robchur wrote:

For the page to disappear from a search engine, the engine's crawler will have to revisit the page, which might not happen for some time. This reduces the utility of such an option, because a short-term removal might not have an effect at all.

cannon.danielc wrote:

Committed by Raymond as r23166.

(In reply to comment #11)

Committed by Raymond as r23166.

Reverted by Brion with r23226:

There are issues with putting robots stuff into the current protection system, so we're backing this out to prevent another backwards-compatibility disaster when it's done in a more reliable way. :)

See also Bug 8473 about $wgArticleRobotPolicies being too weak to act on Special:*.

ayg wrote:

If the patch for bug 8068 remains checked in, this is probably no longer necessary.

(In reply to comment #14)

If the patch for bug 8068 remains checked in, this is probably no longer
necessary.

Patch for above bug is live since 4 weeks --> closing this bug as FIXED