Page MenuHomePhabricator

Change default robot indexing behaviour for several namespaces
Closed, InvalidPublic

Description

Author: lar

Description:
Change the default for the following namespaces:

User: User_talk: Wikipedia: WP: (1) Wikipedia_talk: WT: (2)

1 - or more generally, Project: and if there is a shortcut namespace, the default for that too
2 - or more generally, Project_talk: and if there is a shortcut namespace, the default that too

to not be indexed by bots (effectively, I believe, include <meta name="robots" content="noindex, nofollow"> in the rendered page), across all WMF projects

The rationale is to reduce the amount of project specific material, unlikely to be of general interest to readership, that is exposed publicly. (dirty laundry, if you like) As our own internal search engines improve, the need for external search is lessened.

I believe that the following bugs may be related but I was unable to find this specific one, If it's a dup, my apologies.

8068 Magic word to add noindex to a page's header
9415 option to protect pages from being indexed by search engines
10052 Add class="robots-nocontent" in footer to avoid search engine to index it
11720 Google (and others) is indexing data dumps

(8068 and 9415 allow variability by page, this is saying to change the default for certain name spaces but if it were implemented along with 8068 or 9415, one could override the default on a page by page basis)


Version: unspecified
Severity: enhancement

Details

Reference
bz13864

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:13 PM
bzimport set Reference to bz13864.
bzimport added a subscriber: Unknown Object (MLST).

ayg wrote:

I guess this is for the English Wikipedia? In that case a change to robots.txt would be simplest. If by "default" you mean default for all wikis unless opting out, that would require a different approach, probably, unless we want to include every single localization of all of the above (1000 lines? more? kept up-to-date how?) in robots.txt.

The major objection to this is that Wikipedians use Google to find project-related pages on Wikipedia. That they also show up in searches by the general public is arguably not great, but in practice it seems like the former usage outweighs the importance of the latter. Although as you say, Wikipedia's built-in search is steadily improving.

lar wrote:

If you want to discuss this further somewhere else please suggest somewhere and we can take it there... but I'm actually thinking of asking for this to be across all WMF wikis, not just en:wp. I think en:wp is where the problem (of dirty laundry visibility) is manifesting itself first, but in the long run, yes, all wikis unless opted out. I don't know enoough about implementation details to suggest how to implement it cleanly. Perhaps something that generates the appropriate file by examining project configurations? (presumably the project -> Wikipedia: and User: -> Usario: etc mappings that result from namespace localisations are kept in a mapping table somewhere in the wiki? else how does the SOFTWARE know that Usario (or Benutzer or whatever) is User: ?)

I agree that Wikipedians do use google to find project related pages on Wikipedia. And maybe a project debate is required to secure community approval for the change, but this bug addresses the mechanics/underpinnings... I'll save making the BLP and "do no harm" arguments for that discussion I guess. I'm just hopeful that as the internal search continues to improve that becomes less of a factor. And it does seem to be getting better by leaps and bounds of late... I actually started using it again, it's that good.

jeluf wrote:

Discussion about changes concerning all projects should take place on meta. Please come back when there's consensus for this decision in the community at large.

lar wrote:

Please separate evaluation of the technical aspects of doing this (See Simetrical's comments, there are apparently several different possible approaches, as well as interrelationships with other bugs that I mentioned in the open) which merit discussion, from evaluation of community consensus. if the facility to do this is difficult, it doesn't matter what consensus is or isn't. If the facility to do this is easy, it could well be enabled, regardless of which wikis do or don't decide to use it.

jeluf wrote:

Removed the "shell" keyword and the "site request" component since this is not a request for a specific change but a general discussion topic.

WilyDoppelganger wrote:

Can this be implemented so that each wiki can choose a subset of namespaces they don't want indexed (maybe Special:Noindex could host a list like:
Talk
User talk
Wikipedia talk
And so forth of various namespaces where pages are noindex'd, then projects can be allowed to decide for themselves which pages to noindex or not, and later if meta comes to some overall policy, it could be discussed then. By localising discussion, it makes discussion possible.

Yes, namespaces can indeed be disabled from indexing per-wiki by customizing $wgNamespaceRobotPolicies. This has existed for some time, and no technical changes are required for implementation of a community decision.

I'm INVALIDing this pending general community decision.