Page MenuHomePhabricator

Prevent search engines from indexing the user namespace in German Wikipedia
Closed, ResolvedPublic

Description

Author: jonathan.haas

Description:
Following community discussion (see http://de.wikipedia.org/wiki/Wikipedia:Meinungsbilder/Indizierung_von_Benutzerseiten ) please disallow search engines from indexing the user namespace for the German Wikipedia by adding

NS_USER => 'noindex,follow'

to $wgNamespaceRobotPolicies accordingly.


Version: unspecified
Severity: enhancement

Details

Reference
bz36181

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:20 AM
bzimport set Reference to bz36181.
bzimport added a subscriber: Unknown Object (MLST).

beau wrote:

You can add entries to a page: MediaWiki:Robots.txt
There are already similar lines:

Benutzerdiskussionsseiten

Disallow: /wiki/Benutzer_Diskussion:
Disallow: /wiki/Benutzer_Diskussion%3A
Disallow: /wiki/User_talk:
Disallow: /wiki/User_talk%3A

Please reopen of you need help with this.

jonathan.haas wrote:

This will prevent them from being indexed by adding the magic word INDEX to the page source, right? Changing $wgNamespaceRobotPolicies seems to be generally favored by the community and was also done in similar cases (for example see bug 16247)

(Can't reopen for some reason)

No, this will alter robots.txt for dewiki.

jonathan.haas wrote:

So altering robots.txt (or MediaWiki:Robots.txt which is the same as I know) will still allow individual pages to be indexed by adding INDEX?

Right, if you want to whitelist separate pages it won't work. Reopened.

(In reply to comment #6)

Right, if you want to whitelist separate pages it won't work. Reopened.

Couldn't you add whitelisted pages to MediaWiki:Robots.txt?

http://en.wikipedia.org/wiki/Robots_exclusion_standard#Allow_directive

I realize this is not as scalable as INDEX, but maybe this feature could be added?

(In reply to comment #7)

I realize this is not as scalable as INDEX, but maybe this feature could be
added?

Actually, that is probably something a bot could do, right? Watch for new INDEX uses and add them to MW:Robots.txt

jonathan.haas wrote:

Sure, we could (although we would probably need to give a bot admin rights and I'm not sure we want that). But why not use $wgNamespaceRobotPolicies directly? Is there some technical problem I should know of? According to http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php (not sure if I'm looking at the right file) there are already a lot of namespaces there in addition to the robots.txt one.

Please add NS_USER => 'noindex,follow' to the dewiki part of wgNamespaceRobotPolicies in InitialiseSettings.php.

That is the easiert way and already used by other namespaces on dewiki and other wikis. There is no reason to use a harder other technical way, when there is this easy way.

Thanks.

alexander_berlin wrote:

Just to make this clear: the community decision has been made under the premise that it is possible to opt in indexing (e.g. by INDEX)

It is not acceptable to implement any solution that does not provide this requirement!

Next week is gone, please give a comment about the status.
Thanks.

Next week is over, please, shell user or operator, add a comment or change the status of this bug, if nobody is there, to fix it or you think, that is this already fixed. Thanks for a response.

Line added to InitialiseSettings.php with https://gerrit.wikimedia.org/r/#/c/9469/

Now it needs someone to merge and deploy.

(In reply to comment #11)

Just to make this clear: the community decision has been made under the premise
that it is possible to opt in indexing (e.g. by INDEX)

It is not acceptable to implement any solution that does not provide this
requirement!

Overriding with INDEX is still possible. Tested on my local wiki.

Deployed by Reedy today. Thanks :)