Page MenuHomePhabricator

Magic word to add noindex to a page's header
Closed, ResolvedPublic

Description

Author: marten_berglund

Description:
On user pages (and maybe som other namespaces as well) it should be possible to
use a magic word, something like NOGOOGLE, in order to make the google robot
not indexing that page. For instance, I have on my user page a set of subpages,
sand boxes where I play and test or make drafts to what later could be real
wikipedia articles. So I don't want google to index these pages. They now appear
early in google's search result.

On a html-page, the solution to this is to add the line
<pre>

<meta name="robots" content="noindex,nofollow">

</pre>

Could someone implement something like NOGOOGLE to be used by users who
don't want their user pages indexed?


Version: unspecified
Severity: enhancement

Details

Reference
bz8068

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:31 PM
bzimport set Reference to bz8068.
bzimport added a subscriber: Unknown Object (MLST).

robchur wrote:

No. Namespaces which robots are asked not to index can be configured, however in
this case, if it's public, then it's indexable. A NOINDEX type magic word
has been discussed before and rejected simply because it's subject to abuse and
misunderstanding.

Google are quite quick at re-crawling bits of Wikipedia content, so if a draft
page has moved to the article space, they'll reflect it within a few days, usually.

marten_berglund wrote:

But let's say that the magic word NOINDEX has no effect but on subpages
belonging to the User namespace, and nowhere else. For instance, only on pages
like: http://xx.wikipedia.org/wiki/User:N_N/a_subpage.

Is that a possible compromise?

robchur wrote:

No, it's up to the people who manage the web site to determine what is and is
not indexed by search engines, and Wikimedia wikis generally have everything
indexed bar pages such as VfD/AfD/whatever the trendy TLA for deletion debates
is, which external viewers don't typically understand.

There is _no reason_ to disable indexing of your user page or any other page in
that namespace. What you are posting to a public web site is public. If you
don't want anyone else to be able to read it or edit it or whatever, _don't post
it_.

Reopening this as we're considering it or similar as an improvement over lots of manual editing of global robots.txt.

cohesion wrote:

We get complaints frequently via otrs about people wanting various logs removed that malign their companies etc. They usually serve a purpose, but it's not like it's content. Just an example. http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Spam/LinkReports

Having a NOINDEX magic word is probably the best strategy if we want to differentiate what content ought to appear in search engines in more than a very crude way. Routinely editing robots.txt is no solution, and I consider it undesirable to simply block out very broad categories of material (such as everything that is not an article).

Bryan.TongMinh wrote:

I looked into the code, but it appears that $wgOut->setRobotPolicy is called at the very beginning of Article::view. That is a lot lines before the page content is parsed and magic words are evaluated. Anybody an idea how to do this?

It should be possible to call it again to override it with specific data. You'd have to do this when pulling wiki output out of the ParserOutput object (otherwise the parser cache will always eat everything).

  • Bug 14209 has been marked as a duplicate of this bug. ***

ayg wrote:

Fixed in r37973. I patterned the code after NEWSECTIONLINK, and it seems to work fine.