Page MenuHomePhabricator

Remove noindex meta tag from HTML of logged in users
Closed, ResolvedPublic

Description

Author: mathias.schindler

Description:
Logged in users to (German language) Wikipedia pages are served with the following line in the html:

<meta name="robots" content="noindex,nofollow" />

This line is meaningless for logged in users and can be removed without any downside.


Version: unspecified
Severity: trivial

Details

Reference
bz27173

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:14 PM
bzimport set Reference to bz27173.

I don't understand: what advantage would users see if this were removed?

mathias.schindler wrote:

A 50 byte size reduction per page (before compression).

Those pages are indeed not suitable for robot indexing. The 50 byte size reduction is not significant.
We could remove more bytes by removing the script variables, comments or performing html5 minimization techniques. But you would need to reason why those few bytes make a difference.

wikipedia wrote:

I hereby support the bug Matthias has described above and therefore the deletion of the meta robots tag for logged in users. I do not care at all about 50 bytes more or less and that is not the point this is all about, but the meta robots tag brings forward disadvantages in the shape of dangers. These dangers are shown in the current debate about Google.de not listing German Wikipedia articles under some circumstances.

Basically, there are two possible scenarios. I want to describe them both in the following. When I say "Google Bot" I mean any search engine crawler as well - I just take the Google crawler due to current topicality.

1st: Google Bot crawls pages as an anonymous user (not sending header cookies). This scenario is the standard one we assume right now. We do not know any search enigne bots which crawl while being logged in. Therefore: Any robots information is totally senseless to be sent to logged in users as they are generally no crawlers and so do not read robots messages. 1st scenario means: The meta information is obsolete.

2nd: Google Bot crawls pages as logged in user. In this case, the usage of robots information is sensitive also for logged in users. Then, however, this could be the reason (or one reason among others) regarding the Google <-> Wikipedia problem existing right now. If the 2nd scenario could apply, the robots information should be removed temporarily just to make sure it is NOT responsible for the problems. 2nd scenario means: It is likely that the robots information and the Google problem are related to each other. To fix the problem as fast as possible, the robots information should be disabled (for a while, at least).

As you can see, both possible cases make me urge to delete the robots information - at least for a couple of weeks. As soon as Google lists up all the Wikipedia articles again and both MediaWiki Techs and Google Techs found the reason causing the problem, they should deliberate if this meta tags are reasonable and should be added back.

However, according to statements from Wikimedia, neither Google Bots nor any other search engine crawlers log in. If this is true, there is no need for those meta tags as they are NEVER read by crawlers and are therefore no more than source code waste.

reachouttothetruth wrote:

The Googlebot indexing problem (bug 27155) is a problem on Google's end. I don't see why anything has to be done here. Removing the robot indexing policy would result in a bunch of useless pages being indexed, but if the search engine isn't going to display it in the search results then all we've done is waste resources.

Search engine indexing bots ''shouldn't'' be indexing while logged in. But if someone does write a search engine bot that logs in for some reason, it should follow the same indexing policies as all other search engine bots. If the robots policy is removed for logged in users, then such a bot would be getting different indexing instructions that those that don't. Why would we grant an exception to the robot indexing policies simply because the bot logs in?

wikipedia wrote:

(In reply to comment #5)

The Googlebot indexing problem (bug 27155) is a problem on Google's end.

This is not proven yet - not at all.

(In reply to comment #5)

But if
someone does write a search engine bot that logs in for some reason, it should
follow the same indexing policies as all other search engine bots.

Exactly, and the site-wide robots indexing policy *is not* and *should not be* set via META tags. The META tags were introduced as a new feature of FlaggedRevs to prevent unflagged (!) revisions of pages from being indexed. So, if anything, only unflagged revisions should have the META tag with NOINDEX,NOFOLLOW, but for logged-in users *all* pages (flagged und unflagged) have this META tag. This is a bug. Bugs should be fixed.

PDD is right, this is a bug in flaggerevs. changing component.

So, the real reason about this bug is that Google is miscrawling wikipedia and
someone though that such line was at fault. It is not.
If that was the reason, all wikipedias would be affected, not only dewiki. No
wikipedia article would be listed. And if it were logged in, cached pages would
contain the Google user name at the top.

(In reply to comment #9)

If that was the reason, all wikipedias would be affected, not only dewiki.

Erm, are you commenting here without actually having looked into the matter? The META tag bug affects dewiki, huwiki and plwiki *only*, so of course it can't have any effect on "all wikipedias", no matter what that effect might be...

These are two separated bugs: One about that Google issue and one about useless ,noindex,nofollow‘ in flagged (and not only unflagged) revisions. The latter is called [[bugzilla:27173]] (that one here), the former [[bugzilla:27155]].

FWIW, the indexing problem is an issue on Google's end, not ours (bug 27155 tracks that)

We've actually been serving noindex,nofollow to logged in users in FlaggedRevs for quite some time now (the code in this regard hasn't changed in about a year). I think Googlebot's problems just raised people's awareness and it made a decent original assumption as to the cause.

Whether or not we should serve noindex,nofollow to logged in users is debatable, and I guess this bug serves that purpose.

wikipedia wrote:

(In reply to comment #12)

Whether or not we should serve noindex,nofollow to logged in users is
debatable, and I guess this bug serves that purpose.

Chad got the point. Mentioning Google was just an example and nothing else. Independent of the fact if crawlers do their job logged in or not, the meta tags are senseless and have to be removed - or can anyone explain to me why a crawler must not index an article version which has been checked ("flagged")?!

This is caused because in FlaggedArticleView.php, setRobotPolicy has the following check:

<pre>
if ( !$this->pageOverride() && $this->article->isStableShownByDefault() ) {
// set noindex
}
</pre>

in this check, $this->pageOverride() returns false for stable versions for logged in users, yet true for stable versions for non-logged in users.

pageOverride() returns false for logged in users, due to the following check:

<pre>

$config = $this->article->getVisibilitySettings();
# Does the stable version override the current one?
if ( $config['override'] ) {
        if ( $this->showDraftByDefault() ) {
                return ( $wgRequest->getIntOrNull( 'stable' ) === 1 );
        }
        # Viewer sees stable by default
        return !( $wgRequest->getIntOrNull( 'stable' ) === 0 );

</pre>

ergo, pageOverride() does not account for usergroup settings in viewing stable pages, it only takes into account usersettings, page settings and url overrides.

(In reply to comment #14)

ergo, pageOverride() does not account for usergroup settings in viewing stable
pages, it only takes into account usersettings, page settings and url
overrides.

Yes, it does check that. That's what showDraftByDefault() does.

The real cause is that logged-in users see the current version by default, even if it is synced with the stable version. Try logging in and adding ?stable=1 to the page URL (noindex goes away). The two versions are almost the same, except the stable has filetimestamp=X added to thumbnail links. In rare cases, the current version might use newer versions of Commons files too (feature of bug 15748).

One way to index these would be to have setRobotPolicy() check for this scenario (viewing the draft when the stable synced with it).

I've been doing refactoring yesterday to make the code easier to read. I'll deal with this after finishing that.

Sync check (per comment #15) added in r81874.