Page MenuHomePhabricator

Prevent zero.wikipedia.org from being indexed by search engines
Closed, DeclinedPublic

Description

https://www.google.com/search?q=site:en.zero.wikipedia.org

Sometimes searching certain terms displays zero.wikipedia.org URL as the first result, this should be avoided, as users on unsupported carriers (or non-mobile users) have no direct way to jump to the standard site.


Version: unspecified
Severity: major

Details

Reference
bz48856

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:34 AM
bzimport added a project: ZeroPortal.
bzimport set Reference to bz48856.
bzimport added a subscriber: Unknown Object (MLST).

https://bugzilla.wikimedia.org/show_bug.cgi?id=35233

Unfortunately without a cache flush this will stay in search results for some time but will disappear from search results eventually.

  • This bug has been marked as a duplicate of bug 35233 ***

I don't understand how Google was able to index these pages. Isn't the page content restricted to specific IP addresses? How was Googlebot able to index the page content?

While it's difficult to test, it appears that pages such as http://en.zero.wikipedia.org/wiki/Britney_Spears do not have a "noindex" directive within them.

I can't speak about the IP address content restriction (although personally i don't understand this and think at the very least there should be a link rather than the current broken experience! - what if someone shares a link for example). I also believe that if people are sharing a zero link that is the same as a normal wikipedia page link it will boost the wikipedia page's page ranking.

Anyway in terms of indexing, if Google finds a page the first thing it should do is look for the canonical link tag [1]. If it finds it instead of indexing the current page (the zero one in this case) it boosts the non-zero page's ranking.

If you looked at the cached versions of these indexed page the canonical link tag is not there so they got indexed (see bug 35233). When the page HTML for these pages is rewritten it will have the canonical url and they will disappear from search results without any further work.

As far as I know we don't set a noindex directive and I don't believe we should. I believe that since people share links, and might share a link on zero via some other service (which maybe also has data free charges) I think we should improve the experience for users landing on this page who are not on zero. Instead of wiping content we should either automatically redirect or show a different banner linking to the original content.

So with this in mind MZ should a new bug be created or do we still want to brute force via a noindex?

[1] http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139394

This appears to be related to Gerrit changeset 64629 (I63a4542c9792e4979f2a9668d0a5c858f21f591b).

(In reply to comment #3)

So with this in mind MZ should a new bug be created [...]

I filed bug 48921 to track the issue I think you were describing.

We will be setting up a no-index rule for zero.wikipedia.org requests. The business team confirmed that zero.wikipedia.org pages are not supposed to be in the Google index. I prefer that we have a re-crawl of the site first to help Google's existing canonical links updated. But eventually, the business team wants zero.wikipedia.org out of the search index completely.

MZMcBride, as to why the pages were indexed, from what I can tell:

  • At some point a code change resulted in article content other than the "Sorry" warning being echoed into the <language>.zero.wikipedia.org pages below the warning (making them on par if I understand correctly, with <language>.m.wikipedia.org pages sans the warning).
  • With the fulltext content from each <language>.zero.wikipedia.org page, Google's crawlers were able to discover more links.
  • In the absence of a canonical link for each <language>.zero.wikipedia.org page, Google's algorithms wouldn't have had a perfect, non-heuristic means of identifying the pages as being the same. The heuristics seem to have correctly classified a number of pages as dupes, but not all of them based on a site:en.zero.wikipedia.org Google search, for example.

My Gerrit change #64113 was introduced to stop content from being echoed below the "Sorry" warning. This in concert with Jon's Gerrit change #61809 will allow the Google index to self-correct, although as you note, my Gerrit change #64629 provides the means to have no indexing whatsoever.

<cross-posted from mailing list>

Update:
We've added an enhancement to Wikipedia Zero so that if a user who isn't on a participating carrier network navigates to a Wikipedia Zero page on <language>.zero.wikipedia.org, such as http://en.zero.wikipedia.org/wiki/Muse_%28band%29 , the user will be presented an option to visit the canonical URL of the article. If clicked, the canonical URL should get the user to the mobile or desktop version of the page, based on device type.

We're hoping that by next week the Google index will be refreshed so as to correctly mark the <language>.zero.wikipedia.org pages as duplicate pages in the omitted section. Upon confirmation of as much, the current plan is to introduce https://gerrit.wikimedia.org/r/#/c/69420/ to prevent indexing of <language>.zero.wikipedia.org altogether.

<cross-posted from mailing list>

Okay, looks like the index of zero.wikipedia.org pages in Google has shrunk by some 20 million entries. Nonetheless, a number of really old pages (e.g., going back to 6-May-2013) are still in the Google index with article text. I'll set a reminder to check on the Google index again in 30 days, and hopefully then we can finally put the no-index rules in place at that time.

The good news is that many of the pages are now correctly suppressed in natural search as non-canonical pages. In other words, a user would need to go through omitted results or do a site:<domain> search to see them.

Ineligible zerodot pageview attempts are now redirected to Special:ZeroRatedMobileAccess. So even the robots.txt-defined ineligible pages on zerodot are bound to fall out of specific search engine query results (e.g., site:zero.wikipedia.org is currently a way to google for pages that have been previously indexed with content "below the [warning] fold").

DFoy set Security to None.

Does Zero set a canonical URL?

Could someone please answer this question? ^

Does Zero set a canonical URL?

Could someone please answer this question? ^

Yes, it does.

$ curl 'https://en.zero.wikipedia.org/wiki/660s' -H'x-cs: ON' -s | grep canonical
<link rel="canonical" href="https://en.wikipedia.org/wiki/660s" />

Please verify if you can find a genuine search result for zero.wikipedia.org with a query that doesn't explicitly force inurl: or site:.

See https://phabricator.wikimedia.org/T67402#1571061 and https://phabricator.wikimedia.org/T48424#1673536 for why.

Krinkle lowered the priority of this task from High to Low.Sep 25 2015, 4:34 AM

Please verify if you can find a genuine search result for zero.wikipedia.org with a query that doesn't explicitly force inurl: or site:.

Could probably use a mechanize script à la https://gist.github.com/nemobis/7718061 to search "X wikipedia" on google.tn and friends where X is the title of a top article?

Dzahn subscribed.

Declining per T187716#4852639 since there is no more Wikipedia Zero