Page MenuHomePhabricator

Mobile sites being indexed by search engines
Closed, ResolvedPublic

Description

Sites such as http://en.m.wikipedia.org are currently being indexed by search engines such as Google.

I don't believe having these mobile sites indexed is necessary or appropriate.

I'd like to see the sites marked as "no index", via robots.txt or a <meta> tag or whatever other reliable method is available. I thought this was already the case, but it clearly is not: https://www.google.com/search?q=site%3Aen.m.wikipedia.org.


Version: unspecified
Severity: normal

Details

Reference
bz35233

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:19 AM
bzimport set Reference to bz35233.

The mobile site previously had "<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />". This was removed in r113378.

preilly wrote:

We removed the NOINDEX, NOFOLLOW at Google's request. They want to index the mobile site for their mobile search index.

alexsm333 wrote:

Does the mobile site support NOINDEX and MediaWiki:Robots.txt? There is a complaint here: http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#User_talk_pages_not_NOINDEXed_for_mobile_site

(In reply to comment #3)

Does the mobile site support NOINDEX and MediaWiki:Robots.txt? There is a
complaint here:
http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#User_talk_pages_not_NOINDEXed_for_mobile_site

The NOINDEX part is covered by bug 35425. And it looks like https://en.m.wikipedia.org/robots.txt loads properly.

Reopening.

Indexing for the purpose of measuring is fine, but seeing m.wikipedia.org search results in Google's non-mobile search seems confusing and misleading to our users. But that's what's currently happening, and we need to stop it.

Patrick or Tomasz, can you send me the communication that's occurred so far with Google so I understand where they're coming from, and give me any additional background that may be helpful?

The noindex, nofollow has been restored for now.

If you see Google search results which include m., please report them here.

We'll try to comply with Google's request in a way that doesn't affect other crawlers, and once they've confirmed that they can fully exclude m. from Google search results.

My own experience, specifically with Google: I've seen m. results come up for ordinary searches a few days ago. Now I find it much harder to reproduce. However, I can reliably get them both for "site:m.wikipedia.org" and for some searches which pick up unique m. content. For example, the mobile site includes the phrase "Disable images", so the word "disable", in combination with some searches, brings up m. results. (Example: search for "disable gnu general public license" and you'll see a simple.m result.)

So it looks like they're trying aggressively to filter but not always succeeding.

Google's recommendation is to use rel="canonical", i.e. to allow crawlers to crawl the mobile site but to signal that the content is substantially identical to the desktop version. They're recommending for it to be crawler-visible to pick up mobile-optimized pages and serve those directly to users of Google mobile search.

I'm fine with giving this a go. Supposedly Bing, Yahoo! and Google all support rel="canonical" to filter duplicate content.

We'll still see increased crawler traffic relative to noindex but this should help to reliably exclude m. pages from the desktop index.

(In reply to comment #9)

Google's recommendation is to use rel="canonical", i.e. to allow crawlers to
crawl the mobile site but to signal that the content is substantially identical
to the desktop version. They're recommending for it to be crawler-visible to
pick up mobile-optimized pages and serve those directly to users of Google
mobile search.

I'm fine with giving this a go. Supposedly Bing, Yahoo! and Google all support
rel="canonical" to filter duplicate content.

We'll still see increased crawler traffic relative to noindex but this should
help to reliably exclude m. pages from the desktop index.

When I look at the page source of http://en.m.wikipedia.org/ currently, I notice two things:

<meta name="robots" content="noindex,nofollow"/>

and...

<link rel="canonical" href="http://en.wikipedia.org/wiki/Main_Page" >

So what is needed to resolve this bug? Simply removing the noindex/nofollow HTML output (presumably this is behind a PHP configuration variable)?

Related URL: https://gerrit.wikimedia.org/r/61809 (Gerrit Change I1790f38880458588b9ccc5c2d5e0fa67ff00e386)

https://gerrit.wikimedia.org/r/61809 (Gerrit Change I1790f38880458588b9ccc5c2d5e0fa67ff00e386) | change APPROVED and MERGED [by Jdlrobson]

  • Bug 48856 has been marked as a duplicate of this bug. ***

Reopening this bug so as to avoid duplicate. I came across it for mediawiki.org during a google search. Perhaps we just didn't get all WMF sites?

Replication:

Google search for Wikimedia bugzilla groups [oh the irony]
https://www.google.com/search?q=wikimedia+bugzilla+groups&oq=wikimedia+bugzilla+groups

For me the 4th option was:
User:AKlapper (WMF)/BugzillaAdminPolicy - MediaWiki
https://m.mediawiki.org/wiki/User:AKlapper.../BugzillaAdminPolicy‎

As stated in:
https://bugzilla.wikimedia.org/show_bug.cgi?id=48856#c1 (dated 2013-05-28 00:22:05 UTC)

This will remain the case without a a clash flush.

I believe our caches run for 6 weeks so please reopen bug if you notice this behaviour at the end of July.

(In this particular example the search result in question hasn't been edited since Apr 23, 2013 so will still be loading from cache)

  • Bug 50400 has been marked as a duplicate of this bug. ***