Page MenuHomePhabricator

Google ignores canonical url in search results (index.php?curid=, index.php?title=, etc.)
Closed, ResolvedPublic

Description

This was supposedly fixed (bug 16865; r45360).

And though MediaWiki is indeed outputting "noindex", Google appears to be ignoring it and as such is indexing duplicate content.

A few examples:

https://www.google.com/search?q=inurl:curid+site:mediawiki.org

  1. Discussion - MediaWiki www.mediawiki.org/?curid=84252 Mar 29, 2012 Hi! Searching for the shortest urls for wikis using scripts other then Latin was a longtime nightmare. urls using the "wgArticleId" from ...
  1. Link to - MediaWiki www.mediawiki.org/?curid=84277 Mar 30, 2012 mw.config.set({,,,, wgPageName":"Ernst_Lossa","wgTitle":"Ernst Lossa", "wgCurRevisionId":99548829,"wgArticleId":2809853, ...} ...) nr.

https://www.google.com/search?q=inurl:curid+site:wikipedia.org

  1. [edit] Notes - Wikipedia en.wikipedia.org/wiki/index.html?curid=7490642&action=render My Brightest Diamond is the project of singer–songwriter and multi-instrumentalist Shara Worden. The band has released three studio albums, 2006's Bring Me ...
  1. Wikipedia, the free encyclopedia en.wikipedia.org/wiki?curid= The 1950 Atlantic hurricane season was the first year in the Atlantic hurricane database (HURDAT) in which storms were given names by the United States Air ...
  1. Wikipedia simple.wikipedia.org/?curid= This is the front page of the Simple English Wikipedia. Wikipedias are places where people work together to write encyclopedias in different languages. We use ...
  1. Table tennis at the 2004 Summer Paralympics - Wikipedia, the free ... en.wikipedia.org/wiki/index.html?curid=1011065 Table Tennis at the 2004 Summer Paralympics was staged at the Galatsi Olympic Hall from September 18 to September 27. Competitors were divided into ten ...
  1. Upper Eastside - Wikipedia, the free encyclopedia en.m.wikipedia.org/wiki/index.html?curid=19698600 A MiMo restaurant on Biscayne Boulevard in the Upper Eastside. The Upper Eastside is famous for its post war MiMo architecture, and is home to the MiMo ...
  1. Robert Loggia - Wikipédia fr.wikipedia.org/wiki/?curid=899678 Translate this page Vous pouvez partager vos connaissances en l'améliorant (comment ?) selon les recommandations des projets correspondants. Robert Loggia est un acteur et ...

What I found is that:

  • The ones from mediawiki.org are LiquidThreads pages. LQT apparently overrides this logic from Article.php and as such is not outputting "robots => index". So those are a flaw on our end.
  • #3 has action=render. That's never supposed to be indexed (separate bug?) but the way it is used circumvents some of our deferences. #3 accesses an article by the name of "index.html",, but then overrides the curid and tacks on action=render. Basically doing: en.wikipedia.org/wiki/Some_page_name?curid=7490642&action=render
  • #4 and #5 have an empty curid
  • #6 and #7 are more examples of this odd "index.html" title
  • #8 is like the ones on mediawiki.org except that these are not from LQT and are actually outputting "noindex". This is the main problem.

Though it is somewhat outside the scope of this bug, I think we should:

  • Always output rel=canonical when viewing a regular page (whenever not on a Special page, not a non-View action, no diff or oldid) So any url, no matter how weirdly constructed, with:
    • /?title=
    • /w?title=
    • /w/index.php?title=
    • any of the above with curid instead of title
    • any of the above via /wiki/
    • any of the above with action=view

      Right now we're only doing rel=canonical on redirects which makes no sense to me. It is perfectly file to output rel=canonical on the canonical page itself.
  • Always output noindex when not rel=canonical but are viewing a page. Any wikipage/action=view that is not a simple view of the latest version of an article, e.g. with diff or oldid

Version: 1.20.x
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=26115
https://bugzilla.wikimedia.org/show_bug.cgi?id=63891

Details

Reference
bz46424

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:37 AM
bzimport set Reference to bz46424.
bzimport added a subscriber: Unknown Object (MLST).

Filed a separate bug for action=render, bug 63891 .

Krinkle renamed this task from Urls with curid query indexed by Google to Google ignores canonical url in search results (index.pho?curid=, index.php?title=, etc.).Jun 4 2015, 6:56 PM
Krinkle set Security to None.
Krinkle removed a subscriber: Unknown Object (MLST).

Updating to also mention title query parameter.

https://www.google.co.uk/search?q=navigation+popups

  1. Wikipedia:Tools/Navigation popups - Wikipedia, the free ...

en.wikipedia.org/?title=Wikipedia:Tools/Navigation_popups

  1. Wikipedia talk:Tools/Navigation popups - Wikipedia, the free ...

en.wikipedia.org/?title=Wikipedia_talk:Tools/Navigation_popups

  1. Wikipedia:Tools/Navigation popups/FAQ - Wikipedia, the ...

en.wikipedia.org/wiki/Wikipedia:Tools/Navigation_popups/FAQ

Nemo_bis renamed this task from Google ignores canonical url in search results (index.pho?curid=, index.php?title=, etc.) to Google ignores canonical url in search results (index.php?curid=, index.php?title=, etc.).Jun 17 2015, 9:41 AM
Krinkle claimed this task.

This was fixed as part of T67402.

As for the example queries in the opening post, those are irrelevant. From what I can gather, Google is well-aware of the canonical url but still allows users to find non-canonical urls if and only if their search query did not match the canonical variant of a web page.

The pages return by queries like https://www.google.com/search?q=inurl:curid+site:mediawiki.org can only be found when forcing it with inurl:curid when looking for keywords of the destination pages themselves, the canonical variant (and only the canonical variant) is shown in the search results.

Read more at https://phabricator.wikimedia.org/T67402#1571061.