Page MenuHomePhabricator

Canonical URL should include language variant
Closed, ResolvedPublic

Description

Acceptance criteria (AC):

  • Example page: https://zh.wikipedia.org/wiki/汉语 ( shortened as /wiki/汉语 format below )
Language variant in URLActionlink rel="canonical"link rel="alternate" hreflang="zh-Hant-TW"Notes
non-specifiedview/wiki/汉语/zh-tw/汉语
zh-twview/zh-tw/汉语/zh-tw/汉语
non-specifiededit/wiki/汉语/zh-tw/汉语See T67402
zh-twedit/zh-tw/汉语/zh-tw/汉语for consistency
non-specifiedhistory/w/index.php?title=汉语&action=history(without HTML meta alternate tag)See Func's cmt.
zh-twhistory/w/index.php?title=汉语&action=history(without HTML meta alternate tag)See Func's cmt.
non-specifiedinfo/w/index.php?title=汉语&action=info(without HTML meta alternate tag)See Func's cmt.
zh-twinfo/w/index.php?title=汉语&action=info(without HTML meta alternate tag)See Func's cmt.

I think we don't need to append the variant parameter to the history or info page,
variants only make sense when editing or viewing the page.

NOTE: Maybe we should prefer to use the BCP 47 code for language variant path?
NOTE: Maybe we would like to introduce new configuration option(s) to exclude some of the language variants?

Language variants currently point to the same canonical URL. For example, on this page:

http://zh.wikipedia.org/zh-tw/%E6%B1%89%E8%AF%AD

...there is a rel=”canonical” pointing to
http://zh.wikipedia.org/wiki/%E6%B1%89%E8%AF%AD

This rel=”canonical” link asks search engines to index the Simplified Chinese page to represent the content on both pages, instead of separately indexing the Simplified Chinese and Traditional Chinese pages. Similar rel=”canonical” links are found on all zh-TW pages. Google is reporting that we see a similar problem on other Chinese (e.g. zh-SG) and Serbian content pages.

(this may be caused by the fix to bug 48402 ( T50402: rel=canonical of https pages should point to http )


Version: 1.23.0
Severity: normal
See Also:

Related Objects

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:45 AM
bzimport set Reference to bz52429.

If I understand the semantic meaning of rel="canonical" correctly, what it does now is the expected behavior.

http://zh.wikipedia.org/wiki/%E6%B1%89%E8%AF%AD is not "the Simplified Chinese
page", but an automatically converted page based on requests (prefs for users and Accept-Language for anons). We want all these links to show up in Google search results instead of links specifying a particular variant.

However Google seems not respecting it and indexing links to pages in every variant, and we have to workaround it: https://zh.wikipedia.org/w/index.php?title=MediaWiki:Gadget-variant-link-fix.js

Currently Google seems to mainly index /zh/ links instead of /wiki/'s (which is unexpected). /zh-tw/ or /zh-cn/'s are not indexed as expected though.

RobLa: So should this still be high priority wrt Liangent's comment 1 here?

If still high priority:
Tim: Do you plan to work on this at some point?

(In reply to Andre Klapper from comment #3)

RobLa: So should this still be high priority wrt Liangent's comment 1 here?

If still high priority:
Tim: Do you plan to work on this at some point?

I guess Tim is just the default CC, but actually this issue seems not Wikimedia-specific.

Change 154240 had a related patch set uploaded by Tim Starling:
Don't send rel=canonical to variant-neutral page

https://gerrit.wikimedia.org/r/154240

Change 154240 merged by jenkins-bot:
Don't send rel=canonical to variant-neutral page

https://gerrit.wikimedia.org/r/154240

All patches mentioned in this report were merged or abandoned - is there more work left to do here (if yes: please reset the bug report status to NEW or ASSIGNED), or can you close this ticket as RESOLVED FIXED?

(In reply to Rob Lanphier from comment #0)

Language variants currently point to the same canonical URL. For example, on
this page:

http://zh.wikipedia.org/zh-tw/%E6%B1%89%E8%AF%AD

...there is a rel=”canonical” pointing to
http://zh.wikipedia.org/wiki/%E6%B1%89%E8%AF%AD

Now has:

<link rel="alternate" hreflang="zh" href="/zh/%E6%B1%89%E8%AF%AD" />
[...]
<link rel="alternate" hreflang="zh-TW" href="/zh-tw/%E6%B1%89%E8%AF%AD" />
<link rel="alternate" hreflang="x-default" href="/wiki/%E6%B1%89%E8%AF%AD" />
[...]
<link rel="canonical" href="http://zh.wikipedia.org/zh-tw/%E6%B1%89%E8%AF%AD" />

But I'm not sure this is properly fixed in general, because this is still an issue:

(In reply to fireattack from comment #2)

Currently Google seems to mainly index /zh/ links instead of /wiki/'s (which
is unexpected). /zh-tw/ or /zh-cn/'s are not indexed as expected though.

The two URLs for "zh" version don't agree on which is canonical:

/zh/ says

<link rel="alternate" hreflang="zh" href="/zh/%E6%B1%89%E8%AF%AD" />
[...]
<link rel="canonical" href="http://zh.wikipedia.org/zh/%E6%B1%89%E8%AF%AD" />

/wiki/ says

<link rel="alternate" hreflang="zh" href="/zh/%E6%B1%89%E8%AF%AD" />
[...]
<link rel="canonical" href="http://zh.wikipedia.org/wiki/%E6%B1%89%E8%AF%AD" />

Created attachment 16795
Google search in Italian for [[zh:汉语]]

If I search a Latin alphabet string of that article I manage to get 4 variants from Google after asking to show me duplicate pages as well. None of them is /wiki/

Searching '"漢語,又称中文、华语" site:wikipedia.org' yielded two results including zh.wap.wikipedia.org/zh-tw/汉语 but that's another bug.

Attached:

汉语.png (768×1 px, 107 KB)

Krinkle renamed this task from Language variants currently point to the same canonical URL to Canonical URL should include language variant.Jul 7 2015, 10:34 AM
Krinkle set Security to None.

I don't know why the title of this page reads "Canonical URL should include language variant" (because it shouldn't). But anyway, I'm here to report the exact weired behavior mentioned above:

Created attachment 16795
Google search in Italian for [[zh:汉语]]

If I search a Latin alphabet string of that article I manage to get 4 variants from Google after asking to show me duplicate pages as well. None of them is /wiki/

Searching '"漢語,又称中文、华语" site:wikipedia.org' yielded two results including zh.wap.wikipedia.org/zh-tw/汉语 but that's another bug.

Attached:

汉语.png (768×1 px, 107 KB)

If you use google to search "汉语 维基百科", the first result would be https://zh.wikipedia.org/zh/汉语 and all the other results below are using /zh/.

This is NOT optimal because it will show page in original variant instead of user's preference (in my case, zh-cn).

However if you search with "汉语 维基百科 site:wikipedia.org", the first result will become https://zh.wikipedia.org/wiki/汉语 as well as other links. This is optimal because /wiki/ links would automatically jump to language variant that user wants.

I have no idea what causes this strange situation (it's even maybe Google's fault), but it needs to be fixed. It's quite annoying that users need to manually change language variant from Google result.

So I think this bug should be "Canonical URL should be /wiki/ links, but somehow Google doesn't honor it".

Change 609513 had a related patch set uploaded (by VulpesVulpes825; owner: VulpesVulpes825):
[mediawiki/core@master] Write language varaint link as child element rather than individual entry in sitemap

https://gerrit.wikimedia.org/r/609513

VulpesVulpes825 subscribed.

As T198965#4438037 suggests, fixing Sitemap will not solve this issue unless T87140 gets implemented. Hence removing myself as the assignee of this task.

I think we shouldn't be setting canonical url to /wiki on chinese variants pages. The localized pages have their own title. Setting the canonical url to /wiki makes them all show the same title in google search results.

As a Taiwanese user, it always bothers me to see simplified chinese title in my search results.

I would suggest we remove the canonical url.

I'm happy to help making the code changes if we agree on doing this.

Kindly refer to this discussion on Search Console Help

The combination of such subtle differences with correct implementation of hreflang and country targeted folders should exclude any necessity for canonicals anyway.

https://support.google.com/webmasters/thread/130615008?hl=en&msgid=130651877

Change 879579 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/core@master] Make meta canonical URL variant-language-aware

https://gerrit.wikimedia.org/r/879579

In T54429#7981483, @YLJ wrote:

I would suggest we remove the canonical url.

Kindly refer to this discussion on Search Console Help

The combination of such subtle differences with correct implementation of hreflang and country targeted folders should exclude any necessity for canonicals anyway.

https://support.google.com/webmasters/thread/130615008?hl=en&msgid=130651877

I think we do need a canonical set since we have multiple ways to access variants, /zh-*/ subpage and ?variant=zh-* query string.

@Winston_Sung We may need to use subpage as the canonical since links in the variants tab use subpage syntax.

In T54429#7981483, @YLJ wrote:

I would suggest we remove the canonical url.

Kindly refer to this discussion on Search Console Help

The combination of such subtle differences with correct implementation of hreflang and country targeted folders should exclude any necessity for canonicals anyway.

https://support.google.com/webmasters/thread/130615008?hl=en&msgid=130651877

I think we do need a canonical set since we have multiple ways to access variants, /zh-*/ subpage and ?variant=zh-* query string.

@Winston_Sung We may need to use subpage as the canonical since links in the variants tab use subpage syntax.

@Func :

Due to technical limitations (as you mentioned, there should be only one canonical URL for each variant), it is controlled by the $wgVariantArticlePath configuration option.

Due to technical limitations (there should be only one canonical for one variant), it is controlled by the $wgVariantArticlePath configuration option.

Of course only one canonical, I mean the subpage should be used for WMF sites. Yes, so you should make use of it and keep things consistent.

Oh, I now understand what you meant.

As long as the configuration option $wgVariantArticlePath being set properly (as the language code directory (the "subpage")) on WMF sites, it will be used (for WMF sites).

@Func :

I think you misunderstand about the code

The part you mentioned is the parameter for $this->getTitle()->getCanonicalURL() , which is independent from the URL format configuration.

See Title::getCanonicalURL( $params )

Jdlrobson claimed this task.
Jdlrobson subscribed.

The canonical behaviour described in the table in the description is working as requested.

In terms of ?action=(info|history|edit) the link[hreflang] will always point to the article page.

As far as I can see the remaining potential ask (from the original task description) here is to drop all link[hreflang] tags on ?action=info ?action=history and ?action=edit pages. After chatting to a Google representative, there doesn't seem to be any benefits from dropping those link tags.

If you feel there is any follow up work to be done here, please create a new ticket with detailed description of what is required and I will forward that question to Google.

Change 879579 merged by jenkins-bot:

[mediawiki/core@master] OutputPage: Fix the behavior for canonical URL and alternate URLs

https://gerrit.wikimedia.org/r/879579