Page MenuHomePhabricator

Automatically add anchor for original (English on Wikimedia) version of heading title
Closed, ResolvedPublic4 Estimated Story PointsFeature

Description

If the source page, e.g. [[Foo]] contains this wikitext:

<translate>
== Bar ==

Lorem ipsum dolor sit amet.
</translate>

then you can link to it like this: [[Foo#Bar]].

If a translation, e.g. [[Foo/xx]] has translated "== Bar ==" as "== Something else ==" then the text will look something like this:

== Something else ==

Some translated version of <Lorem ipsum dolor sit amet> here.

It would be useful to make [[Foo#Bar]] automatically point to [[Foo#Something_else]], if it is feasible.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Pols12 subscribed.

Adding I18n since this issue breaks most of Special:MyLanguage links with anchor fragment.

Yes, when a translation admin see a link to a section in the same page, they usually think to add an empty span.
But they can’t guess if someone else, from any page or talk page, links to a section.

Currently, each time you want to link to a section, you have to ensure a manual anchor exists.

Nikerabbit set the point value for this task to 2.Mar 31 2022, 11:12 AM
abi_ changed the point value for this task from 2 to 4.Jun 29 2022, 11:52 AM
Nikerabbit raised the priority of this task from Lowest to Medium.Aug 17 2022, 7:28 AM

While doing this, maybe could the untranslated headings’ language be marked in the table of contents when it’s marked up at the actual place (i.e. syntax version 2, no nowrap attribute)? Currently the final markup is

<div lang="en" dir="ltr" class="mw-content-ltr">
== Headline ==
</div>

and the parser identifies just the == Headline == part as a heading, so the language markup isn’t present in the TOC, which causes the same issues as what syntax version 2 solved outside of the TOC (correct pronunciation by screen readers, avoiding bidirecionality issues, using appropriate fonts etc.).

Change 826995 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/Translate@master] Add anchor for source version of heading on translation pages

https://gerrit.wikimedia.org/r/826995

Change 826995 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Add anchor for source version of heading on translation pages

https://gerrit.wikimedia.org/r/826995

Tested on translatewiki.net. I went to https://translatewiki.net/wiki/Translating:MediaWiki/Basic_glossary:_Tips_for_translators#Should_I_transliterate_a_term_from_English_(or_another_language_that_is_familiar_to_computer_users_in_my_community)_or_translate_it_to_a_native_word_in_my_language? and then changed the URL manually to include /lt before the anchor and the page did scroll to the correct section (in so far that it hit the page bottom anyway). It also works when adding Special:MyLanguage in the URL. I did not test the redirect page case, but I expect it to work as well.

One potential issue I see is that the span tag gets wrapped in a <p> element. Fortunately it seems it doesn't change the visual appearance of the page.

Can also confirm by looking at the rendered page source at: https://translatewiki.net/wiki/Translating:MediaWiki/Basic_glossary:_Tips_for_translators/lt?action=edit

<languages />
Šiame puslapyje pateikiami patarimai pradedantiesiems "MediaWiki" ir jos [[Translating:MediaWiki/Basic glossary|pagrindinio terminų žodyno]] vertėjams.

Taip pat turėtumėte perskaityti šiuos puslapius:
* [[Localisation guidelines]]
* [[Translating:MediaWiki]]

<span id="How_much_time_does_it_take_to_translate_this_glossary?"></span>
== Kiek laiko užtrunka išversti šį terminų žodyną? ==

Jei turite patirties naudojantis "MediaWiki" svetainėmis ir jūsų kalboje yra nusistovėjusi "Wiki" redagavimo terminologija, tai užtruks apie dvi dienas.

[...]

I did not test complicated cases like duplicate heading names or heading with various special characters or unicode.

If the heading contains any non-ASCII character, MediaWiki core adds a second dot-encoded anchor tag inside the h2 tag. It would have been nice to hook the parser to insert the Translate anchor in the same place.

However, many thanks for this nice-designed workaround which will make us save so much time, and restore many broken section links for visitors! ♥️

The dot-encoded anchor is a legacy thing. It was necessary back in the HTML4 times (maybe in XHTML as well), but in HTML5 unencoded anchors work well. The dot-encoded anchors were kept, probably to avoid breaking existing anchor links. Since there are no existing anchors here, there’s nothing to break.

Translation pages are not updated immediately to have these anchors, only when they are updated for any reason. We may consider running a script to refresh all translation pages eventually.

Note marking the page for translation again is not sufficient to update translation pages: a real change has to be made.

No, it doesn’t have to: this update was done after marking a page for translation without any changes. (In your case, something may have gone wrong in the job queue.)

However: this is also an example for how things can go terribly wrong, with the only solution being switching to HTML markup, which suppresses the extra anchors. (Switching to HTML was actually desirable in this case, in order to suppress the broken section edit links when the template is transcluded, but in most cases it would be undesirable and unnecessary extra work to fix the translations.)


Another example, which doesn’t appear to be broken, but strictly speaking it is, is https://www.mediawiki.org/wiki/Wikimedia_Apps/hu#Android: both English and Hungarian headings are Android (which makes sense, as the name of Google’s OS is written the same in both languages), which results in the same ID appearing twice in the document, which is invalid HTML. This second example is pretty easy to avoid: just compare the original and the translated heading, and if they’re the same, skip adding our anchor.

No, it doesn’t have to: this update was done after marking a page for translation without any changes. (In your case, something may have gone wrong in the job queue.)

None of the 3 marking actions (on 14th, 15th and 18th September) have created any edit by FuzzyBot on translation pages. There is an issue with that page or wiki.

No, it doesn’t have to: this update was done after marking a page for translation without any changes. (In your case, something may have gone wrong in the job queue.)

None of the 3 marking actions (on 14th, 15th and 18th September) have created any edit by FuzzyBot on translation pages. There is an issue with that page or wiki.

I checked the logs and found that these translation pages were not updated due to the following error:

{
  "message": "edit-no-change",
  "params": [],
  "type": "warning"
}

I checked the markup of the page, and noticed that it uses the following syntax to mark the headings:

== <translate><!--T:2--> Who we are</translate> ==

This markup is not supported by this feature. The feature identifies headings using the ==. So right now, on the page, we cannot identify the headings, and hence generate anchors for them.

Change 832827 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/Translate@master] Avoid adding heading anchor if translation and definition are same

https://gerrit.wikimedia.org/r/832827

I've submitted a fix to handle headings with the same definition and translation: https://gerrit.wikimedia.org/r/832827

I've created two issues to track smaller work that was identified while working on this task:

  1. T318070: Ensure untranslated heading are wrapped in the table of contents
  2. T318067: Handle duplicate heading names when generating anchors

Change 832827 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Avoid adding heading anchor if translation and definition are same

https://gerrit.wikimedia.org/r/832827

Another example, which doesn’t appear to be broken, but strictly speaking it is, is https://www.mediawiki.org/wiki/Wikimedia_Apps/hu#Android: both English and Hungarian headings are Android (which makes sense, as the name of Google’s OS is written the same in both languages), which results in the same ID appearing twice in the document, which is invalid HTML. This second example is pretty easy to avoid: just compare the original and the translated heading, and if they’re the same, skip adding our anchor.

This issue has been fixed. Thanks for the report.

I'm going to mark this task as resolved now. Sub tasks have been created for issues reported and features requested.

this update […] is also an example for how things can go terribly wrong, with the only solution being switching to HTML markup, which suppresses the extra anchors.

This hasn’t been fixed, and this is not just an annoyance/nice-to-have, but quite badly broken output that cannot even be easily worked around. I think it should be fixed before we can call this ticket “done”.

Also https://www.mediawiki.org/wiki/Help:Extension:Translate/Page_translation_administration/de is broken:

image.png (969×1 px, 236 KB)

This is a different issue from the above, where wikitext parsing is creating a link and breaking the span syntax. I am seeing the spans added where it shouldn't be added. It seems the nowrap attribute is being ignored.

https://www.mediawiki.org/w/index.php?title=Help:Extension:Translate/Page_translation_administration/fi&diff=next&oldid=5502590&diffmode=source this diff causes the table rendering to fail in mysterious ways. Not sure why though, but should be possible to fix by avoiding the span tag and newline.

However: this is also an example for how things can go terribly wrong, with the only solution being switching to HTML markup, which suppresses the extra anchors. (Switching to HTML was actually desirable in this case, in order to suppress the broken section edit links when the template is transcluded, but in most cases it would be undesirable and unnecessary extra work to fix the translations.)

I'm not sure how to address this issue. The Translate extension does not see the content of the transcluded template and hence cannot generate a proper id for the span tag. To avoid breakage we should disable generation of the span if we detect that the heading has a template.

Not generating an anchor is one option, another is emitting something like

<span id="{{anchorencode:{{Q|{{{1<noinclude>|Q11696</noinclude>}}}}} officeholders}}"></span>

deferring the call to Parser::guessSectionNameFromWikiText to after template expansion. This could be done unconditionally, but maybe it’s better to do it conditionally, only when there may be issues, since it looks uglier.

Not generating an anchor is one option, another is emitting something like

<span id="{{anchorencode:{{Q|{{{1<noinclude>|Q11696</noinclude>}}}}} officeholders}}"></span>

deferring the call to Parser::guessSectionNameFromWikiText to after template expansion. This could be done unconditionally, but maybe it’s better to do it conditionally, only when there may be issues, since it looks uglier.

Thanks, I understood the problem, and knew what should be done, but did not know how to do it. This should work well.

It may not make sense to generate the id based on the expanded value because the value could change per language (e.g. if it is a translatable template or using {{int}}), defeating the purpose of stable ids.

Change 843511 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/Translate@master] Do not add anchor heading if translate tag has nowrap attribute

https://gerrit.wikimedia.org/r/843511

Change 843512 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/Translate@master] Disable anchor generation for headings having templates

https://gerrit.wikimedia.org/r/843512

I created a template: Template:Hello with the following content:

WikiData: [https://www.wikidata.org/wiki/{{{1}}}]

Then embedded it into a translatable page:

<translate>
== {{Hello|Q11696}} officeholders ==
</translate>

The existing code and usage of {{anchorencode:}} resulted in the following HTML:

For the source page:

<h2>
	<span id="WikiData:_.5B1.5D_officeholders"></span>
	<span class="mw-headline" id="WikiData:_[1]_officeholders">WikiData: <a rel="nofollow" class="external autonumber"href="https://www.wikidata.org/wiki/Q11696">[1]</a> officeholders</span></span>
</h2>

Translation page:

<p>
	<span id="WikiData:_[https://www.wikidata.org/wiki/Q11696]_officeholders"></span>
</p>
<h2>
	<span id="WikiData:_.5B1.5D_officeholders_-_ES"></span>
	<span class="mw-headline" id="WikiData:_[1]_officeholders_-_ES">WikiData: <a rel="nofollow" class="external autonumber" href="https://www.wikidata.org/wiki/Q11696">[1]</a> officeholders - ES</span>
</h2>

My knowledge of templates is fairly limited so maybe I'm missing something.

I'm planning to disable anchor generation for headings having templates in order to avoid breaking formatting and create a separate task to improve this behavior.

It may not make sense to generate the id based on the expanded value because the value could change per language (e.g. if it is a translatable template or using {{int}}), defeating the purpose of stable ids.

It may indeed cause issues if the user linking to the section is not aware of this issue, but I wouldn’t say it doesn’t make sense: if one is aware of this issue, it’s still easier to create a working link to it than to the actual heading (just transclude the same potentially language-dependent template, and it’ll work unless the UI language changes while following the link – which is not impossible, but quite unlikely).

Translation page:

<p>
	<span id="WikiData:_[https://www.wikidata.org/wiki/Q11696]_officeholders"></span>
</p>
<h2>
	<span id="WikiData:_.5B1.5D_officeholders_-_ES"></span>
	<span class="mw-headline" id="WikiData:_[1]_officeholders_-_ES">WikiData: <a rel="nofollow" class="external autonumber" href="https://www.wikidata.org/wiki/Q11696">[1]</a> officeholders - ES</span>
</h2>

It looks like the parser can’t create proper anchors for autonumbered links – everything else ([[internal]], [https://example.org explicitly labelled], https://example.org/plain) works. It actually makes sense: the displayed text depends on the context (auto-incremented throughout the page), so even if it added a link text in the ID, it wasn’t guaranteed to be stable (if an earlier translation unit contains a similar link in the translation but not in the source language, the numbers differ). On the other hand, these wrong anchors doesn’t cause extra issues: they’re not stable, but don’t output visible garbage.

Since when templates are involved, we don’t know when creating the translation page whether this will be the case, we have to accept false positives or false negatives. Considering other cases involving templates may be stable and the lack of serious issues, I don’t think anchors should be disabled in these cases.

It's not just templates, other wiki syntax like links can break the span parsing. My suggestion is to use wfEscapeWikiText to avoid that.

In any case, avoiding broken rendering in pages should be the highest priority.

It's not just templates, other wiki syntax like links can break the span parsing. My suggestion is to use wfEscapeWikiText to avoid that.

In any case, avoiding broken rendering in pages should be the highest priority.

After digging into this a big more, I realize that I did not understand the issue. Normal links and templates work fine. It seems that something in the Q template on wikidata is special, where it actually creates a HTML link that breaks parsing.

abi_ changed the task status from Open to In Progress.Oct 18 2022, 12:57 PM

It's not just templates, other wiki syntax like links can break the span parsing. My suggestion is to use wfEscapeWikiText to avoid that.

In any case, avoiding broken rendering in pages should be the highest priority.

After digging into this a big more, I realize that I did not understand the issue. Normal links and templates work fine. It seems that something in the Q template on wikidata is special, where it actually creates a HTML link that breaks parsing.

Hmm, I should have made that clear. The rendering issues will be caused if a template generates HTML content such as a link.

After discussing this some more with Niklas, we're planning to let translation admins use the nowrap attribute to disable anchor generation for such headings instead of stopping the anchor generation for all headings containing templates.

There will still be incorrect anchors generated in some cases as mentioned here but they should not break the rendering.

Change 843512 abandoned by Abijeet Patro:

[mediawiki/extensions/Translate@master] Disable anchor generation for headings having templates

Reason:

As per: https://phabricator.wikimedia.org/T62544#8324869

https://gerrit.wikimedia.org/r/843512

The rendering issues will be caused if a template generates HTML content such as a link.

What about using htmlspecialchars() or something like that?

Change 843511 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Do not add anchor heading if translate tag has nowrap attribute

https://gerrit.wikimedia.org/r/843511

Change 844531 had a related patch set uploaded (by Pols12; author: Pols12):

[mediawiki/extensions/Translate@master] TranslationUnit: Support anchors starting with hash # character

https://gerrit.wikimedia.org/r/844531

Change 844531 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] TranslationUnit: Support anchors starting with hash # character

https://gerrit.wikimedia.org/r/844531

Nikerabbit changed the subtype of this task from "Task" to "Feature Request".

The rendering issues will be caused if a template generates HTML content such as a link.

What about using htmlspecialchars() or something like that?

We would need to be able to call this after the Parser has expanded the template. I don't think this is possible.

Another example, which doesn’t appear to be broken, but strictly speaking it is, is https://www.mediawiki.org/wiki/Wikimedia_Apps/hu#Android: both English and Hungarian headings are Android (which makes sense, as the name of Google’s OS is written the same in both languages), which results in the same ID appearing twice in the document, which is invalid HTML. This second example is pretty easy to avoid: just compare the original and the translated heading, and if they’re the same, skip adding our anchor.

This issue is caused by headings have templates that expand to generate HTML resulting in an incorrect id, and the rendering to break. We've provided a work-around that will skip automatic generation of linkable anchor by tagging the heading in a translate tag with the nowrap attribute.

We've update the Page translation administrator documentation to reflect the changes made as part of this task.