Page MenuHomePhabricator

Draft a computer-assisted translation system for Wikidata labels/descriptions
Open, LowPublicFeature

Description

Wikidata-repo only handles label&description translations item-by-item. In order to improve efficiency of translators, it would be necessary to find a convenient way to have a list of labels/descriptions to translate.

Evaluate if with some modifications Extension:Translate could be the platform of choice or if it is necessary to build a custom solution.


Version: unspecified
Severity: enhancement

Details

Reference
bz62695

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:53 AM
bzimport set Reference to bz62695.
bzimport added a subscriber: Unknown Object (MLST).

It would be a good start to sort those lists by number of sitelinks, be able to select the source language, and provide the input field next to each element. Going beyond that would be like replicating Extension:Translate's functionality. Maybe there is a way to make both work together?

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

For descriptions, automated translation makes little sense (and manual translation even less). Alternative:
http://magnusmanske.de/wordpress/?p=265
(or, more likely, a WMF-supported offspring of that)

@Magnus I like your idea but I don't agree with your conclusion that manual translation does not make sense. I am yet to see these tools to be available in as many languages as Wikipedia or MediaWiki do.

In the same way you created Reasonator to fill a gap, I think it makes sense to support manual translation until something else can take over it.

@Nikerabbit 13M items x 287 languages = 4 billion descriptions to fill in manually. For simple one-line description, a few minutes of developer time per language would suffice.

Manual descriptions make sense where the notability (in lack of a better term) of the item is not easily expressed via statements. But people typing "Dutch painter (1750-1800)" in 287 languages is, quite frankly, idiotic if one can do the same from the statements. And statements improve over time. Someone adds a parent in a statement, or an award that painter won, or an employer, or... Then 287 manual descriptions will have to be updated, just to be en par with an automatic description. Madness!

At the National Institute for Occupational Safety and Health we are working on incorporating our datasets into Wikidata. One of the benefits of this is that it allows for translation of our content with less effort, since the relationships ("X causes Y") are language neutral and people only need to translate the terms ("X" and "Y").

We are looking into working with international partners to translate Wikidata entities of interest to us. To monitor progress, and to provide a list of terms that need to be translated, I created this Wikidata translation dashboard: http://tools.wmflabs.org/niosh/wdtranslations.html (takes a while to load due to lousy JavaScript). That report is based off of a very specific Wikidata query (the items in our area of interest -- in this case, chemical hazards), plus the labels for entities featured on those items. This tool also compares coverage of a language to English, since that is the baseline in our specific case. Other than that, you could think of this as a starting point for developing a Wikidata translation tool.

Amire80 added a subscriber: Ijon.

I couldn't find a task like this one (even though I moved it to a different column once), so I created T212323. I now closed it as duplicate of this one. Thanks, @Micru.

Almost five years since the creation of this task, such a tool is still needed. Let me suggest a slightly different focus, however.

Translating labels is important. Not just that, but translating a lot of labels is important. They can be used as values to fill in Wikipedia articles (in infoboxes, but not only there), they can be used as titles of fields in infoboxes and tables, etc. Having a lot of them translated is a necessary demonstration of Wikidata's potential to be better integrated with Wikipedia and other wikis. Wiki editors often bring up the fact that many labels are not translated as a reason to not use Wikidata, because too often it shows a Q number, a label in a fallback language, or nothing at all instead of actual text in the wiki's language. Sadly, they are quite right: translating the label in a central place is more efficient than translating it in each infobox, but currently doing it in Wikidata is not as convenient as it should be. That's what this task is really about.

Translating labels can be semi-automated in many cases, but not all. Many languages written in the Latin alphabet can share the same label for names of people and places, but not always; for example, Lithuanian, Azerbaijani, and some other languages are written in the Latin alphabet, but with their own spelling rules. And languages with other alphabets need manual intervention. This can be somewhat automated, too, but will still require human verification so that they could be really trusted.

Translating descriptions is less important than translating labels. Descriptions were originally made for disambiguation, and from Wikidata's own point of view, when disambiguation is not needed, neither are descriptions. They are also used (somewhat controversially) in the mobile display of Wikipedia articles and in some search results, but these things are less essential than labels. As with labels, translating many of them could be automated, but not all of them. Above all, however, the discussion about translating descriptions is not that important—if a good translation tool is built for labels, it can also be used for descriptions.

Multiplying millions of labels by hundreds (or thousands) of languages and saying that translating all of that by hand is pointless doesn't mean much. 100% of labels will never be translated, to any language. But a good tool that conveniently shows labels by topics will help translate close to 100% of the important ones. Examples of what I call "topics":

  • labels that are most frequently loaded in the German Wikipedia and aren't translated yet
  • labels for which an automatic translation can be suggested, but which require human verification
  • Nuclear scientists
  • Settlements in Borneo

... and so on

Once you break it down by topics, translating piece by piece becomes feasible. Translating a billion labels in sets of ~50 is relatively more convenient than doing them one by one on item pages, as it is done now. This task is mostly about user experience, really.

Terminator and Tabernacle are supposed to help with this, but the last time I tried them they just didn't work at all: they showed a form for selecting info, but submitting the form didn't load any data, so I cannot even know what is the actual translation experience. In addition, the forms themselves were too complex for users who can be good label translators but aren't into SPARQL, and they aren't mobile-friendly (admittedly, neither is the Translate extension, and that should be fixed, too). Finally, their user interface has to be translated in https://tools.wmflabs.org/tooltranslate , which is also broken, so it cannot actually be done (I guess I'm biased, but I never understood Magnus's rationale for not using translatewiki for his tools). If these issues are fixed, perhaps it will be possible to mark this task as resolved.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:13 AM

This will be superseded by T303677.

This will be superseded by T303677.

That task appears to be only about descriptions, and about their automatic generation where possible.

This task here is about a system for organized manual translation of labels. It doesn't look like T303677 supersedes it.