Page MenuHomePhabricator

Transliteration of Crimean Wiki
Closed, ResolvedPublic

Description

Author: timming

Description:
Crimean language has two writing systems - Cyrillic and Latin. The translator is written on php. How can this translator be included into Crimean Wiki, so that users can switch from one writing to another, such as in Kazakh Wiki (http://kk.wikipedia.org).

The example of translator is here: http://medeniye.org/transliterator/

Thank you in advance.


Version: unspecified
Severity: enhancement
URL: http://crh.wikipedia.org/

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bzimport raised the priority of this task from to Low.Nov 21 2014, 10:48 PM
bzimport set Reference to bz21582.
bzimport added a subscriber: Unknown Object (MLST).

I assume this asks for variants to be enabled?

Eugene.Zelenko wrote:

Don Alessandro (http://ru.wikipedia.org/wiki/%D0%A3%D1%87%D0%B0%D1%81%D1%82%D0%BD%D0%B8%D0%BA:Don_Alessandro) tried to create support fro multiples writing systems for Crimean Tatar language based on Kazakh converter.

I don't know current status. Probably developers help needed to update code to current MediaWiki code and other possible problems fixing.

timming wrote:

Crimean Tatar writing system convertor

attachment crh_kir_lat.rar ignored as obsolete

timming wrote:

(In reply to comment #2)
Yes, actually, he asked me to work on it.

I've attached the converter. It is written on php. Is it possible to update it to current MediaWiki code?

alessandro_gor wrote:

Really. I've asked Timming to help me with this problem.
So, nearly two months have passed, is there any progress?

halan wrote:

It is the program it is important for many languages of the people of Russia and were the USSR at which alphabets varied. The experts having possibility to help, I ask to pay attention to this problem.

ilnarshaidullov wrote:

Totally agree with you. To date, day, many representatives of these people live abroad and enjoy the Latin, Cyrillic and others, so this is a very big problem. The sooner we resolve it, the sooner these projects will develop.

remove keyword shell. I'm not exactly sure what its here for.

Note, the patch is in the wrong format (all patches should be in (uncompressed) unified diff format. Some of us don't have a rar decompression programs installed (and apparently there aren't any open source ones(?)).) Furthermore just providing the php files doesn't really give the context of where they are supposed to go.

To summarize, please provide patches in unified diff format.

amdf00 wrote:

OK, I understand. The patch should be rewritten, to be something like http://svn.wikimedia.org/doc/LanguageTg_8php_source.html

So, I'll remove the shell keyword.

That(In reply to comment #9)

OK, I understand. The patch should be rewritten, to be something like
http://svn.wikimedia.org/doc/LanguageTg_8php_source.html

That to, but more importantly, the patch should be generated by the diff program (svn diff if you're using svn, but really they're both the same program). For example (pulling a random example from bugzilla) http://bug-attachment.wikimedia.org/attachment.cgi?id=2634 notice how it has + signs at lines that are new, and - signs on lines that are removed. See [[Unified_diff#Unified_format]]

Thanks.

timming wrote:

(In reply to comment #10)
Maybe I didn't understand you well, but this is not a patch, this is a transliteration program example. And the question was if someone could make a patch for Wikipedia engine out of this example.

So, should I make diff a files, which is not a patch?

Correct me if I am wrong. Thank you.

PS. Adding an archieve in ZIP format.

(In reply to comment #11)

(In reply to comment #10)
Maybe I didn't understand you well, but this is not a patch, this is a
transliteration program example. And the question was if someone could make a
patch for Wikipedia engine out of this example.

So, should I make diff a files, which is not a patch?

Correct me if I am wrong. Thank you.

PS. Adding an archieve in ZIP format.

Sorry, someone added the patch keyword, and I assumed it was patch. The whole unified diff thing really only applies to patches (well I suppose you could make a diff for this entirely new program, but that does seem kind of silly).

Cheers.

alessandro_gor wrote:

I've tried to change our script using Kazakh one as a model. Here is the result - http://crh.wikipedia.org/wiki/User:Don_Alessandro/Translit Is there anybody who can help me to diff and test it? Unfortunately I have no opportunity to install MediaWiki on my localhost now.

timming wrote:

Transliterator in ZIP format

Attached:

Created attachment 7785
Don_Alessandro's code from comment 13

(In reply to comment #13)

I've tried to change our script using Kazakh one as a model. Here is the result

who can help me to diff and test it? Unfortunately I have no opportunity to
install MediaWiki on my localhost now.

I tested it on my personal install (sorry, not publically available unfortunately).

Got a fatal error:

Notice: Undefined variable: mCyrl2Latn in /var/www/w/phase3/languages/classes/LanguageCrh.php on line 69

Fatal error: Cannot access empty property in /var/www/w/phase3/languages/classes/LanguageCrh.php on line 69

I presume thats due to the extra dollar sign on line 69. I got rid of the dollar sign, and got a whole bunch of warnings:

Notice: Undefined variable: crh2Latn in /var/www/w/phase3/languages/classes/LanguageCrh.php on line 60

Notice: Undefined variable: crh2Cyrl in /var/www/w/phase3/languages/classes/LanguageCrh.php on line 61

Warning: array_merge() [function.array-merge]: Argument #1 is not an array in /var/www/w/phase3/includes/StringUtils.php on line 294

Warning: array_merge() [function.array-merge]: Argument #1 is not an array in /var/www/w/phase3/includes/StringUtils.php on line 294

Warning: strtr() [function.strtr]: The second argument is not an array in /var/www/w/phase3/includes/StringUtils.php on line 324

Warning: strtr() [function.strtr]: The second argument is not an array in /var/www/w/phase3/includes/StringUtils.php on line 324

However, I do get the variant drop down menu, with two entries (Cryllic and latin). Unfortunately clicking on either of them causes the page contents to totally disappear (probably related to all the warnings).

I'm also unclear on how includes/CrhConversion.php is supposed to get included (but I'm totally unfamiliar with the lang convert stuff, so there might be magic I just don't know about).

Attached is a unified diff of the stuff at http://crh.wikipedia.org/wiki/User:Don_Alessandro/Translit (as of whatever time it is right now), but with the fatal error on line 70 of LanguageCrh.php fixed.

Cheers.

p.s. All in all, considering you never tested your code, and (I assume) you're rather new at this, having only a single fatal error is actually pretty good.

attachment crh-langconvert.patch ignored as obsolete

alessandro_gor wrote:

Thanks a lot!

I've fixed one little bug, maybe it will work now... Could you please test this

alessandro_gor wrote:

Don Alessandro's code from comment 13 with some bugs fixed

Attached:

alessandro_gor wrote:

Thanks a lot!

I've fixed one little bug, maybe it will work now... Could you please test this: https://bugzilla.wikimedia.org/attachment.cgi?id=7789&action=diff

(In reply to comment #18)

Thanks a lot!

I've fixed one little bug, maybe it will work now... Could you please test
this: https://bugzilla.wikimedia.org/attachment.cgi?id=7789&action=diff

Thanks a lot for your efforts on this, apologies for the delay in getting this applied to mediawiki.

I used the latest patch given, and to verify whether it actually works, I tried some test cases. But I could not get it passed.

I tried this combination : "алейкум" => "aleyküm"
I get preg_replace(): Compilation failed: assertion expected after (?( at offset 30
at regsConverter methods preg_replace call. This happens for Latin to Cyrillic and reverse.

I am not sure in what way you tested the transliteration. I used phpunit. Created a class wiki/tests/phpunit/LanguageCrhTest with the following lines.

<?php
class LanguageCrhTest extends MediaWikiTestCase {
private $lang;
function setUp() {

		$this->lang = Language::factory( 'Crh' );

}
function tearDown() {

		unset( $this->lang );

}
function testTranslate() {

		$this->assertEquals($this->lang->mConverter->convertTo( 'aleyküm', 'crh-cyrl' ) ,'алейкум');
		$this->assertEquals($this->lang->mConverter->convertTo( 'алейкум', 'crh-latn' ) ,'aleyküm');

}
}

and from phpunit folder,
php phpunit.php languages/LanguageCrhTest.php

Could you please follow this and expand the above test class with more test cases and attach here? From the errors I am getting I assume that some of the regular expressions constructed are not working. But the online tool http://medeniye.org/transliterator/index.php gave me correct results.

sumanah wrote:

Alexander, I changed the keyword "need-review" to "reviewed" -- if you have time to upload a revised patch, please change that keyword back when you do. Thanks!

Can someone comment on this?

(In reply to Ata from comment #21)

Can someone comment on this?

What exactly is your question (as I guess you don't want some random comment for the sake of a comment)?
It looks like the current situation is described in comment 19.

(In reply to Andre Klapper from comment #22)

It looks like the current situation is described in comment 19.

Something was not working -- back in 2011. I wonder if smth has changed since then.

Ahonc raised the priority of this task from Low to Medium.Jan 28 2015, 1:04 AM
Ahonc set Security to None.

is there a way to resolve this? today is this project's birthday and we still cannot do anything... :(

@Ahonc What's the reason that you raised priority (T23582#997301)?

Aklapper lowered the priority of this task from Medium to Low.Jan 17 2017, 1:35 PM

I'm working on this at the Vienna Hackathon.

Unfortunately, I didn't get as far as I would have liked on this at the hackathon. It took a while to figure out how to set it up properly, and the existing code here didn't quite work in the current framework—a lot has changed in the last 6 or 7 years—and I was seeing obviously incorrect results.

I'm going to keep working on this from time to time as one of my 10% projects because I want to learn how to do this—though if anyone else wants to work on this in a more focused way, they can claim the task.

So I'm making some progress. I've managed to refactor parts of the original to work and to work more efficiently with the current framework. Progress is slow because I only work on it now and then, but I was having such a good time yesterday that I kept working on it today.

A quick question for someone familiar with Crimean Tatar. In the parallel texts I've found online, the Cyrillic text uses guillemets («x») and the Latin uses curly quotes (“x”). Should we try to convert between them, or just leave them as they are? I know that some wikis prefer straight quotes ("x"). Trying to convert straight quotes to guillemets is also possible, but would not be 100% accurate with the straightforward approach.

I've got a working prototype of the transliteration module! On a corpus of parallel texts I found online, it is >99% accurate going from Cyrillic to Latin, but only about 92% accurate going from Latin to Cyrillic, which is the more important direction for crh_wiki.

I've done the kind of review I usually do when I make changes to language analyzers in Elasticsearch, so there is a lot of data in a write up on MediaWiki.

Speakers of Crimean Tatar who want to help but don't want to read all the details can review the tables with likely problem transliterations in them. Search the page for "Speaker review notes:", which includes what information we need for each table. For the really long lists, reviewing just the top 20 or 30 would have a big impact in overall accuracy!

I suspect there are patterns that the current transliteration is missing from Latin to Cyrillic, but maybe not. I'd prefer not to put in hundreds of more exceptions, but we could try, or just put in the top few dozen and greatly improve overall accuracy.

I also have a list of words from 500 Crimean Tatar Wikipedia articles, with their Latin and Cyrillic transliterations. It's very, very long—more than 5600 individual words.

If speakers want more to review, I can get a larger parallel corpus, or take a larger sample from the Crimean Tatar Wikipedia.

@Ata, can you recommend anyone else to review the transliterations and make suggestions of specific exceptions or general patterns to add to improve the transliteration?

I probably won't work on this again until after Wikimania in mid-August, but after that I will clean up the code a bit and submit a patch for review from the Language Engineering folks, to see if there is anything I've missed on the engineering side.

We are making progress, slowly but surely!

Change 372479 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/core@master] [WIP] Crimean Tatar Transliteration

https://gerrit.wikimedia.org/r/372479

Hi! In one week I will be able to join and to help with checking if everything is ok with transliteration. No I'm on vacations and have no my computer with me.

So I'm making some progress. I've managed to refactor parts of the original to work and to work more efficiently with the current framework. Progress is slow because I only work on it now and then, but I was having such a good time yesterday that I kept working on it today.

A quick question for someone familiar with Crimean Tatar. In the parallel texts I've found online, the Cyrillic text uses guillemets («x») and the Latin uses curly quotes (“x”). Should we try to convert between them, or just leave them as they are? I know that some wikis prefer straight quotes ("x"). Trying to convert straight quotes to guillemets is also possible, but would not be 100% accurate with the straightforward approach.

Really, in Cyrillic script «x» is used, bit it is possible to leave them as they are. It is not a big mistake.

@DonAlessandro—thanks for the great feedback! I'll review it and if there aren't any more questions, I'll incorporate it into the code, upload a new patch, and then start pestering people for review.

In the mean time, can you put the GPL 2+ license on the main page for your original code, just so there aren't any licensing problems?

Thanks!

Thanks again, @DonAlessandro ! That looks good to me.

I'm going to run some more parallel texts—also from medeniye.org, by the way—to look for other problems and to assess the changes from the comments you made. (That was a lot to review, by the way, so I really appreciate it!)

I've run another 25K-word parallel corpus through the conversion, and Latin-to-Cyrillic is holding steady at 99.6% agreement (vs earlier 99.7%), but Cyrillic-to-Latin has lost about 2/3 of its errors at 97.4% (vs earlier 92.3%). More details on MediaWiki.

Next, get someone to review the patch. Maybe @Amire80 or @cscott could review or recommend someone?

Anyone have any idea how to get the patch reviewed?

So after the patch is merged, next steps would be:

  • changing $wgLanguageCode of crhwiki to be 'crh' instead of 'crh-Latn';
  • Making apache rules for the language variant paths in modules/mediawiki/files/apache/sites/main.conf (in puppet)
  • Changing $wgArticleVariantPath

I think that's it, but I'm not sure if there are other things for the variant setup that need to be done.

Change 372479 merged by jenkins-bot:
[mediawiki/core@master] Crimean Tatar Transliteration

https://gerrit.wikimedia.org/r/372479

Thanks so much, @Bawolff!

If no one else wants to work on getting this enabled, I'll keep working on it. I'm going to need some help for the puppet parts, so I'll try to talk @Gehel into making this one of his 10% projects, but he's going probably going to be unavailable for a couple of weeks, at least.

@cscott, @liangent, @SPQRobin: do any of you know of anything else that needs to be done other than what Brian listed?

Change 396282 had a related patch set uploaded (by Tjones; owner: Tjones):
[operations/mediawiki-config@master] Updates to enable transliteration for crhwiki

https://gerrit.wikimedia.org/r/396282

Change 396283 had a related patch set uploaded (by Tjones; owner: Tjones):
[operations/puppet@production] Updates to enable short URLs for transliteration for crhwiki

https://gerrit.wikimedia.org/r/396283

I've posted an announcement to Crimean Tatar Village Pump (translation assistance to Russian or Crimean Tatar much appreciated), and the patches I think are needed to enable the transliteration on-wiki are linked in the two previous messages.

Change 396283 had a related patch set uploaded (by Tjones; owner: Tjones):
[operations/puppet@production] Updates to enable short URLs for transliteration for crhwiki

https://gerrit.wikimedia.org/r/396283

I had a look at the patch above from @TJones. A few notes:

  • the patch adds aliases for all domain except wikidata.org and wikimedia.org on beta (seems to make sense), but only for wikipedia.org on prod, what is the actual intent?
  • we need to split the patch to deploy beta first and production later (I'm taking care of that)
  • @TJones do you have a test procedure to validate the results (I assume yes, so we can probably deploy that together, with you doing the testing).

Change 398832 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] Updates to enable short URLs for transliteration for crhwiki production

https://gerrit.wikimedia.org/r/398832

Thanks, @Gehel—did you mean a plan to test the config before deployment? Unfortunately I don't. Verifying it after deployment is easy, though.

the patch adds aliases for all domain except wikidata.org and wikimedia.org on beta (seems to make sense), but only for wikipedia.org on prod, what is the actual intent?

So, there is only crh language wikipedia. There isn't any crh language wikibooks/etc. For the other languages with lang converter, the short url in production seem only for wikis that actually exist, where beta seems to do it for all projects, even if they don't exist.

It should be noted, that none of the current beta projects use the crh language.

So, there is only crh language wikipedia. There isn't any crh language wikibooks/etc. For the other languages with lang converter, the short url in production seem only for wikis that actually exist, where beta seems to do it for all projects, even if they don't exist.

According to the SiteMatrix, Serbian (sr) doesn't have a Wikiversity or Wikivoyage, and Chinese (zh) doesn't have a Wikiversity. But Serbian is configured for Wikiversity (but not Wikivoyage) and Chinese is configured for Wikiversity. With that inconsistency, it's hard to tell what's supposed to happen.

It should be noted, that none of the current beta projects use the crh language.

I added those because that's what you seemed to suggest in your comments.

The best logic I can make of it is to either (i) enable everything that's available wherever it can be enabled, so it will magically work if that wiki comes into existence, or (ii) only enable things where they exist to keep the configs simple.

We don't currently do either—and I'm happy with either—I just need to know what to do! I've asked some of the maintainers to look at the patches but haven't heard back yet from them. Maybe they will make things clearer.

Another wrinkle—languages other than Serbian and Chinese that use language converter don't have the short URLs enabled, so if there's no resolution to that particular problem, I may abandon those patches and remove the dependency on the mediawiki config, just to get the transliteration into production. Pretty URLs are nice, but not necessary if they cause massive delays. (I had hoped to piggyback off this effort to enable short URLs for the others, but maybe the lesson instead is to not worry about short URLs.)

Pretty URLs are nice, but not necessary if they cause massive delays. (I had hoped to piggyback off this effort to enable short URLs for the others, but maybe the lesson instead is to not worry about short URLs.)

Also keep in mind that we can always come back to doing the short urls after the main transliteration stuff gets enabled. There's no requirement to do those from the get-go.

Change 396283 merged by Gehel:
[operations/puppet@production] Updates to enable short URLs for transliteration for crhwiki - beta

https://gerrit.wikimedia.org/r/396283

The previous patch enables the Apache short-URL config on the beta cluster. It doesn't actually do much since the the Crimean Tatar transliteration isn't enabled and there are no Crimean Tatar projects in the beta cluster. But it did let me make sure that the Apache config didn't break anything (the Chinese language conversion still works as expected on the beta cluster). I'll be trying to get the other configs into an upcoming SWAT deploy soon.

Change 396282 merged by jenkins-bot:
[operations/mediawiki-config@master] Updates to enable transliteration for crhwiki

https://gerrit.wikimedia.org/r/396282

Unfortunately, the config used to enable the transliteration didn't quite work in the test environment during today's SWAT. The transliteration was enabled, but clicking on the link caused a "page not found" error. Manually altering the URL to the correct syntax did work—the main page came up in Cyrillic!—so it's the URL configuration that isn't working. I'm going to study the Kazakh wiki config more closely and try to figure out what the right config is. I'm going to not worry about the short URLs until I get the the main transliteration enabled.

Change 405048 had a related patch set uploaded (by Tjones; owner: Tjones):
[operations/puppet@production] Revert "Updates to enable short URLs for transliteration for crhwiki - beta"

https://gerrit.wikimedia.org/r/405048

Change 405048 merged by Gehel:
[operations/puppet@production] Revert "Updates to enable short URLs for transliteration for crhwiki - beta"

https://gerrit.wikimedia.org/r/405048

Change 408540 had a related patch set uploaded (by Tjones; owner: Tjones):
[operations/mediawiki-config@master] Updates to enable transliteration for crhwiki

https://gerrit.wikimedia.org/r/408540

I had a remnant of the short URL config, but the short URLs weren't enabled, so it didn't work. This time I've only disabled the config that sets the language for crhwiki to crh-latn, which should have the effect of the language being set to plain crh, which should enable the transliteration options. I will try to have it go out in tomorrow's SWAT deploy (Feb 7, 14:00–15:00 UTC)

Change 408540 merged by jenkins-bot:
[operations/mediawiki-config@master] Updates to enable transliteration for crhwiki

https://gerrit.wikimedia.org/r/408540

Mentioned in SAL (#wikimedia-operations) [2018-02-07T14:30:43Z] <zfilipin@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:408540|Updates to enable transliteration for crhwiki (T23582)]] (duration: 01m 11s)

The transliteration is enabled, and seems to be working fine. Woo hoo! What a long strange trip it's been.

I've left a message on the Crimean Tatar Village Pump announcing it, and giving a brief overview of how it all works.

If there are any technical problems with the transliteration, please open a new ticket, and feel free to subscribe me, though I may only be able to look at new tickets as a volunteer or in my 10% project time.

We know there are still a few transliteration errors (especially from Latin to Cyrillic, which is harder). I'd be happy to update the exceptions list from time to time if people can collect the errors somewhere.

The short URLs (like Serbian and Chinese have) are not enabled. Many other languages with transliteration don't have them enabled, so I am willing to declare success without them. I suggest closing this ticket and opening a new one for the short URLs if they are desired.

Than you for a great job! But we still have a problem: the script does not work correctly. I have opened a new ticket here https://phabricator.wikimedia.org/T186727

Change 398832 abandoned by Gehel:
Updates to enable short URLs for transliteration for crhwiki production

Reason:
Trey isn't pursuing short URLs anymore, we won't need this patch any time soon (if ever)

https://gerrit.wikimedia.org/r/398832

I'm closing this ticket because we finally got the transliteration in production. It has some bugs, but those are covered under separate issues. If anyone wants to push for short URLs that should be a separate ticket.