Page MenuHomePhabricator

Localize captcha images
Open, HighPublic

Description

The captcha software should generate captchas in languages other than English at non-English projects, depending on the locale. I've seen some generated captchas at the Vietnamese Wikipedia that would definitely confuse Vietnamese-speakers (can't remember the words exactly), because of things like r's and n's smooshed up right next to each other, so it looks like an m, except to an English user who happens to know a word that has "rn" instead. The user might have to *guess* because the English words really don't follow Vietnamese spelling rules. We've recently had users complaining to the sysops of not being able to read captcha images, presumably for this reason.

An advantage to localizing the captchas would be that it might reduce the impact of spambots at non-English projects. As far as I know, there isn't yet a captcha-defeating bot that understands Vietnamese or Basque or Quechua.

For now, I'm only proposing localizing for most languages that use the Latin alphabet, because requiring users to respond to a captcha in Thai or Arabic would exclude a lot of legitimate interwiki users. And users of other scripts tend to have the means of entering in Latin-based characters. Also, for languages that use diacritical marks, we should generate the words with or without the marks (not sure which) and modify [[MediaWiki:Captcha-createaccount]], asking the user to enter in the word without diacritical marks of any kind.

Once Latin-based alphabets are out of the way, it'd be a good idea to localize for other writing systems as well, but provide a Latin-based alternative, per Neil Harris' suggestion [1].

These localized captcha strings should *not* be stored in the MediaWiki: namespace, nor anywhere easily accessible to the public, because bot writers could easily write language-aware bots using such information. For wordlists, we could start by using open-source lexicons, such as OpenOffice.org's [2]. We should also contact ambassadors of non-English projects, asking them for help compiling sufficiently long lists of their own.

[1] http://mail.wikimedia.org/pipermail/wikien-l/2006-March/042263.html
[2] http://lingucomponent.openoffice.org/spell_dic.html


Version: unspecified
Severity: normal
URL: https://www.mediawiki.org/wiki/CAPTCHA
See Also:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

mimouni.mohamed wrote:

If the source code of capatcha is one PHP. The functions which to generate my
images with character strings are only in coding ANSI. That wants to say that the
Arab characters for example cannot be poster.

adziura+wiki wrote:

I think that polish users of Wikipedia wants localized captcha images. Is
better for new users.

eip wrote:

This would be very useful in Russian Wikipedia too. Of course the words have to be in Cyrillic alphabet.

Removed URL since it was not relevant to this bug (probably due to a rebuilding of the archives)

I am surprised that it came up only now, but now there is demand for this in the Hebrew Wikipedia, too.

Created attachment 7775
(naive) patch to make captcha.py work with unicode

(In reply to comment #1)

If the source code of capatcha is one PHP. The functions which to generate my
images with character strings are only in coding ANSI. That wants to say that the
Arab characters for example cannot be poster.

With some very minor changes to the script this is not true. For example I just generated a bunch of hebrew captchas (just taking random words off the main page of [[he:]]) and some Ukraine captchas (because it was the only non-latin language who had a word list thats just an apt-get away).

My very minor changes included disabling the regex check that words don't match /[^a-z]/. Presumably other languages would need an equivalent checks, and checks to avoid words with diacritical marks (since those would i presume be hard to see in captchas)

p.s. I don't know python, so the very minor changes in my example might not be the "proper python way".

Attached:

  • Bug 19229 has been marked as a duplicate of this bug. ***

a.d.bergi wrote:

(In reply to comment #0)

For now, I'm only proposing localizing for most languages that use the Latin
alphabet, because requiring users to respond to a captcha in Thai or Arabic
would exclude a lot of legitimate interwiki users.

We also could use the uselang attribute (and user setting) instead of locale, then this wouldn't be a problem. But I guess the bigger problem then is to find a captcha generator for exotic alphabets.

xenondwb wrote:

I have an idea how this problem could be solved:

MediaWiki should have an default fund of words, if the wiki doesn't contain enough words (eg. 500 words).
Than everytime a captcha should be displayed, a script fetches a random article and two random words. This words will be in the target language, because the are from articles in the same language as the user wants.
Than a script would place those two words onto an image, make them a bit unreadable and display them to the user.
The user would now have the task to solve the captcha.

But there are some problems:

  • As mentioned: If the Wiki has not enough words, it can't create really random captchas. So, eventually should be included a default fund of words, but this could be a design problem.
  • Also it would be a problem with the non unicode characters. Eventually it should be coded new, instead of using five millions totally different existing solutions and merge them.
  • For big pages this could eventually be a performance problem.

And the biggest problem: It would take some time to create all this new code. Also, I don't know if that would be really better than the existing solution.

And there would be one desing thing: This would be only a good solution for big Wikis, because there it would be hard to predict the selected words in the captcha, like it could eventually be with smaller Wikis.

sumanah wrote:

Adding i18n keyword,

everton137 wrote:

Hi, while working for WMF for the Wikipedia Education Program, I've seen a lot of new editors, most of them students, facing a lot of difficulties while editing the CAPTCHA in English.

I think this is a very important issue for Wikipedia in other languages. I've changed its importance to "high".

It's a shame that even single implementations are very backlogged.

The developers team really thinks that Vector skin and a WYSIWYG editing interface will be the most relevant to help on editors retention?

Somewhere I've recently said that the language barrier was solved on Wikimedia, resting only the non-Wikipedia projects issue. But unfortunately I was very wrong.

On the bug opening, the Wikimedia paid staff was very small. Now it's a bit larger. But still no single word from any tech-guys, neither the volunteers one...

[[:m:User:555]]

(In reply to comment #12)
Yes, WYSIWYG, article feedback and MoodBar are far more important than some key-issues. The reason? Here it is: Jimmy and the remaining board and Sue are native English speakers so it isn't prioritized. It's not what they are seeing when they are editing Wikipedia. We prefer designing a nice new en.wp main page investing thousands of dollars into questionable campus ambassadors, ...

So even lots of other simple bugs will be never fixed.

I'm terribly sorry to see the delay with this :(.

Well, just to be clear, we've not designed a nice new en.wp page - that's a community decision! - and of the 10 board members, half are ESL speakers. Localisation and services to non-enlang projects are things we're focusing more and more on; we've got a dedicated internationalisation team, for example.

On the rest of your examples - I think there's some confusion here as to who does what. Localisation and bug-fixing the "core" software is divided between the internationalisation team and the "Platform" sub-department of Engineering. Things like the visual editor or the feedback tool are the responsibility of the Features Engineering team. So there isn't really one set of things being prioritised by staffers over the other, because they're each handled by different sets of people :).

A more likely issue is that, well, things get lost in Bugzilla :(. Furthermore, there are a lot more bugs than there are developer hours to deal with them - take a look at https://bugzilla.wikimedia.org/weekly-bug-summary.cgi?tops=10&days=365 to see what I mean. Compared to the profile of the software, we really don't have a massive engineering team overall - and that's not down to the board, that's down to our comparatively small budget organisation-wide, which they can't really do anything about.

However! if you'll look above you'll see that Sumana (our awesome Engineering Community Manager) has added the localisation keyword, which should bring this problem to the attention of the localisation team, and I'm going to do my best to make sure they're reached - either to deal with the request, provide some kind of ETA on dealing with it or, if they can't solve the issue, explain what the problem is. They're great people, and I'm confident as both a staffer and a long-term editor that this will get resolved one way or another :).

Ni!

Thanks for the very informative message Oliver.

I think it is good for us to get really upset when a bug that *affects the
experience of every single new editor
* in many Wikipedias has had no
meaningful progress after 6 years since being reported, despite several
comments here and even face-to-face to staff members.

At the same time, it is important for us to get the fact straight about who is
responsible for what, like you described.

However, "bugs get lost" is also not a good explanation for what goes on here.
It's not even an explanation at all.

There are only 17 Mediawiki bugs with equal or more votes than this one, and
that number only grows to 25 if considering every product on this bugzilla:
https://bugzilla.wikimedia.org/buglist.cgi?votes_type=greaterthaneq&query_format=advanced&list_id=132785&votes=24&resolution=---&resolution=LATER&resolution=DUPLICATE&product=MediaWiki

Some of those 17 don't even count as they are already solved or have equivalent
functionality implemented, but some partial issue keeps them from going away.

Yet some of those are, similar to this one, also in a completely stalled state
for no good reason, despite a lot of people contributing to point out how
important they are and suggest solutions. Red interwiki links is probably my
favorite (Bug #11).

Wikimedia's tech team needs to improve how they prioritize work based
on community input.

And the board is also at fault for not requiring or developing themselves a
clear policy about that.

My impression is that they might be comfortable relying mainly on commissioned
studies of usability and participation, overlooking that most of those are
statistically questionable or based on unrealistic assumptions. Not meaning
they are not useful, they are useful and necessary, just limited. They won't
reveal the whole story by themselves, and sometimes not even the crucial facts.

So here we are, despite continuous community input, six years into a relatively
simple bug that affects every single new editor of Wikipedia in several
languages.

Thanks again Oliver for replying and looking after, and Sumana, now let us
hope the right people get to read this.

Hugs,

Ni!

This is actually one of my prime concerns; that we prioritise primarily based on "how big a deal, technically, a bug is" rather than the potential impact on the community. Bugzilla has one metric, and it's largely used for technical importance. But I'm confident the new Bugmeister, whomever they will be, can start making progress in this area :). At the moment we're without a bugmeister completely (which may go some way to explaining how even highly-voted bugs are falling through the cracks, although I appreciate this is older than the bugmeister position).

Adding bug 32695 as blocker because it might be the solution, by fetching the correct Wikisource.

sumanah wrote:

(In reply to comment #16)

Wikimedia's tech team needs to improve how they prioritize work based
on community input.

Yes, the WMF absolutely does need to do better at incorporating community input into our work prioritization. Guillaume Paumier, Rob Lanphier, and I presented a talk about this a few weeks ago: https://wikimania2012.wikimedia.org/wiki/Submissions/Transparency_and_collaboration_in_Wikimedia_engineering and I know Oliver and other folks have talked about and worked on it as well, but there's a ways to go.

On that more general topic, I strongly recommend that you join the wikitech-ambassadors mailing list https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors and bring up your concerns from comment # 16, so we can talk about them in a group that includes more community members and Foundation folks, including Guillaume and Rob.

But on this particular issue (localising CAPTCHAs) I'm cc'ing Alolita Sharma, Director of Engineering (Internationalization and R&D), and Siebrand Mazeland, Product Manager of the Localisation team, hoping for their input.

Thanks, Al-Scandar Solstag!

Hmm I like the idea of comment 9.

Some issues:
*Swear words - people get angsty when "fuck", etc is in their captcha. (This is probably a minor consideration)
*complex characters - Unicode characters in and of themselves are not a problem. (Some wikis have words not in their native script, but that's the minority, and can be resolved with a "request new captcha") More concerning is Diacritics. Diacritics are small, and may be hard to see when messed with by the captcha algorithm (although a native speaker might know what the word is and be able to fill in the diacritics). I'm doubtful that a captcha of ɓ b will look very different.

However, with that said, perhaps we should just do some testing to see if that's really an issue. Maybe its less of an issue to a non-native speaker than using english captchas are.

*Actual coding - we'd need to be able to generate captchas from php, presumably in real time. Not a major issue, but requires coding efforts. (Or I suppose we could get the word list once, and generate the captchas one off with the current script)


We should also evaluate the effectiveness of our captchas. The captcha program was written a while ago. Since then there's been advances in getting text out of images. Lots of third party wikis report captchas not being all that effective against spam. Perhaps our captchas aren't actually doing anything.

(In reply to comment #20)

We should also evaluate the effectiveness of our captchas. The captcha program
was written a while ago. Since then there's been advances in getting text out
of images. Lots of third party wikis report captchas not being all that
effective against spam. Perhaps our captchas aren't actually doing anything.

AFAIK it's already proven to be completely broken, see http://lists.wikimedia.org/pipermail/wikitech-l/2011-November/056078.html (maybe while implementing the proposed new method we could also get it to use the right dictionaries).
There's quite a chance that our captchas are discouraging only good faith editors, especially non-English speaking.

As part of an email conversation related to this topic, I made some mockups to illustrate some captcha ideas that could be less problematic for non-English speakers, improve the general UX, and rely on images from Commons.

Based on tagging parts of a panorama picture with the appropriate word (in the UI language or Basic English words).

Based on finding from a set of similar images the ones that fit a specific criteria (with an image describing also the criteria).

Based on finding the image that is different from a set of images.

These captchas will probably generate new problems for the technical side, require adjustments to reduce the chance of a machine to solve them, or may just be unfeasible to generate, but I wanted to provide these ideas in case anybody else may use it as a base for improve on any technical weakness they may have and make them at least as hard to solve for a machine as text-based captchas are.

A page at Mediawiki has been created to gather ideas and feedback: https://www.mediawiki.org/wiki/Requests_for_comment/CAPTCHA

As others have said on the mailing list, I fear such captchas would not only be easier for bots to solve than the current solution (once they've had a little time to adjust), but also would be harder to localize unless the number of such captcha challanges were extremely small.

Note that some users may not have appropriate keyboard to enter the captcha in their language. Aside from captcha generation in various languages, fuzzy comparison with the answer is needed as well.

fyi there is a proposal from the Language team for a mentored project about

Multilingual, usable and effective captchas
http://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Multilingual.2C_usable_and_effective_captchas

I have some reservations about featuring that project to Google Summer of Code or Outreach Program for Women participants, but I'm willing to be proven wrong. Reasons:

  • Unclear buy-in from the community or the maintainers. The whole CAPTCHA topic is messy, with several discussion threads, a RFC, a prototype, and potential plans. We don't have a clear plan for captchas. There hasn't been enough feedback about captchas based purely on images without any text, as this project proposes.
  • Bug 32695 - Review and Deploy Wikicaptcha. Is there and waiting for feedback.
  • I'm not a CS anything and I could be perfectly wrong, but the project feels too ambitious for three months, both with the amount of work required and the skills needed.

With all this I see the risk of failure bigger than wished for a GSOC project, either because students will most likely lack the time/skills or because even a complete GSOC project would have a hard time ending up merged in our codebase.

Feedback welcome.

The key word in the gsoc proposal that I like is research. My problem with most captcha proposals is that they promote someone's pet idea without any citations to back up their theory.

This does seem to be much more research oriented than most gsoc projects.

Sure, research is great. But before proposing someone to do a 3 month research on this subject I would like to have confidence that this research is welcome and there is an interest from the MediaWiki / ConfirmEdit maintainers in changing the status quo.

Reading the feedback in various channels it is easier to find a disbelief on captchas as a solution altogether.

aalekh1993 wrote:

Over a period of few months there has been active Development of Multilingual, usable and effective captchas for GSOC 2014.But currently it seems that there is no technical and primary mentor for the project. Therefore I Request all members to please have a thought about becoming a part of this project as primary technical mentor.

Let's move the GSoC 2014 discussion to

Bug 62960 - Prototype CAPTCHA optimized for multilingual and mobile

Change 121255 had a related patch set uploaded by Nemo bis:
Make captcha.py produce images in arbitrary language

https://gerrit.wikimedia.org/r/121255

Plans for the ultimate solution are being discussed at bug 62960.

In the meanwhile, as workaround, we're testing making images in all languages with words taken from Wiktionary. For technical details please read and comment on https://gerrit.wikimedia.org/r/121255
You can see samples at https://www.dropbox.com/sh/i2af7xvn4y593gc/-RRtFyoJji/captchas

In my testing the images seem rather good, mainly depending on the availability of a good font. DejaVu is a well known high quality font covering most languages and DejaVuSans-Bold seems to work well for the languages it covers: https://sourceforge.net/p/dejavu/code/HEAD/tree/trunk/dejavu-fonts/langcover.txt

Caveats:

  • We still have to handle RTL languages. Results probably don't make any sense now.
  • We've not yet made the blacklist multilingual but it's not too hard, ignore the bad words if any.
  • We still have to figure out how to exclude confusable words. It's not impossible, there is a Unicode library for that (but not for python perhaps). See bug 63216.
  • Of 165 languages for which Amgine gave me "big" dictionaries, 20 were not in DejaVu and for 10 I used FreeSerif instead. Those are lower quality. We may end up using [[mw:ULS]] font repo with some hacks, if many languages need it; or we could just skip them: I wonder if a captcha in e.g. Gujarati or Japanese will ever make sense.
  • Security fixes demanded by http://cdn.ly.tl/publications/text-based-captcha-strengths-and-weaknesses.pdf will be in a separate patch. They're several small things that someone familiar with PIL can do easily enough in the existing code. One of them is "printing" each letter separately with some aspect variations, which may solve some problems with ligatures too.

(In reply to Nemo from comment #32)

Plans for the ultimate solution are being discussed at bug 62960.

In the meanwhile, as workaround, we're testing making images in all
languages with words taken from Wiktionary. For technical details please
read and comment on https://gerrit.wikimedia.org/r/121255
You can see samples at
https://www.dropbox.com/sh/i2af7xvn4y593gc/-RRtFyoJji/captchas

In my testing the images seem rather good, mainly depending on the
availability of a good font. DejaVu is a well known high quality font
covering most languages and DejaVuSans-Bold seems to work well for the
languages it covers:
https://sourceforge.net/p/dejavu/code/HEAD/tree/trunk/dejavu-fonts/langcover.
txt

I think zh-* (Chinese) variants are mistakenly included. They are not claimed to be covered by the font, and many substitute squares (tofus) appear in your samples.

(In reply to Yusuke Matsubara from comment #33)

I think zh-* (Chinese) variants are mistakenly included.

Indeed; deleted. If someone thinks a captcha in CJK locales makes sense and/or has ideas on how to support them, please share.

(In reply to Siebrand Mazeland from comment #35)

ISO code got does not make sense, tofu at
https://www.dropbox.com/sh/i2af7xvn4y593gc/a6Kz0eSXZ4/captchas/got#f:
image_5edb52ac_e04341dd3d25c8f8.png

Right. Maybe https://www.gnu.org/software/freefont/coverage.html lies? I'm getting more and more inclined to only use DejaVu. For the languages it doesn't support we'd need to ensure native speakers like the font (e.g. by using ULS fonts) but it's also quite hard to design image distortions that make sense with those scripts.
If you know one of the following languages please speak up!

  • bn Bengali
  • chr Cherokee
  • gu Gujarati
  • hi Hindi (Devanagari script)
  • mr Marathi (Devanagari script)
  • sa Sanskrit (Devanagari script)
  • ml Malayalam
  • si Sinhala/Sinhalese
  • ta Tamil
  • th Thai 1%

Missing in FreeFont too:

  • am Amharic
  • bo Tibetan
  • ja Japanese
  • km Central Khmer
  • kn Kannada
  • ko Korean
  • my Burmese (Myanmar)
  • pa Panjabi/Punjabi
  • te Telugu
  • ug Uyghur 87%
  • ur Urdu 92%

Sorry for double message; another idea I had is that some of those languages don't have an OCR, as Wikisource folks painfully know (for instance Malayam). Maybe for such languages we could just disable distortions, given bots are unlikely to parse them on their own anyway.
Cf. http://finereader.abbyy.com/recognition_languages/

I went through the pictures for CAPTCHAs in Hindi. They're mostly understandable except for in a few of the images it's impossible to distinguish the character. Hindi has quite a few similar-looking characters differing just by a small line or a dot.

For example, the middle character is not-recognizable in https://www.dropbox.com/sh/i2af7xvn4y593gc/050a6S-21C/captchas/hi#lh:null-image_76947daa_e5d5575a79755d28.png

But mostly they read just fine.

I must say the Czech (cs) version is better than I’d expect. The only issue seems to be diacritics: especially the difference between i/í is practically indistinguishable after the distortion. For most words, you can probably tell from context, but in some cases, both versions would make correct words (e.g. https://www.dropbox.com/sh/i2af7xvn4y593gc/bSYQyGEMBH/captchas/cs#lh:null-image_6d80659d_e7c8421a61605559.png can be both “dobyti” and “dobytí”). Removing all words with “í” would probably be enough, ignoring the difference between “í” and “i” would be perfect, but I guess having some (low) nonzero expected error rate would be acceptable as well.

For Vietnamese, 27 of the images contain a piece of tofu instead of a second word; 2 images contain more than one piece. It’s odd, because this font clearly supports the Vietnamese half of Latin Extended Additional. The high distortion is problematic and probably unnecessary, because Vietnamese OCR is still pretty rudimentary, with little support for diacritics. As it is, though, a different font may help with many of the following legibility challenges:

ú or ủ?
https://www.dropbox.com/sh/i2af7xvn4y593gc/v_jkVCy5Xg/captchas/vi/image_6876ca11_cc6e08a95ea5b935.png

ẽ or ế?
https://www.dropbox.com/sh/i2af7xvn4y593gc/rfc7TwizAo/captchas/vi/image_432bfc9d_d02d9707bcb0a02b.png

If I didn’t know this font used two-story a’s, I’d see ã instead of ỗ:
https://www.dropbox.com/sh/i2af7xvn4y593gc/CGUSde4hfC/captchas/vi/image_5cbb4b12_976cd14e4e332a23.png

ú or ứ?
https://www.dropbox.com/sh/i2af7xvn4y593gc/Svdiq4ZLS5/captchas/vi/image_c90d4c3d_6b4e7a877b3e79dc.png

d or đ?
https://www.dropbox.com/sh/i2af7xvn4y593gc/Ah4yviImWT/captchas/vi/image_ae9020dd_0a618ab7494104fd.png

Tofu is because of things like [[wikt:裘]] and [[wikt:意見]] being in the dictionary. As with Malayalam issues reported on mailing list, I'm unsure how to handle such "extraneous" "words" for all languages; though in this and the Serbian's case we could "just" check the dictionary is in the main language's script (if we know the language code...).

About vi, I was reading earlier this morning on Gentium: «version of the font with redesigned diacritics (flatter ones) to make it more suitable for use with stacking diacritics, and for languages such as Vietnamese». http://scripts.sil.org/cms/scripts/page.php?item_id=Gentium_faq&_sc=1#5d25a5da
How many languages have such complex diacritics and is there some generic enough font? I doubt we can exclude words with diacritics, we'd only have 500 left out of thousands in vi's case. I'm uploading a new attempt with Arimo font, please check if it's any better.

(In reply to Nemo from comment #41)

Tofu is because of things like [[wikt:裘]] and [[wikt:意見]] being in the
dictionary. As with Malayalam issues reported on mailing list, I'm unsure
how to handle such "extraneous" "words" for all languages; though in this
and the Serbian's case we could "just" check the dictionary is in the main
language's script (if we know the language code...).

Yep, that’s what’s required for Vietnamese then.

About vi, I was reading earlier this morning on Gentium: «version of the
font with redesigned diacritics (flatter ones) to make it more suitable for
use with stacking diacritics, and for languages such as Vietnamese».
<http://scripts.sil.org/cms/scripts/page.
php?item_id=Gentium_faq&_sc=1#5d25a5da>
How many languages have such complex diacritics and is there some generic
enough font?

Among Latin alphabets that we’ll be displaying, Vietnamese is a bit of a special case for stacking diacritics. GentiumAlt’s flatter diacritics allow it to fit Vietnamese on a standard-height line at the cost of some legibility. If anything, we need more exaggerated diacritics that can survive the distortions.

I doubt we can exclude words with diacritics, we'd only have
500 left out of thousands in vi's case.

Right, the whole point of this exercise is to include the diacritics. :-)

Hi,
Bengali (bn) text are not displaying properly. All the conjunctions are misplaced and that is why almost none of the image represents any word. In some images (Example:https://www.dropbox.com/sh/i2af7xvn4y593gc/7fTaoyiaSb/captchas/bn#lh:null-image_a060ec4f_d15f04bc689bb980.png) parts of the characters are missing because of the padding/border.

I am not sure it is a problem of the font or not. But if can tell me the name of the font i can test that.

Nasir Khan Saikat

(In reply to Nemo from comment #41)

I'm uploading a new attempt with
Arimo font, please check if it's any better.

Yes, it’s better. The only severe ambiguity I ran into was:

h or n? Knowing the word, it’s n, but it sure looks like h:
https://www.dropbox.com/sh/i2af7xvn4y593gc/-GesxDHeX9/captchas/vi-arimo/image_03be064f_70c0338194b8dca2.png

Another issue for Vietnamese: the ̃ and ̉ diacritics can look like each other when stacked over ̂ and distorted. The southern dialect merges the two tones into ̉, so southerners won’t always be able to rely on the words they know to resolve the ambiguity. I’ve asked the Vietnamese Wikipedia community for feedback on this issue: [[vi:Wikipedia:Thảo luận#Việt hóa các hình CAPTCHA]].

Finally, many Vietnamese Wikipedia users rely on an IME script embedded via a gadget, but gadgets are disabled at [[Special:UserLogin/signup]]. We’d need to port the (rather complex) IME to ULS to keep the signup form accessible. Otherwise, as others have mentioned on the mailing lists, there will have to be an option to fall back to an English CAPTCHA.

Suggestion regarding Bengali and similar: they do not have to be distorted as much. This because OCR for these scripts is less developed than OCR for Latin alphabet, and I doubt spammers will be willing to bother so much for relatively small Wikipedias. If we notice that the captchas are being ignored, more distortion could be added.

sumanah wrote:

Also see comments at http://lists.wikimedia.org/pipermail/wikitech-ambassadors/2014-April/thread.html#644 about Swedish, French, Bengali, Romanian, and Catalan.

Croatian (hr) is _completely_ broken. For example, 80% of them (or so) is completely (or half) in Cyrillic. Some older people will be able to read it, but almost nobody in Croatia will be able to enter Cyrillic text, since that is not an official script here.

Serbian (sr) is strange too. I think both Latin and Cyrillic are official there, but isn't it strange to ask for people to change input method (from Latin to Cyrillic, and vice versa) in the middle of captcha, like here?

https://www.dropbox.com/sh/i2af7xvn4y593gc/ocKv1yBPuf/captchas/sr#lh:null-image_e373536e_2ecbd37b76d67185.png

The above is not the only example.

Bosnian (ba) has completely Cyrillic CAPTCHAs, as far as I can see, but according to Wikipedia "Standard Bosnian uses a Latin alphabet."[1]

Željko

1: https://en.wikipedia.org/wiki/Bosnian_language

ba is not Bosnian https://translatewiki.net/wiki/Portal:Ba

Thanks for these comments, but we're already aware of the mixed/wrong script issues: it was the first thing people brought to our knowledge, no need for more examples.
http://thread.gmane.org/gmane.org.wikimedia.mediawiki.i18n/846

As previously said (see comment 41), we'll rely on the ICU interface to Unicode data to remove mixed script and (where possible) secondary/wrong scripts for each language. Problems with the source dictionary (en.wiktionary.org) should be dealt by editing said wiki.

So far, the general sentiment from the Vietnamese Wikipedia community has been that the added difficulty of distinguishing diacritics vastly outweighs any readability improvements from using actual Vietnamese words instead of English words or random letters. Moreover, there is skepticism that the wiki even has a problem with CAPTCHA-solving bots. These are gut feelings rather than hard data, of course, but I can imagine a couple changes that would mitigate the community's concerns:

1a. Minimize or eliminate distortions in Vietnamese. High-quality OCR solutions like Google's already have enough difficulty with clear, undistorted Vietnamese text.
1b. Alternatively, strip diacritics *before* display and accept diacritic-less input. There would likely be no change in difficulty for bots, but Vietnamese users would still be able to employ their knowledge of Vietnamese spelling patterns.

  1. Provide an option to solve a standard English CAPTCHA. (Not sure what the default should be.) Many websites that require CAPTCHAs offer some alternative for accessibility; Vietnamese CAPTCHAs with diacritics would be insurmountable to those with declining eyesight.

IMHO, for Thai language, the pictures are very blurred. Although some can be guessed easily, the rest needs a lot of effort. In some cases, it is impossible to determine the correct word at all.

Results from [[th:WP:HELPDESK#CAPTCHA]] from Thai Wikipedia: S: 0, O: 6, N: 0

Comments:

Nullzero: See the above comment

G(x): Too hard too read

Taweetham: (1) Too hard to read (2) Contain swearing words (3) Not convenient for interwiki users (4) Thai language is complex. He doesn't know whether the software will generate words which are impossible to enter or not

BlackKoro: Unable to read

Lerdsuwa: Can't distinguish between "ท" and "ห", "ล" and "ส"

Aristitleism: (1) Very hard to read (2) Contain swearing words (3) Contain some obsolete characters which no one uses anymore such as "ฦ" It is also hard to find these obsolete characters on Thai keyboard.

(In reply to Minh Nguyễn from comment #52)

Moreover, there is skepticism that the wiki even has a problem with CAPTCHA-
solving bots. These are gut feelings rather than hard data, of course, but I
can imagine a couple changes that would mitigate the community's concerns:

Could some hard data be found on whether or not the Vietnamese Wikipedia has ever had any problems with CAPTCHA-solving bots?

Serbian (sr) is strange too. I think both Latin and Cyrillic are official there, but isn't it strange to ask for people to change input method (from Latin to Cyrillic, and vice versa) in the middle of captcha, like here?

https://www.dropbox.com/sh/i2af7xvn4y593gc/ocKv1yBPuf/captchas/sr#lh:null-image_e373536e_2ecbd37b76d67185.png

The above is not the only example.

As I said before,

Note that some users may not have appropriate keyboard to enter the captcha in their language. Aside from captcha generation in various languages, fuzzy comparison with the answer is needed as well.

They may not have a keyboard to enter things in English, for that matter, no?

They may not have a keyboard to enter things in English, for that matter, no?

I newer saw a computer without an English, or at least a Latin keyboard.

Perhaps an example of how I imagine this should work for Serbian is in order:

  • Captchas could be in Cyrillic, or Latin, or either, but not mixed Cyrillic and Latin.
  • Words with ŠĐŽČĆ characters should be excluded. This ensures that even users who have only English keyboard can easily type the captcha.
  • Intra-alphabet homographs with different meanings should also be excluded (f.e. Latin PECA which is the same as Cyrillic РЕСА (Latin RESA)). This isn't as scary as it sounds :)
  • Perhaps a note should be left somewhere saying that either script could be used to fill in the captcha.
  • When a user enters the captcha, both the captcha and the user input should be converted to Latin and compared.

Change 121255 had a related patch set uploaded (by Nemo bis):
Make captcha.py produce images in arbitrary language

https://gerrit.wikimedia.org/r/121255

Words with ŠĐŽČĆ characters should be excluded. This ensures that even users who have only English keyboard can easily type the captcha.

T65216 would address this issue by performing diacritic folding, similar to how MediaWiki’s search engine performs diacritic folding when matching page titles.

There are already some websites that sue Chinese-based captcha around, however it is a really really bad idea to me personally. As someone who can read and write Chinese and is also a native language user of one of the Chinese languages, it's very complicated for me to type Chinese text into computer dependent on environment and the input method available. Worse case scenario I would have to use Google Translate or Google Search to find out those matching characters over the internet and then copy them over in order to finish a localized captcha challenges. Please DO NOT implement such troublesome thing.

FYI: My current opinion on this is we should drop the wordlist thing. I think most of the time the combining two words results in a string that is not recognizable as a word, and it probably helps computers more than humans.

That doesn't solve the broader problem of different scripts, but would at least be a bit more equal to latin script based languages.

This task is having High priority for more than a decade, so I kindly doubt if this still matches "Someone is working or planning to work on this task soon." from Priority levels meaning, should we downgrade it?