Page MenuHomePhabricator

Implement, Review and Deploy Wikicaptcha
Closed, DeclinedPublic

Description

Author: sumanah (@sumanah)

Description:
Idea: Write a version of reCAPTCHA (for use by ConfirmEdit) that uses document images that have been processed by MediaWiki's ProofreadPage extension for WikiSource. In other words, a CAPTCHA that feeds data to ProofreadPage to augment its OCR processing. Some existing code to build on: http://lists.wikimedia.org/pipermail/wikitech-l/2011-November/thread.html#56121 (Neil Harris & #ConfirmEdit)


Version: unspecified
Severity: enhancement
URL: https://wikimania2012.wikimedia.org/wiki/Submissions/Wikicaptcha:_a_ReCAPTCHA-like_solution_for_Wikisource
See Also:

Details

Reference
bz32695

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:03 AM
bzimport set Reference to bz32695.
bzimport added a subscriber: Unknown Object (MLST).

This has been discussed a few times and a proof of concept was produced: http://lists.wikimedia.org/pipermail/wikisource-l/2011-February/000939.html
If I remember correctly, starting from a properly mapped DjVu it's not so difficult to identify the words which need to be checked, extract the corresponding (portion of) image and put the new text back in the DjVu.
It's way less obvious how to translate the activity on a Page: to the corresponding DjVu page and vice versa.

sumanah wrote:

Alex, is wikicaptcha, in its current form, ready for a deployment review? Or is it still in an experimental/prototype phase? It would probably be good to clarify that in the README at https://github.com/CristianCantoro/wikicaptcha .

Am cc'ing Andrea Zanni (Aubrey).

Thanks for working on this!

sumanah wrote:

Alex, it looks like WikiCAPTCHA awaits a design review https://www.mediawiki.org/wiki/WMF_Project_Design_Review_Process before we can move forward with deploying it on Wikimedia sites. Just wanted to let you know. Thanks.

This is a very nice idea! What is the status? Would a Google Summer of Code project help getting a MediaWiki extension running and polished, ready to be used in any MediaWiki enabled site?

Another question would be whether this extension is put in use in Wikimedia sites.

If the idea makes sense and there is at least one mentor available I would like to push it as a candidate to

http://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects

and move it to https://www.mediawiki.org/wiki/Summer_of_Code_2013#Project_ideas

(In reply to comment #3)

Alex, it looks like WikiCAPTCHA awaits a design review
https://www.mediawiki.org/wiki/WMF_Project_Design_Review_Process before we
can
move forward with deploying it on Wikimedia sites. Just wanted to let you
know. Thanks.

The code looks to be an early prototype. I only did a five minute read through but it looks to be a proof of concept, not a feature complete implementation.

Open questions about this whole idea:
*how would data propogate back to wikisource.
*is this even effective as a captcha
the dataset used to generate the images are publically available. It is unclear that the dataset is large enough that someone downloading the entire thing wouldn't happen.
an attacker could add entries to the dataset. Im not sure how exploitable that is, but its something that is concerning
**its unclear this will actually prevent spam. Computers do not get bored. Even with 1% getting through, it would not be effective. This is using texts that ocr software marked as low confidence, which sounds significantly weaker than what recaptcha does according to wikipedia and ive heard rumours that recaptcha is not entirely effective. (Not sure if this is true).

To answer to the open questions:

(In reply to comment #5)

*how would data propogate back to wikisource.

I don't see that it is practically possible to propagate data back to Wikisource. Rather, this would be used to perform initial OCR for Wikisource, perhaps primarily for works where machine-based OCR would be ineffective.

*is this even effective as a captcha

I don't see that it would be any less effective than the current captcha.

**the dataset used to generate the images are publically available. It is
unclear that the dataset is large enough that someone downloading the entire
thing wouldn't happen.

Actual dataset used on Wikipedia doesn't need to be publicly available.

**an attacker could add entries to the dataset. Im not sure how exploitable
that is, but its something that is concerning

I don't see how could an attacker add entries to the dataset. Actual dataset used on Wikipedia would probably be tightly controlled.

**its unclear this will actually prevent spam. Computers do not get bored.
Even
with 1% getting through, it would not be effective. This is using texts that

I don't see that it would be any less effective than the current captcha.

(In reply to comment #6)

**its unclear this will actually prevent spam. Computers do not get bored.
Even
with 1% getting through, it would not be effective. This is using texts that

I don't see that it would be any less effective than the current captcha.

Anything less than the current 25 % failure would be an improvement, though over 1 % a captcha is considered broken (according to the paper on [[mw:CAPTCHA]]).

This is a low priority roadmap feature, the Product and Design teams would welcome community support.

Please contact me for design review when prototype is ready to review by UX team.

I'm exploring a new and IMHO interesting path: to ignore djvu text layer, and toparse (both to extract naked text layer and some interesting parameters) from abbyy.xml file. This file (really heavy and discouraging at a firs glance) is published by Internet Archive into its file download area.

The interesting thing is, that that heavy file contains both coordinates of words, and an interesting 'wordPenalty' parameter, something like a "uncertainty score" for the whole word; but there's too a character-by-character score of "certainty score".

I'm sharing scripts with http://www.mediawiki.org/wiki/User:Rtdwivedi, who is MUCH skilled than me, since the idea is to upload text layer from abbyy.xml file and to wrap uncertain words into a span tag, making them easy to be fized by VisualEditor. A test output of extracring scripts can be seen into any page of http://it.wikisource.org/wiki/Indice:Ricordi_di_Londra.djvu, where words with a wordPenalty > 0 are red; unluckily VisualEditor doesn't run presently in wikisource, but you can test the resulting code with VisualEditor in a wikipedia sandbox.

I presume that similar scripts, using abbyy.xml files, could extract lists of uncertain words and their images from abbyy.xml file and related scans and feed a CAPTCHA engine.

My suggestion is, to ask Rtdwivedi for comments; personally I feel myself curious, bold and sometimes lucky, but very far from a "programmer".

ellydwivedi2093 wrote:

Hi everyone,

As Alessandro said, the words that should be chosen for CAPTCHA from the DjVu layer should be chosen on the basis of their confidence level. The confidence level of words shall be decided by the ProofreadPage extension itself. Words with high penalty would be used for CAPTCHA. I would also suggest not using the words in their complete sense, but mixing two high penalty words together. Presently, ProofreadPage extension doesn't have the facilities to do so. The spell checker( which would use the word penalty ) would be implemented after the integration with VisualEditor has been done.

Hello, this is a quasi-automated-but-not-really message:

I am reviewing all tracking bugs for extensions to review and deploy to WMF servers. See the list here:
https://bugzilla.wikimedia.org/showdependencytree.cgi?id=31235&hide_resolved=1

The [[mw:Review queue]] page lists the steps necessary to complete the review. I have copied them below and done some initial filling out based on what I can easily gleen from this bug and any linked to sources that are obvious. If I miss something/state something false, please do correct me.

Also, if you haven't yet done so, please review the information on and linked to from:
https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment

TODO/Check list

Extension page on mediawiki.org: no?
Bugzilla component: no?
Extension in Gerrit: in github, please transfer to gerrit
Design Review: not yet done (see comment 8
Archeticecture/Performance Review: some
Security Review: no?
Screencast (if applicable): no
Community support: seems to be the initial beginnings of (at least some of the tech community)

Other than the obvious things above that are 'no's, what else can I/WMF help with here to move it along?

Is there a "working" prototype that the functionality can be testing somewhere (without setting up a development environment) that Design can evaluate.

(In reply to comment #12)

Is there a "working" prototype that the functionality can be testing
somewhere
(without setting up a development environment) that Design can evaluate.

I'm not exactly sure why a design review would be needed at this stage. The design is probably going to look very much like what the current captcha looks like, since its mostly proposed replacing the backend, not the front end.

/me still thinks my questions in comment 5 aren't sufficiently answered. I'd like answers to the tune of "we know this will be a good idea because of X", not we think we couldn't possibly do worse than the current system, because the current system sucks so much (Which I wouldn't bet on). Heck I'd even settle for a concrete description (something that could actually be evaluated) of what folks working on this even plan to do.

I don't understand if that was clear enough, but there isn't any developer working on this project. The contributions Cristian and Alex can make are what they already did and mention: make a proof-of-concept and investigating specifications for interaction with Wikisource, DjVu and so on.

vladjohn2013 wrote:

Hi, this project is still listed at https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Multilingual.2C_usable_and_effective_captchas

Should this project be still listed in that page? If not, please remove it. If it still makes sense, then it could be moved to the "Featured projects" section if it has community support and mentors.

aalekh1993 wrote:

Hello,
I have been a frequent contributor to mediawiki.....and as a part of contribution process is looking for project for upcoming Google Summer Of Code 2014 ........as mentoined in bug 32695 and prototype developed by Pginer
my idea for the project is:

             
""Develop a captcha service with wikimedia commons images this captcha service 
   will comprise of some custom question with random keyword 1 and 2 .......upon selection of 
   random keyword 1, a question will be generated along with images fetched from commons
   database,also there will be few images from another keyword 2 which will show some
    images which will be not related to the question we can also take help of image
   annotations as mentioned here (https://commons.wikimedia.org/wiki/File:Vitraux_de_la_basilique_Notre-Dame,_Genève_23.jpg).""

I therefore request you all to place comment into my idea regarding the project as i am really interested to work for this challenging but amazing project :) .

Eagerly Waiting for your reply.

aalekh1993 wrote:

Hello,
I have been a frequent contributor to mediawiki.....and as a part of contribution process is looking for project for upcoming Google Summer Of Code 2014 ........as mentoined in bug 32695 and prototype developed by Pginer
my idea for the project is:

             
""Develop a captcha service with wikimedia commons images this captcha service 
   will comprise of some custom question with random keyword 1 and 2 .......upon selection of 
   random keyword 1, a question will be generated along with images fetched from commons
   database,also there will be few images from another keyword 2 which will show some
    images which will be not related to the question we can also take help of image
   annotations as mentioned here (https://commons.wikimedia.org/wiki/File:Vitraux_de_la_basilique_Notre-Dame,_Genève_23.jpg).""

I therefore request you all to place comment into my idea regarding the project as i am really interested to work for this challenging but amazing project :) .

Eagerly Waiting for your reply.

(In reply to Aalekh Nigam from comment #17)

I therefore request you all to place comment into my idea regarding the
project as i am really interested to work for this challenging but amazing
project :) .

This should probably go to a wikipage where you explain your idea and where people could comment. Bugzilla might not be the best place for a lenghty discussion. Feel free to paste a link here as a comment.

Also, Aalekh, this bug is about Wikisource (scanned books) images, a CAPTCHA from Commons images would need a separate bugzilla report.

aalekh1993 wrote:

Actually this was a simple idea for way to handle the project......since commons is a part of wiki....so my idea is that it might just act as an database for various captcha options as mentoined by pginer in http://pauginer.tumblr.com/post/33445896205/captcha-ideas

Aalekh, your proposal is still missing in Google Melange. Please submit it there as a draft linking to your wiki page. In any case, we will evaluate your proposal in mediawiki.org. Thank you!

aalekh1993 wrote:

Over a period of few months there has been active Development of Multilingual, usable and effective captchas for GSOC 2014.But currently it seems that there is no technical and primary mentor for the project. Therefore I Request all members to please have a thought about becoming a part of this project as primary technical mentor.

Let's move the GSoC 2014 discussion to

Bug 62960 - Prototype CAPTCHA optimized for multilingual and mobile

I'm not very familiar with DjVu and Wikisource in general, but I'm interested in bringing this project forward.
I got https://github.com/CristianCantoro/wikicaptcha copied to Gerrit as http://git.wikimedia.org/summary/?r=labs/tools/wikicaptcha, but now I'm not so sure it should be under labs/tools.
If the idea has not been superseded, I'm going to work on it after https://gerrit.wikimedia.org/r/180741/.

Great! Let us know here how you can be helped.

I didn't claim this task because that "review and deploy" scares me :-)

Great! Let us know here how you can be helped.

I'd like to hear anyone's opinion about whether Tool Labs is the appropriate place for hosting the project; I probably didn't think of that when requesting the Gerrit repository.
But most importantly, I'd like to know where and at which stage the DjVu files are supposed to be created/hosted/processed, and we need a workflow design on how to import 'suggestions' into Wikisource.
And, of course, code review is welcome ;-)

He7d3r added a project: ProofreadPage.
He7d3r set Security to None.
He7d3r added a subscriber: sumanah.
He7d3r renamed this task from Review and Deploy Wikicaptcha to Implement, Review and Deploy Wikicaptcha.Mar 28 2015, 2:33 PM
He7d3r added a subscriber: sumanah.

I'd like to hear anyone's opinion about whether Tool Labs is the appropriate place for hosting the project; I probably didn't think of that when requesting the Gerrit repository.
But most importantly, I'd like to know where and at which stage the DjVu files are supposed to be created/hosted/processed, and we need a workflow design on how to import 'suggestions' into Wikisource.
And, of course, code review is welcome ;-)

I'd like to hear anyone's opinion about whether Tool Labs is the appropriate place for hosting the project; I probably didn't think of that when requesting the Gerrit repository.

I'm not sure exactly what "the project" means here (more on this below). In general, no production service can live on Labs. For example, if we wanted to replace the CAPTCHA being used at https://en.wikipedia.org/wiki/Special:CreateAccount, it would have to live on Wikimedia production servers. It could not live on Labs. (Though keeping a development copy of the code on Labs would be perfectly acceptable, of course.)

But most importantly, I'd like to know where and at which stage the DjVu files are supposed to be created/hosted/processed, and we need a workflow design on how to import 'suggestions' into Wikisource.

Before investing a lot of time coding, I suggest drafting a request for comments on mediawiki.org that outlines the hard and soft requirements, possible implementations and workflows, and potential technical concerns.

Ricordisamoa's seem questions for Wikisource, more than for devs. An RfC might be useful once it's clear what would be acceptable for Wikisource, not before.

As for the answers: I don't know. Discussion is easier when there is an idea of how to make the sync feasible. It's possible that standards such as METS may be useful.

I would rate this very low in Wikisource's priorities. There are tools that we need, rather than be diverted by some scheme to recreate an ill-defined wikicaptcha purpose.

I would rate this very low in Wikisource's priorities.

Well, the Italian Wikisource disagrees.

@Ricordisamoa can you tell us what the status of this is? is the code usable? Is there a demo somewhere? How complete is the current implementation, what are the known issues?

@Ricordisamoa can you tell us what the status of this is? is the code usable? Is there a demo somewhere? How complete is the current implementation, what are the known issues?

I did not author the original implementation. I didn't look at the code, with the exception of conventions and style.
With the goal of deploying to production, even a full rewrite with more PHP might be worth it. But we need manpower.

Updating assignee to reflect reality.

This RFC seems to be stalled. If there is currently no interest in driving this further, it should for now be removed from the RFC work board.
If there is interest in continuing the RFC process, please let us (TechCom) known who will be working on this RFC, and who commits to implementing it if approved, and in what time frame.

Removing from TechCom-RFC workboard as it remains stalled.

Username_Needed changed the task status from Open to Stalled.Jan 18 2019, 9:45 AM
Username_Needed added a subscriber: Username_Needed.

Still seems to be stalled, marking it as such

T94186 has been declined and there seems to be no interest in this specific implementation. Reflecting status.
See e.g. instead T250227 for other approaches, and T241921 for broader discussion on this convoluted topic.