Page MenuHomePhabricator

Merge proofread text back into Djvu files
Open, LowPublic

Description

Author: vladjohn2013

Description:
Merge proofread text back into Djvu files

Wikisource, the free library, has an enormous collection of Djvu files and proofread texts based on those scans.

w:DJVU files include a text layer. Typically a DjVu file begins with a text layer that consists of w:OCR text, which Wikisource uses as the initial version of the transcription. Wikisource contributors then 'fix' the OCR errors and save the corrections onto the Wikisource project as wikitext, and eventually the transcription is accurate & completed. A tool is needed to create a new DjVu file with the accurate & complete Wikisource transcription.

There are existing tools being worked on that extract the accurate & complete Wikisource transcription, typically exporting it as EPUB. However they likely discard a lot of useful information that is needed to recreate a DJVU file, most importantly the (x,y) positions of each piece of text. They may also discard the page numbers.

There is some previous work about merging the proofread text as a blob into pages, and also about finding similar words to be used as anchors for text re-mapping. Tools exist which work with the w:hOCR data, for instance hOCR.js by @Alex_brollo (the gadget author who worked most with the DjVu layers), and Pywikibot 's djvutext.py.

The idea is to create an export tool that will get word positions and confidence levels using Tesseract and then re-map the text layer back into the DjVu file. If possible, word coordinates should be kept.

Project proposed by Micru. I have found an external mentor that could give a hand on Tesseract, now I'm looking for a mentor that would provide assistance on Mediawiki.

URL:https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Merge_proofread_text_back_into_Djvu_files
Skills: knowledge of DjVu file type desirable, knowledge of how to build a web api on unix, knowledge of python, knowledge of hocr file format.
Mentors:

Details

Reference
bz57807

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:34 AM
bzimport added a project: ProofreadPage.
bzimport set Reference to bz57807.
bzimport added a subscriber: Unknown Object (MLST).

vladjohn2013 wrote:

This proposal has been listed at https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects and we are filing a report to gather community feedback and share updates.

CCing Micru and Aarti, who are proposing this project. I'm not sure who is Aubrey:

"Aubrey can be a mentor providing assistance regarding Wikisource"

Created attachment 16531
simple DjVu test file

A simple .DjVu test file where the embedded text-layer is LINE based instead of WORD based. Provided for XML file generation illustration (testme.xml)

Attached:

Created attachment 16532
resulting XML file

Command line used to generate XML file

C:\Program Files (x86)\DjVuLibre>djvutoxml.exe testme.djvu testme.xml

Attached:

(In reply to vladjohn2013 from comment #0)

Merge proofread text back into Djvu files

. . . The idea is to create an
export tool that will get word positions and confidence levels using
Tesseract and then re-map the text layer back into the DjVu file. If
possible, word coordinates should be kept.

Isn't some of that already possible using DjVuLibre's built in DjVu-to-XML scheme? (See attachments)

As far as I can tell, this method was once feasible & pursued then "abandoned" some 7+ years ago for the current 'plain-text' dump approach we have now due to some resource(?) issues at the time. Most of the related bits seem (to me) to still be in place if you go by what is found in https://git.wikimedia.org/tree/mediawiki%2Fcore

/includes/media/DjVu.php and;
/includes/media/DjVuImage.php

It seems (again, to me) the first step on the path to making the proposal a reality is to see if its still possible to actually generate an XML from a DjVu file using the current state of mediawiki et. al as it stands today. I know this is possible on a vanilla x86 local install of the DjVuLibre software package (refer to the attachments again)... but all that online server, Linux, Debian, Ubuntobama stuff is beyond me - and something along those lines is what is in play here.

So: Can anyone successfully generate the DjVuLibre defined XML derivative from a .DjVu file using just the available mediawiki regime/scheme in place?

Wikimedia will apply to Google Summer of Code and Outreachy on Tuesday, February 17. If you want this task to become a featured project idea, please follow these instructions.

@Micru, @Aubrey, are you still interested in pursuing this task as a GSoC/Outreachy project?

This is a message posted to all tasks under "Re-check in September 2015" at Possible-Tech-Projects. Outreachy-Round-11 is around the corner. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

This is a message sent to all Possible-Tech-Projects. The new round of Wikimedia Individual Engagement Grants is open until 29 Sep. For the first time, technical projects are within scope, thanks to the feedback received at Wikimania 2015, before, and after (T105414). If someone is interested in obtaining funds to push this task, this might be a good way.

jayvdb set Security to None.
jayvdb updated the task description. (Show Details)

This is the last call for Possible-Tech-Projects missing mentors. The application deadline for Outreachy-Round-11 is 2015-11-02. If this proposal doesn't have two mentors assigned by the end of Thursday, October 22, it will be moved as a candidate for the next round.

Interested in mentoring? Check the documentation for possible mentors.

Just to let you know that I'm presently working about a different - but related - problem: to build a "wikisource-like" djvu text editor. First results are very encouraging, here a screenshot of my "djvu python ajax editor".

The first djvu, with a completely edited text layer by this draft tool, is this one: Atlantide (Mario Rapisardi).djvu.

Presently the goal for such tool is to edit djvu text layer before it is used by wikisource, but its engine can edit djvu text layer saving its structure, so that I see that it will be possible to use it to re-upload text coming from wikisource: I'll explore this issue as soon as I'll add some needed property to editor. It's only a matter to match djvu text with $(".pagetext").text() with some "aligning" tool.

As previously mentioned, this task is moved to 'Recheck in February 2016' as it doesn't have two mentors assigned to it as of today, October 23 - 2015. The project will be included in the discussion of next iteration of GSoC/Outreachy, and is excluded from #Outreachy-11. Potential candidates are discouraged from submitting proposals to this task for #Outreachy-11 as it lacks mentors in this round.

NOTE: This task is a proposed project for Google-Summer-of-Code (2016) and Outreachy-Round-12 : GSoC 2016 and Outreachy round 12 is around the corner, and this task is listed as a Possible-Tech-Projects for the same. Projects listed for the internship programs should have a well-defined scope within the timeline of the event, minimum of two mentors, and should take about 2 weeks for a senior developer to complete. Interested in mentoring? Please add your details to the task description, if not done yet. Prospective interns should go through Life of a successful project doc to find out how to come up with a strong proposal for the same.

Core code for this task already exists at https://github.com/wikisource/ocr-tools , how to use the core code need to be documented. It'll need to be glued inside a container which can run on toolabs and provide a web interface and api to start a task. The core code use a Needleman-Wunsch algorithm to do a global alignment between an already existing hocr-like text layer in a djvu with an existing proofreaded text available on wikisource. One of the task needed is to generate such text layer using tesseract if needed.

I see skills required include Epub knowledge and many micro-task linked to this task, all of that seems unrelated to this task. For skills needed, knowledge of Djvu format, some knowledge of how to build a web api on unix, knowledge of python should be enough.

I see skills required include Epub knowledge and many micro-task linked to this task, all of that seems unrelated to this task. For skills needed, knowledge of Djvu format, some knowledge of how to build a web api on unix, knowledge of python should be enough.

Thank you a lot! Please fix the description accordingly. :)

@jayvdb, @Aubrey, you were listed as mentors of this project for Outreachy-11. A new round of GSoC '16 and Outreachy 12 has started. Are you willing to mentor the project in this round?

Just to let you know briefly the "state of art" of my tries:

  1. I've a rough, but running "djvu editor" (based on a server-client local python application; editing is done into a simple html page, with js tools, somehow similar to wikisource nsPage edit environment);
  2. I'm trying some DIY trick to align djvu text layer with wikisource edited text, using the same "djvu editor" GUI and base scripts;
  3. I'm testing too something deeply different - t.i. uploading wikisource code (raw or parsed into html) into a metadata text field of djvu page.

Please consider that my skill is very limited, I can't publish code into Github nor I can implement the server-client trick into Tool Labs; I'd like to zip and send the whole thing to anyone interested.

Per Brewster, the Internet Archive also has a routine to extract the OCR:

get .txt files from the _djvu.xml (which come from the abbyy.xml)

Worth checking if this code is available on their public git repositories.

P.s.: https://github.com/rajbot/AbbyyToDjvuXml is linked from http://raj.blog.archive.org/2011/03/17/how-to-serve-ia-style-books-from-your-own-cluster/#comment-17944

Anyone willing to push this project for the current round of Outreachy internship ( Dec6 to March 6 )? Application period has already started and remains open until Oct-17

Due to a lack of interest in mentoring this project, I am removing the OPP tag for now.