Author: vladjohn2013
Description:
Merge proofread text back into Djvu files
Wikisource, the free library, has an enormous collection of Djvu files and proofread texts based on those scans.
w:DJVU files include a text layer. Typically a DjVu file begins with a text layer that consists of w:OCR text, which Wikisource uses as the initial version of the transcription. Wikisource contributors then 'fix' the OCR errors and save the corrections onto the Wikisource project as wikitext, and eventually the transcription is accurate & completed. A tool is needed to create a new DjVu file with the accurate & complete Wikisource transcription.
There are existing tools being worked on that extract the accurate & complete Wikisource transcription, typically exporting it as EPUB. However they likely discard a lot of useful information that is needed to recreate a DJVU file, most importantly the (x,y) positions of each piece of text. They may also discard the page numbers.
There is some previous work about merging the proofread text as a blob into pages, and also about finding similar words to be used as anchors for text re-mapping. Tools exist which work with the w:hOCR data, for instance hOCR.js by @Alex_brollo (the gadget author who worked most with the DjVu layers), and Pywikibot 's djvutext.py.
The idea is to create an export tool that will get word positions and confidence levels using Tesseract and then re-map the text layer back into the DjVu file. If possible, word coordinates should be kept.
Project proposed by Micru. I have found an external mentor that could give a hand on Tesseract, now I'm looking for a mentor that would provide assistance on Mediawiki.
URL:https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Merge_proofread_text_back_into_Djvu_files
Skills: knowledge of DjVu file type desirable, knowledge of how to build a web api on unix, knowledge of python, knowledge of hocr file format.
Mentors: