Page MenuHomePhabricator

Port djvutext.py to core
Closed, ResolvedPublic

Description

Might be non-trivial due to dependency on a djvu reading program ('djvused'), part of djvulibre-bin


Version: core-(2.0)
Severity: enhancement

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:14 AM
bzimport set Reference to bz64853.
bzimport added a subscriber: Unknown Object (????).

I do not think this is really needed. With support of "preload" (see https://bugzilla.wikimedia.org/show_bug.cgi?id=58963), if an Index Page is present on wikisource, one can get page text with the API even if a page is not created yet. This is what I usually do on en.WS.

Change 132816 had a related patch set uploaded by Mpaa:
Bug 64853 - Port djvutext.py to core

https://gerrit.wikimedia.org/r/132816

As I said above, I cannot see a use case. If one has a djvu and wishes to upload the text to wikisource, first of all he will create an Index linked to that djvu.
Once that is done, you can fetch the text with the patch above directly from the site.

No dependencies, no errors and paginations is handled by the Proofread extensions.

jayvdb lowered the priority of this task from High to Low.Dec 5 2014, 4:09 AM
jayvdb set Security to None.
jayvdb removed a subscriber: Unknown Object (????).

I think it would still be useful to have a script which batch uploaded the OCR text for the entire work using the preload functionality added by @Mpaa.
But would like to hear more active All-and-every-Wikisource contributors opinion on that.

Actually one possible use case is when Pages have been created from an old version of djvu file, then a new djvu with an improved text layer and Page is available and content needs to be overwritten. Maybe quite a remote use case ...

...or not so remote if T34695 ever approaches reality. In general, the issue of syncing Wikisource <-> their DjVu source <-> the source digital library is IMHO the biggest open question in the Wikisource model.

This is more than enWS, and more than a policy at one site.

I believe that there is still a place for the script.

  1. The script should be considered more versatile than English Wikisource. If the other WSes (or indeed other external sites) wish to utilise the tool, then go for it
  2. If enWS received a perfect djvu book that hadn't relied on OCR then why wouldn't we want to have a bot do all the work.

I cannot speak for the internals of the Proofread application, or pywiki, so you will need to determine that level of complexity

Actually one possible use case is when Pages have been created from an old version of djvu file, then a new djvu with an improved text layer and Page is available and content needs to be overwritten. Maybe quite a remote use case ...

That's bordering on laughable. I myself have blindly bot created entire thousand page+ Index:es of generally worthless OCR'd crap not knowing any better in my early WS days and can attest most everyone with any time under their belt has done the same at some point or another too. I know skipping the current deletion step needed to replace inferior OCR's text after a source file replacement would be useful regardless of how many people actually go back and do something about these 'poor decision' cases.

Change 210808 had a related patch set uploaded (by Mpaa):
Added DjVu class and djvutext.py in core

https://gerrit.wikimedia.org/r/210808

Change 210808 merged by jenkins-bot:
Added DjVuFile class and djvutext.py in core

https://gerrit.wikimedia.org/r/210808

jayvdb assigned this task to Mpaa.

Change 224199 had a related patch set uploaded (by John Vandenberg):
Add djvulibre-bin to travis apt package list

https://gerrit.wikimedia.org/r/224199

Change 224199 merged by jenkins-bot:
Add djvulibre-bin to travis apt package list

https://gerrit.wikimedia.org/r/224199