Page MenuHomePhabricator

Text layer of DjVu files doesn't appear in Page namespace due to higher memory consumption after upgrade to Ubuntu 12.04
Closed, ResolvedPublic

Description

Author: zorktha

Description:
The text layer of DjVu files doesn't appear when creating a new page in Page namespace.

Example : https://fr.wikisource.org/wiki/Livre:Barr%C3%A8s_-_Une_journ%C3%A9e_parlementaire_-_com%C3%A9die_de_m%C5%93urs_en_trois_actes_%281894%29.djvu

Problem appeared a few days ago, just after the last code update.


Version: wmf-deployment
Severity: normal

Details

Reference
bz42466

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:00 AM
bzimport set Reference to bz42466.
bzimport added a subscriber: Unknown Object (MLST).

same trouble reported on en.ws e.g. http://en.wikisource.org/wiki/Wikisource:Scriptorium/Help#Anyone_having_trouble_pulling_text_layers.3F

There is no change in the extension regarding to text layer since a while. Anyone now if something changed in the text layer extraction code in mediawiki or something related to caching the text layer with file metadata ?

I increased priority to high/major as no one can work on fr.ws with any new uploaded djvu.

baltoslavic wrote:

This seems to affect the English, French, and Latin Wikisource projects, so presumably it is affecting *ALL* Wikisource projects. For DjVu files newly uploaded to Commons (since the update), we cannot access the text layer component of the file.

Normally, uploaded DjVu files contain a text layer that appears in the edit window when editing in the Page namespace, but for the past week, no such text has been appearing whenever anyone edits. This feature is used on Wikisource to convert the DjVu into a wiki-text, so without access to the text layer, work on all the Wikisource projects will grind to a halt. For example, the English Wikisource has had to abanson its plans for the December collaboration, because we won't be able to pull the text from the file we were going to upload.

Until this bug is corrected, all Wikisource projects will be unable to begin any new texts from DjVu source files.

After some investigation this bug is not caused by ProofreadPage but by core or djvulibre.

As written before, no recent changes in

mediawiki/core/includes/media/DjVu.php
mediawiki/core/includes/media/DjVuImage.php

(In reply to comment #3)

this bug is not caused by ProofreadPage but by core or djvulibre.

(In reply to comment #1)

Anyone now if something changed in the text layer extraction code in mediawiki
or something related to caching the text layer with file metadata ?

(In reply to comment #2)

Until this bug is corrected, all Wikisource projects will be unable to begin
any new texts from DjVu source files.

> Tentatively blocking bug 38865.

(In reply to comment #2)

Until this bug is corrected, all Wikisource projects will be unable to begin
any new texts from DjVu source files.

Well, not to minimize this bug but that's not true, it's only that they won't be able to rely on the text layer. Frequently, the layer is such crap, especially on older texts, that this has no practical effect. Furthermore, we have our own OCR tool that can be used on the fly with a gadget that's implemented by a button above the edit box that is turned on by default (i.e IPs can use it). For example, I just generated https://fr.wikisource.org/wiki/Page:De_la_D%C3%A9monomanie_des_Sorciers_%281587%29.djvu/141 using that tool. Considering that's a 16th C. work, that's about as good as I'd expect from the text layer associated with the djvu. Furthermore the text layer can be copy pasted in or even botted in. The text layer on this particular work is about equal to what the tool generated and presumably the folks at IA were able to optimize ABBYY FineReader 8.0 for the language and type, unlike the built in tool, which I think still uses tesseract.

This isn't critical either, there is no internal data loss, which is part of our definition of critical. It's just a loss of function. The text layer is still there in the file on commons.

I'm not saying this isn't an important bug, I'm saying, if you're a wikisourcerer, don't feel tied to the text layer that comes with a djvu or pdf.

baltoslavic wrote:

(In reply to comment #5)

Well, not to minimize this bug but that's not true, it's only that they won't
be able to rely on the text layer. Frequently, the layer is such crap,
especially on older texts, that this has no practical effect. Furthermore, we
have our own OCR tool...

The OCR text layer that comes along with a file uploaded from a source such
as the Internet Archive is far superior to what gets generated by our OCR tool.
If we have to rely on our own OCR tool, it will greatly increase the work that
has to be done cleaning up problems generated by the OCR. Our own tool is
prone to far more stupid mistakes.

And we don't do many of the "older texts" (17th century and earlier), at least
not on the English Wikisource.

The community also feels (has expressly stated and agrees) that we should not
work on newly uploaded texts until the bug is corrected, because we can't judge
the accuracy of a match against the text layer, nor can we spot problems in a
page to text match for edited files. As one member put it:

"[Our OCR tool is] only intended for one-off use when a single page is missing
text or has a very poor text-layer and not for every page in a work that already
has a text-layer."

And another:

"As history as taught us many times before - if you work a file while in a state
of error, you might not like the result of your misplaced efforts once the error is resolved."

So, whatever you might think about it, the problem is choking off community work.
The relevant discussion thread is in Wikisource's Scriptorium under the headers
"Anyone having trouble pulling text layers?" and "Index text pages".

Match and Split appears to function.

match and split doesn't rely on mediawiki but extract the text layer directly from the djvu.

My only point was that that it is still finding the layer, the data isn't gone, it's just not being found by mediawiki.

A person who cannot access bugzilla has commented ...

Though the code involved between PDF & DjVu files are completely different when it comes to text dumping, this could mean the bug is not wmf-code-update related at all but DjVuLibre specific -- especially in light of the fact none of the DjVu PHP related code has been touched for sometime now. Errors after a main update where none of the affecting code has changed in the interim makes me think the existing software has become outdated when it comes to the common programing applied today.

Also......

Code: DjVuImage.php

Line 295 - possible incorrect file path
    out "pubtext/DjVuXML-s.dtd"
    in "share/djvu/pubtext/DjVuXML-s.dtd"

and points to the djvu updates.

Might that be a reasonable thing to do anyway? Can it be done and tested on the appropriate test server?

After some tests with the help of phe, we found that the issue is caused by an increased memory consumption by the djvutext vertion packed in 3.5.24. So the 100Mo memory limit of Wikimedia servers made the script fail. The new djvutxt script need at least 300Mo.

Look like the trouble come from the recent upgrade to unbuntu 12.04. It's not caused by djvulibre as the same version of these tool use less than 60 Mo on a slackware. Perhaps locale file related which use nowadays a lot of more virtual memory than the old way.

Moving to "Wikimedia" product and removing blocking 38865 as per last comment, as this seems to be related to the server upgrades to Ubuntu 12.04.

Aaron: Do you have an idea who could look into this, by any chance?

I've uploaded a patch that increase the memory limit of the djvutxt call. This solve the bug on my Fedora 16 (with the same version of DjVuLibre as Ubuntu 12.04): https://gerrit.wikimedia.org/r/#/c/36632/

This patch have been deployed on production cluster. Extraction of text layer works now. I let this bug open because it would be interesting to know why djvutxt use so much memory.

Tpt and me agree this would be a good idea to use a constant. This would allow to adjust limit the time needed to track the issue.

Gerrit change 37495.

[Workaround found => removing blocking 38865]

sumanah wrote:

Both of Tpt's changes are now merged; is the problem still affecting Wikisources?

@Sumana Extraction of the text layer works fine in Wikisources now but we have kept this bug open because the increase of the memory consumption of djvutxt is very strange.

wikisourcerer wrote:

Let's back up a bit before my head explodes....

First - a DjVu is nothing more than a glorified zip file that is archiving a bunch of other stand-alone [indirect] djvu files - an Index "file" within directing the order viewed, any annotations, embedded hyperlinks, shared dictionaries, typical metadata, coordinate mappings of text-layers, images, etc. etc. for all the DjVus within it as a single [bundled] DjVu file. The premise behind the DjVu file format is largely mirrored by the Index: and Page: namespaces on Wikisource today.

Why it was treated like an image file rather than an archive file from day one around here I'll never quite understand (I can peek at a a single .jpg or .txt file compacted within a .zip file without having to exract/deflate the entire .zip archive to do it & it doesn't re-classify the .zip file as a pic or a doc file just because I can... So???...WtF???.... but I digress).

The point I'm trying to make is DjVus were never meant to be anything more than an quick and easy, compact alternative to PDF files (a hack). THAT is why there will always be issues ....

https://bugzilla.wikimedia.org/show_bug.cgi?id=8263#c3

https://bugzilla.wikimedia.org/show_bug.cgi?id=9327#c4

https://bugzilla.wikimedia.org/show_bug.cgi?id=24824#c10

https://bugzilla.wikimedia.org/show_bug.cgi?id=30751#c3

https://bugzilla.wikimedia.org/show_bug.cgi?id=21526#c16

https://bugzilla.wikimedia.org/show_bug.cgi?id=28146#c4

https://bugzilla.wikimedia.org/show_bug.cgi?id=30906#c0

<<< and I'm sure there are more; its my 1st day; sorry>>>>

... with the current "plain text dump" approach over the never fully developd extract & parse approach. An XML of the text-layer generated via OCR is how Archive.org does it & that is how we should be doing it too. Once the text is in XML form - you can wipe it from the DjVu file on Commons (leaving nothing but the image layers to pull thumnbails from) until at the very least its fixed up by the Wikisource/WikiBooks people if not just by BOT for reinsertion if need be.

Someone needs to revisit DjVuImage.php and finish off the extract & convert/parse to/from XML development portion [DjVuLibre?] abandoned or whatever because "it was too slow" 6 years ago. The current bloated text dump will still be there to fall back on

@George Orwell III
Yes, you have right but I'm not sure that this bug is the best place to put this comment as the topic of the bug is related to a very specific problem (increase of memory consumption of djvutext version in Ubuntu 12.4).
I think you should open a new bug related to XML text layer and copy/past there your comment.

sumanah wrote:

What is the current status regarding memory consumption? Has it gone to a sustainable and serviceable level?

I am not seeing this behaviour, can we call this resolved, or at least no longer an issue?

jayvdb claimed this task.
jayvdb subscribed.

I am not seeing this behaviour, can we call this resolved, or at least no longer an issue?

Probably.