Page MenuHomePhabricator

Store DjVu, PDF extracted text in a structured table instead of img_metadata
Closed, ResolvedPublic

Description

When DjVu files contain text layers, we currently extract these and store them into the file's metadata blob, so it's available to extensions like ProofreadPage which can use it.

Unfortunately this *massively* increases the size of the file object -- which contains the uncompressed serialized metadata blob in memory -- leading to errors like T32751, running out of memory when loading a bunch of file objects at once in an API request.

In addition it's a bit awkward to access the text from other places; things like search indexing (T8421) would benefit from having a more standardish place to get at extracted text, and this could also be used for other file formats.


Version: 1.20.x
Severity: normal

Details

Reference
bz30906
TitleReferenceAuthorSource BranchDest Branch
Extract Kafka config to env varrepos/data-engineering/eventutilities-python!21gmodenaT329061-remove-hardcoded-kafka-parametersmain
Customize query in GitLab

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:57 PM
bzimport set Reference to bz30906.
bzimport added a subscriber: Unknown Object (MLST).

Changing deps from bug 6421 (DjVu-only) to bug 21062 (also notes PDF etc), so we cover wider space.

Perhaps (as an interim solution) we shouldn't be loading file metadata unless a method is called that specifically needs it. I imagine most of the time you don't need the metadata (otoh, maybe you need it more now a days that we check if jpg's need to be rotated)

Tgr set Security to None.
brion renamed this task from Store DjVu extracted text in a structured table instead of img_metadata to Store DjVu, PDF extracted text in a structured table instead of img_metadata.Oct 7 2016, 10:30 PM

Hmm. While DjVu and PDF (and, I think, TIFF) has explicit text layers; any kind of image can in principle contain text and could benefit from a structured way to store the OCR as actual text. We have oodles of images-of-text in JPEG, PNG, etc. formats in addition to the "book" formats (DjVu, PDF). Even book scans are not-infrequently uploaded as a couple hundred JPEGs gathered in a category (if we're lucky).

This would make all those images content-searchable, which, combined with structured data for the description page, would be an extremely powerful tool!

Krinkle closed this task as Resolved.
Krinkle assigned this task to Ladsgroup.