Store DjVu, PDF extracted text in a structured table instead of img_metadata
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• brion
	Sep 14 2011, 10:34 PM

Description

When DjVu files contain text layers, we currently extract these and store them into the file's metadata blob, so it's available to extensions like ProofreadPage which can use it.

Unfortunately this *massively* increases the size of the file object -- which contains the uncompressed serialized metadata blob in memory -- leading to errors like T32751, running out of memory when loading a bunch of file objects at once in an API request.

In addition it's a bit awkward to access the text from other places; things like search indexing (T8421) would benefit from having a more standardish place to get at extracted text, and this could also be used for other file formats.

Version: 1.20.x
Severity: normal

Details

Reference: bz30906

	Title	Reference	Author	Source Branch	Dest Branch
	Extract Kafka config to env var	repos/data-engineering/eventutilities-python!21	gmodena	T329061-remove-hardcoded-kafka-parameters	main

Customize query in GitLab

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved	Feature	AnneT	T10738 Improve media (image) search display
Duplicate		None	T15370 Search media (images, videos, sounds, etc) by relevant metadata
Invalid		None	T43037 [DO NOT USE] PDF related bugs and enhancements (tracking)
Resolved		None	T8421 Extract embedded text from DjVu and PDF documents for search
Resolved		None	T8422 Extract embedded text from PDF documents for search
Resolved		• Deskana	T23061 Add uploaded file text and metadata from files to fulltext search set
Resolved		None	T23062 Interface to add more data/text fields for Lucene search engine (eg uploaded file text and metadata)
Declined		None	T32751 Allowed memory size exhausted when using API
Resolved		Ladsgroup	T32906 Store DjVu, PDF extracted text in a structured table instead of img_metadata

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:57 PM

• bzimport added a project: MediaWiki-File-management.

• bzimport set Reference to bz30906.

• bzimport added a subscriber: Unknown Object (MLST).

• brion created this task.Sep 14 2011, 10:34 PM

Changing deps from bug 6421 (DjVu-only) to bug 21062 (also notes PDF etc), so we cover wider space.

Perhaps (as an interim solution) we shouldn't be loading file metadata unless a method is called that specifically needs it. I imagine most of the time you don't need the metadata (otoh, maybe you need it more now a days that we check if jpg's need to be rotated)

• Gilles added a project: Multimedia.Nov 24 2014, 3:38 PM

GOIII subscribed.Feb 22 2015, 5:47 PM

Aklapper added a project: All-and-every-Wikisource.Mar 10 2015, 4:15 PM

Related to T96360

aaron mentioned this in T99263: Store Pdf extracted text in a structured table instead of img_metadata.May 15 2015, 5:42 PM

Nemo_bis subscribed.May 17 2015, 10:02 PM

AuFCL subscribed.May 24 2015, 3:11 AM

Tgr updated the task description. (Show Details)Jul 17 2015, 9:18 AM

Tgr set Security to None.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 17 2015, 9:18 AM

Tgr mentioned this in T105791: Chunked upload file fails to reconstitute....Jul 17 2015, 9:34 AM

zhuyifei1999 subscribed.Jul 17 2015, 10:21 AM

Yann subscribed.Jul 17 2015, 8:46 PM

Tgr mentioned this in T94562: Chunked/stashed uploads fail for some pdf and djvu files: "No specifications provided to ArchivedFile constructor.".Sep 1 2015, 7:03 PM

Tgr mentioned this in T107704: Failed 97MB PDF chunked upload to Commons: "No specifications provided to ArchivedFile constructor.".Sep 1 2015, 9:44 PM

Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:34 PM

Restricted Application added a subscriber: Steinsplitter. · View Herald TranscriptSep 4 2015, 6:34 PM

Krinkle mentioned this in T589: RFC: image and oldimage tables.Oct 4 2016, 9:50 PM

• brion renamed this task from Store DjVu extracted text in a structured table instead of img_metadata to Store DjVu, PDF extracted text in a structured table instead of img_metadata.Oct 7 2016, 10:30 PM

Restricted Application added a project: Commons. · View Herald TranscriptOct 7 2016, 10:30 PM

• brion mentioned this in T147296: img_metadata queries for PDF files saturates s4 replicas.Oct 7 2016, 10:31 PM

AuFCL unsubscribed.Nov 21 2016, 7:36 PM

Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptNov 21 2016, 7:36 PM

Liuxinyu970226 removed a parent task: T37925: [DO NOT USE. Please use the Wikisource project] Wikisource related bugs and enhancements (tracking).Dec 23 2016, 12:10 PM

MarkAHershberger unsubscribed.Dec 23 2016, 2:28 PM

Xover subscribed.Oct 23 2019, 6:10 AM

Hmm. While DjVu and PDF (and, I think, TIFF) has explicit text layers; any kind of image can in principle contain text and could benefit from a structured way to store the OCR as actual text. We have oodles of images-of-text in JPEG, PNG, etc. formats in addition to the "book" formats (DjVu, PDF). Even book scans are not-infrequently uploaded as a couple hundred JPEGs gathered in a category (if we're lucky).

This would make all those images content-searchable, which, combined with structured data for the description page, would be an extremely powerful tool!

Store DjVu, PDF extracted text in a structured table instead of img_metadataClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Store DjVu, PDF extracted text in a structured table instead of img_metadata
Closed, ResolvedPublic
Actions

Related Objects
Search...