Page MenuHomePhabricator

Extract embedded text from PDF documents for search
Closed, ResolvedPublic

Description

PDF files may contain a machine-readable form of the text contained
in the represented document. It could be useful to extract this
text and include it in the search index for the file's description
page.

I'm pretty sure there are open-source tools for extracting text
data from PDFs out and about, but haven't looked into it.


Version: 1.7.x
Severity: enhancement

Details

Reference
bz6422

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:16 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz6422.
bzimport added a subscriber: Unknown Object (MLST).

PdfHandler extension does text extraction using 'pdftotext' utility if $wgPdftoText is on.

Currently this is stored into the metadata blob and isn't available for search, but may be used by Extension:ProofreadPage.

dchandler wrote:

@Brian: Thanks so much for posting this. I have desperately been trying to add the capability of searching within pdfs. I'm definitely a non-expert though and can generally only install extensions or make modifications that are well-documented.

Have you already implemented this on a wiki or know anyone who has? I've seen it suggested that FileIndexer (http://www.mediawiki.org/wiki/Extension_talk:FileIndexer) may be another approach. Do you have any advice for which approach is easier to implement for a non-expert? Do you think that the Extension:Proofreadpage method might be easier or more stable than using the other extension?

Do you know of any step-by-step guides to doing this with pdftotext and Proofread page?

Thanks so much in advance for any suggestion or guidance you have.

dr.trigon wrote:

As mentioned in bug 6421 (comment #3) - DrTrigonBot could do text extraction and store it into a dedicated wiki page in order to be accessible by search. But since PdfHandler does text extraction as well this should not be needed.

As I see we have everything needed:
1.) text extraction (PdfHandler or DrTrigonBot)
2.) indexing for search (see bug 6421)
...so as I understand we should be able to finish this and close the ticket/bug, or am I wrong? Could somebody comment on this?

Thanks and Greetings

I don't think there's anything left here to do, we index PDF/DJVU data in the new search.