Page MenuHomePhabricator

pdftotext should be poppler version not xpdf version on wikisource
Closed, ResolvedPublic

Description

Author: lars

Description:
[[Commons:File:Иннокентий Анненский - Царь Иксион, 1902.pdf]]
or
http://commons.wikimedia.org/wiki/File:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf
in the new version uploaded March 10, 2012,
is a PDF/A file with page images and OCR text layer, generated
from ABBYY Finereader OCR software.

The program pdftotext extracts the OCR text layer, which for the
first page begins: "Дннѳнскій.\n\nТ Р А Г Е Д І Я\nВЪ пяти ДѢЙСТВІЯХЪ\n".
(This text contains a few OCR errors, such as the initial "Д", which
is a misinterpreted "А", but this is entirely normal.)

The pdftotext output, piped through "od -c" begins:
0000000 320 224 320 275 320 275 321 263 320 275 321 201 320 272 321 226
0000020 320 271 . \n \n 320 242 320 240 320 220 320 223
0000040 320 225 320 224 320 206 320 257 \n 320 222 320
0000060 252 320 277 321 217 321 202 320 270 320 224 321 242 320

However, when the ProofreadPage extension tries to extract the text,
using the PdfHandler, the text passes through UtfNormal::cleanUp()
(line 140 of source file extensions/PdfHandler/PdfHandler.image.php),
and only the period, newline, some hyphens and digits come through.
Try this at the Russian Wikisource, by clicking the red-linked page numbers,
http://ru.wikisource.org/wiki/%D0%98%D0%BD%D0%B4%D0%B5%D0%BA%D1%81:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf

Pages are correctly split on \f (form feed).


Version: unspecified
Severity: normal

Details

Reference
bz35122

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:11 AM
bzimport set Reference to bz35122.
bzimport added a subscriber: Unknown Object (MLST).

lars wrote:

I should add that I run Ubuntu Linux 11.10, where pdftotext -? says:
pdftotext version 0.16.7
Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC

The above version successfully extracts the text.

A different version, which fails to extract letters, is included in xpdf 3.02, which says:
pdftotext version 3.02
Copyright 1996-2007 Glyph & Cog, LLC

pdftotext version 3.02

from xpdf-3.02 package produces a nice garbage mainly with spaces, dots and other ASCII punctuation.

I've updated the summary to make it clearer what is needed. Let me know if I have that right and I'll open an RT ticket.

Yes, it's fine. The xpdf version thing is just our theory. We have no idea which version of pdftotext is running really.

(In reply to comment #5)

Yes, it's fine. The xpdf version thing is just our theory. We have no idea
which version of pdftotext is running really.

reedy@fenari:~$ pdftotext -v
pdftotext version 3.02
Copyright 1996-2007 Glyph & Cog, LLC

Interesting that the latest ubuntu doesn't have pdftotext from xpdf,
lucid has it in xpdf-utils

New version of pdftotext is available from poppler-utils, although version numbers are low (now at 0.18.4):

http://packages.ubuntu.com/precise/poppler-utils

http://poppler.freedesktop.org/

Usually you have to get rid of xpdf to use poppler.

mah@lucid:~$ xpdf -v
xpdf version 3.02
Copyright 1996-2007 Glyph & Cog, LLC
mah@lucid:~$ pdftotext -v
pdftotext version 0.12.4
Copyright 2005-2009 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC
mah@lucid:~$ dpkg -l xpdf xpdf-reader poppler-utils xpdf-utils
Desired=Unknown/Install/Remove/Purge/Hold

Status=Not/Inst/Cfg-files/Unpacked/Failed-cfg/Half-inst/trig-aWait/Trig-pend
/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
/ Name Version

+++-==============================-==============================-
ii poppler-utils 0.12.4-0ubuntu5.2
ii xpdf 3.02-2ubuntu1.1
ii xpdf-reader 3.02-2ubuntu1.1
un xpdf-utils <none>

https://rt.wikimedia.org/Ticket/Display.html?id=2631

beau wrote:

*** Bug 34540 has been marked as a duplicate of this bug. ***

beau wrote:

*** Bug 32064 has been marked as a duplicate of this bug. ***

This was fixed when MediaWiki boxes were upgraded to Ubuntu Precise (which happened a few months ago). Faidon checked that on a Precise box poppler-utils is indeed installed instead of xpdf-utils.

Closing as FIXED.

Examples in bug bug 34540 and bug 32064 still show foreign characters as �. Any chance that the fix isn't deployed yet? Or these other bugs are not duplicates really?

(In reply to comment #13)

Examples in bug bug 34540 and bug 32064 still show foreign characters as �.
Any chance that the fix isn't deployed yet? Or these other bugs are not
duplicates really?

I don't know the implementation details of this functionality, but I'd be surprised if the text extraction wasn't cached. Hence if the text was extracted before this bug report was fixed, the text should still be wrong.
And now somebody please correct me if I'm wrong.

Yeah, action=purge on the file seems to have fixed it. Pikne, do you confirm?
As for what's a duplicate and what not, we can assume that poppler-utils has and/or will have bugs that xpdf doesn't, so the only way to know is to run it locally on your computer for the files you have problems with, to find out where the problem lies.

Yes, looks fine now. I didn't realize that sort of things could be cached too.