Author: lars
Description:
[[Commons:File:Иннокентий Анненский - Царь Иксион, 1902.pdf]]
or
http://commons.wikimedia.org/wiki/File:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf
in the new version uploaded March 10, 2012,
is a PDF/A file with page images and OCR text layer, generated
from ABBYY Finereader OCR software.
The program pdftotext extracts the OCR text layer, which for the
first page begins: "Дннѳнскій.\n\nТ Р А Г Е Д І Я\nВЪ пяти ДѢЙСТВІЯХЪ\n".
(This text contains a few OCR errors, such as the initial "Д", which
is a misinterpreted "А", but this is entirely normal.)
The pdftotext output, piped through "od -c" begins:
0000000 320 224 320 275 320 275 321 263 320 275 321 201 320 272 321 226
0000020 320 271 . \n \n 320 242 320 240 320 220 320 223
0000040 320 225 320 224 320 206 320 257 \n 320 222 320
0000060 252 320 277 321 217 321 202 320 270 320 224 321 242 320
However, when the ProofreadPage extension tries to extract the text,
using the PdfHandler, the text passes through UtfNormal::cleanUp()
(line 140 of source file extensions/PdfHandler/PdfHandler.image.php),
and only the period, newline, some hyphens and digits come through.
Try this at the Russian Wikisource, by clicking the red-linked page numbers,
http://ru.wikisource.org/wiki/%D0%98%D0%BD%D0%B4%D0%B5%D0%BA%D1%81:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf
Pages are correctly split on \f (form feed).
Version: unspecified
Severity: normal