Page MenuHomePhabricator

Install PdfHandler extension
Closed, ResolvedPublic

Description

Author: jodeldi

Description:
Please install the PdfHandler extension on Wikimedia servers.

The handler suports Pdf files, extracts pages, generates thumbs from extracted pages and displays Pdf files in multipage view like DjVu. It works together with ProofreadPage and embedding images with [[Image:Bla.pdf|page=5]] works too. The extension was tested by Raymond and me. A preview works under http://www.xarax.eu/wiki/Bild:110.pdf. The extension would be usefull for proofreading on all wikisource projects, but also to make image thumbs on all wikimedia projects.


Version: unspecified
Severity: enhancement
URL: http://www.mediawiki.org/wiki/Extension:PdfHandler

Details

Reference
bz11215

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:51 PM
bzimport set Reference to bz11215.

joergens.mic wrote:

I support this wish. The arguments can be seen above

gmaxwell wrote:

The example given is a scanned document. Why are these files not using djvu instead? Djvu files are usually substantially smaller (especially if the pdf is not using the patented jbig compression), we have pre-existing support for them, and the djvu viewer is nicer for high resolution images.

joergens.mic wrote:

1 Because we are getting, typically pdf files from academic libraries, most archives and libraries dosn't even know this lizard tech format.

2 Most of the programs support pdf, only some ist supporting djvu.

3 Pdf ist a de facto standard world wide, djvu is not.

4 Downloading an reading an pdf ist possible for everybody owning an PC because an acrobat reader is on every pc. For djvu you have to find the according programs.

5 In pdf file sometimes even the transcription of the text (ocr) is embedded, i don't know that things like that are possible in djvu.

6 Pdf is an format which is accepted by commons, we should be able to use it in a usefull manor.

I hope the 6 points above will give you the answers to your question.

jeluf wrote:

According to http://www.mediawiki.org/wiki/Extension:PdfHandler#Bugs_and_enhancements the extension can't handle already uploaded images. That's a show stopper.

Please fix and re-open this bug afterwards.

jodeldi wrote:

with current mediawiki (1.12alpha (r28001)) the extension works with already uploaded pdf files

thomasV1 wrote:

I support the activation of this extension on Wikisource.

We all agree that the Djvu format is technically more
appropriate for scanned documents.

However, the process of creating Djvu files is too difficult for most contributors.
As a result, contributors do not provide scanned images of their wikisource
documents; they only provide OCR-ed text, and they keep the pdf on their own
hard disk, wich makes collaborative verification impossible.

activating this extension on wikisource would solve that problem.

GIF files are also not desirable when PNG is more appropriate, but we do not prohibit GIF files from being viewed.

I agree that this is a useful improvement to the wikisource infrastructure, for the reasons given by others in comment #3 and comment #6. Wikisource guidelines should recommend DJVU over PDF, and provide information to assist in the learning curve.

ipork wrote:

I support this extension and I agree completely with ThomasV.

francescogabrielli wrote:

Strong support! Accurimbono

Please install this extension. Thanks, Yann

Joshua_Sherurcij wrote:

I strongly support the need for this extension at Wikisource, to improve attempts at collaborative effort and proofreading, as well as hosting documents that would otherwise be unhosted. <br /> -Joshua

Software requirements for deployment:

  • Install ghostscript and xpdf-utils on image scalers (gs, pdfinfo)
  • Install xpdf-utils on app servers (pdfinfo)

In general I think it seemed to be working ok when I last tested and tweaked it, so once the software deps are in we can throw it on test.wikipedia.org.

  1. I agree with ThomasV's comments. 2. I think converting pdf files into djvu files requires time and effort that might be given to effective proofreading. 3. I prefer djvu when it can be done, but lots of documents won't be converted so fast: checking if a modiff is or is not a vandalism is very difficult, with many pdf that exist on libraries, if we have insufficient tools or no tools at all to use them. So I ask for these tools too. Zeph.

There is a problem with the result: if I select the "Version imprimable" option in the left menu of http://fr.wikisource.org/wiki/Du_contrat_social/Texte_entier the plain text that I get is correct, but if I select the "Version PDF" option the text is cut into pieces and it is no more understandable at all. Zeph.

(In reply to comment #15)

There is a problem with the result: if I select the "Version imprimable" option
in the left menu of
http://fr.wikisource.org/wiki/Du_contrat_social/Texte_entier the plain text
that I get is correct, but if I select the "Version PDF" option the text is cut
into pieces and it is no more understandable at all. Zeph.

PDF export (Collection extension -> mwlib) is totally unrelated. This bug is about installing support for inline display of uploaded PDF files in pages.

mike.lifeguard+bugs wrote:

(In reply to comment #17)

I support it

Please don't add useless comments like this one. Commenting on bugs is for offering technical information related to solving the bug. Anything else simply lowers the SNR, making life difficult for everyone. Please CC yourself if you want to follow progress, and vote if all you have to offer is "I support it."

thomasV1 wrote:

I no longer support the activation of this.
I changed my mind for the following reason:

Djvu files may contain a text layer, which can store the result of an OCR. This
text layer can be extracted and provide a starting point for corrections on the
wiki. Soon this will be done automatically, when a page is edited for the first
time, without the need to use a robot (see latest changes to ProofreadPage).

The text layer is not supported by the PDF format. Users who start to work on a
PDF project might later realize that they want a djvu file, because they need to
start from an OCR. This will force them to rename all the pages. This is messy.

In contrast, if they start from a djvu file, it will always be possible to add
or improve the text layer, by uploading a new version of the file. Moving pages
around is not needed.

In addition, the last year has shown that the community has learned to create and
handle djvu files, which is the appropriate format for scans.

mike.lifeguard+bugs wrote:

(In reply to comment #19)

I no longer support the activation of this.

Sorry, this is not the place to put such comments. Discussion and consensus-building belongs on the wiki, not on the bug. The bug is for technical implementation of the request. However, to briefly address your comments, there is no reason that other wikis cannot make use of this extension, even if Wikisource prefers djvu. Furthermore, there is no reason we cannot support both, and every reason we should do so.

(In reply to comment #19)

I no longer support the activation of this.
I changed my mind for the following reason:

Djvu files may contain a text layer, which can store the result of an OCR. This
text layer can be extracted and provide a starting point for corrections on the
wiki.

So can PDF.

IIRC we fixed this up to stick it on the Usability wiki when we moved it to shared infrastructure... if the PDF rendering is working on the scaler boxes we should be free to enable it generally.

thomasV1 wrote:

pdfhandler seems to have been installed