Page MenuHomePhabricator

Improved version of PdfHandler
Closed, DeclinedPublic

Description

I want to suggest some improvements for PdfHandler:

  • Respect page rotation (fix for Bug 22194 - pdfinfo doesn't respect it) - it's very simple to get page information from PHP code, so I've done it.
  • Direct rendering without ImageMagick rescaling step, also fixes possible memory/tempfile/performance issues on high DPIs.
  • Ghostscript options for better rendering (-dUseCropBox -dTextAlphaBits=4 -dGraphicsAlphaBits=4)
  • Thumbnails now link to correct individual pages of PDF.

I have commit access to extensions now, am I allowed to create my own branch for review? Or where should I commit my work for review? (The most dubious of all this is probably the multiple page thumbnail feature)


Version: unspecified
Severity: enhancement

Details

Reference
bz31775

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:53 PM
bzimport set Reference to bz31775.
bzimport added a subscriber: Unknown Object (MLST).

Forgotten another one:

  • Do not escape $wgPdfProcessor to allow specifying 'nice -n 20 gs' for lower ghostscript process priority, for example

Yes, you can create your own branch. Or even make the changes in trunk if they are straightforward enough (I recommend you separate the changes on its own revision, don't perform a big commit fixing a dozen different things).

$wgPdfProcessor needs to be escaped. What if I have gs at C:\Program files\Ghostview\gs.exe ?

I think that you can specify it already escaped in the configuration ?) I.e. it's still possible to use it.
The problem is that ghostscript can create relatively big load on the server, it's very useful to run it niced, but escaping forbids it. :(
On the other hand, one could use shell script with "nice -n 20 gs $*" instead of gs itself...
...

I discovered that it was too naive for me to think I can easily implement page size getting which works correct on all PDFs, so I returned to using pdfinfo, but had to install a patched version on all our servers.

It's a pain to separate the changes after they are already done :( So I've committed a single r100426 to the branch... Please give me some review? :)

Made a few comments on the rev -- I'd recommend breaking this into several separate bugs and treating them separately.

The page rotation looks like it requires patching third-party programs -- this would seem to mean that it won't work with a default poppler-utils install, and would make deploying it a lot harder.

There are a bunch of introductions of register_globals vulnerabilities -- do not want. :)

I'm a bit unsure what this multiple-thumbnail thing is for; it's not even mentioned in your descriptions. Not sure if this is a desirable thing, and if it is I'm not sure how much difficulty it would cause with compatibility. Thumbnails are rendered via a 404 handler on major installs like Wikimedia, and we need consistent naming that's understood on the back and front end; this may not work with that.

What's the benefits of the direct rendering from ImageMagick? Are there any reasons that wasn't done in the first place, such as file size, memory usage, etc?

What does "Thumbnails now link to correct individual pages of PDF." actually mean?

(In reply to comment #5)

The page rotation looks like it requires patching third-party programs -- this
would seem to mean that it won't work with a default poppler-utils install, and
would make deploying it a lot harder.

I've sent a bug to poppler's bugtracker: https://bugs.freedesktop.org/show_bug.cgi?id=41867
Hope the patch will be included some day. :)

There are a bunch of introductions of register_globals vulnerabilities -- do
not want. :)

Oops, thank you very much! I've understood at last, why so much extensions disallow setting parameters before inclusion... =) although register_globals is off on every "self-respecting" web server :)

I'm a bit unsure what this multiple-thumbnail thing is for; it's not even
mentioned in your descriptions. Not sure if this is a desirable thing, and if
it is I'm not sure how much difficulty it would cause with compatibility.
Thumbnails are rendered via a 404 handler on major installs like Wikimedia, and
we need consistent naming that's understood on the back and front end; this may
not work with that.

For example, insert all slides of a presentation to a wiki page...
I agree it's a sort of specific feature.

What's the benefits of the direct rendering from ImageMagick? Are there any
reasons that wasn't done in the first place, such as file size, memory usage,
etc?

Increased performance, reduced memory usage and no temp file problems.
What about temp files: we've had an issue when handling some PDFs (I can find an example if you want) - gs was creating several ~1Gb files on /tmp and then crashing. So /tmp became full some day :)
And all gs-based pdf viewers suffer - opening such PDFs takes ~4 min on modern hardware :) at the same time, Adobe Reader displays them normally.
The problem exists mostly when these PDFs are rendered with high DPI setting. So when using imagemagick, you must lower this setting to solve the problem, but this also affects the quality of other images.

What does "Thumbnails now link to correct individual pages of PDF." actually
mean?

This means [[File:Something.pdf|page=2]] will link to /wiki/File:Something.pdf?page=2 - not just /wiki/File:Something.pdf

Reverted !isset. r101066 :)
There was also r100601...

I have a question about 404 handler:
Is it http://www.mediawiki.org/wiki/Extension:WebStore ?
How does it reduce the load? I mean, when the article which includes an image is requested, the browser is anyway expected to download the thumbnail?

And another one about file names:
Multiple page rendering allows to efficiently generate thumbnails for the whole PDF (rather than calling gs multiple times). But imagehandler checks for the existence of ONE file. What would be the correct way to make it compatible with multiple files?

(In reply to comment #7)

I have a question about 404 handler:

Or is it purposed to ease distributing the load between different groups of backends?

No.
See the comments on http://www.mediawiki.org/wiki/Special:Code/MediaWiki/100535

I think you will find there pointers to pretty much everything on 404 handler.

In the meantime, libpoppler bug that I've filled (https://bugs.freedesktop.org/show_bug.cgi?id=41867) was fixed by developers in a different way - pdfinfo will now also show page rotation in addition to the page size. The patch is included into trunk and it will be in poppler 0.20. So we should update PdfHandler to work with it :)

I've added the support for newer poppler in r103313 to PdfHandler trunk, also posted a comment to bug 22194.

(In reply to comment #5)

I'd recommend breaking this into several
separate bugs and treating them separately.

Yes! Only one issue per report, please, otherwise it's easy to lose overview, like I did here.

Vitaliy: What's the status of this request? What's still needed to get this fixed?

Please break this into several separate tasks if still applicable. See https://www.mediawiki.org/wiki/How_to_report_a_bug : "Include only one problem per task".
Closing this task as non-actionable as it covers a lot of different issues so it is impossible to properly track any progress via patches or the task status.