Page MenuHomePhabricator

Switch from jpeg to png for thumbnailing pdfs
Open, MediumPublic

Description

From bug 36580:

<Robin_Watts> That looks a lot like you're rendering to JPEG - the ringing

artifacts etc.

<chrisl> hexmode: the "heavily compressed" effect is, as Robin_Watts

mentioned, because it's jpeg compressed - the solution is: don't use
jpeg.....

Version: unspecified
Severity: normal

Details

Reference
bz36597

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:28 AM
bzimport set Reference to bz36597.
bzimport added a subscriber: Unknown Object (MLST).

Created attachment 10534
using png

Switching to png output instead of jpeg using the command in bug 36580 comment 4 results in smaller file size, as well:

$ gs -sDEVICE=png16m -sOutputFile=after_gs.png -dFirstPage=1 -dLastPage=1 -r150 -dBATCH -dNOPAUSE -q Welcome2WP_English_082310.pdf

gives me a file size of 59394 instead of 143024.

Attached:

after_gs.png (1×1 px, 58 KB)

Created attachment 10535
downscaled png

after convert, compare to attachment #10524 (https://bugzilla.wikimedia.org/attachment.cgi?id=10524)

Attached:

after_convert.png (599×422 px, 54 KB)

GS default compression for jpeg device is 0.75, it'll better to try first a saner value like -dJPEGQ=95 and compare the output size and quality between png and jpg before switching to png. Switching to png can have a huge impact on wikisource using pdf file.

questpc wrote:

png is better for low-color pages, jpg is better for wide-color pages.

Copy paste from a comment I made on Gerrit change #6802 :


We could add a parameter to the thumb syntax to let the user choose the rendered. Something like:

[[File:foo.pdf|thumb|png]]
[[File:foo.pdf|thumb|jpg]]

And have the default set by a global configuration variable such as $wgPdfThumbOutputFormat or something. Would get us the best of both worlds :-]

That is definitely an easy change to the current patchset I will be more than happy to review it :)

questpc wrote:

It is better to calculate the number of color in PDF page, because one PDF file may combine low-color text pages and colorful illustrations. Or, if the color range calculation is too expensive, one may compress to lossless png and to 95% jpeg the same page and choose which image is smaller. For a wide-color images, lossless png will be MUCH larger than high-quality 95% jpeg.

I have abandoned Gerrit change 6802 pending a proper design choice which should be happening in this bug report.

[Patch in Gerrit got reviewed (and abandoned), hence resetting keyword]

Chaning state back to 'new' since the previous patch was abandoned some time ago.

I'm interested in following up on this bug, particularly for PDFs of vectorial graphs generated from data analysis software (like R or Mathematica), which researchers (including myself) routinely upload to Commons.

The quality of JPEG thumbnails for these PDF graphs is abysmal when compared to a thumbnails for a native PNG format.

Original files:
https://commons.wikimedia.org/wiki/File:Active_Editors_arwiki.pdf
https://commons.wikimedia.org/wiki/File:Active_Editors_arwiki_2.png

Thumbnails:
https://upload.wikimedia.org/wikipedia/commons/thumb/3/39/Active_Editors_arwiki.pdf/page1-1004px-Active_Editors_arwiki.pdf.jpg
https://upload.wikimedia.org/wikipedia/commons/0/06/Active_Editors_arwiki_2.png

The only other option for vectorial plots to avoid these compression artifacts is to upload them as SVG (which renders as PNG). However in many cases PDF is the default export option and the most common format for scientific media people will consider donating to Commons.

Copy paste from a comment I made on Gerrit change #6802 :


We could add a parameter to the thumb syntax to let the user choose the rendered. Something like:

[[File:foo.pdf|thumb|png]]
[[File:foo.pdf|thumb|jpg]]

And have the default set by a global configuration variable such as $wgPdfThumbOutputFormat or something. Would get us the best of both worlds :-]

That is definitely an easy change to the current patchset I will be more than happy to review it :)

This is T3316: Allow image thumb output format to be specified, by the way.

So?
Where is that JPEG lover among Wikimedia techs who obstructs fixing the bug for 4 years?
Look at https://commons.wikimedia.org/w/index.php?title=File%3AResourceLoader_Wikimania_2011.pdf&page=3 for example – several respected guys have their produce presented on par with miscellaneous rubbish from Wikimedia Commons, excretions of such users of Microsoft Paint or Adobe Photoshop who know nothing about formats.

Ī̲ don’t expect admins to proceed with the desired change in any foreseeable future, but fortunately a simple DHTML tweak can bring PNGs – see https://meta.wikimedia.org/wiki/User:Incnis_Mrsi/PDF-PNG.js

So?
Where is that JPEG lover among Wikimedia techs who obstructs fixing the bug for 4  years?

There is no obstruction, just status quo and lots of other things to work on and no one speaking up about this ticket for 4 years, which screams 'no priority' to me. I'd like to point out that constructive criticism is going to be more helpful than posturing.

There is no obstruction, just status quo and lots of other things to work

The only “work” to do was to change the value of [[ https://www.mediawiki.org/wiki/Extension:PdfHandler#Configuration | $wgPdfOutputExtension ]] from jpg to png – in this case, perhaps, noone would remember about the possibility of PDF → JPEG transformation, as PNG is universally supported in browsers since c. 1998, and no “huge impact on wikisource” would ever occur but disappearance of dirt. Instead, certain users opted to argue about “wide-color images”, increasing JPEG quality from 75% to 95%… how often do Wikimedians wrap color photography or painting into PDF?

The only “work” to do was to change the value of $wgPdfOutputExtension from jpg to png

And unintended side consequences would not occur ? Problems of having to regenerate thousands of images guaranteed not to be a scalability issue ? There would be 0 users asking for the opposite change in the week after ? No outrage debates with the community where the foundation has to pull 6 people off other projects to do crises management because of an unpredicted events ?

Even a three character change can have far reaching consequences. What looks simple and obvious to some isn't always so. This goes wrong often enough when Developers and/or Foundation DO think it through, that such things cannot be ignored/skipped/assumed even for small 'obvious' things like this.

Problems of having to regenerate thousands of images guaranteed not to be a scalability issue ?

For a huge site like Wikimedia? Some thousands (or even myriads) of gs(1) invocations? Insignificant.

There would be 0 users asking for the opposite change in the week after ?

You can count on PDF-JPG.js by me, a script doing the replacement in reverse, from (hopefully default) PNG to JPEG. Should Ī̲ craft it in advance?

I would recommend a post-processing for PNG thumbs by way of color reduction to 256. This usually gives a very signification reduction in file size at little cost to quality for drawing and text content. In ImageMagick, this is done by prefixing the output file name with PNG8:. Their handling of transparency is not good, but documents don't have that issue.