Page MenuHomePhabricator

Problematic book export to PDF/ODF for bidi documents
Closed, ResolvedPublic

Description

Author: itay_is_me

Description:
PDF and ODF files of bidi document, part of a Hebrew book from the Hebrew wikibooks

When trying to export bidi documents collection (download the collection), there are problems with both PDF and ODF files:

  • Text in PDF files is mirrored (ordered from left to right instead of right to left). For example, instead of תכנות מתקדם, it is written םדקתמ תונכת.
  • Headers inside the document are displayed as blank squares (probably illegal font was used).
  • ODF files are assigned as LTR document instead of RTL.

Version: unspecified
Severity: enhancement
OS: Windows NT
Platform: PC

Attached:

Details

Reference
bz17766

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 10:32 PM
bzimport added projects: Collection, I18n.
bzimport set Reference to bz17766.

Yeah, RTL is not currently working in the PDF export... additionally, character shaping doesn't happen for Arabic script.

The ODF bit might actually be an easier fix, if it mainly comes to marking the document language/direction... though embedded LTR bits might be a problem.

rpschirmer wrote:

While creating a PDF-version the folling error occures since a few days:

POST-Anfrage fehlgeschlagen
aus Wikipedia, der freien Enzyklopädie
Wechseln zu: Navigation, Suche

Die POST-Anfrage an http://pdf1.wikimedia.org:8080/mw-serve/ ist fehlgeschlagen (Empty reply from server).

Zurück zur Seite Wikipedia:Hauptseite.

(In reply to comment #2)

While creating a PDF-version the folling error occures since a few days:

POST-Anfrage fehlgeschlagen
aus Wikipedia, der freien Enzyklopädie
Wechseln zu: Navigation, Suche

Die POST-Anfrage an http://pdf1.wikimedia.org:8080/mw-serve/ ist fehlgeschlagen
(Empty reply from server).

Zurück zur Seite Wikipedia:Hauptseite.

This has also been filed as bug 18816

  • Bug 23893 has been marked as a duplicate of this bug. ***

Can someone of PediaPress clarify the root cause of this issue? I'd like to get this resolved after two and a half years preferably sooner than later, and it is not clear to me where the issue is coming from.

I believe it needs support for bidirectional text and complex scripts in the underlying ReportLab library that does the PDF output.

Some googling indicates there are at least some RTL/bidi patches around, some of which may or may not have gotten merged upstream. Last I saw though using fribidi for Arabic shaping was sufficient only for Arabic language, not for other languages in Arabic script like Farsi and Pashto.

OK. Since PediaPress does not appear to be anywhere interested in getting this fixed, we should just disable the Collection extension where it does not work because of failing script support. I have checked Arabic and Hebrew Wikipedias and there Collection is not enabled, so that appears to be fine.

we hid it in fa.wp by javascript

volker.haas wrote:

The problem is not that PediaPress is not interested in fixing this. The problem is that the framework we are using to generate PDFs does not support right-to-left languages properly (reportlab). I have been in discussion with the developers and some Israelis (for QA mainly) and we have made slow progress. Now it looks as if all major issues have been ironed out and that the PDF export is ready for hebrew at least. Check the sample the I rendered ~2 days ago [1] - beware that all contents are random since I do not speak or read hebrew.
So far I have some feedback from hebrew speaking persons who think that the quality is good and PDF export is ready for production use for hebrew. As a matter of fact we have contacted the WMF a week ago and informed them that we think the PDF export is now ready to go live in the hebrew wikipedia and scheduled to activate the Collection extension in the hebrew wikipedia later today.

Please keep in mind that the PDF export is open source software and everybody can contribute. I have spend a substantial amount of time already to improve rtl support, but as someone who does neither speak or ready any of these languages I make progress slowly.

Please contribute!

My last tests indicated that for arabic there are still some problems. If anybody is willing to contribute I'd be happy to accept patches.

[1] http://pediapress.com/files/he/sample_1.pdf

volker.haas wrote:

This bug report specifically mentioned one article [1]. I fixed the remaining issue (wrong direction in source nodes) with [2]. I updated the render servers. The article now looks correct to me (at least the squiggly lines in my browser closely resemble the ones in the PDF).

I am closing this ticket, open new ones for specific issues for hebrew.

[1]http://he.wikibooks.org/wiki/%D7%AA%D7%9B%D7%A0%D7%95%D7%AA_%D7%9E%D7%AA%D7%A7%D7%93%D7%9D_%D7%91-Java/%D7%96%D7%A8%D7%9E%D7%99%D7%9D
[2] http://code.pediapress.com/git/mwlib.rl?p=mwlib.rl;a=commit;h=bde6f53e9159a8869a8b1a6d529d657fd2ca3d5a

I still see wrong direction in the source nodes for that page.

Preformatted sections are showing left-aligned, but in a variable-width font instead of a monospace one.

The final template on the page seems to render as a tiny table containing only '-', '"', and '-' respectively in each cell.

Other than that it looks about right; the bidi layout generally seems to match what I see in a modern browser. Hebrew is simpler than Arabic though as it doesn't have the glyph shaping / ligature things, so that needs to be clearly tested as well.

volker.haas wrote:

(In reply to comment #12)

I still see wrong direction in the source nodes for that page.

The wrong direction in the source nodes is a caching issue. If you want to make sure that the latest version of the software is used you need to make a (one article) collection and not render by using the "download as PDF" link. That is unfortunate, but I don't know how to fix it, but the problem solves itself pretty quickly...

Preformatted sections are showing left-aligned, but in a variable-width font
instead of a monospace one.

The problem is that the text in question is not recognized as preformatted sections at all...I'll investigage, seems like a mwlib parser issue.

The final template on the page seems to render as a tiny table containing only
'-', '"', and '-' respectively in each cell.

I'll check that. To be honest, I think that table should probably not be printed at all, since it seems to be some navigational template.

Other than that it looks about right; the bidi layout generally seems to match
what I see in a modern browser. Hebrew is simpler than Arabic though as it
doesn't have the glyph shaping / ligature things, so that needs to be clearly
tested as well.

volker.haas wrote:

(In reply to comment #13)

(In reply to comment #12)

Preformatted sections are showing left-aligned, but in a variable-width font
instead of a monospace one.

The problem is that the text in question is not recognized as preformatted
sections at all...I'll investigage, seems like a mwlib parser issue.

This is now fixed with http://code.pediapress.com/git/mwlib?p=mwlib;a=commit;h=dc8311e85de779d991fe34d7f09879006801a998

Great. Does Wikimedia need to do anything to get this deployed, or is this all on your end?

I've filed a minor additional bug with the page footer in Hebrew as bug 30223.

Arabic support however seems to still be very problematic.

I tried exporting a random page from ar.wikibooks.org:
https://secure.wikimedia.org/wikibooks/ar/wiki/%D8%B3%D9%84%D9%81%D9%86%D9%8A_3_%D8%AC%D9%86%D9%8A%D9%87:_%D8%A7%D9%84%D8%A5%D8%AA%D8%B5%D8%A7%D9%84%D8%A7%D8%AA_%D9%88%D8%A7%D9%84%D9%85%D8%AC%D8%AA%D9%85%D8%B9_%D9%81%D9%8A_%D9%85%D8%B5%D8%B1

The text positioning is completely wrong; instead of being right-justified things seem to float somewhere in the middle. This may indicate incorrect handling of font metrics?

There also appear generic box characters in a large number of places (I think where a zero-width non-joiner appears, which happens *a lot* on that page).

Other than the boxes, shaping looks more or less ok but I can't read Arabic myself so I could easily be missing some additional details.

PDF attachment to follow.

Created attachment 8884
Arabic page export from ar.wikibooks.org

See above comment describing rendering bugs.

Attached:

I can confirm that the directionality on source nodes has been fixed with a forced reload of the Hebrew page. The font fix for <pre> sections I assume hasn't been installed yet on the generator server.

Created attachment 8888
Screenshot of misplaced hebrew nukta vs web rendering

Hebrew also has failures -- when combining characters for vowel markings (nukta) are used, they do not combine, but stack up as separate characters along the line.

attachment Screen Shot 2011-08-05 at 2.06.07 AM.png ignored as obsolete

Created attachment 8889
Fixed screenshot

Not sure what broke on the previous image.

Attached:

screen_shot.png (294×782 px, 65 KB)

volker.haas wrote:

(In reply to comment #15)

Great. Does Wikimedia need to do anything to get this deployed, or is this all
on your end?

Deployment of the rendering software is done by me. Activation of the Collection extension is done by WMF.

volker.haas wrote:

I investigated the 'nukta issue'.

Some preliminary remarks:

We are using a python library which implements the bidi algorithm. This algorithm basically reorders characters from their logical (the "direction" of storage) to their visual ordering. The library uses the fribidi c library. Details can be found at [1]

After the tests I have done, I believe the fribidi library screws up when reordering:

word investigated:
חַיְפַא

logical ordering (this is how the string is stored)
ח 1495
1463
י 1497
1456
פ 1508
1463
א 1488
ERRONEOUS transformation by fribidi
א 1488
פ 1508
1463
י 1497
1456
ח 1495
1463
correct transformation (manually transformed):
א 1488
1463
פ 1508
1456
י 1497
1463
ח 1495

I checked the manual transformation in the PDF and the result is as expected (same as in the browser).

Minimal example in python:

first install pyfribidi: easy_install pyfribidi

the run python (or ipython):


In [35]: import pyfribidi2

In [36]: text = unicode('חַיְפַא', 'utf-8')

In [37]: bidi_trans = lambda t: pyfribidi2.log2vis(t, base_direction=pyfribidi2.RTL)

In [38]: for c in bidi_trans(text): print c, ord(c)

....:

א 1488
פ 1508
1463
י 1497
1456
ח 1495

1463

To me it looks as if the fribidi library needs to be fixed. Help is welcome ;)

[1] http://pypi.python.org/pypi/pyfribidi/0.10.0

Hmmm.... the fribidi transformation actually looks legit to me; the combining characters should appear after their base characters in the stream, same as in Latin ("e", "combining acute" -> renders like "é")

It looks like fribidi deliberately switched *to* keeping the combining characters logically after their base characters some years ago; here's some old threads on the subject:

http://www.mail-archive.com/linux-utf8@nl.linux.org/msg01710.html
http://lists.freedesktop.org/archives/fribidi/2002-March/000067.html

It may be that something specific about how the underlying PDF library handles fonts and combining characters could be incorrectly pushing them to the right of their base characters; or it may simply not support combining characters and so is inserting them visually in the logical order as if they were their own letters...?

Excuse me, In last report I had some mistakes LTF => RTF

wikimedia wrote:

FriBidi has an option to NOT reorder non-spacing marks. The problem is, for correct rendering of complex text FriBidi is not enough. You can add more heuristics, but it will never be the real thing.

(In reply to comment #24)

Hi, I checked it in fa.wikipeda and fa.wikibooks these are some samples
1-http://fa.wikibooks.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=e600cdf555951c7b&writer=rl&return_to=%DA%A9%D9%88%D8%AF%DA%A9%D8%A7%D9%86%3A%D9%85%D9%86%D8%B8%D9%88%D9%85%D9%87+%D8%AE%D9%88%D8%B1%D8%B4%DB%8C%D8%AF%DB%8C
2-http://fa.wikibooks.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=8070f397f7a74170&writer=rl&return_to=%D8%A7%D8%AE%D9%84%D8%A7%D9%82+%D8%A7%D8%B3%D9%84%D8%A7%D9%85%DB%8C
3-http://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=c232c1e277b29a56&writer=rl&return_to=%DA%A9%D8%A7%D9%81%DB%8C%E2%80%8C%D8%A7%D8%B3%DA%A9%D8%B1%DB%8C%D9%BE%D8%AA
they have some bugs
1-all of them they have LTF problem
2-case 1,3 have problem with ل.
3-in case 3 it has problem with اً in معمولاً
4-in ar.wikibooks.org they don't have center direction! where we can change PDF
text direction to LTF
5-In the last page ﻫﺎﻫﺎ ﻭ ﻣﺸﺎﺭﮐﺖﻣﻨﺎﺑﻊ ﻣﻘﺎﻟﻪ is incorrect it must be
مشارکت‌ها و منابع مقاله‌ها
6-Infobox in case 3 has problem it must be like
http://fa.wikipedia.org/wiki/%DA%A9%D8%A7%D9%81%DB%8C_%D8%A7%D8%B3%DA%A9%D8%B1%DB%8C%D9%BE%D8%AA

۱-http://fa.wikibooks.org/wiki/%DA%A9%D9%88%D8%AF%DA%A9%D8%A7%D9%86:%D9%85%D9%86%D8%B8%D9%88%D9%85%D9%87_%D8%AE%D9%88%D8%B1%D8%B4%DB%8C%D8%AF%DB%8C
۲-http://fa.wikibooks.org/wiki/%D8%A7%D8%AE%D9%84%D8%A7%D9%82_%D8%A7%D8%B3%D9%84%D8%A7%D9%85%DB%8C

volker.haas wrote:

(In reply to comment #23)

Hmmm.... the fribidi transformation actually looks legit to me; the combining
characters should appear after their base characters in the stream, same as in
Latin ("e", "combining acute" -> renders like "é")

It looks like fribidi deliberately switched *to* keeping the combining
characters logically after their base characters some years ago; here's some
old threads on the subject:

http://www.mail-archive.com/linux-utf8@nl.linux.org/msg01710.html
http://lists.freedesktop.org/archives/fribidi/2002-March/000067.html

It may be that something specific about how the underlying PDF library handles
fonts and combining characters could be incorrectly pushing them to the right
of their base characters; or it may simply not support combining characters and
so is inserting them visually in the logical order as if they were their own
letters...?

Thanks for the info Brion. If you are right then the error is indeed in the "low level rendering" done by the PDF framework. I started investigating the reportlab source code...

To everyone else:

  • could you please always provide minimal examples (shortest possible markup example which exposes some problem)
  • clearly describe what you expect vs. what you get
  • open separate tickets for separate problems
  • keep problems related to different scripts in different tickets. (right now I am focusing on hebrew, after that I'll start with arabic)

Thanks!

volker.haas wrote:

Thanks for the info regarding the fribidi library, Behdad! If I read the relevant part of the Unicode spec correctly it has to be expected that some software can't deal with reordered non-spacing marks [1]. Therefore it seems valid to not reorder them. I made the necessary change in pyfribidi [2] and mwlib [3].

The issue Brion raised originally should be fixed. I tested the following for correctness [4]

I am closing this ticket now. Please open specific tickets for other issues.

[1] section 5.13 in http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf
[2] http://pypi.python.org/pypi/pyfribidi
[3] http://code.pediapress.com/git/?p=mwlib.ext/.git;a=commit;h=e4bba86023c78ac800362dfe59edfecb2ff3adbb
[4] http://he.wikipedia.org/w/index.php?title=%D7%9E%D7%A9%D7%AA%D7%9E%D7%A9:Volker.haas&oldid=10980308

(In reply to comment #30)

fixed?
I tested it on a random page in Persian wikipedia
http://fa.wikipedia.org/wiki/%D9%BE%D9%84%D8%A7%D8%B3%D9%85%D8%A7%DB%8C_%DA%A9%D9%88%D8%A7%D8%B1%DA%A9_%DA%AF%D9%84%D9%88%D8%A6%D9%88%D9%86

the output is still LTR and have some problem with character ل

the problem is with لا not ل

another bug is very large size
PDF version of article "تهران" ("Tehran") is 19 mg!
Is there a way to make PDFs smaller? because almost all of readers of Persian Wikipedia is Iranian and law of Iran don't let normal people have internet with more speed of 128 kb/s :(

(In reply to comment #32)
(In reply to comment #30)
Please open a new issue for this. Comment 32 is not related to this issue, and comment 30 is an issue that is a lot smaller than what we originally started off with.

(In reply to comment #33)

(In reply to comment #32)
(In reply to comment #30)
Please open a new issue for this. Comment 32 is not related to this issue, and
comment 30 is an issue that is a lot smaller than what we originally started
off with.

I made bug 23893 that marked as duplicate of this bug. please reopen this bug or bug 23893

(In reply to comment #33)

(In reply to comment #32)
(In reply to comment #30)
Please open a new issue for this. Comment 32 is not related to this issue, and
comment 30 is an issue that is a lot smaller than what we originally started
off with.

I opened [https://bugzilla.wikimedia.org/show_bug.cgi?id=30326 bug 30326]

  • This bug has been marked as a duplicate of bug 30326 ***