Page MenuHomePhabricator

Parsoid: Export as LaTeX
Closed, DeclinedPublic

Description

[Suggestion from community.]

Not sure if we'd want to do this?


Version: unspecified
Severity: enhancement

Details

Reference
bz37933

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 12:26 AM
bzimport added a project: Parsoid-DOM.
bzimport set Reference to bz37933.

We are building a highly marked-up HTML DOM (see for example http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary and http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata, or http://parsoid.wmflabs.org for live output), which should be relatively easy to convert to LaTeX with existing tools. If additional information in the DOM is needed, then please let us know here!

Also changing the title to Parsoid since I imagine this to be more about the actual conversion than a button to start it.

Just a clarification: We don't currently plan to work on this ourselves, but would be happy to support somebody taking this on. A quick test using the pandoc tool (http://johnmacfarlane.net/pandoc/, apt-get install pandoc) looks quite promising:

pandoc -s -r html http://parsoid.wmflabs.org/en:Foo -o Foo.tex

The output could likely be improved by making use of the extra information contained in the Parsoid DOM, for example by adding a separate HTML flavor to pandoc. If Haskell is not your favorite language, there seem to be other html -> latex converters around, including at least one in Java: https://www.google.de/search?num=100&hl=en&ie=UTF-8&oe=UTF-8&q=convert%20html%20latex

Mass-moving bugs into the new 'Parsoid' product.

Closing as wontfix, as we don't plan to tackle this as part of the Parsoid project.

dirk.hunniger wrote:

I actually solved this problem. I used the same basic technologies as pandoc. The debian package is available as mediawiki2latex in debian sid. The command line tools takes an URL to a wiki page and writes a pdf file generated with latex. The latex source tree including processed images can also be exported.
Yours Dirk Hünniger

(In reply to comment #6)

I actually solved this problem. I used the same basic technologies as pandoc.
The debian package is available as mediawiki2latex in debian sid. The command
line tools takes an URL to a wiki page and writes a pdf file generated with
latex. The latex source tree including processed images can also be exported.
Yours Dirk Hünniger

So is this based on HTML or wikitext?

dirk.hunniger wrote:

It your choice. The standard mode is html. But if you provide the -m command line option it is based on wikitext.

(In reply to comment #8)

It your choice. The standard mode is html. But if you provide the -m command
line option it is based on wikitext.

The HTML mode sounds great. Would it be hard to take advantage of the extra metadata the Parsoid HTML5+RDFa [1] offers?

[1]: http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec

dirk.hunniger wrote:

Yes you can do that. You just just need to find someone to implement it since I lack time at the moment and will likely do so in the foreseeable future. The second point is that I cannot see any advantage I could get from using this data because from what I can see at a glance, everything useful is already in the normal html and I am already using all that.

dirk.hunniger wrote:

Actually C. Scott Ananian is currently working on this Bug. Maybe you should update the status and assignees of this bug. C. Scott Ananian is in particular a developer of the parsoid project.

https://git.wikimedia.org/summary/mediawiki%2Fextensions%2FCollection%2FOfflineContentGenerator%2Flatex_renderer

@Dirk: This bug is about implementing this in Parsoid, which we don't plan to do.

The latex renderer in the collection extension leverages Parsoid HTML, but is a separate project. This makes it useful independently of Parsoid, for example once we switch to HTML storage.