Page MenuHomePhabricator

Create additional components for PDF Rendering
Closed, DeclinedPublic

Description

I would propose adding two components:

  • "Wikimedia/PDF Rendering" (or "Mediawiki extensions/PDF Rendering") for the new renderer
  • "Wikimedia/mwlib PDF Rendering" (or "Mediawiki extension/mwlib PDF Rendering") for the old mwlib-based one (to recategorize bugs).

I would like to have bugs related to the "Collection" extension only to be keep separated from rendering issues.

The reason is I am still willing to help with the extension itself (the bookmarking tool), and sometimes I can help with the mwlib renderer
(running one for myself), but not necessarily with the new one.

Therefore I'd like to be on default CC list for "Collection" only, if possible.


Version: wmf-deployment
Severity: normal

Details

Reference
bz69603

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:33 AM
bzimport set Reference to bz69603.

I'm basically fine with this - whatever suits developers.
@C. Scott?

I would like to have bugs related to the "Collection" extension only to
be keep separated from rendering issues.

...I'd have no idea how a reporter or triager would be able distinguish and would appreciate some link explaining how to debug. :-/

Well, all bugs seem start to be named "New PDF renderer" so somehow they know :)

I just wonder where do folk know extension "Collection" is the *right* place - the special page is called "Book".

http://en.wikipedia.org/wiki/Help:Books/Feedback refers to #pediapress on IRC (where some folk do come) and code.pediapress.com bugtracker.

We have currently the option to pickup an mwlib or ocg latex renderer.

This is the #1 question that needs to be asked anyway for any bug regarding rendering, if not stated by the reporter.

We can sort it out differently (I don't mind having a keyword, maybe), but I'd like to opt out of notifications for "new rendered bugs", so I think the component is a better place.

I don't have a strong opinion on this. The plan is to deprecate the mwlib renderer over time, so I guess the idea was that we wouldn't need two different PDF rendering bugzilla components in the long run. But there will be ZIM and other backends for the Collection extension, so having bug reports split by backend seems reasonable.

Where exactly is the codebase for the "new" PDF renderer located?

https://git.wikimedia.org/summary/?r=mediawiki/extensions/Collection/OfflineContentGenerator.git

https://git.wikimedia.org/summary/?r=mediawiki/extensions/Collection/OfflineContentGenerator/bundler.git

plus few more under "OfflineContentGenerator".

I personally regret we have it under "Extension:Collection" - this extension is just bookmarking service, that hands over the product ("a collection") to some other services, renderers or otherwise; but this was discussed way back in November 2013 (https://www.mediawiki.org/w/index.php?title=Git%2FNew_repositories%2FRequests%2FEntries&diff=822347&oldid=821692).

Actually Extension:Collection could be just something lie pinboard.in for mediawikis.

Created attachment 16333
Tab-separated value spreadsheet to sort Collection bugs

Please also see bug 30511 comment 7 for support for this bug :)

We also have bug 31552 which (used to?) track various PDF issues.

There are some bugs specific to ZIM (bug 29817, bug 29849, bug 30199), we might have a component or a tracking bug for these.

We might also need tracking bug for complex script handling, since lots of mwlib bugs might need to be re-tested with OCG. There is also bug 35568 which has already been confirmed in the OCG as well.

In the attachment I have assigned bugs up to 56564 to new components:

  • "Collection" - for bookmarking/extension PHP/JS/CSS bugs
  • "mwlib renderer" - for bugs reported against mwlib (mostly PDF) rendering
  • "mwlib renderer for ZIM" - for bugs reported specificly against ZIM generation
  • "content fetch" - temporary name for a category for recent bugs related to switching to use Wikimedia API to render wikitext instead of older Export method. This was necessary due to switch to Lua and is a pretty fundamental change to the whole mwlib approach
  • "OfflineContentGenerator" - bugs reported/confirmed against the "new PDF renderer"

Attached:

Andre: we need to get this done before Wednesday, since we're going to turn on the new PDF service more broadly and I want to make sure that new bugs filed go into the "right" spot (whatever that is).

I think it would actually be best to create a new top-level component, called "Offline content service". (Unfortunately this won't help much with discoverability for those who want to file bugs against "Download as PDF", but at least it's technically accurate.)

Underneath this top-level component would be the components, "PDF renderer", "ZIM renderer", "ePUB renderer", "Bundler", and "Service". We might also have a "mwlib" component, although my understanding is that there is a non-WMF bugtracker which is better suited. Any mwlib-specific bugs should be transferred to the external bugtracker and closed. But I'd be willing to add a "mwlib" component for these issues if it would aid transition.

The existing "Collection" component under extensions would be used only for issues related to the extension itself (which communicates with the OCG service, or with an external mwlib service), as Marcin wants.

(In reply to C. Scott Ananian from comment #7)

Andre: we need to get this done before Wednesday

Alright, let's see. Sorry, this got off the radar.

I think it would actually be best to create a new top-level component,
called "Offline content service". (Unfortunately this won't help much with
discoverability for those who want to file bugs against "Download as PDF",
but at least it's technically accurate.)

Underneath this top-level component would be the components, "PDF renderer",
"ZIM renderer", "ePUB renderer", "Bundler", and "Service".

I assume that reflects the code architecture? Is there some scheme so a non-developer could in theory understand how things are supposed to work?

Regarding ZIM renderer, there is already an openZIM product: https://bugzilla.wikimedia.org/describecomponents.cgi?product=openZIM - how does this differentiate?

What would be component descriptions?
I have no idea what "Bundler" and "Service" do and how I could tell...
In general, see https://www.mediawiki.org/wiki/Bug_management/Project_Maintainers#To_add_a_project_or_component for required info.

We might also

have a "mwlib" component, although my understanding is that there is a
non-WMF bugtracker which is better suited. Any mwlib-specific bugs should
be transferred to the external bugtracker and closed.

Currently the description for "Collection" at https://bugzilla.wikimedia.org/describecomponents.cgi?product=MediaWiki%20extensions says:

Page collection extension for PDF/EPUB/ODT/OpenZIM/Okawix generation via PediaPress's mwlib backend (Homepage). Primarily for the Special:Book frontend. Note: There is an upstream bug tracker at http://pediapress.com/code/ maintained by PediaPress folks separately.

...which boils down to https://github.com/pediapress

But I'd be willing to
add a "mwlib" component for these issues if it would aid transition.

I'd prefer to point people to https://github.com/pediapress instead and explain how they can realize that they should go there (proposals?).

The existing "Collection" component under extensions would be used only for
issues related to the extension itself (which communicates with the OCG
service, or with an external mwlib service), as Marcin wants.

Yes, the architecture consists of a WMF extension (Collection) which talks to a service (mw-ocg-service) which schedules jobs and communicates with the extension. Jobs start with the bundler (mw-ocg-bundler) which spiders the requested articles and fetches all needed resources (images, stylesheets, text). The bundle is then given to one of several different backends. Right now there is a PDF backend and a plaintext backend. ZIM and ePub are next on the roadmap.

The ZIM backend *uses* the OpenZIM/zimwriter component, but it is a separate piece of code. Its primary job is to rewrite the html and organize the resources into a standalone tree (which it then gives to zimwriter).

So here's my proposed components and (updated) descriptions. I'm going to start simple with just two components under the OCG product, the others can be added as there is need.


MediaWiki extensions/Collection:

Page collection extension for creating offline formats (PDF/EPUB/ZIM/plaintext) using the Offline Collection Generator service ([[mw:OCG]]). Primarily for the Special:Book frontend.

The Collection extension can also talk to pediapress; bugs for the pediapress backend should be filed at http://pediapress.com/code/.

OCG: (new product)
The Offline Content Generator (OCG) service creates offline formats, such as PDF, ePub, ZIM, and even plaintext, from collections of mediawiki articles. Issues with the "Download as PDF" sidebar link can be filed here. ([[mw:OCG Homepage]])

OCG/General:
Issues with OCG that do not fit into the other components (or if you are unsure).

OCG/PDF renderer:

Issues with OCG related to PDF rendering.

The default cc field for all OCG bugs should be ocg-team@wikimedia.org. No bug voting, please.

(In reply to Marcin Cieślak from comment #0)

I would propose adding two components:

  • "Wikimedia/PDF Rendering" (or "Mediawiki extensions/PDF Rendering") for

the new renderer

Aaargh, this was perfect, why did it become a product?!
If it really needs to be a product, it should be

  • Product: Collection
    • Component: General (Collection extension)
    • Component: new PDF renderer/rdf2latex/whatever

I don't want to search everything across two products, it's messy.

(In reply to Marcin Cieślak from comment #6)

Created attachment 16333 [details]
Tab-separated value spreadsheet to sort Collection bugs

Does this mean you're already working on the retriaging? I'm currently working on https://etherpad.wikimedia.org/p/BugTriage-mwlib, the status of the bugzilla components looks too unstable to do anything for me.

Attached:

At the very least, the "General/Unknown" component should be renamed to something else or it's very hard to search OCG product with Collection component without also including the General/Unknown component in MediaWiki extensions.

Example custom query currently needed: https://bugzilla.wikimedia.org/buglist.cgi?f1=blocked&f2=OP&f3=component&f4=product&f5=CP&j2=OR&list_id=346828&o1=equals&o3=equals&o4=equals&query_format=advanced&resolution=---&v1=745&v3=Collection&v4=OCG

We're about to migrate to Phabricator. I think it's not worth fixing issues with how bugzilla manages search. The "General/Unknown" component is the standard for every product in bugzilla, AFAIK.

I'm happy with the current situation. Most "new" bugs are with the OCG production. The collection extension is mature, issues shouldn't be brought to Marcin's attention unless they are actually with the extension (the UI of adding pages to the collection).

I would like to re-close this bug.

Having reviewed the bugs (see the attachment) I would propose to rename the component to "PDF generation" or maybe "PDF and offline formats generation" to make sure people searching for "PDF" find it there first.

Since we have some bugs already that are reproducible in both mwlib and the ocg renderer I would suggest to use tracking bugs to mark bugs as confirmed against some particular renderer. We have already a tracking bug for ZIM, I also wonder if we should not have a separate bug for rendering of RTL and some other complex scripts (it's a whole class of problems there on its own).

So, component shouldn't be renderer-specific imho.

Nemo: yes, I did work on triage actually. I have re-tested few bugs even and commented on the accordingly. I have used existing Bugzilla bugs as a starting point, not the 'lost' PediaPress ones (so I think my work complements yours, in a way).

Re: Phabricator: we need to split those bugs away anyway and assign to Phabricator projects, which are a bit more flexible - you can have a bug in multiple projects, so we don't need tracking bugs for that.

I'm not opposed to renaming it "PDF and offline formats generation" but I looked at the list of products at https://bugzilla.wikimedia.org/enter_bug.cgi and it seemed to me that the left hand side was all uniformly short. Does the right-hand-side text convey "PDF generation" adequately?

(Also, there's some messaging going out in Tech News and with the ambassadors this weekend around the new service, including how to report bugs, so I don't think we should rename the product *right now* regardless. But in a week or two we can tweak stuff if we feel it would be helpful.)

(In reply to Marcin Cieślak from comment #14)

Having reviewed the bugs (see the attachment) I would propose to rename the
component to "PDF generation" or maybe "PDF and offline formats generation"
to make sure people searching for "PDF" find it there first.

Usually Collection is presented to the public as the [[m:Book tool]], is that clear enough? Remember we also have the openZIM product.

Since we have some bugs already that are reproducible in both mwlib and the
ocg renderer I would suggest to use tracking bugs to mark bugs as confirmed
against some particular renderer. We have already a tracking bug for ZIM,

Agreed, please do. As an alternative, a tag in whiteboard could be used. Either way, can you please apply on bugzilla the classification you made in your spreadsheet? I already saw some bugs I'd like to reclassify, but it's clunky to edit outside bugzilla.

I
also wonder if we should not have a separate bug for rendering of RTL and
some other complex scripts (it's a whole class of problems there on its own).

We should reuse bug 745 and similar generic trackers, see e.g. link in comment 12.

(In reply to C. Scott Ananian from comment #13)

We're about to migrate to Phabricator. I think it's not worth fixing issues
with how bugzilla manages search.

IMHO the opposite: we need a functioning bugzilla in the coming two weeks for https://www.mediawiki.org/wiki/Bug_management/Triage/201410 , so there's no best moment to fix this.

I think we're going with the current components for now; after the Phabricator migration we can look at this again.

I propose to close this bug.

Yes, we definitely can fix the current setup better in the Phabricator. Thank you for the discussion anyway, it was useful o me.