Page MenuHomePhabricator

Allow post-parsed documents to be available offline and saved into IndexedDB and made queryable
Closed, DeclinedPublic

Description

USAGE SCENARIOS:

  1. Go to a wiki article, click to add that page to one's IndexedDB storage and/or offline cache, keyed to that article title.
  1. Go to a wiki category, click to add all pages under that category to one's IndexedDB storage and/or offline cache, keyed to that category.
  1. Use an auto-archive mode so that all visited pages will be archived and keyed to their article title and category and/or stored in the offline cache.
  1. Use offline HTML5 to access the entire Mediawiki app, with any such saved documents available while offline. Ideally, edits could also be cached and optionally submitted when connectivity was reestablished depending on conditions (using the API for Ajax storage ideally as well, since such an offline tool would be a modern app already, and would not need the page refreshes required by PHP form submissions, even allowing on some wikis, the ability to do real-time editing).
  1. Provide an HTML5 SharedWorker API to allow cross-domain access to its IndexedDB database, so that other applications could request user permission to access their locally stored files, and perhaps act as query engines or browsers of this data.
  1. Although a third-party API, especially with implementation of bug 28700 could do this if item #5 just above were implemented, it would be convenient for Mediawiki to have its own default offline query page, allowing users the ability to create XQueries (e.g., using the newly unfolding JavaScript-based XQIB http://xqib.org library) or evaluate arbitrary jQuery-based statements against IndexedDB-archived items, including the ability to specify querying against an entire collection (category) or all documents.

One might be able to search, for example, for

$('collection("Shakespeare") div[class=ironic]')

...to find all passages in the locally-stored Wikisource Shakespeare category/collection which were marked up as being ironic.

Or one could perform searches (and even joins using XQuery, JSLinq, https://github.com/nkallen/jquery-database etc.) to query HTML tabular data on Wikipedia's reference tables, for example:

$('doc() td:contains("345")'); // doc() being the current document context, or collection() being the current collection/category

...search through all of one's Wikibook texts, mash-up SVG documents on Commons, or do complicated transformations (without taxing or waiting for a server) which for example, join all letters from a certain author and put them in date order, if the markup specified such dates.

Though I used familiar jQuery examples, I think XQuery, besides being a standard, not only side-steps potential security problems with safely evaluating raw user jQueries, but also offers the more powerful XPath expressions (though admittedly it is not as conveniently extensible as allowing arbitrary JavaScript).

For places and countries with poor internet connectivity, such offline ability ought to still allow rich computer interaction offline, and maybe become a germ for decentralized, distributed wikis.


Version: unspecified
Severity: enhancement
Whiteboard: aklapper-moreinfo

Details

Reference
bz28706

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 11:26 PM
bzimport set Reference to bz28706.
bzimport added a subscriber: Unknown Object (MLST).

I might also point out that offline apps do not need to be offline--they can still be made to access remote resources, but they can allow the user and site to benefit from caching in that all static files can be indefinitely cached (until the cache.manifest needs to be changed and an update forced), something which would particularly be useful for such frequently visited sites as Wikipedia (giving even more reason to implement bug 17595 , in particular the suggestions at http://www.princexml.com/howcome/2009/wikipedia/infobox/ such as avoiding inline styles which only increase bandwidth and make skinning less pleasant).

And probably needless to say, having a mechanism to optionally auto-update (if not provide a local history) the local storage when a new version becomes available online would be ideal.

Sorry, I had not seen notification of the comment from Andre. No, this need is not met by those projects.

And thank you for your engagement on this!

My proposal here is to utilize the emerging web standard of offline applications to allow users of Mediawiki to use as many features as possible offline including viewing previously saved content offline.

There are two main differences from the projects you mentioned:

  1. The user would not need to download ALL content of Wikipedia for offline browsing. They could download for offline use just those pages (or perhaps categories) of interest to them, or perhaps allow configuration to store all pages once visited, as well as the option to download the whole site at once (keeping things in sync would be very nice, but no doubt impractical for a site with as many changes as Wikipedia, unless perhaps, it were to be confined to updating the page histories of interest instead of content).
  1. There would not be any need for additional software beyond a (modern) browser. While I understand it is a goal of Wikipedia to support all browsers having any significant user base, I believe that taking advantage of this emerging standard before it is implemented in all browsers is truly compelling enough to let users with supporting browsers take advantage of this capability today, while simply allowing non-supporting browsers to continue to use Wikipedia in the same manner as before (i.e., without offline capability). If implemented in this standard manner, as other browsers are upgraded, users of these browsers will also be able to benefit from these features.

The specific web technologies required include:

  1. Caching manifests which would allow Wikimedia's servers to send the HTML, CSS, JavaScript, and image files behind the outward-facing core of Wikipedia to the user's browser so that the files can continue to work offline when reading the pages stored in step #2.
  1. IndexedDB database offline database storage. This would store the files chosen by the user (e.g., by their clicking links like "download this category for offline use" or "download this page for offline use"). Users could ideally also optionally allow entire page histories to be downloaded in addition to the most current copy.

Since a complete implementation would be a vast undertaking, I would propose a phased approach such as the following:

  1. Implement caching manifests, so that supporting browsers could permanently cache the HTML/CSS/JavaScript/image files used by Wikpedia, with the benefit of providing faster performance for users and lighter demand on the server--browsers would never need to ask Wikipedia for new copies of these permanently stored files after the first time they obtained the copies, except for the need to ping Wikimedia servers (when online) to check whether there were any updates to those files that needed to be auto-downloaded. Browsers currently do cache Wikipedia files in a similar manner, but they don't reserve permanent space for these files, so the browser will periodically need to re-download these files and thus slow down their visits and demand on the server.
  1. Implement IndexedDB storage for viewing of content pages specified by the user (explicitly chosen individual pages or whenever-visited pages, entire category of pages, the entire site, etc.). This might perhaps start with the downloading only of the most current page, but moving to the option to download the entire page history for offline viewing.
  1. Implement offline search in a manner similar to the current online search capabilities (but with a view toward supporting more sophisticated searches such as described in step #5 below).
  1. Move Mediawiki away from PHP/MySQL to a standard Server-Side JavaScript solution to allow for code sharing between the server and client since the more features implemented in JavaScript (in the right way) would mean an easier task of supporting the same capabilities offline. This would also enhance performance for users who had cached the content.

For example, currently if one wishes to compare two revisions and be shown the "diffs", one has to request this of the server, with network connections being the largest bottleneck of performance (especially where internet connectivity is poor). If implemented in JavaScript, this functionality could be run offline.

Likewise, it would be possible even to save up user edits so that when they were online again, the system could ask them whether they wished to submit their edits back to the server.

Of course, the longer the user had been offline, and the higher the traffic of the Mediawiki site they were using (e.g., Wikipedia), the more likely they would be to run into conflicts (pointing perhaps to the desirability of a better file merge capability).

One other benefit to this language shift might be to stimulate Mediawiki developers toward providing more user-friendly (Ajax-based) designs which did not always force the user to have to wait for an entire page refresh.

  1. Facilitate a "distributed wiki" or decentralized wiki model. The improvements of #4 could be progressively enhanced to support users making "forks" (i.e., their own independent version) of the content, so they could store and view their own independent version of pages of interest to them (e.g., by adding their own notes to wiki content), and perhaps even submit their own modified version to a server of their choice if they wished to publish their fork. While the benefits of Wikipedia often involve the community working together, if the technology facilitated such a distributed model, there would also be room for easily sharing Wikipedia-based content (including small portions) amongst a community of expertise. This would not only be of use with Wikipedia content, but for the Mediawiki software in general.
  1. With a move toward WYSIWYG editing and client-side wiki language processing facilitated by #4, the XHTML output could also be shared with users, and thus allowing it to be predictably queryable to power users. The likes of a jQuery plugin could be used to allow such power users the ability to request to find content within a particular category, merge it with content from another page or category, and then display the output. This capability would not need to be enabled on the server, as such arbitrary queries could be demanding on server performance, but it would not be an obstacle for users running the queries on their own machines.

Correction: "with a view toward supporting more sophisticated searches
such as described in step #5 below" should end with "in step #6 below".

You can create your own selection of articles in the ZIM format using the "Special:Book" page.

Regarding the tech:
We have time to time such type of remark advocating a nosql DB system to sotre WP offline. The problem is that as far as I know, none of them is as efficient in term of compression/access speed/resource consumption as ZIM. This would be interesting to get some benchmark about indexdb with big corpus of Wikipedia content (with pictures) to see if indexdb could be a valid choice, at all, to store big Wikipedia contents offline.

Regarding the incremental update:
This is in any case challenging and pretty independent of the storage backend IMO. This is already on the openZIM roadmap, I want to see that implemented in 2014.

Regarding offline edition:
Even more challenging, directly linked to the dev of the visual editor. IMO also pretty independent of the storage backend. I doubt to see such a feature working well soon.

IMO, this ticket is not a feature request: this is a new 3 years strategy program for the Mediawiki team ;) Less ambitious but more realistic: a few people are interested in a pure javascript ZIM reader (so all browser would have easily a ZIM support) and some of them have alreday started to work on that.

Thank you for your informative response.

Yes, it surely is beyond a simple feature request!, and though your existing positive immediate and realistic initiatives are of course most appreciated by us in the community, and I do not at all wish for this proposal to detract from them, my feeling is that having some long-term, even if tentative vision spelled out with potential broad lines of action (such as your mention of exploration of IndexedDB) might help momentum could eventually be built, and excite contributors or would-be contributors with the cherished possibilities of the team. While IndexedDB, for example, might not prove optimal, I think it is hard to dispute that such an integrated offlineable web app system would be more ideal, and that obstacles ought to at least be EVENTUALLY surmountable, if not via IndexedDB than some other technology (but it really appears Mozilla, Microsoft, and all the browsers are planning to go this route).

In addition to the IndexedDB benchmarking goal, I also would think that the first proposed step (of caching manifests) would be a very practical step to explore now, as they can allow "offline" caching only for non-dynamic application files--not for content. One would not even need to declare all possible cacheable files immediately in the caching manifest since the proposed immediate-term goal would be optimization of performance and would not be used yet for genuine offline functionality.

Thank you for all your hard work on what is already such a preeminently ambitious and successful system.

(In reply to comment #7)

IMO, this ticket is not a feature request: this is a new 3 years strategy
program for the Mediawiki team ;)

(In reply to comment #8)

Yes, it surely is beyond a simple feature request!

This request should probably be split into handable subtasks, as it seems to contain many different issues to solve first, as can be seen in comment 5.
One report per issue (also see https://www.mediawiki.org/wiki/How_to_report_a_bug ) is highly recommended, as bug reports without a clear scope otherwise could never be considered "fixed".

I'm undecided whether to close this request as WORKSFORME (parts already available) or WONTFIX (unlikely to happen via implementing the exact solution proposed here), or INVALID (way more than one issue in a report), but I think I go for the first.

Ok, fair enough, thank you. Sometimes though, I think a bigger request can serve as a tracking bug for related changes (and also spare me effort in knowing first which lines of action you might be willing to start with as independent bugs, which are related enough to be seen as one bug, etc.).

I've added the caching manifest request as bug 45980 and the IndexedDB investigation/implementation as bug 45981.