Page MenuHomePhabricator

Deploy extension Memento on Wikipedia sites
Closed, DeclinedPublic

Description

Author: azaroth42

Description:
MediaWiki currently does not support access to prior states of its articles using the Memento time-based content negotiation paradigm [1,2,3]. Memento allows browsers to request resources at a particular point in time, and for TimeGates to then redirect the browser to the version that was accessible at the requested time. As MediaWiki maintains all versions, it is the most authentic and knowledgeable source of this information, compared to (for example) the Internet Archive's very sparse collection of articles. MediaWiki is the best placed to efficiently and accurately provide this information, rather than a third party system.

Users wish to see versions of resources both before and after certain events, for example one might wish to see the page about Michael Jackson both before and after his death, or follow the evolution of the description of the TSA's approach to air travel security since 2001.

Editors can also benefit from Memento access to see where the hot spots of activity are, and the differences before and after editing wars.

By exposing Memento TimeGates, MediaWiki allows for time series analysis of its resources, either by extracting information from the article (text mining, data extraction, etc) or from the upcoming data platform. As the information may change many times, allowing fine grained access is extremely valuable compared to the DBPedia implementation[4].

A prototype extension which implements Memento for MediaWiki is available:
http://www.mediawiki.org/wiki/Extension:Memento A working browser plugin for Firefox is also available: http://bit.ly/memfox

Resources:

1 https://datatracker.ietf.org/doc/draft-vandesompel-memento/
2 http://www.mementoweb.org/
3 http://arxiv.org/abs/0911.1112
4 http://arxiv.org/abs/1003.3661
5 http://www.mediawiki.org/wiki/Extension:Memento
6 http://bit.ly/memfox


Version: unspecified
Severity: enhancement

Details

Reference
bz34778

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 12:09 AM
bzimport set Reference to bz34778.
bzimport added a subscriber: Unknown Object (MLST).

azaroth42 wrote:

Clarifications:

  • Memento is valuable for regular users and content editors, as well as bots, cyborgs and other software agents. Giving HTTP access to versions allows easy access to software agents to perform the kinds of analysis human users and editors do. Software agents are the primary audience of the third use case given, time series analysis of the information in the articles.
  • The extension referenced has been tested in several installs around the world, but not in a Wikipedia scale installation of MediaWiki. As such we describe it as a prototype, but it must be noted that it is stable and tested.

azaroth42 wrote:

Additional Use Cases:

  • Users find it time-consuming to navigate the many history pages to find the

article at a particular point in time. Memento makes this a single click,
saving the user time and frustration.

  • Users wish to navigate between old versions of articles at a particular point

in time. As the browser sends the requested timestamp, the user can click on
internal links within the wiki and be taken straight to the version of the
article at that time. Without Memento, the user will be taken to the current
version and must then page through the history to find the appropriate version.
This integrates with external web archives to display resources linked to from
articles that are outside of the control of MediaWiki.

As this is a request to deploy the Memento extension in production on Wikipedia, I file it under the correct product category. Not sure if History/Diff is the most relevant component.

Extensions that are deployed need to be reviewed. I've added it to [[mw:Review_queue]].

Code needs to also go into our SVN repository

This extension currently involves patches to core mediawiki, which is generally a big no no.

It also doesn't follow some of Wikimedia's coding conventions. For example:

*Using $_SERVER directly, instead of looking at $wgServer variable, etc. Another example:
if ( !stripos( $_SERVER['REQUEST_URI'], '?' ) && !stripos( $_SERVER['REQUEST_URI'], 'Special:TimeGate' ) ) {

*Using $dbr->query instead of $dbr->select

*Hard coded strings that should be i18n-ized

*Things like: $page_namespace_id = constant( "NS_" . strtoupper($namespace) ); which make assumptions that may not be true in many setups

*Using echo in places where $wgOut->disable() has not been called.

hariharshankar wrote:

Thanks for your comments. I will fix the plugin to incorporate these suggestions.

Clarification on patching MediWiki Core:
The patch to the core of mediawiki is needed only when older versions of Templates are to be fetched by the plugin. This patch is not mandatory for this plugin to work.

(In reply to comment #6)

This extension currently involves patches to core mediawiki, which is generally
a big no no.

It also doesn't follow some of Wikimedia's coding conventions. For example:

*Using $_SERVER directly, instead of looking at $wgServer variable, etc.
Another example:
if ( !stripos( $_SERVER['REQUEST_URI'], '?' ) && !stripos(
$_SERVER['REQUEST_URI'], 'Special:TimeGate' ) ) {

*Using $dbr->query instead of $dbr->select

*Hard coded strings that should be i18n-ized

*Things like: $page_namespace_id = constant( "NS_" . strtoupper($namespace) );
which make assumptions that may not be true in many setups

*Using echo in places where $wgOut->disable() has not been called.

The patch to the core of mediawiki is needed only when older versions of
Templates are to be fetched by the plugin. This patch is not mandatory for this
plugin to work.

My mistake. To be honest I just skimmed the extension page rather quickly (The perennial extensions never get looked at thread came back up on the mailing list, and I was curious about which extensions were in the review queue).

hariharshankar wrote:

We have worked on the extension code so that it meets the MediaWiki coding conventions. The newer version of the code can be downloaded from http://www.mediawiki.org/wiki/Extension:Memento . We have also updated the extension documentation page to reflect the new changes.

Vulnerable to register_globals

$mmScriptPath defined but not used.
Useless statement $historyuri;
No need of mmSetupExtension() for setting a hook.
Usage of $wgTitle will fail on recent MediaWiki
stripos() is not the way to check if a variable was set in the query string
explode() is not how you retrieve a variable from the query string
You're changing the default timezone, overriding whatever the user might have configured.
HTML injection building links
Hardcoded names of Special pages
You're fetching the whole list of revisions for each page, that can be a very expensive operation, retrieving several thousands of rows. Try requesting just what you need.

This is not suitable for deployment at this point. I recommend you to reach some developers on how to properly code this.

azaroth42 wrote:

Dear Platonides,

Thank you for the comments on the latest revision.

Could you please provide pointers to the best practices regarding retrieving the title, and how best to parse URLs without stripos() and explode()?

Regarding the timezone, we're not changing it other than to follow the RFC specification that all timestamps in HTTP headers MUST be in GMT, "without exception". This does not change the UX in the page. Please see:
http://tools.ietf.org/html/rfc2616#page-20

Could you please confirm what you mean by "HTML injection building links"? We do not change the HTML of the returned history page.

If there is a better way to discover the names of the Special pages (timegate and timemap) generated within the extension, please let us know and we'll update the extension.

We do request only the parts of the history list that are required for the different operations. The timegate needs the closest match, first, last, previous and next. The timemap is a serialization of the set of versions of the resource, and thus requires the entire history list.

Many thanks!

(In reply to comment #11)

Could you please provide pointers to the best practices regarding retrieving
the title, and how best to parse URLs without stripos() and explode()?

RequestContext.

Regarding the timezone, we're not changing it other than to follow the RFC
specification that all timestamps in HTTP headers MUST be in GMT, "without
exception". This does not change the UX in the page. Please see:
http://tools.ietf.org/html/rfc2616#page-20

Then prepare your headers in a way that doesn't affect UI timezone.

If there is a better way to discover the names of the Special pages (timegate
and timemap) generated within the extension, please let us know and we'll
update the extension.

SpecialPage::getTitle(), getTitleFor()

We do request only the parts of the history list that are required for the
different operations. The timegate needs the closest match, first, last,
previous and next. The timemap is a serialization of the set of versions of
the resource, and thus requires the entire history list.

Then it's not going to be deployed on WMF, because requesting thousands of revisions is not an option with our scale.

By the way, is there evidence of interest in this feature from Wikimedia community?

Another issue:

$xares = $dbr->select( "revision", array('rev_id', 'rev_timestamp'), array("rev_page=$pg_id"), METHOD, array("ORDER BY"=>"rev_id DESC") );

rev_id is not guaranteed to always behave like you want it to, sort by rev_timestamp.

(Expanding on comment #12)

Then it's not going to be deployed on WMF, because requesting thousands of
revisions is not an option with our scale.

Even caching isn't going to help, because we have pages with *hundreds of thousands* revisions. Memcached object size limit is 1MB - revision information for such pages won't even fit into it. And Special:Timemap will get tired of serving megabytes and megabytes of data in such cases.

(In reply to comment #11)

Dear Platonides,

Thank you for the comments on the latest revision.

Could you please provide pointers to the best practices regarding retrieving
the title, and how best to parse URLs without stripos() and explode()?

The request object has a getVal() method.

Regarding the timezone, we're not changing it other than to follow the RFC
specification that all timestamps in HTTP headers MUST be in GMT, "without
exception". This does not change the UX in the page. Please see:
http://tools.ietf.org/html/rfc2616#page-20

Of course, so you either change the default and set it back to what it was before or -better- use a function that doesn't need switching default timezones.
I think your mmConvertTimestamp() function could be replaced with a call to wfTimestamp() with TS_RFC2822 output.

Could you please confirm what you mean by "HTML injection building links"? We
do not change the HTML of the returned history page.

You're handcrafting many urls, such as
$first['uri'] = $alturi . "?title=" . $title . "&oldid=" . $oldestRevID;

This is horrible practise. It'd lead to html injection if outputted in html, in http headers the server might be tricked to redirect to an attacker website (maybe not possible with the broken way you read them, but stil...).
Look to wfExpandUrl() and wfAppendQuery()

If there is a better way to discover the names of the Special pages (timegate
and timemap) generated within the extension, please let us know and we'll
update the extension.

(Answered by MaxSem)

We do request only the parts of the history list that are required for the
different operations. The timegate needs the closest match, first, last,
previous and next. The timemap is a serialization of the set of versions of
the resource, and thus requires the entire history list.

This unbounded query is retrieving all the revisions for the page.
$xares = $dbr->select( 'revision', array('rev_id', 'rev_timestamp'), array("rev_page=$pg_id"), METHOD, array('DISTINCT', 'ORDER BY'=>'rev_id DESC') );

Suppose we were visiting https://en.wikipedia.org/wiki/Main_Page which has 4104 revisions. Can you justify why you need all of them instead of just 3 or 4?

Also, it'd be helpful if you provided a public repository of the extension. You can request it to be hosted with the other mediawiki extensions in our repository. That'd help later for deployment.

(In reply to comment #14)

Suppose we were visiting https://en.wikipedia.org/wiki/Main_Page which has 4104
revisions. Can you justify why you need all of them instead of just 3 or 4?

[[WP:ANI]] is 620k revs.

sumanah wrote:

Rob, in addition these comments from experienced MediaWiki developers, you'll find this guide helpful:

https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment

hariharshankar wrote:

We have updated the plugin code to incorporate all but one change that had been suggested. We need your opinion on how to fix the timezone issue that the plugin introduces. We have found 3 ways to fix this issue and we would appreciate input on which is the best approach.

  1. Use the DateTimeZone class http://www.php.net/manual/en/class.datetimezone.php to use the GMT timezone for every time function used by the plugin. The drawback with this approach is that this class is available in PHP version > 5.2.0. Even though this would work in Mediawiki servers, a lot of other wikis use older PHP versions.
  1. Save the default timezone set for a wiki by using the getTimezone() function, use setTimezone('GMT') when the plugin is invoked, and then set it back to the default timezone when the plugin is finished. This approach will work with all the versions of PHP.
  1. The plugin can check if the datetimezone class is available and use it. Otherwise use method 2 above.

Please advice us on what the best approach will be.

Note: The new plugin code is not available for download yet.

(In reply to comment #17)

The drawback with this approach is that
this class is available in PHP version > 5.2.0. Even though this would work in
Mediawiki servers, a lot of other wikis use older PHP versions.

All supported MediaWiki versions require 5.2.3, while the next release will require 5.3, so this is not a problem.

hariharshankar wrote:

We have updated the plugin to incorporate the changes suggested above.

The major fix is in the timemaps, where we were trying to fetch all the revisions of an article earlier. Now, we have introduced paged timemaps, where an optional configuration parameter can be set to limit the revisions that can be fetched. If this variable is not set, the number of revisions default to 500.

Adding the link to the bug report so I don't need to go looking for it each time http://mementoweb.org/tools/wiki/memento.zip

It would be easier to check if the changes were available in some repository.

  • No need of mmSetupExtension, set $wgHooks in the global scope.

$title = preg_replace('/ /', "_", $objTitle->getPrefixedText());

Used in several places. preg_replace() here is overkill, as str_replace() would do it. But in this case you'd just want $objTitle->getPrefixedURL()

  • Coding conventions: Tabs instead of spaces, spaces into brackets...
  • No need to make the SELECT rev_id rev_timestamp FROM revision a DISTINCT one, as those combined fields are always unique (the db server is probably realising and ignoring the distinct, but no need to add it).
  • Generation of the link header is wrong. Eg. in the way it is generating the article url. You're not doing any rawurlescaping, so some specially crafted article names could confuse memento clients.
  • Minor: give the author names in $wgExtensionCredits in an array.
  • Unneded space indenting in timegate/timegate.alias.php, timegate/timegate.i18n.php, timemap/timemap.alias.php, timemap/timemap.i18n.php and top of timemap/timemap.php
if( stripos($par, $wgServer.$waddress) == 0 ) {
    $title = preg_replace( "|".$wgServer.$waddress."|", "", $par );

Wrong check and wrong regex.

$dbr->begin();

There is never a corresponding commit() or rollback().
You're doing an exit in the bottom, the php driver might be closing the transaction or even the connection, but don't rely on it.
(several places)

  • wfLoadExtensionMessages() is deprecated since 1.15, expected to be removed in 1.20 (just remove that line).
  • Variable $wgMementoReqDateTime defined as global in execute(), set as local variable in tgParseRequestDateTime() and never used.
  • tgParseRequestDateTime() should use wfTimestamp() instead of strtotime() -> date()
  • When you have a title object, you don't need a manual query to revision table to get the latest revision, just use getLatestRevID().
  • You have repeated code for selecting the first/prev/next of a revision. I think it could be abstracted in a single function.
  • TimeGate methods use a tg prefix that isn't really needed (you're scoped by the class name).
  • At TimeGate::tgGetMementoForResource if there's no revision for the given memento, it will merrily use the undefined variable $memRevUnixTS generating wrong SQL. Maybe you wanted to abort with an error message if there's no suitable memento?

(even with the $oldestRevUnixTS / $recentRevUnixTS, there could be a race condition)

It's in better shape than the previous version :)

Also note that WikiPage has a getOldestRevision() method...

hariharshankar wrote:

  • Most of the suggestions above have been implemented.
  • getOldestRevision() and getLatestRevID() was not used for now because the plugin also needs latest and oldest timestamp with the revid. Is there a function that gives us these timestamps as well?

sumanah wrote:

Harihar, thanks for your update and for incorporating these revisions.

Per https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment , I'm cc'ing Howie Fung and Brandon Harris to help get a design review for this proposed functionality.

Also, to get your extension deployed, you will need to move your extension from GitHub to our Git system, which is hosted at gerrit.wikimedia.org . Instructions: https://www.mediawiki.org/wiki/Git/New_repositories .

(In reply to comment #22)

  • getOldestRevision() and getLatestRevID() was not used for now because the

plugin also needs latest and oldest timestamp with the revid. Is there a
function that gives us these timestamps as well?

Were you able to find this information on your own?

Is there any kind of consensus anywhere to get this deployed? I've never heard of this until just now.

(In reply to comment #24)

Is there any kind of consensus anywhere to get this deployed? I've never heard
of this until just now.

No there is not.

Quite frankly I think consensus should be established for people wanting this extension before resources are spent on improving it.

Agreed, a thread on https://en.wikipedia.org/wiki/Village_pump_(technical) would be a good place to start gauging consensus.

Max, it's a redirect.

Here's a summary from a discussion from wikimedia-dev and further notes on the next steps for the Memento maintainers:

  • start gauging community consensus (explaining the benefits of Memento support for editors and readers), see c24-27 above (BTW I imagine bots and third party apps would also be among potential target users, correct?)
  • get access to gerrit.wikimedia.org to prepare for code review, see c23 above
  • if possible get us some estimates on the target user base and the current state of browser support

(In reply to comment #28)

  • if possible get us some estimates on the target user base and the current

state of browser support

Umm, non-existent? (Some plugins, nothing native) We are talking about non-standard http headers. (From what I can tell, draft RFC, and even if they do manage to get an RFC published, it honestly seems something unlikely to be adopted by the browser community). Really where talking about something on the level of the "browser edit button" minus the links to wiki idealogy.

That said it would be kind of a cool thing, provided the effort was minimal.

Most interest from the community I imagine would be more about doing bug 851, which has a fair bit of interest.

(In reply to comment #23)

Also, to get your extension deployed, you will need to move your extension from
GitHub to our Git system, which is hosted at gerrit.wikimedia.org .
Instructions: https://www.mediawiki.org/wiki/Git/New_repositories .

Repo was just created, but not with direct pushing permissions (since this is looking for WMF deployment, direct pushing is not allowed).

hariharshankar wrote:

The memento extension code has been moved to the git repository and is waiting for review.
https://gerrit.wikimedia.org/r/29812

I recommend rejecting this.

Asking users to install a Firefox extension to make navigation easier is not how I imagine a secure and user-friendly web would work.

Perhaps if this were supported by unmodified browsers, it would be more attractive for us. The browsers have a long history of introducing features in advance of their use on the web, so I don't think it's a "chicken-and-egg" problem.

TimeGate responses, as specified by the Internet-Draft, appear to be effectively uncacheable with currently used HTTP proxy software. We have no way to remove resources from a cache with a finer granularity than a URI. So when the page is changed, we would have the choice of either:

  • Purging the TimeGate URI when the page is changed, in which case all versions of that resource would be simultaneously purged, reducing the hit rate for rarely-accessed old revisions, or
  • Not purging it, in which case responses for recent Accept-Datetime values would become stale. Also, there would be no way to purge revisions which are removed from the database by RevisionDelete.

Additionally, the definition of the Vary header in the Internet-Draft appears to conflict with the definition in HTTP (RFC 2616), as implemented by MediaWiki, PHP, Squid, etc. It's unclear what the "negotiate" value is for or how it will interact with the Vary header values that MediaWiki must send to HTTP proxy servers.

The Internet-Draft seems to unnecessarily overspecify server and client behaviour. For example, depending on the server software, it may be difficult to implement the requirement that TimeGates respond to request methods other than GET and POST with an HTTP 405 code.

(In reply to comment #29)

That said it would be kind of a cool thing, provided the effort was minimal.

I don't think the effort would be minimal. The code quality is poor, and would suffer from a high rate of bit rot due to poor integration with the MediaWiki core. For example, mmAcceptDateTime() assumes $_GET['oldid'] will have a certain interpretation by the MediaWiki core, and sends header values corresponding to this interpretation, regardless of what MediaWiki decides to actually do with that parameter. The assumption is already incorrect and will become more incorrect over time.

azaroth42 wrote:

Hi Tim,

The Memento team has carefully analyzed your feedback. We hope our
below response can convince you to change your opinion regarding
Memento support in Wikipedia and would very much appreciate further
communication regarding the matter.

Many thanks!

Rob

> Problem 1:

Asking users to install a Firefox extension to make navigation easier
is not how I imagine a secure and user-friendly web would work.
Perhaps if this were supported by unmodified browsers, it would be
more attractive for us. The browsers have a long history of
introducing features in advance of their use on the web, so I don't
think it's a "chicken-and-egg" problem.

Response:

It's difficult to argue with this point. We would obviously much
prefer native adoption by browsers over a plug-in solution. But,
without a plug-in, there would be no way to demonstrate the cross-site
time travel capability introduced by Memento. Also, it is hard to see
what incentives browser manufacturers have to natively implement
Memento's datetime negotiation as long as there is no critical mass of
servers supporting it. Failed attempts to get the attention of Mozilla
and Opera support this consideration, but if you have experience otherwise,
then any assistance you might give would be greatly appreciated. At this
point, Memento enjoys growing adoption by web archives (Internet Archive, British Library Web Archive, UK National Archives) and it has the unanimous support from the International Internet Preservation Consortium. Adoption by
WikiPedia could help build the essential critical mass that, we think,
could give us the momentum to credibly approach browser manufacturers.
Given WikiPedia's track record as early adopters of innovative
technologies (as emphasized by editors in the RFC discussion re
Memento support), we were hopeful to have your support in working
towards establishing that critical mass.

> Problem 2:

TimeGate responses, as specified by the Internet-Draft, appear to be
effectively uncacheable with currently used HTTP proxy software. We
have no way to remove resources from a cache with a finer granularity
than a URI. So when
the page is changed, we would have the choice of either:

  • Purging the TimeGate URI when the page is changed, in which case all

versions of that resource would be simultaneously purged, reducing the
hit rate for
rarely-accessed old revisions, or

  • Not purging it, in which case responses for recent Accept-Datetime

values would become stale. Also, there would be no way to purge
revisions which are removed from the database by RevisionDelete.

Response:

We very much share the concern of cacheability, as exemplified by the
Memento protocol responses for Original Resources and Mementos.
However, when it comes to TimeGates, the situation regarding caching
deserves some further consideration:

  • RFC 2616 states, as quoted below, that 302 responses are by default not cached:

"A response received with any other status code (e.g. status codes 302
and 307) MUST NOT be returned in a reply to a subsequent request
unless there are cache-control directives or another header(s) that
explicitly allow it."

  • Caching 302 responses from a TimeGate will yield marginal benefit, if any:
  • Datetime negotiation values exist on a continuum unlike e.g. media

type negotiation for which values reside in a discrete set. In the
latter case, chances that a cache has an entry for a specific value
out of the (small) discrete set are significant. In the TimeGate case,
chances are dramatically lower, if not insignificant understanding the
size of the value space. For example, when only taking into account
day granularity, the value space for Wikipedia has cardinality of over
3650 (365 days * 10+ years). Adding hours, minutes, and seconds to the
value space brings this cardinality to over 365*10*24*60*60. Chances
for a cache hit become very small.

  • The overhead on the server resulting from not caching TimeGate

responses remains reduced as responses only contain headers without a
representation in the body. Please see for example
http://www.mementoweb.org/guide/rfc/ID/#a200-step4-http

> Problem 3:

Additionally, the definition of the Vary header in the Internet-Draft
appears to conflict with the definition in HTTP (RFC 2616), as
implemented by MediaWiki, PHP, Squid, etc. It's unclear what the
"negotiate" value is for or how it will interact with the Vary header
values that MediaWiki must send to HTTP proxy servers.

Response:

We see no conflict with the Vary definition of RFC 2616 as it states
the following about the field names used in Vary:
"The field-names given are not limited to the set of standard
request-header fields defined by this specification."
Furthermore, the "negotiate" value for Vary has become widely used
since its introduction in RFC 2295 that details Transparent Content
Negotiation. The "negotiate" value is used by default for negotiated
responses by Apache servers.

However, we agree that the "negotiate" value serves no real purpose
without the corresponding Negotiate request header and can be regarded as a
remnant of the early days of Memento during which RFC 2295 was a
significant inspiration. We are most willing to remove this value from
the Vary header in the Memento protocol and hence also from the
MediaWiki plugin.

> Problem 4:

The Internet-Draft seems to unnecessarily overspecify server and
client behaviour. For example, depending on the server software, it
may be difficult to implement the requirement that TimeGates respond
to request methods other than GET and POST with an HTTP 405 code.

Response:

The concern regarding HTTP 405 is fair and we would be most willing to
remove this requirement from the specification. Other feedback
regarding instances of overspecification would be very welcome as we
could take them into account when wrapping up the Internet Draft. From
our perspective, we have tried to clearly detail a variety of existing
and anticipated situations in a consistent manner, trying to redact a
specification that really helps implementers. But, in our enthusiasm,
we may have gone overboard, indeed.

> Problem 5:

(In reply to comment #29)

That said it would be kind of a cool thing, provided the effort was minimal.

I don't think the effort would be minimal. The code quality is poor,
and would suffer from a high rate of bit rot due to poor integration
with the MediaWiki core. For example, mmAcceptDateTime() assumes
$_GET['oldid'] will have a certain interpretation by the MediaWiki
core, and sends header values corresponding to this interpretation,
regardless of what MediaWiki decides to actually do with that
parameter. The assumption is already incorrect and will become more
incorrect over time.

Response:

This comment regarding poor software quality comes as a big surprise
as we have invested very significant resources to improve the initial
code base, through many iterations, in response to feedback from
MediaWiki people. This is the first time we hear about the false
assumption re mmAcceptDateTime(). Our developer Harihar Shankar states
the following with this regard:
"I am determining if the current resource is a version of an article
by looking at the URL and check if there is "oldid" in it. This is
definitely not the best way to do it, but I looked extensively in
their documentation and I could not find a better alternative. This
issue has not been brought up by the code reviewers so far."
We would be very interested in learning what the appropriate approach
is. And we are interested in hearing about other problems with the
code. In both cases, we will be most happy to make required changes to
bring the code to the desired quality level.

azaroth42 wrote:

Dear all,

We've tried to take the feedback from the bug into account, and have released a new version of the Internet Draft that makes things easier to implement, and with more implementation patterns, for content management systems like wikis. It's much shorter as well to define only the necessary aspects rather than everything that might be nice to have.

The new draft is: http://tools.ietf.org/html/draft-vandesompel-memento-06

I hope this further reduces the concerns for the extension.
Thanks in advance for any further comments.

Hello Rob,

I'm Greg Grossmeier, Release Manager for the Wikimedia Foundation (basically, manager for deployments of Mediawiki and extensions to our servers that host all WMF projects).

I just wanted to take a moment and say thank you for your effort on this extension thus far. You and your team have put a lot of good faith effort into it and I/we appreciate that.

Unfortunately, at this time, we're in the same boat as Mozilla and Opera: we need to see a tangible use case supported by a large (absolute, not percentage, necessarily) number of users. I, at least, generally agree with what you are attempting to do with Memento (I have a Library Science degree and worked on metadata stuff with the W3C and Schema.org while at Creative Commons), but the time needed to do this right at the WMF is too high for us right now with the current expected payoff; we're time and budget constrained just as much as any other non-profit and there are currently higher priorities items in our queue that directly benefit the Wikimedia community.

No reason this couldn't change in the future, but it would need to be something along the lines of at least one major browser supporting Memento.

Thanks for your understanding,

Greg

(In reply to comment #34)

Also, it is hard to see
what incentives browser manufacturers have to natively implement
Memento's datetime negotiation as long as there is no critical mass of
servers supporting it. Failed attempts to get the attention of Mozilla
and Opera support this consideration, but if you have experience otherwise,
then any assistance you might give would be greatly appreciated.

As I said, the browser manufacturers have a long history of implementing features in advance of their use on the web. For example, the lead taken by Firefox and Opera in the introduction of various HTML 5 features.

If you want to get Mozilla's attention, you could start by filing a bug: https://bugzilla.mozilla.org/enter_bug.cgi

  • Caching 302 responses from a TimeGate will yield marginal benefit, if any:

Indeed. The high cardinality of TimeGate requests is a problem for efficient implementation. It is possible to imagine a protocol for retrieval of historical revisions which would not have this problem.

This comment regarding poor software quality comes as a big surprise
as we have invested very significant resources to improve the initial
code base, through many iterations, in response to feedback from
MediaWiki people.

The comments above show that the code quality started out being terrible. It has improved greatly. Now, it is only poor. It still has some way to go before it is acceptable for WMF deployment (even if it was something we wanted).

This is the first time we hear about the false
assumption re mmAcceptDateTime(). Our developer Harihar Shankar states
the following with this regard:
"I am determining if the current resource is a version of an article
by looking at the URL and check if there is "oldid" in it. This is
definitely not the best way to do it, but I looked extensively in
their documentation and I could not find a better alternative. This
issue has not been brought up by the code reviewers so far."
We would be very interested in learning what the appropriate approach
is. And we are interested in hearing about other problems with the
code. In both cases, we will be most happy to make required changes to
bring the code to the desired quality level.

If the necessary interfaces really are missing, then the developer's response should be to introduce them. But I think using an ArticleViewHeader hook and calling getOldID() on the Article object passed to the hook would be a reasonable way to do it. Then the hook will only be triggered on actual views of ordinary wiki pages, and the oldid will be the same one used by Article.php, which would be an improvement.

$wgRequest should not be used at all, nor "new RequestContext". You can get what information you need from the Article methods. Instead of $wgOut, you can get an OutputPage object from $article->getContext()->getOutput(), and instead of $wgRequest, you can use $article->getContext()->getRequest().

Nothing should ever call exit(), including Special:TimeGate and Special:TimeMap. You can use OutputPage::disable() to customise the output.

This seems cool. Any further updates on implementation by moz or other browsers? Who are the major implementers of memento today?

(In reply to comment #38)

Any further updates on implementation by moz or other browsers?

I guess it's best if you asked Moz for that. :)

hvdsomp wrote:

(In reply to comment #38)

This seems cool. Any further updates on implementation by moz or other
browsers? Who are the major implementers of memento today?

Thanks for asking. This allows me to provide a general update regarding Memento activity:

  • We received funding from the Andrew W. Mellon Foundation to develop a more solid Memento MediaWiki add-on, taking into account the feedback received during the discussion of this bug report. This work is currently ongoing. As soon as a version is available we will share it, here and on the MediaWiki Developers list, to solicit further feedback. We remain hopeful that Wikipedia and MediaWiki installations will consider implementing it.
  • This recent release of the Wayback software is already operational at the Internet Archive. Some web archives (e.g. British Library, UK National Archives) run Wayback versions that are compliant with previous versions of the Memento protocol. It is expected that these web archives as well as other web archives that run a pre-Memento Wayback version will migrate to the new version in the months to come.
  • We have not yet further pursued native browser support for Memento, mainly because (contrary to what Tim suggests) we feel that chances to achieve it are rather low as long as there is no broader server-side Memento support. Memento has very significant support in the web archiving community (see above, e.g. Internet Archive). But we feel support outside of the web archiving community, e.g. by CMS with solid versioning approaches, is essential too. This is why we are keen on Wikipedia/MediaWiki support. Anyhow, we are currently working on two separate Chrome plug-in implementations. A major goal of that work is to determine how to minimize the footprint required for Memento support in the browser as a means of maximizing chances of possible native adoption.
  • More information is available via the Memento site [http://mementoweb.org]

Greetings,

Herbert Van de Sompel on behalf of the Memento team

Change 29812 abandoned by Hashar:
(bug 34778) Extension Memento: Initial Submit

Reason:
Seems the extension is stalled https://www.mediawiki.org/wiki/Extension:Memento and there is not any will to have it deployed. Thus abandoning change.

Feel free to resubmit a new change with the current code if there is any.

https://gerrit.wikimedia.org/r/29812

Change 32237 abandoned by Hashar:
(bug 34778) Extension Memento: Improvements after previous review.

Reason:
Seems the extension is stalled https://www.mediawiki.org/wiki/Extension:Memento and there is not any will to have it deployed. Thus abandoning change.

Feel free to resubmit a new change with the current code if there is any.

https://gerrit.wikimedia.org/r/32237

Change 32238 abandoned by Hashar:
(bug 34778) Extension Memento: Improvements after review.

Reason:
Seems the extension is stalled https://www.mediawiki.org/wiki/Extension:Memento and there is not any will to have it deployed. Thus abandoning change.

Feel free to resubmit a new change with the current code if there is any.

https://gerrit.wikimedia.org/r/32238

Change 32239 abandoned by Hashar:
(bug 34778) Extension Memento: Improvements after review

Reason:
Seems the extension is stalled https://www.mediawiki.org/wiki/Extension:Memento and there is not any will to have it deployed. Thus abandoning change.

Feel free to resubmit a new change with the current code if there is any.

https://gerrit.wikimedia.org/r/32239

Clarification needed:

  1. Is this work really shut down for ever, or just stalled awaiting some action?
  2. Does this mean there will not be a deployed Memento extension on WP?
  3. Is there an alternative being pursued toward the same end?

Clarification needed:

  1. Is this work really shut down for ever, or just stalled awaiting some action?
  2. Does this mean there will not be a deployed Memento extension on WP?
  3. Is there an alternative being pursued toward the same end?

In the context of Wikipedia:

  1. Never is a long time, but probably
  2. I think it is unlikely to be deployed unless there is either get a groundswell of support from ordinary wikipedia users, or RFC 7089 becomes implemented by a large number of providers.
  3. No.

However, that doesn't mean the extension can't still be developed. Feel free to make the extension better. Wikipedia isn't the only wiki in the universe.

Thanks for the timely response.

  1. Pity
  2. As far as providers go, a peek at http://timetravel.mementoweb.org/about/ shows a current list that's already pretty impressive, including many nations' archives, the internet archive, and github! There are two browser extensions out to provide clients on Chrome, two more for Firefox, and of course there's always https://github.com/ukwa/mementoweb-webclient
  3. If the main obstacle is just that new versions of templates create changes to old versions of wikitext the solution is simple: just archive the rendered html rather than the wkitext. Even a pdf archive would be better than nothing. That's what the UK national archive does, and archive.is too. As I understand it, they don't do so for WP mainly because they are expecting that WP intends to do its own.
  1. If the main obstacle is just that new versions of templates create changes to old versions of wikitext the solution is simple: just archive the rendered html rather than the wkitext. Even a pdf archive would be better than nothing. That's what the UK national archive does, and archive.is too. As I understand it, they don't do so for WP mainly because they are expecting that WP intends to do its own.

We have fine support in the backend for rendering a page with old versions of templates. We don't actually expose that anywhere (outside of things flagged revs does), but its not like that would be all that hard to do and is really a very separate issue from the question of this extension. Problems are more likely to come up with images that have been subsequently been deleted and changes to site CSS.

We have fine support in the backend for rendering a page with old versions of templates.

Not for deleted templates though. Sometimes a very used template is moved to a new name and the redirect from the old name is deleted to force people to use the new name... (On the Italian Wikipedia alone there would be millions of such template calls.)