Page MenuHomePhabricator

Add Revision Diff functionality to API
Closed, ResolvedPublic

Description

Author: overlordq

Description:
Patch to hook it into the API

The current methods of getting a diff:

  1. Requesting both revisions, running diff algo on it.
  2. Scraping the diff html output

are rather sub par. A handier/nicer way would be to add this option to the api that returns a nice clean format.


Version: 1.12.x
Severity: enhancement

attachment apidiff.patch ignored as obsolete

Details

Reference
bz13209

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:05 PM
bzimport set Reference to bz13209.

overlordq wrote:

The actual functionality

It's horrible horrible code that is probably more complex then it should be, and is most likely horribly inefficient.

The only thing going for it, is that it works.

http://www.thedarkcitadel.com/w/api.php?action=query&prop=diff&titles=Main%20Page
or
http://www.thedarkcitadel.com/w/api.php?action=query&prop=diff&titles=Rainbow%20Tables

attachment ApiQueryDiff.php ignored as obsolete

Bryan.TongMinh wrote:

Don't assume text to be located in the text table, since they are not. You should use Revision::getRevisionText.

This bug was fixed once and then reverted because of performance issues.

(In reply to comment #3)

This bug was fixed once and then reverted because of performance issues.

To elaborate on this a little further: there are several problems I ran into when implementing API diffs:

  • Diffs are cached, so for performance reasons you wanna use the diff cache. However, the diff cache contains prefab HTML.
  • Diff generation in PHP is slow as hell, so we use a C++ program for that (wikidiff2). However, wikidiff2 outputs (you guessed it) HTML.

Caching diffs rather than HTML is:

  • A huge job, because it requires rewriting large parts of wikidiff2 and MediaWiki's diff handling code
  • Memory-intensive, as arrays take truckloads of memory in PHP
  • A performance hit, as the diff has to be processed and HTML generated on every diff view, rather than once and for all

matthew.britton wrote:

I would settle for an API query that wrapped the already-generated HTML representation of the diff in <api><diff> ... </diff></api> tags and output that. It would still be more efficient at both ends than getting the same information through index.php. Any chance of this?

(In reply to comment #5)

I would settle for an API query that wrapped the already-generated HTML
representation of the diff in <api><diff> ... </diff></api> tags and output
that. It would still be more efficient at both ends than getting the same
information through index.php.

It wouldn't: with the right parameter mix (I don't know the exact mix, others do) you can get just the diff from index.php .

matthew.britton wrote:

Reopening this; I understand that providing machine-readable diffs is not feasible for performance reasons. However, providing an API query that outputs the HTML diffs (already generated by MediaWiki) suitably wrapped would not, as suggested above, be merely redundant to an existing UI query. It is something that would be useful for two key reasons:

  • Machine-readable output in the event of errors, something that the UI does not always, or even usually, provide
  • It would be relatively simple to implement an API query format that, unlike the UI, could return several diffs in the same request. This would make fetching many diffs easier on the servers, something that is done in the process of dealing with vandalism for example.

(In reply to comment #7)

Reopening this; I understand that providing machine-readable diffs is not
feasible for performance reasons. However, providing an API query that outputs
the HTML diffs (already generated by MediaWiki) suitably wrapped would not, as
suggested above, be merely redundant to an existing UI query. It is something
that would be useful for two key reasons:

  • Machine-readable output in the event of errors, something that the UI does

not always, or even usually, provide

  • It would be relatively simple to implement an API query format that, unlike

the UI, could return several diffs in the same request. This would make
fetching many diffs easier on the servers, something that is done in the
process of dealing with vandalism for example.

I guess those are good arguments for implementing this. I'll do it soon, but note that its usefulness is severely limited because the diff format depends on the wiki's configuration. On a wiki using MW's built-in diff functionality (slow!) or wikidiff2 (WMF uses this), you'll get an HTML diff polluted with <table>, <tr>, <td>, <ins> and <del> tags. But if that's what you want, you can get it :)

alexsm333 wrote:

The parameters mentioned by Roan in #6 are
&action=render&diffonly=yes

This outputs just HTML diff.
(Need to import /skins-1.5/common/diff.css to make it look better)

I don't really see why adding <api><diff> is needed at all.

(In reply to comment #9)

The parameters mentioned by Roan in #6 are
&action=render&diffonly=yes

This outputs just HTML diff.
(Need to import /skins-1.5/common/diff.css to make it look better)

I don't really see why adding <api><diff> is needed at all.

Because it'd allow fetching lots of diffs at the same time. I'll have to ask Brion about this, though, I think requesting lots of uncached diffs may blow up stuff.

matthew.britton wrote:

(In reply to comment #8)

On a wiki using MW's built-in diff functionality
(slow!) or wikidiff2 (WMF uses this), you'll get an HTML diff polluted with

<table>, <tr>, <td>, <ins> and <del> tags. But if that's what you want, you can get it :)

Sometimes the intended purpose is display to the user anyway, in which case it's not a problem. For those cases where it isn't, it's not *that* difficult to parse the HTML and get the information into a more machine-friendly format.

(I have, of course, considered fetching the two revisions, then producing the diff locally. But when the change is small relative to the size of the page, which is often, and the user is on a slow connection and/or has a slow machine, this option takes much longer, and alas there are still many people out there on dial-up. :(

Obviously if this blows stuff up then forget it, Brion has the last word on that. :)

matthew.britton wrote:

(In reply to comment #10)

Because it'd allow fetching lots of diffs at the same time. I'll have to ask
Brion about this, though, I think requesting lots of uncached diffs may blow up
stuff.

Oh, and by "lots" I didn't have in mind hundreds. Even if it was limited to, say, five, it would be useful.

extractRowInfo($row) is bogus -- it's assuming that text.old_text contains the final wikitext of the revision, which is wrong. You *must* fetch text via the Revision class, or you'll break on compressed text, legacy encodings, cur row proxies, batch compression, or external storage.

(In reply to comment #13)

extractRowInfo($row) is bogus -- it's assuming that text.old_text contains the
final wikitext of the revision, which is wrong. You *must* fetch text via the
Revision class, or you'll break on compressed text, legacy encodings, cur row
proxies, batch compression, or external storage.

Yes, the patch is fundamentally flawed, we know about that. I'll mark it obsolete.

The real question we have for you though is if you see any harm in clients being able to pull 50 (500 for privileged users) diffs in one request, considering they might have to be generated at that point. Is 50/500 diff generations in one request evil enough to deny this feature?

Well, they can pull 50 or 500 diffs in individual requests if they like already. :)

Without wikidiff2 present, diff generation is hideously slow, however, making lots of hits a bit of a danger to begin with. This is why for instance the RSS feed attempts to avoid doing diffs it thinks might be too big or slow (though it doesn't do a very good job of this iirc).

In short: I hate diffs, they suck. ;)

Considerations:

  • Do we really need to request multiple diffs at once, or do we just want an API point for requesting *a* diff in a consistent fashion?
  • Do we care that the result is going to be a big wad of HTML which will change depending on the version of MediaWiki, the diff backend installed, etc?
  • What's the actual intention of fetching the diff anyway?

(In reply to comment #15)

Well, they can pull 50 or 500 diffs in individual requests if they like
already. :)

Well yeah, but that'll at least load-balance somewhat nicely.

Without wikidiff2 present, diff generation is hideously slow, however, making
lots of hits a bit of a danger to begin with. This is why for instance the RSS
feed attempts to avoid doing diffs it thinks might be too big or slow (though
it doesn't do a very good job of this iirc).

In short: I hate diffs, they suck. ;)

I second that.

Considerations:

  • Do we really need to request multiple diffs at once, or do we just want an

API point for requesting *a* diff in a consistent fashion?

  • Do we care that the result is going to be a big wad of HTML which will change

depending on the version of MediaWiki, the diff backend installed, etc?

For answers to these questions, read comment #7, #8, #10 and #11.

  • What's the actual intention of fetching the diff anyway?

That doesn't seem to be mentioned anywhere.

matthew.britton wrote:

(In reply to comment #15)

  • Do we really need to request multiple diffs at once, or do we just want an

API point for requesting *a* diff in a consistent fashion?

  • Do we care that the result is going to be a big wad of HTML which will change

depending on the version of MediaWiki, the diff backend installed, etc?

  • What's the actual intention of fetching the diff anyway?

It seems that the only person who actually wants this is me, so I guess don't bother too much about it if it's a lot of work. Things work OK how they are, at least nothing has fallen over yet.

But in my case, the answers to these questions:

  • Ideally multiple diffs, but single ones would still mean simpler client-side code and cleaner error handling.
  • Not really, since it's going to be presented to the user anyway.
  • The intended use is for recent changes patrolling on the English Wikipedia. Users need to review diffs fairly constantly at short intervals over a reasonable period of time, and this accounts for the bulk of web requests associated with rc patrolling. I am assuming -- quite possibly incorrectly -- that it would be easier on the servers if the client was asking for, say, 2 diffs every 6 seconds rather than 1 diff every 3 seconds. If that assumption is complete bollocks then please disregard anything about multiple diffs. :) The trade-off here is minimizing the delay between the user wanting the diff and having it, while also not being excessive with web requests. I keep said delay close to zero by pre-loading diffs a few in advance of the one they're actually looking at, hence it would be easy to fetch them two or three at a time.

matthew.britton wrote:

Another things I should clarify is that all the diffs that such requests might generate would have been generated anyway, just possibly in separate requests. And at least for en.wikipedia RC patrolling, if an edit looks even in the least suspicious then it is almost guaranteed that a diff is going to be generated for it fairly quickly because there is always at least someone patrolling.

(In reply to comment #18)

Another things I should clarify is that all the diffs that such requests might
generate would have been generated anyway, just possibly in separate requests.
And at least for en.wikipedia RC patrolling, if an edit looks even in the least
suspicious then it is almost guaranteed that a diff is going to be generated
for it fairly quickly because there is always at least someone patrolling.

That applies to your tool, but the API isn't restricted to your tool. If it's *possible* to blow stuff up, your tool might not use that 'feature', but someone else might.

(In reply to comment #17)

It seems that the only person who actually wants this is me, so I guess don't
bother too much about it if it's a lot of work. Things work OK how they are, at
least nothing has fallen over yet.

It's not that much work, but it's more than just a few lines, which is why I wanna get Brion's approval before I implement this.

But in my case, the answers to these questions:

  • Ideally multiple diffs, but single ones would still mean simpler client-side

code and cleaner error handling.

Single diffs are definitely possible; I'll write a module that returns them later, and leave the possibility of multiple diffs open until Brion's cast his verdict.

I am assuming -- quite possibly incorrectly --
that it would be easier on the servers if the client was asking for, say, 2
diffs every 6 seconds rather than 1 diff every 3 seconds. If that assumption is
complete bollocks then please disregard anything about multiple diffs. :)

That makes sense. Retrieval of cached diffs is fast compared to request overhead, generating diffs isn't (AFAIK). But since most diffs are in cache your assumption is valid (although it doesn't matter much in a worst-case scenario, but then those scenarios are rare).

The
trade-off here is minimizing the delay between the user wanting the diff and
having it, while also not being excessive with web requests. I keep said delay
close to zero by pre-loading diffs a few in advance of the one they're actually
looking at, hence it would be easy to fetch them two or three at a time.

That makes sense as well.

Created attachment 5664
Patch that adds diff functionality to prop=revisions

The attached patch adds diff functionality to prop=revisions. It can:

  • diff all listed revs to a given revid (rvdiffto)
  • diff all listed revs to the previous rev of the same page (rvdifftoprev)
  • diff all listed revs to the top rev of the same page (rvdifftotop)
  • show diff size (bytes), diff lines and revids involved (rvdiffprop)

Side effects of this patch:

  • all calls to getText() (get text if publicly available) changed to revText() (get text if available to the current user)
  • the revisions array in the output now uses revids rather than 0,1,2... as keys (technically a breaking change, although sane clients won't even notice)

Performance considerations:

  • difftoprev and difftotop are only enabled when listing multiple revisions of the same page
    • this means all revisions have the same top revision so it has to be fetched only once
  • difftoprev diffs to the previous/next revision IN THE LIST, which may not be the previous/next revision if rvuser/rvexcludeuser is used to filter stuff. Also, the last/first revision doesn't get a difftoprev
    • this allows difftoprev to use only the revision texts already fetched
  • when diffs are generated, at most 50 revisions (500 for users with the apihighlimits right) are returned
    • this means at most 150/1500 diffs are generated in one request
      • 50/500 of these (diffto) have a low cache hit probability (diffto revision may not even belong to the same page)
      • 50/500 of these (difftoprev) have a high cache hit probability
      • 50/500 of these (difftotop) have a medium cache hit probability

Attached:

Reassigning to Brion as I need him to review the patch.

If the diffs *are* in fact cached, then fetching a lot at once is cheap. If they're not, I really don't want a single request to be able to multiply an expensive operation times 1500...

My general preference would be to limit to one-at-a-time fetching.

Perhaps if there's a strong use case for it, multiple fetches could be conditionally enabled, or could fetch "only if cached", but I'm not sure how valuable this is.

(In reply to comment #22)

If the diffs *are* in fact cached, then fetching a lot at once is cheap. If
they're not, I really don't want a single request to be able to multiply an
expensive operation times 1500...

I completely understand. Of course that limit could be reduced, but I'm guessing it would have to be too low to be useful (would it?).

My general preference would be to limit to one-at-a-time fetching.

Perhaps if there's a strong use case for it, multiple fetches could be
conditionally enabled, or could fetch "only if cached", but I'm not sure how
valuable this is.

Not trivial to implement either: the current diff backend doesn't support such conditional fetches.

I'll think about one-at-a-time fetching and only-if-cached fetching some more later, but right now I just wanna get some sleep :P

matthew.britton wrote:

(In reply to comment #22)

If the diffs *are* in fact cached, then fetching a lot at once is cheap. If
they're not, I really don't want a single request to be able to multiply an
expensive operation times 1500...

My general preference would be to limit to one-at-a-time fetching.

Perhaps if there's a strong use case for it, multiple fetches could be
conditionally enabled, or could fetch "only if cached", but I'm not sure how
valuable this is.

For me, multiple fetches would be useful, but I would only be using small values of "multiple".

How about not granting a higher limit to users with the 'apihighlimits' right for this particular request, and/or reducing the limit on the number of revisions? If the revision limit was lowered to, say, 10 (with no higher limit for bots), that would mean a maximum of 30 diffs generated, which may or may not be still too high but is certainly lower than 1500 :)

Fetching only if cached wouldn't be much use in my case (as it would be mostly used with suspicious recent changes which are inevitably going to be requested by someone at some time but may not have been yet).

It's not vital, just useful to have.

(In reply to comment #20)

  • the revisions array in the output now uses revids rather than 0,1,2... as

keys (technically a breaking change, although sane clients won't even notice)

To be honest, this would break almost all of my scripts, as this changes the json output from an array of revisions ([...]) to an object ({...}). If you still want to do this, please consider adding some parameter like &indexrevids like you did for the pageids.

Bryan.TongMinh wrote:

(In reply to comment #20)

  • the revisions array in the output now uses revids rather than 0,1,2... as

keys (technically a breaking change, although sane clients won't even notice)

It would because this means that the type changes from list to dictionary, which does differ in sane languages.

(In reply to comment #25)

(In reply to comment #20)

  • the revisions array in the output now uses revids rather than 0,1,2... as

keys (technically a breaking change, although sane clients won't even notice)

To be honest, this would break almost all of my scripts, as this changes the
json output from an array of revisions ([...]) to an object ({...}). If you
still want to do this, please consider adding some parameter like &indexrevids
like you did for the pageids.

Sure.

Does this mean you can't iterate over an object in JavaScript in a sane fashion (i.e. without knowing the keys beforehand)? Isn't there something like a foreach statement?

(In reply to comment #27)

(In reply to comment #25)

(In reply to comment #20)

  • the revisions array in the output now uses revids rather than 0,1,2... as

keys (technically a breaking change, although sane clients won't even notice)

To be honest, this would break almost all of my scripts, as this changes the
json output from an array of revisions ([...]) to an object ({...}). If you
still want to do this, please consider adding some parameter like &indexrevids
like you did for the pageids.

Sure.

Oh wait, I just realized I don't absolutely need this change; I'll just have to create yet another array to keep track of which revision of which page has which revid.

matthew.britton wrote:

Any progress on this? Feel free to scrap the whole multiple-diff thing if there are still reservations about it; just single would be fine.

(In reply to comment #23)

I'll think about one-at-a-time fetching and only-if-cached fetching some more
later, but right now I just wanna get some sleep :P

I decided to go with only-if-cached fetching, which I implemented in r47890. This implementation will let you pull an unlimited number of cached diffs and at most $wgAPIMaxUncachedDiffs (1 by default) uncached diffs.