Page MenuHomePhabricator

Performance review of PubSubHubbub extension
Closed, DeclinedPublic

Description

Author: sebastian.brueckner

Description:
The PubSubHubbubExtension (https://www.mediawiki.org/wiki/Extension:PubSubHubbub) needs to be reviewed before deployment on Wikidata.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=67118

Details

Reference
bz67117

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:41 AM
bzimport set Reference to bz67117.
bzimport added a subscriber: Unknown Object (MLST).

A few things:

  • This seems to be missing a ArticleRevisionVisibilitySet handler
  • http_post() needs to handle $wgHTTPProxy
  • The maximum jobs attempts for the queue will have to be set very high to avoid update losses (maxTries)
  • NewRevisionFromEditComplete and other hooks trigger before COMMIT so the jobs should probably be delayed (using 'jobReleaseTimestamp'). Moving them post-COMMIT is not an option since the network partition could cause nothing to be enqueued (the reverse, a job and no COMMIT, is wasteful but harmless).
  • We tend to run lots of jobs for one wiki at a time. http_post() could benefit from some sort of singleton on the curl handle instead closing it each time. See http://stackoverflow.com/questions/972925/persistent-keepalive-http-with-the-php-curl-library.
  • createLastChangesOutput() should use LIMIT+DISTINCT instead of a "break" statement. Also, I'm not sure how well that works. There can only be one job for hitting the URL that returns this result in the queue, but it only does the last 60 seconds of changes. Also, it selects rc_timestamp but does not use it now. Is it OK if the Hub missing a bunch of changes from this (e.g. are the per-Title jobs good enough?)
  • It's curious that the hub is supposed to talk back to a special page, why not an API page instead?
  • The Link headers also go there. What is the use of these? Also, since they'd take 30 days to apply to all pages (the varnish cache TTL), it would be a pain to change them. They definitely need to be stable.

Come to think of it, it seems like we need to send the following events to the hub:

  • New revisions
  • Page (un)deletions
  • Revision (un)deletions
  • Page moves

All of the above leave either edit or log entries in recent changes. Page moves only leave one at the old title...though rc_params can be inspected to get the new title. I wonder if instead of a job per title if there can instead be a single job that sends all changes since the "last update time" and updates the "last update time" on success. The advantages would be:
a) Far fewer jobs needed
b) All updates would be batched
c) Supporting more hubs is easier since only another job and time position is needed (rather than N jobs for each hub for each title)
Of course I may have missed something.

alexander.lehmann wrote:

Hi Aaron.

Thanks for your notes. Some questions to your notes are inline.

(In reply to Aaron Schulz from comment #1)

A few things:

  • This seems to be missing a ArticleRevisionVisibilitySet handler
  • http_post() needs to handle $wgHTTPProxy
  • The maximum jobs attempts for the queue will have to be set very high to

avoid update losses (maxTries)

  • NewRevisionFromEditComplete and other hooks trigger before COMMIT so the

jobs should probably be delayed (using 'jobReleaseTimestamp'). Moving them
post-COMMIT is not an option since the network partition could cause nothing
to be enqueued (the reverse, a job and no COMMIT, is wasteful but harmless).

I'm not sure if I understood the problem correctly. How should I set the time period of the delay? Should the delay dynamically adjust or a should we set a fixed period of time.

  • We tend to run lots of jobs for one wiki at a time. http_post() could

benefit from some sort of singleton on the curl handle instead closing it
each time. See
http://stackoverflow.com/questions/972925/persistent-keepalive-http-with-the-
php-curl-library.

  • createLastChangesOutput() should use LIMIT+DISTINCT instead of a "break"

statement. Also, I'm not sure how well that works. There can only be one job
for hitting the URL that returns this result in the queue, but it only does
the last 60 seconds of changes. Also, it selects rc_timestamp but does not
use it now. Is it OK if the Hub missing a bunch of changes from this (e.g.
are the per-Title jobs good enough?)

  • It's curious that the hub is supposed to talk back to a special page, why

not an API page instead?

The extension provides data in the MediaWiki XML export format, I don't know if the API is intended for this format. Does it make a real difference to using a special page?

  • The Link headers also go there. What is the use of these? Also, since

they'd take 30 days to apply to all pages (the varnish cache TTL), it would
be a pain to change them. They definitely need to be stable.

They need to be there according to the specification of PubSubHubbub[1].
But they are stable.

Come to think of it, it seems like we need to send the following events to
the hub:

  • New revisions
  • Page (un)deletions
  • Revision (un)deletions
  • Page moves

All of the above leave either edit or log entries in recent changes. Page
moves only leave one at the old title...though rc_params can be inspected to
get the new title. I wonder if instead of a job per title if there can
instead be a single job that sends all changes since the "last update time"
and updates the "last update time" on success. The advantages would be:
a) Far fewer jobs needed
b) All updates would be batched
c) Supporting more hubs is easier since only another job and time position
is needed (rather than N jobs for each hub for each title)
Of course I may have missed something.

Unfortunately, there is no possibility to identify the request of the Hubs. Theoretically, anyone can call the PubSubHubbub export interface. Therefore, we do not know the "last update time". The hub cannot provide the time, because the resource is identified only through the exact URL.

  • The delay could be in theory be based on the current max slave lag...but I simple approach would be to just pick a value. 10 seconds would be very safe under non-broken circumstances.
  • It's not a huge deal, but it would be nice to use the API (which also works with things like OAuth if we wanted hubs with more access). Brad might have an opinion on special page vs API. I don't feel strongly, but we should get it right given the Link header and cache interaction.
  • You could still use the recent changes table to reduce the number of jobs. There could be at most one de-duplicated job per hub that would grab all titles changes from the last time and send them to the hub. When the job succeeds (not HTTP errors posting the changed URIs), it could bump the time. The time table could be a simple DB table. The job could be delayed to give it time to cover a larger time range. The range could be "last time" to present (or a smaller range to limit the number of items, especially for the new hub case). Maybe 5 minutes. It could hopefully batch more titles in one or a few HTTP requests (or use pipelining or at least some curl_multi).

(In reply to Aaron Schulz from comment #3)

  • It's not a huge deal, but it would be nice to use the API (which also

works with things like OAuth if we wanted hubs with more access). Brad might
have an opinion on special page vs API. I don't feel strongly, but we should
get it right given the Link header and cache interaction.

OAuth works with special pages too.

It sounds like the general idea of the extension is possible to be configured for general users to use? Then an API endpoint wouldn't necessarily be out of place. On the other hand, is there anything wrong with the existing special page implementation? If not, is there a reason not to just stay with that?

I don't think it works on special pages. I asked Chris about that a few days back.

I tested it before posting comment 4, just to be sure. It does.

You can now test it yourself. Go to http://tools.wmflabs.org/oauth-hello-world/index.php?action=authorize, then go to http://tools.wmflabs.org/oauth-hello-world/index.php?action=testspecial which will dump the response received when hitting Special:MyPage.

The updates uses the XML dump format. That's not really compatible with the way the API structures XML output, and in case JSON is requested from the API, would would have to be encoded as a single string value. That's really annoying.

Also, API modules generally shouldn't mess with HTTP headers, which is necessary for PuSH, as far as I understand.

A Special page seems to fit the bill much better on the technical level, though it's a bit awkward conceptually. There are several special pages that generate non-HTML output though.

Hm, perhaps the best way to implement this would be a custom action?...

Without looking at the code we had a few comments on this extension that also relate to performance [1] [2].

The basic idea is whether the client (wiki mirror) can delay the poll (e.g. 5-10 minutes) and aggregate multiple edits of the same page into one. OAI had a native support for this.

The reason is that in DBpedia Live, we had a lot of scaling issues when the database was in constant update (even with the aggregated edits with 5-10 minutes poll delay) and using in parallel the MW API to get data for extraction.

I don't know if this is supported by the PubSubHubbub specs but the client could do some aggregation manually.

We had another query for Hub data persistence but not sure which thread is more appropriate for this question.

[1] https://lists.wikimedia.org/pipermail/wikidata-tech/2014-July/000533.html
[2] https://lists.wikimedia.org/pipermail/wikidata-tech/2014-July/000539.html

aaron changed the task status from Open to Stalled.Apr 7 2015, 7:42 PM
aaron claimed this task.