Page MenuHomePhabricator

Wikimedia setup interfering with API maxage and smaxage parameters
Closed, ResolvedPublic

Description

Author: herd

Description:
Please allow the HTTP headers "Cache-Control", "Expires", and "X-Cache*" type headers to be modified in API queries by utilizing the &maxage URL parameter, similar to action=raw's utilization. This would allow noncritical queries in user scripts, that perhaps activate on every page load, to not strain the servers overtly.

Inspiring example is a conceptual user status script that would query the API on every page load of participating users to check their last edit. By giving the query a squid-level and browser-level short cache, such as 15 minutes, the backend queries would be greatly reduced by the same user, or multiple users, visiting that user's page.

This parameter should probably be ignored for any write-function or anything sending POST data.


Version: unspecified
Severity: normal
URL: http://web-sniffer.net/?url=http%3A%2F%2Fen.wikipedia.org%2Fw%2Fapi.php%3Faction%3Dquery%26meta%3Dsiteinfo%26smaxage%3D900&submit=Submit&http=1.1&gzip=yes&type=GET&uak=0

Details

Reference
bz14402

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:14 PM
bzimport set Reference to bz14402.

Using Squid to cache API requests looks like a good idea, but purging those caches when something changes ranges from extremely tricky to impossible, depending on the type of query. An edit should purge all API requests involving the page, the user, recentchanges, logevents (we shouldn't even bother caching those two, anyway), and possibly other things as well. The problem is that there's a wide variety of possible requests due to the large number of modules and the generator feature, and that it's probably not possible to purge all requests that need to be purged due to the API's dynamic nature. Meta queries like siteinfo are simply impossible to cache, because there's no way to track whether and when the information displayed (based on the interwiki tables and a range of $wg variables) has changed.

A far less problematic approach would be to implement throttling on the client side, by e.g. running the query every 15 minutes and caching the result.

Meta queries like siteinfo are simply impossible to cache, because there's no
way to track whether and when the information displayed (based on the interwiki
tables and a range of $wg variables) has changed.

It's entirely possible to cache them; they just might return stale results.

Since caching here would be enabled only for requests that specifically ask for it in the URL, clients would have to understand the potential risk and decide for themselves whether that's appropriate for them.

A far less problematic approach would be to implement throttling on the client
side, by e.g. running the query every 15 minutes and caching the result.

That's not necessarily feasible for client-side JavaScript, which could benefit from shared caching at a higher level.

(In reply to comment #2)

It's entirely possible to cache them; they just might return stale results.

Since caching here would be enabled only for requests that specifically ask for
it in the URL, clients would have to understand the potential risk and decide
for themselves whether that's appropriate for them.

When are these caches purged, then? And how, given that there are infinitely many possible URLs that request siteinfo? Squid caching is probably not gonna work because of that; we could implement some generic form of query caching, though. The problem is that that still has to go through the API and the database, kind of defeating the point (in fact, siteinfo queries cached that way will probably be slower than normal ones).

A far less problematic approach would be to implement throttling on the client
side, by e.g. running the query every 15 minutes and caching the result.

That's not necessarily feasible for client-side JavaScript, which could benefit
from shared caching at a higher level.

True.

When are these caches purged, then?

Never; they would just expire after X seconds. Hence "might return stale results."

For many purposes, it doesn't matter that, say, there's an 0.0000001% chance that the list of namespaces changed in the last couple hours.

(In reply to comment #4)

When are these caches purged, then?

Never; they would just expire after X seconds. Hence "might return stale
results."

For many purposes, it doesn't matter that, say, there's an 0.0000001% chance
that the list of namespaces changed in the last couple hours.

That's true, but the question remains exactly how we're gonna implement caching this kind of information. The only efficient way of doing so is by using a Squid-like cache that bypasses PHP altogether, since having the API itself fetch the data from some kind of cache (memcached, DB) would probably be slower. A problem with the former, however, is that there are multiple (if not lots of) possibilities for cacheable API requests, some of which might even combine multiple cacheable properties.

This feature req is about HTTP caching, which is accomplished roughly like...

Cache-Control: public, s-maxage: 3600, max-age: 3600

for public squid caching or...

Cache-Control: private, max-age: 3600

for private user-agent caching.

(Note that anything for *public* caching would need to avoid doing authentication, sending cookies, etc.)

herd wrote:

This feature req is about HTTP caching

Indeed. Even just private caching headers would be good. Especially for userscripts that make API calls or importScriptURI() on page loads, that might be navigated away from and reloaded via forward/back.

I'd even go so far as to suggest that a 5 minute default (overridable with &maxage=0 of course) for all &callback= might be a good idea.

smaxage done in r36347, lemme know if you need the regular maxage as well.

herd wrote:

(In reply to comment #8)

smaxage done in r36347, lemme know if you need the regular maxage as well.

Actually, yes, maxage would be more useful than smaxage (per all my comments). Please include both if possible, thanks!

(In reply to comment #9)

(In reply to comment #8)

smaxage done in r36347, lemme know if you need the regular maxage as well.

Actually, yes, maxage would be more useful than smaxage (per all my comments).
Please include both if possible, thanks!

Done in r36349

herd wrote:

Reopening, has no effect. Well, it does have *an* effect, but not the desired effect, and no change to the "Cache-Control" header.

After staring at /api/ApiMain.php trying to figure out why it wasn't working, I've come to three thoughts (bear in mind my php is pseudohp):

$expires = $exp == 0 ? 1 : time() + $this->mSquidMaxage;

  1. Shouldn't this be adding $exp to time() ? (or is it?)

    header('Cache-Control: s-maxage=' . $smaxage . ', must-revalidate, max-age=' . $maxage);
  1. Shouldn't "must-revalidate" be conditional?
  1. Reason for reopening: &maxage and &smaxage are definitely being set (as they do affect the "Expires" header), but, they have no effect on the "Cache-Control" header at least on Wikimedia. It can further be observed that the parameters in the header statement in ApiMain are in a different order than appears on the http headers:

    'Cache-Control: s-maxage=' . $smaxage . ', must-revalidate, max-age=' . $maxage

    Cache-Control: private, s-maxage=0, max-age=0, must-revalidate

Possibly something is overwriting the header() that ApiMain attempts to use?

(In reply to comment #11)

Reopening, has no effect. Well, it does have *an* effect, but not the desired
effect, and no change to the "Cache-Control" header.

After staring at /api/ApiMain.php trying to figure out why it wasn't working,
I've come to three thoughts (bear in mind my php is pseudohp):

$expires = $exp == 0 ? 1 : time() + $this->mSquidMaxage;

  1. Shouldn't this be adding $exp to time() ? (or is it?)

You're right, it should add $exp. I changed $this->mSquidMaxage to $exp in some places and forgot this one. Fixed in r36525

header('Cache-Control: s-maxage=' . $smaxage . ', must-revalidate, max-age=' .
$maxage);

  1. Shouldn't "must-revalidate" be conditional?

On what condition? I have no idea what must-revalidate does...

  1. Reason for reopening: &maxage and &smaxage are definitely being set (as they

do affect the "Expires" header), but, they have no effect on the
"Cache-Control" header at least on Wikimedia.

"at least on Wikimedia" is the key sentence here. It does set the Cache-Control headers on my local install, so maybe Squid or some other program is interfering here? Also, note that errors (including the API help) will NEVER be cached and will therefore simply ignore &maxage and &smaxage

It can further be observed that
the parameters in the header statement in ApiMain are in a different order than
appears on the http headers:

'Cache-Control: s-maxage=' . $smaxage . ', must-revalidate, max-age=' .
$maxage

Cache-Control: private, s-maxage=0, max-age=0, must-revalidate

Possibly something is overwriting the header() that ApiMain attempts to use?

You probably just tested the help screen then. In case of an error (and action=help is a module that always exits with an error), another part of ApiMain overwrites the previously set header with the "no-cache" header you quoted above.

Resolving back to FIXED as it works perfectly for me.

herd wrote:

(In reply to comment #12)

Possibly something is overwriting the header() that ApiMain attempts to use?

You probably just tested the help screen then.

No, I tested a dozen different queries and formats:

Help: http://web-sniffer.net/?url=http%3A%2F%2Fen.wikipedia.org%2Fw%2Fapi.php&submit=Submit&http=1.1&gzip=yes&type=GET&uak=0

Cache-Control: private, s-maxage=0, max-age=0, must-revalidate

Siteinfo: http://web-sniffer.net/?url=http%3A%2F%2Fen.wikipedia.org%2Fw%2Fapi.php%3Faction%3Dquery%26meta%3Dsiteinfo%26maxage%3D900%26smaxage%3D900&submit=Submit&http=1.1&gzip=yes&type=GET&uak=0

Cache-Control: private, s-maxage=0, max-age=0, must-revalidate

RC: http://web-sniffer.net/?url=http%3A%2F%2Fen.wikipedia.org%2Fw%2Fapi.php%3Faction%3Dquery%26list%3Drecentchanges%26maxage%3D900%26smaxage%3D900&submit=Submit&http=1.1&gzip=yes&type=GET&uak=0

Cache-Control: private, s-maxage=0, max-age=0, must-revalidate

etc...

For those URLs, I get: (on my test wiki)

api.php:

Cache-Control: s-maxage=0, must-revalidate, max-age=0

api.php?action=query&meta=siteinfo:

Cache-Control: s-maxage=900, must-revalidate, max-age=900

api.php?action=query&list=recentchanges&maxage=900&smaxage=900:

Cache-Control: s-maxage=900, must-revalidate, max-age=900

So either WMF needs to update ApiMain.php to r36525 (currently at r36502), or something WMF-specific such as Squid is interfering here.

Reopening (and changing product to "Wikimedia"). ApiMain.php has long since been updated, but something on WMF servers is still interfering with this.

(In reply to comment #15)

Reopening (and changing product to "Wikimedia"). ApiMain.php has long since
been updated, but something on WMF servers is still interfering with this.

Changing summary accordingly.

Unassigning for myself, this is a Wikimedia configuration issue.

Assigning to Mark, CC'ing Fred. Squid config needs to be updated to treat /w/api.php differently?