Page MenuHomePhabricator

Monitor effectiveness of HTCP purging
Closed, ResolvedPublic

Description

Moved over from bug 43448:

Could we have some sort of monitoring of if HTCP packets cause
things actually get purged. squid/varnish purging suddenly not working seems to
have happened quite a few times in the past (all for different reasons), and we
have no monitoring of it, (We don't even have any unit tests on the MW side for
it as far as I am aware).

This is bad since:
*Most people don't have squid/varnish set up in their dev environment, so
people don't notice on local test wikis.
*The symptoms are gradual, and usually pass unnoticed for some time
*With exception of images, the people primarily effected are anons, who are
less likely to know how to report the issue.


Version: unspecified
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=41130
https://bugzilla.wikimedia.org/show_bug.cgi?id=48927

Details

Reference
bz43449

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:09 AM
bzimport set Reference to bz43449.

FYI, posted on ops@ by Tim Starling six hours ago:

"There is a nagios check to make sure varnishhtcpd is working, but it
only checks to see if the process is still running, it doesn't check
to see if it is actually working."

For the records, in case somebody considers working on this:
<TimStarling> andre__: just purge a URL, request it, and check its Age header
<TimStarling> it should be less than some threshold
<TimStarling> http://tools.ietf.org/rfcmarkup?doc=2616#section-5.1.2

I spoke with CT about this, and he's going to talk to his team about what can be done here.

Assigning to Mark as just discussed in the Ops/Platform meeting here in SF.

Assigning to Brandon per Roadmap Updates meeting and email thread.

Brandon: Are there any news / progress to share yet?

Yes, I've been implementing a replacement for varnishhtcpd. You can see the evolving initial version at the changeset here: https://gerrit.wikimedia.org/r/#/c/60390/ . I hope to be able to test it in prod in the next few days, and it shouldn't suffer from the perf/loss bugs of the previous implementation.

Stats output still needs implementation there as well (for monitoring the daemon's own reliability as well as other issues like loss of multicast delivery), but we'd rather put the stats work in the fresh new code than attach to the known-failing code.

The replacement daemon was deployed to production today. The initial deployment is just a minimum-change swap of the two pieces of software. Further enhancements (to performance, and logging of stats to spot multicast loss) will come once this has had a little time to stabilize without any loud complaints of being worse than before.

(In reply to comment #9 by Brandon Black)

The replacement daemon was deployed to production today. The initial
deployment is just a minimum-change swap of the two pieces of software.
Further enhancements (to performance, and logging of stats to spot multicast
loss) will come once this has had a little time to stabilize without any loud
complaints of being worse than before.

Great! Is it already possible to judge whether it's stable enough?

Brandon: Were the last 5 weeks enough time to judge whether it's stable enough? (Is this bug report fixed?)

Yes, I think so, although I just fixed a bug in the software yesterday. Still, it's a significant improvement and we've un-deployed the previous software. May as well close this bug and then open further ones as warranted for further changes to our purging architecture? The title of the bug doesn't precisely correlate with what ended up happening anyways (monitoring the success rate, which we still can't really do, and won't ever be able to do with any real accuracy so long as it's plain multicast).

Brandon, are you actually building proper monitoring into this daemon, or do we need to start separate work? I remember Mark making the case that this could be done within Varnish, but I'm still kinda confused as to how we can actually do effective monitoring of Varnish purging from within Varnish.

Bug 49362 is an example of a bug that would be great to have proper monitoring for.

The daemon logs some stats to a file, which we could pick up and graph (but currently do not, yet). These would basically give you the rate of multicast purge requests the daemon's receiving and whether it's failing to process any of them due to some large-impact bug that's overflowing the queue.

The larger issue that makes that relatively ineffective is that the requests arrive over multicast, which is an unreliable protocol by design. They could be lost in the sender's output buffers, anywhere in the network, or discarded at the receiving cache (local buffering issues) and we'd have no indication that was happening.

Upgrading from multicast is also an expensive proposition in terms of complexity (after all, the reason we're using it is that it's simple and efficient). We've thrown around some ideas about replacing multicast with http://en.wikipedia.org/wiki/Pragmatic_General_Multicast , likely using http://zeromq.org/ as the communications abstraction layer, as a solution to the unreliability of multicast. This would basically give us a reliable sequence-number system with retransmission that's handled at that layer.

That means adding zeromq support to the php that sends the purge requests, adding it to vhtcpd, and most likely also building out a redundant, co-operating set of middleboxes as publish/subscribe multiplexers. I'm not fond of going down this path unless we really see a strong need to upgrade from multicast, though. It smells of too much complexity for the problem we're trying to solve, and/or that there may be a better mechanism for this if we re-think how purging is being accomplished in general.

In any case, I think that would all be outside the scope of this ticket.

I think monitoring should happen outside of the deamon (Since as you say, there's a limit to what we can do with an unreliable protocol).

What I would suggest is some script (perhaps even living on the tools lab) that does the following:
*Find the 10 most recent overwritten files. Get the thumbnail sizes that would be on the image description page, along with the original file asset (from both europe and north america varnish). Look at the age header. If the age header is longer than the time between last re-upload and now, yell.
*Pick a test file at random. Request the file at some random size. Do ?action=purge. Sleep for about 10 seconds. Request the file again. Check to make sure that the age header is either not present or < 10.
*For good measure. Pick a popular page like [[Wikipedia:Village pump (Technical)]] (also some redirect page like [[WP:VPT]]). Request the page. Check that the age header is less than the time between now and last edit. (Or at least for the redirect case, make sure that the difference isn't super big to give some lee-way for job queue)

How would one "Find the 10 most recent overwritten files" reliably/efficiently? Most of these solutions you're suggesting seem to give us some probabilistic idea that things are working, but really solve the problem if a random small percentage of purges are being lost in the pipe somewhere. They'd have to run at pretty high rates to even catch singular failed elements (one varnish not receiving purges, which may or may not have already cached the test file, which you may or may not hit with your check)

Sorry, I meant to say "..., but really DON'T solve the problem ..."

(In reply to comment #16)

How would one "Find the 10 most recent overwritten files"
reliably/efficiently?
Most of these solutions you're suggesting seem to give us some probabilistic
idea that things are working, but really solve the problem if a random small
percentage of purges are being lost in the pipe somewhere. They'd have to
run
at pretty high rates to even catch singular failed elements (one varnish not
receiving purges, which may or may not have already cached the test file,
which
you may or may not hit with your check)

Very true. However I'm more concerned with mass failures. (The type of thing where doing this once every 6 hours would be sufficient). Massive failures to the purging system have happend in the past several times. Monitoring for this type of failure I think is important. (Fine grained monitoring would be cool too, but seems more difficult)

How would one "Find the 10 most recent overwritten files" reliably/efficiently?

via db query (or api): select log_title from logging where log_type = 'upload' and log_action = 'overwrite' order by log_timestamp DESC limit 10;

There is a patch to setup ganglia monitoring for vhtcpd in gerrit 77975.

Change 77975 had a related patch set uploaded by MZMcBride:
Add ganglia monitoring for vhtcpd.

https://gerrit.wikimedia.org/r/77975

Change 77975 merged by BBlack:
Add ganglia monitoring for vhtcpd.

https://gerrit.wikimedia.org/r/77975

(In reply to comment #22)

Change 77975 merged by BBlack:
Add ganglia monitoring for vhtcpd.

https://gerrit.wikimedia.org/r/77975

With this changeset now merged, I'm a little unclear what's still needed to mark this bug as resolved/fixed. Brian W.: can you clarify?

(In reply to comment #23)

(In reply to comment #22)

Change 77975 merged by BBlack:
Add ganglia monitoring for vhtcpd.

https://gerrit.wikimedia.org/r/77975

With this changeset now merged, I'm a little unclear what's still needed to
mark this bug as resolved/fixed. Brian W.: can you clarify?

I'm going to call this bug closed for now. There's pretty graphs at http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=vhtcpd_inpkts_dequeued&s=by+name&c=Upload+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4

One thing we were talking about earlier was doing actual tests where we have a script that either looks at recent re-uploads, and check the purge succeded, or specificly purged things, and checked to see if that works. We could do that later if this monitoring turns out not to be enough