Page MenuHomePhabricator

Establish an internal system or a recommended external system for monitoring user-created Toolforge web services
Open, MediumPublic

Description

The ask here is for opt-in monitoring of the web services that can alert a Toolforge maintainer when their web service is down, perhaps with trending as well.

If we don't implement something, it may be a viable alternative to find an external service or set of them that can be documented as options with no guarantees or endorsement.

Details

Reference
bz51434

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
yuvipanda raised the priority of this task from High to Needs Triage.Jan 12 2015, 7:52 AM

We have shinken!

Not going to happen. We will probably end up doing some monitoring as part of the service manifests work, however.

Restricted Application added a subscriber: StudiesWorld. · View Herald Transcript
Matthewrbowker renamed this task from Setup an icinga instance to monitor tools on tool-labs to Implement a system to montior tools on tool-labs.Sep 14 2016, 6:10 PM
Matthewrbowker reopened this task as Open.
Matthewrbowker claimed this task.
Matthewrbowker triaged this task as Medium priority.

I am re-opening this ticket and taking it on in my capacity as a volunteer.

Icinga is not the given solution for this, so I've also generalized the title. @yuvipanda wants to look at "prometheus blackbox_exporter + alertmanager"

Aklapper renamed this task from Implement a system to montior tools on tool-labs to Implement a system to monitor tools on tool-labs.Nov 13 2016, 4:12 PM
Aklapper set Security to None.

Okay, after some examination here's what I'd like to propose. @yuvipanda this is subject to your OK.

I currently have a Labs project set up. For the short term, I'd like to set up icinga tied into LDAP. Monitoring would be set up with an email to me, manually configured. This will get something out there, functional and relatively useful.

In the long term, I'd like to create a custom management console and monitoring solution, tied to OAuth and written in PHP. This would use the Nagios monitoring plugins but have a custom front-end interface whereby people with Wikitech accounts could manage monitoring their tools. I have been unable to find a solution that fits that bill. This will take a while to code, but I believe this will be far more sustainable and usable in the long run.

Does this make sense? If not, feel free to contact me on IRC (nick: Matthew_) or post here.

(Labs is slowly moving authorative information about instances from LDAP to OpenStack, so if that affects your Icinga setup, https://gerrit.wikimedia.org/r/#/c/328611/2/modules/shinken/files/shinkengen is probably interesting to you.)

General thought #1: There are too many monitoring systems already in place, and adding yet another one further increases maintenance effort, bus factor, etc. Icinga is good as it is also used in production, tweaking it to use some OAuth backend probably workable, having a completely new application opens a can of worms :-). (IMHO; after the experience with Shinken.)

General thought #2: To assess whether it makes sense to put effort into this system or not, we should probably start with defining what is meant by "monitor". What functionality should the system offer that cannot be done better in a different way? (For example, instead of "monitoring" a webservice and alerting its maintainers if it does not respond (if that is a common problem), we could add an option to webservice that specifies a URL path that must return 200 and automatically restart the webservice if not. If the webservice is restarted more than x times per y minutes, we can alert the maintainers via mail. (Or we could leave out the restart and just alert them if it does not respond correctly.) This functionality would live somewhere in webservice, the proxy/Kubernetes, etc. Similarly, we can centralize a nag script that alerts maintainers once a day if a grid job is stuck in error state.)

General thought #3: If a monitoring system should be added that is not Icinga, it should probably be part of Striker (https://toolsadmin.wikimedia.org/).

What about the bots? Many of the current (pseudo-)monitoring service are for webservices (check 200, etc), but for bots, do we have anything at all?

@zhuyifei1999: Depends on the definition of monitoring. If a bot is started by bigbrother and jstart, if it fails, it will be restarted a couple of times, each time with a mail to the maintainers. But this will monitor the process not failing, i. e. if the process is "stuck", it won't notice that type of failure.

But this will monitor the process not failing, i. e. if the process is "stuck", it won't notice that type of failure.

Exactly. Things like T145633: Deadlock can be caused by raising SpamfilterError in site.editpage() happens. Also when a periodic task submitted via jsub in cron fail, no one knows till someone check the logs.

I'm looking for some either sort of check-ping system that expects periodic (configurable) pings from a bot script, or check the time of latest edit with summary matching a regex (many bots have multiple tasks).

Hi, all. Apologies about the delay, I didn't see emails related to this task for some reason...

(Labs is slowly moving authorative information about instances from LDAP to OpenStack, so if that affects your Icinga setup, https://gerrit.wikimedia.org/r/#/c/328611/2/modules/shinken/files/shinkengen is probably interesting to you.)

Thank you for the information. Does that include information about Tools in Tool Labs?

General thought #1: There are too many monitoring systems already in place, and adding yet another one further increases maintenance effort, bus factor, etc. Icinga is good as it is also used in production, tweaking it to use some OAuth backend probably workable, having a completely new application opens a can of worms :-). (IMHO; after the experience with Shinken.)

You do raise a fair point. Another idea I can play with (and I've actually already played with it at work) is providing a front-end interface for icinga management or something of the sort. Again, they key is low barrier to entry for tool developers who simply want to know if a tool is up or down. As far as I can tell, there is no set procedure to set up tools to be monitored with Shinken.

General thought #2: To assess whether it makes sense to put effort into this system or not, we should probably start with defining what is meant by "monitor". What functionality should the system offer that cannot be done better in a different way? (For example, instead of "monitoring" a webservice and alerting its maintainers if it does not respond (if that is a common problem), we could add an option to webservice that specifies a URL path that must return 200 and automatically restart the webservice if not. If the webservice is restarted more than x times per y minutes, we can alert the maintainers via mail. (Or we could leave out the restart and just alert them if it does not respond correctly.) This functionality would live somewhere in webservice, the proxy/Kubernetes, etc. Similarly, we can centralize a nag script that alerts maintainers once a day if a grid job is stuck in error state.)

I define a monitor as a software check that determines whether a given piece of software is working correctly.

For a Tool Labs tool (http://tools.wmflabs.org), a webservice check should be sufficient. Icinga provides one in its base package.

For a Labs instance, a host alive check and a ping check is possible right out of the gate. I can do more if people need more.

General thought #3: If a monitoring system should be added that is not Icinga, it should probably be part of Striker (https://toolsadmin.wikimedia.org/).

Could we do management as part of Striker? That would make this very easy, at least on the Tool Labs side. I don't quite know who to ask though...

What about the bots? Many of the current (pseudo-)monitoring service are for webservices (check 200, etc), but for bots, do we have anything at all?

I'm looking for some either sort of check-ping system that expects periodic (configurable) pings from a bot script, or check the time of latest edit with summary matching a regex (many bots have multiple tasks).

Something can definitely be coded there, Icinga and the Nagios plugins are very flexible. We could also do pings from a bot script or from IRC, or indeed check edit summaries, although the latter will be harder.

Could we do management as part of Striker? That would make this very easy, at least on the Tool Labs side. I don't quite know who to ask though...

@bd808

Could we do management as part of Striker? That would make this very easy, at least on the Tool Labs side. I don't quite know who to ask though...

The best way to discuss adding something to Striker is in a phab ticket associated with the Striker project.

In the long term, I'd like to create a custom management console and monitoring solution, tied to OAuth and written in PHP. This would use the Nagios monitoring plugins but have a custom front-end interface whereby people with Wikitech accounts could manage monitoring their tools. I have been unable to find a solution that fits that bill. This will take a while to code, but I believe this will be far more sustainable and usable in the long run.

Striker is Python rather than PHP, but it does provide authentication for Labs users. Its current authorization layer only knows about Tool Labs tool membership, but that may be fixable. Wikitech supports OAuth authentication that could be used in a tool or Labs project, but an authorization layer would have to be developed separately.

The universe is full of FLOSS system monitoring tools. Nearly every one of them was started because the author found all other tools lacking and set out to create a better solution rather than improving an existing tool. I can see the utility in making some helper functionality to make configuring an existing monitoring system easier for Labs. I can not see the utility in adding to the total number of monitoring tools available in the universe.

[…]

(Labs is slowly moving authorative information about instances from LDAP to OpenStack, so if that affects your Icinga setup, https://gerrit.wikimedia.org/r/#/c/328611/2/modules/shinken/files/shinkengen is probably interesting to you.)

Thank you for the information. Does that include information about Tools in Tool Labs?

AFAIUI: No.

[…]
I define a monitor as a software check that determines whether a given piece of software is working correctly.

For a Tool Labs tool (http://tools.wmflabs.org), a webservice check should be sufficient. Icinga provides one in its base package.

For a Labs instance, a host alive check and a ping check is possible right out of the gate. I can do more if people need more.

For Labs instances we already have an ssh check via Shinken which is effectively alive and ping.

[…]

I'm looking for some either sort of check-ping system that expects periodic (configurable) pings from a bot script, or check the time of latest edit with summary matching a regex (many bots have multiple tasks).

Something can definitely be coded there, Icinga and the Nagios plugins are very flexible. We could also do pings from a bot script or from IRC, or indeed check edit summaries, although the latter will be harder.

One major problem with any self-serving (monitoring) solution is that users must be treated as potentially hostile. So, for example, you can't just use simple Nagios plugins for webservices, but must check that the URL "belongs" to the user. Similarly, users must not be able to interfere with each other's tools.

When webservices were first introduced, on failure they would just stop working, with the idea that maintainers would then come along, fix any issues and restart the webservice. IIRC users then wanted webservices to automatically restart because that was (all) what they would do when they encountered a failed webservice.

I assume that bot operators would act in the same way, so I think that a pattern for bots would be more useful, e. g. start the bot with bigbrother, on every edit touch a file ~/.bot-watchdog and have a cron job every hour/day that tests whether ~/.bot-watchdog has been touched in the past x hours and, if not, delete the grid job and let it be restarted by bigbrother.

Hello!

My apologies for the delay.

Based on this information, I'm going to split this task into two parts. First part will be just for Tool Labs, second part will be for Labs as a whole. I will begin by handling tool labs only, as this appears to be the less involved of the two...

@bd808 if I provided some sort of CRUD API, could we use Striker as a front end? if that answer is yes, I'll create a task to discuss specifics. This will handle @scfc 's issues with regard to user input.

The question now is what monitoring would need to be implemented for Tool Labs? I understand web (HTTP), but what would need to be implemented for bots? Could I poll the job queue in some way? I'm not very farmiliar with jsub so...

@scfc my thought would be just start with monitoring. Automated restart can be handled down the line.

@bd808 if I provided some sort of CRUD API, could we use Striker as a front end? if that answer is yes, I'll create a task to discuss specifics. This will handle @scfc 's issues with regard to user input.

Totally possible, yes. As I mentioned in T53434#2909937, open a ticket with the rough ideas and we can iterate from there to figure out what would be needed to create the integration. The trickiest part may be securing authentication between Striker and a Labs project hosting the monitor.

The question now is what monitoring would need to be implemented for Tool Labs? I understand web (HTTP), but what would need to be implemented for bots? Could I poll the job queue in some way? I'm not very farmiliar with jsub so...

There's a tool for this! https://tools.wmflabs.org/gridengine-status/ dumps out a json blob that provides the same information as https://tools.wmflabs.org/?status. The tool that was built for the Precise migration should give you an idea of how you can consume it.

The 'is my webservice up' and 'is my job running' checks are probably a good place to start. Longer term some sort of liveness checks would be more awesome. The need for any of this may magically disappear with a proper Kubernetes based PaaS (T136264: Evaluate Kubernetes based workflow replacement options for SGE) as Kubernetes has built in support for per 'pod' liveness checking, but that's no reason to block trying to find a solution now. I have a feeling that even after we have chosen and deployed a PaaS it will take quite a while to get everyone migrated over to using it.

@bd808 if I provided some sort of CRUD API, could we use Striker as a front end? if that answer is yes, I'll create a task to discuss specifics. This will handle @scfc 's issues with regard to user input.

Totally possible, yes. As I mentioned in T53434#2909937, open a ticket with the rough ideas and we can iterate from there to figure out what would be needed to create the integration. The trickiest part may be securing authentication between Striker and a Labs project hosting the monitor.

Done, see T157847: Preparation for api for community-labs-monitoring

The question now is what monitoring would need to be implemented for Tool Labs? I understand web (HTTP), but what would need to be implemented for bots? Could I poll the job queue in some way? I'm not very farmiliar with jsub so...

There's a tool for this! https://tools.wmflabs.org/gridengine-status/ dumps out a json blob that provides the same information as https://tools.wmflabs.org/?status. The tool that was built for the Precise migration should give you an idea of how you can consume it.

The 'is my webservice up' and 'is my job running' checks are probably a good place to start. Longer term some sort of liveness checks would be more awesome. The need for any of this may magically disappear with a proper Kubernetes based PaaS (T136264: Evaluate Kubernetes based workflow replacement options for SGE) as Kubernetes has built in support for per 'pod' liveness checking, but that's no reason to block trying to find a solution now. I have a feeling that even after we have chosen and deployed a PaaS it will take quite a while to get everyone migrated over to using it.

Sounds good! Thank you for the information.

I'm looking for information on how tools-prometheus-01 and tools-prometheus-02 work. The only documentation I've found was this task and a small section in Wikitech about monitoring in the Kubernetes cluster.

I see both nodes are running are actively collecting metrics. Any help is welcome and sorry to hijack this task to ask for information but it seems the solution proposed here was already implemented to some extent.

I've found a presentation that says the Toolforge Prometheus instances were used as a testbed for ideas before implementing the production ones. So I think the main Prometheus page in Wikitech applies then. It doesn't talk a lot about Toolforge but I think it's a starting point. If anyone remembers something that's special/different about it when compared to Production, please let me know.

The Coretex project and community has been very active too, looks like it could be a good fit for multi-tenant monitoring based on tools we already use.

https://github.com/cortexproject/cortex

Bstorm subscribed.

With the exception of alerting, https://k8s-status.toolforge.org/ and the namespaced dashboards it links (like https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?var-namespace=tool-pywikibot-testwiki) to fulfill a lot of this task (as well as the aforementioned https://sge-status.toolforge.org/).

A self-serve alerting dashboard is pretty far from where we are now. Honestly, production doesn't even have anything approaching that since it's really just all puppetized config. Accepting that a larger Thanos or Cortex system with alertmanager for Cloud-VPS is a bit of a different scope that was not really defined in this ticket, I'd like to propose closing this as historical and opening tickets for the things we are currently examining building.

Bstorm renamed this task from Implement a system to monitor tools on tool-labs to Establish an internal system or a recommended external system for monitoring user-created Toolforge web services.Oct 21 2020, 4:19 PM
Bstorm updated the task description. (Show Details)
Bstorm removed a subscriber: JHedden.

Updated the task to reflect the ask more specifically with the option of simply suggesting methods of doing this if we don't find the resources and time to build it (or before we do).

Copying my comment from T278097, and requesting that this be given a higher priority.

I'd just like to add my support to this idea. Many of the tools that run in toolforge are critical parts of the technical infrastructure that keeps the project going. They deserve all the normal logging, alerting and monitoring support that any serious production system has.

I'd love to see something like https://en.wikipedia.org/wiki/Graphite_(software) set up that any tool could easily feed performance data to and tool maintainers could build their own dashboards. There's really no reason for each tool developer to reinvent the wheel on this kind of stuff.

I'd love to see something like https://en.wikipedia.org/wiki/Graphite_(software) set up that any tool could easily feed performance data to and tool maintainers could build their own dashboards. There's really no reason for each tool developer to reinvent the wheel on this kind of stuff.

For what it’s worth you can already do the Graphite part − see eg what I did in T279236: Add timings/instrumentation, and the documentation I added at https://wikitech.wikimedia.org/wiki/Statsd#Use_in_Cloud_Services_environment. What I was told back then however was more or less “it will probably work for now, but Graphite/statsd will go away eventually”.

As for dashboards, well one can always then use https://graphite-labs.wikimedia.org ; as for Grafana, see eg T295296

Interesting, I'll give that a look. The last time I asked about this, the answer was essentially, "There is such a system, but it's for official WMF use only, not for toolforge". Which made me sad.

@JeanFred do you have some example code where you use this? I tried

echo -n "spi-tools.test.foo:99|c" | nc -w 1 -u cloudmetrics1001.eqiad.wmnet 8125

and when I went to https://graphite-labs.wikimedia.org/, I expected to see the "spi-tools.test.foo" metric listed in the tree listing, but it's not there. Any suggestions?

@RoySmith mobile atm, but before I forget, is https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&refresh=5m&var-namespace=tool-spi-tools-dev useful to you at all?

(You can even set up alerts, if you make a copy of the dashboard for your namespace — just don't pipe them to AlertManager, but if you've got something which will listen to a webhook..)

When I try to log into grafana using my toolsadmin credentials, I get "Invalid username or password"

Might this explain the credential issues? https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Editing_dashboards. You need the right ldap group, all of which requires NDA. I believe T295296 mentions this. I can help with https://wikitech.wikimedia.org/wiki/Volunteer_NDA if this is the only blocker.

We can implement this soon using metricsinfra for the alerts, and the http/tcp liveness probes + k8s data for the monitoring collection + grafana dashboard for the trends