⚓ T53434 Establish an internal system or a recommended external system for monitoring user-created Toolforge web services

yuvipanda raised the priority of this task from High to Needs Triage.Jan 12 2015, 7:52 AM

We have shinken!

Not going to happen. We will probably end up doing some monitoring as part of the service manifests work, however.

zhuyifei1999 subscribed.Nov 26 2015, 12:52 PM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptNov 26 2015, 12:52 PM

Restricted Application added a subscriber: StudiesWorld. · View Herald Transcript

I am re-opening this ticket and taking it on in my capacity as a volunteer.

Icinga is not the given solution for this, so I've also generalized the title. @yuvipanda wants to look at "prometheus blackbox_exporter + alertmanager"

Matthewrbowker mentioned this in T148569: Request creation of community-labs-monitoring labs project.Oct 18 2016, 6:36 PM

tom29739 subscribed.Oct 19 2016, 1:31 AM

Matthewrbowker mentioned this in T150051: Create a #labs-project-community-labs-monitoring tag.Nov 4 2016, 7:40 PM

Aklapper renamed this task from Implement a system to montior tools on tool-labs to Implement a system to monitor tools on tool-labs.Nov 13 2016, 4:12 PM

Aklapper set Security to None.

Aklapper added a project: community-labs-monitoring.

Okay, after some examination here's what I'd like to propose. @yuvipanda this is subject to your OK.

I currently have a Labs project set up. For the short term, I'd like to set up icinga tied into LDAP. Monitoring would be set up with an email to me, manually configured. This will get something out there, functional and relatively useful.

In the long term, I'd like to create a custom management console and monitoring solution, tied to OAuth and written in PHP. This would use the Nagios monitoring plugins but have a custom front-end interface whereby people with Wikitech accounts could manage monitoring their tools. I have been unable to find a solution that fits that bill. This will take a while to code, but I believe this will be far more sustainable and usable in the long run.

Does this make sense? If not, feel free to contact me on IRC (nick: Matthew_) or post here.

(Labs is slowly moving authorative information about instances from LDAP to OpenStack, so if that affects your Icinga setup, https://gerrit.wikimedia.org/r/#/c/328611/2/modules/shinken/files/shinkengen is probably interesting to you.)

General thought #1: There are too many monitoring systems already in place, and adding yet another one further increases maintenance effort, bus factor, etc. Icinga is good as it is also used in production, tweaking it to use some OAuth backend probably workable, having a completely new application opens a can of worms :-). (IMHO; after the experience with Shinken.)

General thought #2: To assess whether it makes sense to put effort into this system or not, we should probably start with defining what is meant by "monitor". What functionality should the system offer that cannot be done better in a different way? (For example, instead of "monitoring" a webservice and alerting its maintainers if it does not respond (if that is a common problem), we could add an option to webservice that specifies a URL path that must return 200 and automatically restart the webservice if not. If the webservice is restarted more than x times per y minutes, we can alert the maintainers via mail. (Or we could leave out the restart and just alert them if it does not respond correctly.) This functionality would live somewhere in webservice, the proxy/Kubernetes, etc. Similarly, we can centralize a nag script that alerts maintainers once a day if a grid job is stuck in error state.)

General thought #3: If a monitoring system should be added that is not Icinga, it should probably be part of Striker (https://toolsadmin.wikimedia.org/).

What about the bots? Many of the current (pseudo-)monitoring service are for webservices (check 200, etc), but for bots, do we have anything at all?

@zhuyifei1999: Depends on the definition of monitoring. If a bot is started by bigbrother and jstart, if it fails, it will be restarted a couple of times, each time with a mail to the maintainers. But this will monitor the process not failing, i. e. if the process is "stuck", it won't notice that type of failure.

In T53434#2903104, @scfc wrote:

But this will monitor the process not failing, i. e. if the process is "stuck", it won't notice that type of failure.

Exactly. Things like T145633: Deadlock can be caused by raising SpamfilterError in site.editpage() happens. Also when a periodic task submitted via jsub in cron fail, no one knows till someone check the logs.

I'm looking for some either sort of check-ping system that expects periodic (configurable) pings from a bot script, or check the time of latest edit with summary matching a regex (many bots have multiple tasks).

Hi, all. Apologies about the delay, I didn't see emails related to this task for some reason...

In T53434#2902974, @scfc wrote:

(Labs is slowly moving authorative information about instances from LDAP to OpenStack, so if that affects your Icinga setup, https://gerrit.wikimedia.org/r/#/c/328611/2/modules/shinken/files/shinkengen is probably interesting to you.)

Thank you for the information. Does that include information about Tools in Tool Labs?

In T53434#2902974, @scfc wrote:

General thought #1: There are too many monitoring systems already in place, and adding yet another one further increases maintenance effort, bus factor, etc. Icinga is good as it is also used in production, tweaking it to use some OAuth backend probably workable, having a completely new application opens a can of worms :-). (IMHO; after the experience with Shinken.)

You do raise a fair point. Another idea I can play with (and I've actually already played with it at work) is providing a front-end interface for icinga management or something of the sort. Again, they key is low barrier to entry for tool developers who simply want to know if a tool is up or down. As far as I can tell, there is no set procedure to set up tools to be monitored with Shinken.

In T53434#2902974, @scfc wrote:

General thought #2: To assess whether it makes sense to put effort into this system or not, we should probably start with defining what is meant by "monitor". What functionality should the system offer that cannot be done better in a different way? (For example, instead of "monitoring" a webservice and alerting its maintainers if it does not respond (if that is a common problem), we could add an option to webservice that specifies a URL path that must return 200 and automatically restart the webservice if not. If the webservice is restarted more than x times per y minutes, we can alert the maintainers via mail. (Or we could leave out the restart and just alert them if it does not respond correctly.) This functionality would live somewhere in webservice, the proxy/Kubernetes, etc. Similarly, we can centralize a nag script that alerts maintainers once a day if a grid job is stuck in error state.)

I define a monitor as a software check that determines whether a given piece of software is working correctly.

For a Tool Labs tool (http://tools.wmflabs.org), a webservice check should be sufficient. Icinga provides one in its base package.

For a Labs instance, a host alive check and a ping check is possible right out of the gate. I can do more if people need more.

In T53434#2902974, @scfc wrote:

General thought #3: If a monitoring system should be added that is not Icinga, it should probably be part of Striker (https://toolsadmin.wikimedia.org/).

Could we do management as part of Striker? That would make this very easy, at least on the Tool Labs side. I don't quite know who to ask though...

In T53434#2903006, @zhuyifei1999 wrote:

What about the bots? Many of the current (pseudo-)monitoring service are for webservices (check 200, etc), but for bots, do we have anything at all?

In T53434#2903122, @zhuyifei1999 wrote:

I'm looking for some either sort of check-ping system that expects periodic (configurable) pings from a bot script, or check the time of latest edit with summary matching a regex (many bots have multiple tasks).

Something can definitely be coded there, Icinga and the Nagios plugins are very flexible. We could also do pings from a bot script or from IRC, or indeed check edit summaries, although the latter will be harder.

In T53434#2909575, @Matthewrbowker wrote:

Could we do management as part of Striker? That would make this very easy, at least on the Tool Labs side. I don't quite know who to ask though...

@bd808

In T53434#2909575, @Matthewrbowker wrote:

Could we do management as part of Striker? That would make this very easy, at least on the Tool Labs side. I don't quite know who to ask though...

The best way to discuss adding something to Striker is in a phab ticket associated with the Striker project.

In T53434#2902794, @Matthewrbowker wrote:

In the long term, I'd like to create a custom management console and monitoring solution, tied to OAuth and written in PHP. This would use the Nagios monitoring plugins but have a custom front-end interface whereby people with Wikitech accounts could manage monitoring their tools. I have been unable to find a solution that fits that bill. This will take a while to code, but I believe this will be far more sustainable and usable in the long run.

Striker is Python rather than PHP, but it does provide authentication for Labs users. Its current authorization layer only knows about Tool Labs tool membership, but that may be fixable. Wikitech supports OAuth authentication that could be used in a tool or Labs project, but an authorization layer would have to be developed separately.

The universe is full of FLOSS system monitoring tools. Nearly every one of them was started because the author found all other tools lacking and set out to create a better solution rather than improving an existing tool. I can see the utility in making some helper functionality to make configuring an existing monitoring system easier for Labs. I can not see the utility in adding to the total number of monitoring tools available in the universe.

In T53434#2909575, @Matthewrbowker wrote:

[…]

In T53434#2902974, @scfc wrote:

(Labs is slowly moving authorative information about instances from LDAP to OpenStack, so if that affects your Icinga setup, https://gerrit.wikimedia.org/r/#/c/328611/2/modules/shinken/files/shinkengen is probably interesting to you.)

Thank you for the information. Does that include information about Tools in Tool Labs?

AFAIUI: No.

[…]
I define a monitor as a software check that determines whether a given piece of software is working correctly.

For a Tool Labs tool (http://tools.wmflabs.org), a webservice check should be sufficient. Icinga provides one in its base package.

For a Labs instance, a host alive check and a ping check is possible right out of the gate. I can do more if people need more.

For Labs instances we already have an ssh check via Shinken which is effectively alive and ping.

[…]

In T53434#2903122, @zhuyifei1999 wrote:

I'm looking for some either sort of check-ping system that expects periodic (configurable) pings from a bot script, or check the time of latest edit with summary matching a regex (many bots have multiple tasks).

Something can definitely be coded there, Icinga and the Nagios plugins are very flexible. We could also do pings from a bot script or from IRC, or indeed check edit summaries, although the latter will be harder.

One major problem with any self-serving (monitoring) solution is that users must be treated as potentially hostile. So, for example, you can't just use simple Nagios plugins for webservices, but must check that the URL "belongs" to the user. Similarly, users must not be able to interfere with each other's tools.

When webservices were first introduced, on failure they would just stop working, with the idea that maintainers would then come along, fix any issues and restart the webservice. IIRC users then wanted webservices to automatically restart because that was (all) what they would do when they encountered a failed webservice.

I assume that bot operators would act in the same way, so I think that a pattern for bots would be more useful, e. g. start the bot with bigbrother, on every edit touch a file ~/.bot-watchdog and have a cron job every hour/day that tests whether ~/.bot-watchdog has been touched in the past x hours and, if not, delete the grid job and let it be restarted by bigbrother.

Hello!

My apologies for the delay.

Based on this information, I'm going to split this task into two parts. First part will be just for Tool Labs, second part will be for Labs as a whole. I will begin by handling tool labs only, as this appears to be the less involved of the two...

@bd808 if I provided some sort of CRUD API, could we use Striker as a front end? if that answer is yes, I'll create a task to discuss specifics. This will handle @scfc 's issues with regard to user input.

The question now is what monitoring would need to be implemented for Tool Labs? I understand web (HTTP), but what would need to be implemented for bots? Could I poll the job queue in some way? I'm not very farmiliar with jsub so...

@scfc my thought would be just start with monitoring. Automated restart can be handled down the line.

In T53434#2951500, @Matthewrbowker wrote:

@bd808 if I provided some sort of CRUD API, could we use Striker as a front end? if that answer is yes, I'll create a task to discuss specifics. This will handle @scfc 's issues with regard to user input.

Totally possible, yes. As I mentioned in T53434#2909937, open a ticket with the rough ideas and we can iterate from there to figure out what would be needed to create the integration. The trickiest part may be securing authentication between Striker and a Labs project hosting the monitor.

The question now is what monitoring would need to be implemented for Tool Labs? I understand web (HTTP), but what would need to be implemented for bots? Could I poll the job queue in some way? I'm not very farmiliar with jsub so...

There's a tool for this! https://tools.wmflabs.org/gridengine-status/ dumps out a json blob that provides the same information as https://tools.wmflabs.org/?status. The tool that was built for the Precise migration should give you an idea of how you can consume it.

The 'is my webservice up' and 'is my job running' checks are probably a good place to start. Longer term some sort of liveness checks would be more awesome. The need for any of this may magically disappear with a proper Kubernetes based PaaS (T136264: Evaluate Kubernetes based workflow replacement options for SGE) as Kubernetes has built in support for per 'pod' liveness checking, but that's no reason to block trying to find a solution now. I have a feeling that even after we have chosen and deployed a PaaS it will take quite a while to get everyone migrated over to using it.

In T53434#3018651, @bd808 wrote:

In T53434#2951500, @Matthewrbowker wrote:

@bd808 if I provided some sort of CRUD API, could we use Striker as a front end? if that answer is yes, I'll create a task to discuss specifics. This will handle @scfc 's issues with regard to user input.

Totally possible, yes. As I mentioned in T53434#2909937, open a ticket with the rough ideas and we can iterate from there to figure out what would be needed to create the integration. The trickiest part may be securing authentication between Striker and a Labs project hosting the monitor.

Done, see T157847: Preparation for api for community-labs-monitoring

The question now is what monitoring would need to be implemented for Tool Labs? I understand web (HTTP), but what would need to be implemented for bots? Could I poll the job queue in some way? I'm not very farmiliar with jsub so...

There's a tool for this! https://tools.wmflabs.org/gridengine-status/ dumps out a json blob that provides the same information as https://tools.wmflabs.org/?status. The tool that was built for the Precise migration should give you an idea of how you can consume it.

The 'is my webservice up' and 'is my job running' checks are probably a good place to start. Longer term some sort of liveness checks would be more awesome. The need for any of this may magically disappear with a proper Kubernetes based PaaS (T136264: Evaluate Kubernetes based workflow replacement options for SGE) as Kubernetes has built in support for per 'pod' liveness checking, but that's no reason to block trying to find a solution now. I have a feeling that even after we have chosen and deployed a PaaS it will take quite a while to get everyone migrated over to using it.

Sounds good! Thank you for the information.

bd808 added a subtask: T157847: Preparation for api for community-labs-monitoring.Feb 11 2017, 12:38 AM

JeanFred subscribed.Feb 16 2017, 4:20 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 7:02 PM

Matthewrbowker edited projects, added User-Matthewrbowker; removed Cloud-Services.Jul 20 2017, 1:00 AM

Matthewrbowker moved this task from Inbox to Defined on the User-Matthewrbowker board.Jul 20 2017, 1:01 AM

Matthewrbowker moved this task from Defined to Working on the User-Matthewrbowker board.Aug 8 2017, 8:18 PM

Chicocvenancio subscribed.Jun 22 2018, 6:06 PM

bd808 added a subtask: T197977: https://tools-prometheus.wmflabs.org/tools responds with 503.Jun 22 2018, 6:41 PM

• Bstorm closed subtask T197977: https://tools-prometheus.wmflabs.org/tools responds with 503 as Resolved.Jun 22 2018, 7:52 PM

• Vvjjkkii reopened subtask T197977: https://tools-prometheus.wmflabs.org/tools responds with 503 as Open.Jul 1 2018, 1:02 AM

Matthewrbowker closed subtask T197977: https://tools-prometheus.wmflabs.org/tools responds with 503 as Resolved.Jul 1 2018, 1:31 AM

Harej subscribed.Aug 19 2018, 7:41 PM

Framawiki subscribed.Nov 11 2018, 2:25 PM

I'm looking for information on how tools-prometheus-01 and tools-prometheus-02 work. The only documentation I've found was this task and a small section in Wikitech about monitoring in the Kubernetes cluster.

I see both nodes are running are actively collecting metrics. Any help is welcome and sorry to hijack this task to ask for information but it seems the solution proposed here was already implemented to some extent.

I've found a presentation that says the Toolforge Prometheus instances were used as a testbed for ideas before implementing the production ones. So I think the main Prometheus page in Wikitech applies then. It doesn't talk a lot about Toolforge but I think it's a starting point. If anyone remembers something that's special/different about it when compared to Production, please let me know.

This looks interesting: https://www.cncf.io/blog/2018/12/18/cortex-a-multi-tenant-horizontally-scalable-prometheus-as-a-service/

• GTirloni added a project: cloud-services-team (Kanban).Mar 21 2019, 8:58 PM

• GTirloni unsubscribed.Mar 21 2019, 9:11 PM

Aklapper mentioned this in T224313: Requesting access to icinga for tonycepo.May 25 2019, 8:40 AM

bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Jul 11 2019, 4:29 PM

bd808 moved this task from Doing to Watching on the cloud-services-team (Kanban) board.Sep 12 2019, 9:37 PM

In T53434#5016338, @bd808 wrote:

This looks interesting: https://www.cncf.io/blog/2018/12/18/cortex-a-multi-tenant-horizontally-scalable-prometheus-as-a-service/

The Coretex project and community has been very active too, looks like it could be a good fit for multi-tenant monitoring based on tools we already use.

https://github.com/cortexproject/cortex

Matthewrbowker removed Matthewrbowker as the assignee of this task.Apr 8 2020, 7:03 PM

Matthewrbowker subscribed.

aborrero added a parent task: T194333: [Epic] Provide logging/metrics/monitoring SaaS for Cloud VPS tenants.Apr 14 2020, 4:55 PM

Framawiki mentioned this in T262767: Monitor Copypatrol/EranBot uptime.Sep 13 2020, 5:57 PM

With the exception of alerting, https://k8s-status.toolforge.org/ and the namespaced dashboards it links (like https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?var-namespace=tool-pywikibot-testwiki) to fulfill a lot of this task (as well as the aforementioned https://sge-status.toolforge.org/).

A self-serve alerting dashboard is pretty far from where we are now. Honestly, production doesn't even have anything approaching that since it's really just all puppetized config. Accepting that a larger Thanos or Cortex system with alertmanager for Cloud-VPS is a bit of a different scope that was not really defined in this ticket, I'd like to propose closing this as historical and opening tickets for the things we are currently examining building.

• Bstorm moved this task from Needs discussion to Inbox on the cloud-services-team (Kanban) board.Oct 21 2020, 4:16 PM

Updated the task to reflect the ask more specifically with the option of simply suggesting methods of doing this if we don't find the resources and time to build it (or before we do).

• nskaggs subscribed.Oct 27 2020, 7:51 PM

YFdyh000 subscribed.Feb 4 2021, 4:05 AM

diegodlh mentioned this in T286135: Toolforge jobs framework: email maintainers on job failure.Feb 27 2022, 1:24 AM

diegodlh mentioned this in T302689: Improve error monitoring.Feb 28 2022, 12:24 AM

taavi mentioned this in T306790: Set up monitoring for community cronjobs.Apr 25 2022, 12:24 PM

bd808 merged a task: T278097: Monitoring and alerting for Toolforge tools.Apr 25 2022, 3:40 PM

bd808 added subscribers: Sascha, RoySmith, Peachey88, Aklapper.

Copying my comment from T278097, and requesting that this be given a higher priority.

I'd just like to add my support to this idea. Many of the tools that run in toolforge are critical parts of the technical infrastructure that keeps the project going. They deserve all the normal logging, alerting and monitoring support that any serious production system has.

I'd love to see something like https://en.wikipedia.org/wiki/Graphite_(software) set up that any tool could easily feed performance data to and tool maintainers could build their own dashboards. There's really no reason for each tool developer to reinvent the wheel on this kind of stuff.

dcaro subscribed.Apr 25 2022, 4:43 PM

In T53434#7877356, @RoySmith wrote:

I'd love to see something like https://en.wikipedia.org/wiki/Graphite_(software) set up that any tool could easily feed performance data to and tool maintainers could build their own dashboards. There's really no reason for each tool developer to reinvent the wheel on this kind of stuff.

For what it’s worth you can already do the Graphite part − see eg what I did in T279236: Add timings/instrumentation, and the documentation I added at https://wikitech.wikimedia.org/wiki/Statsd#Use_in_Cloud_Services_environment. What I was told back then however was more or less “it will probably work for now, but Graphite/statsd will go away eventually”.

As for dashboards, well one can always then use https://graphite-labs.wikimedia.org ; as for Grafana, see eg T295296

Interesting, I'll give that a look. The last time I asked about this, the answer was essentially, "There is such a system, but it's for official WMF use only, not for toolforge". Which made me sad.

taavi mentioned this in T308555: Provide access to Thanos.May 18 2022, 7:36 AM

@JeanFred do you have some example code where you use this? I tried

echo -n "spi-tools.test.foo:99|c" | nc -w 1 -u cloudmetrics1001.eqiad.wmnet 8125

and when I went to https://graphite-labs.wikimedia.org/, I expected to see the "spi-tools.test.foo" metric listed in the tree listing, but it's not there. Any suggestions?

@RoySmith mobile atm, but before I forget, is https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&refresh=5m&var-namespace=tool-spi-tools-dev useful to you at all?

(You can even set up alerts, if you make a copy of the dashboard for your namespace — just don't pipe them to AlertManager, but if you've got something which will listen to a webhook..)

When I try to log into grafana using my toolsadmin credentials, I get "Invalid username or password"

Might this explain the credential issues? https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Editing_dashboards. You need the right ldap group, all of which requires NDA. I believe T295296 mentions this. I can help with https://wikitech.wikimedia.org/wiki/Volunteer_NDA if this is the only blocker.

TheresNoTime mentioned this in T309824: Request creation of wmcs-uptime VPS project.Jun 2 2022, 11:11 PM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 7:12 PM

fnegri moved this task from Kanban to Inbox on the cloud-services-team board.

We can implement this soon using metricsinfra for the alerts, and the http/tcp liveness probes + k8s data for the monitoring collection + grafana dashboard for the trends

dcaro moved this task from Ready to be worked on to Workspace for triaging whenever needed on the Toolforge board.Feb 21 2024, 1:05 PM

dcaro moved this task from Workspace for triaging whenever needed to Ready to be worked on on the Toolforge board.Feb 21 2024, 4:02 PM

bd808 added a subtask: T362012: [webservice] Allow configuration of Promethus scraping of a specific webservice endpoint for publication in grafana.wmcloud.org.Sat, Apr 6, 9:09 PM

Status	Subtype	Assigned	Task
Open		None	T194333 [Epic] Provide logging/metrics/monitoring SaaS for Cloud VPS tenants
Open		None	T53434 Establish an internal system or a recommended external system for monitoring user-created Toolforge web services
Open		None	T157847 Preparation for api for community-labs-monitoring
Resolved		• Bstorm	T197977 https://tools-prometheus.wmflabs.org/tools responds with 503
Open	Feature	None	T362012 [webservice] Allow configuration of Promethus scraping of a specific webservice endpoint for publication in grafana.wmcloud.org

Establish an internal system or a recommended external system for monitoring user-created Toolforge web services
Open, MediumPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	yuvipanda
	Jul 16 2013, 11:30 AM

Establish an internal system or a recommended external system for monitoring user-created Toolforge web servicesOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Establish an internal system or a recommended external system for monitoring user-created Toolforge web services
Open, MediumPublic
Actions

Related Objects
Search...