Page MenuHomePhabricator

CirrusSearch: Improve elasticsearch monitoring
Closed, ResolvedPublic

Description

Right now icinga spews out a huge blob of json when there is an Elasticsearch problem. That is difficult to read.


Version: unspecified
Severity: normal
See Also:
T62979: CirrusSearch: Move Elasticsearch "search groups" monitoring from cluster level to node level
T64077: CirrusSearch: Add monitoring for slow log

Details

Reference
bz57210

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:20 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz57210.
bzimport added a subscriber: Unknown Object (MLST).

Also we should warn if there are ever fewer than 3 lucene indexes active per shard.

It'd be nice if this could detect a split brain as well.

It'd be really nice if this warned on the elasticsearch cluster as a whole rather than individual hosts.... It should still complain if it can't read a host but not once per host once for issues that affect the whole cluster.

From Antoine:
There is a plugin to monitor clusters. Use case, doc, examples at:
http://docs.icinga.org/latest/en/clusters.html
https://www.nagios-plugins.org/doc/man/check_cluster.html

The idea is to create a service that is based on the result of other
services.

Removing from the list of bugs required to reenable Cirrus as it was really for ops and ops doesn't seem to be jumping up and down about it. I'm leaving it filed as NORMAL and I've got the process started. We'll get this, but not before next week.

Is this bug (and its friends in see also) a blocker for expanding Cirrus on the wikis which were already indexed? It would be really nice to make it default on, say, all Wiktionaries or all Wikiquotes and see what happens to the load.

Two years later: Is this still a problem?

Two years later: Is this still a problem?

Nothing has changed here, so yes.

Deskana claimed this task.

Our monitoring has significantly improved in the past 3.5 years since this task was filed. I assume that this mostly satisfies the intent of this task, so I am closing as resolved.