Page MenuHomePhabricator

Reindexing causes nagios alerts
Closed, ResolvedPublic

Description

We have chronic problems with the search indexing process causing spurious alerts and (worse) downtime. As a result, the alerts that we get are kinda useless. Let's fix that.


Version: wmf-deployment
Severity: normal

Details

Reference
bz47141

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:30 AM
bzimport set Reference to bz47141.
bzimport added a subscriber: Unknown Object (MLST).

What kind of errors? I wrote the Nagios plugin that does the Lucene check so I can probably help there.

We're not having many Icinga alerts at the moment, but there are probably regular brief overload events that do not result in an alert.

Basically, the problem is that when a new index is swapped in after an rsync, most of it is not in the kernel cache, so many requests are slow. This results in client-side timeouts, and thus a flood of "zero results" pages delivered to users.

The deployment of PoolCounter for search probably helped with this, since it would avoid using up all search workers, so there would still be a couple left to service Icinga requests. Also, the use of PoolCounter avoids exhausting Apache MaxClients. This removes the need to set an aggressively low client side timeout, so part of the solution here may be just to increase the timeout, allowing the cluster to ride out short periods of slowness by queueing.

There is also a "warmup" feature in lucene-search-2 which probably needs to be configured correctly.

Chad is about to commit a patch for logging of client-side timeouts, so that will give us more visibility into the scale of the problem.

ram wrote:

That sounds good. That gives us a way to tell if any fix we implement has made things better, made them worse, or had no effect at all.

ram wrote:

As I mentioned at the platform meeting this week, we are seeing a number of failures in the log files about not being able to connect to the RMI Registry
on various hosts. The exceptions look like this:


2013-04-17 00:00:48,681 [pool-1-thread-64] WARN org.wikimedia.lsearch.interoperability.RMIMessengerClient - Cannot contact RMI registry for host search1020 : error during JRMP connection establishment; nested exception is:

java.net.SocketTimeoutException: Read timed out

2013-04-17 00:00:48,914 [pool-1-thread-41] WARN org.wikimedia.lsearch.interoperability.RMIMessengerClient - Cannot contact RMI registry for host search1020 : error during JRMP connection establishment; nested exception is:

java.net.SocketTimeoutException: Read timed out

While a plausible explanation for client-side timeouts is, as Tim suggests, that searches are having to access disk, faiing to even connect to the RMI Registry is indicative of a different issue -- the system being so busy that it is not able to accept a connection.

So I wrote a script to do some rudimentary log analysis; the results are appended below for search1015 and search1016. The histogram shows the number of these failures on an hour-by-hour basis. Notice that we see a spike around 0500 which only abates around 3 to 4 hours later. This is consistent with the failures being related to rsync'ing new snapshots since snapshots are generated every day at 0430. However, there is a second spike around 0700 as well that is unexplained.

The numbers are way higher for Apr-15 since I asked Peter to drop the log level from WARN to TRACE over the weekend resulting in substantially larger log files (~11GB versus 370MB normally); I'm guessing that having to log so much data itself caused additional failures.

All these failures involve attempts to reach search1019 or search1020, though why there are many more failures for the latter than the former (counts shown at the end of each histogram) is another item needing investigation.

Histograms of RMI Registry failures on search1015

Hr. Apr-16 Apr-15

0 - : 0 217
1 - : 7 142
2 - : 12 267
3 - : 3 333
4 - : 63 504
5 - : 599 1782
6 - : 141 684
7 - : 314 864
8 - : 34 245
9 - : 47 40
10 - : 10 64
11 - : 36 21
12 - : 10 74
13 - : 32 371
14 - : 71 447
15 - : 50 666
16 - : 65 827
17 - : 465 451
18 - : 66 589
19 - : 176 7
20 - : 79 16
21 - : 97 18
22 - : 144 28
23 - : 81 3
Total = 2602 8660
search1020: 2210
search1019: 390

Histograms of RMI Registry failures on search1016

Hr. Apr-16 Apr-15

0 - : 0 208
1 - : 10 166
2 - : 23 338
3 - : 0 351
4 - : 71 485
5 - : 621 1677
6 - : 131 766
7 - : 325 951
8 - : 23 241
9 - : 40 41
10 - : 9 57
11 - : 28 24
12 - : 6 74
13 - : 28 388
14 - : 65 479
15 - : 40 677
16 - : 65 827
17 - : 441 473
18 - : 61 114
19 - : 177 10
20 - : 94 28
21 - : 96 21
22 - : 180 20
23 - : 89 0
Total = 2623 8416
search1020: 2239
search1019: 382

ram wrote:

Correction: "second spike around 0700" should read "second spike around 1700", sorry about that.

(In reply to comment #2 by Tim Starling)

Chad is about to commit a patch for logging of client-side timeouts, so that
will give us more visibility into the scale of the problem.

Chad: Has this happened?

(In reply to comment #4 by Ram)

As I mentioned at the platform meeting this week, we are seeing a number of
failures in the log files about not being able to connect to the RMI Registry
on various hosts.

Anybody knows if that's still the case?

(In reply to comment #2 by Tim Starling)

Chad is about to commit a patch for logging of client-side timeouts, so that
will give us more visibility into the scale of the problem.

Chad: Has this happened?

(In reply to comment #4 by Ram)

As I mentioned at the platform meeting this week, we are seeing a number of
failures in the log files about not being able to connect to the RMI Registry
on various hosts.

Anybody knows if that's still the case?

Ram is no more contracting for the wmf. Resetting assignee.

Logging stuff for this went out a long time ago. It's far from perfect, but has allowed us to limp along. Along with other fixes, we're no longer waking up every night trying to keep search from exploding.

I'm marking this FIXED, as enough's been done that we don't need to do anything here at this point.

Cirrus/Elasticsearch has all kinds of new infrastructure with its own monitoring and so forth so nothing here is really relevant anymore.