Page MenuHomePhabricator

Webstatscollector's pagecounts file shaky between 2014-08-24 14:00 and 2014-08-27 21:00
Closed, ResolvedPublic

Description

TL;DR: For the files between 2014-08-24 14:00 and 2014-08-27 21:00
webstatscollector output are showing some irregularities. This affects
both the pagecounts and projectcount files from

https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/

and all services that process them.


The pagecount file 20140824 14:00 [1] did not show
irregularities. But the file for one day later show a drop of 80% for
a few pages.

As gadolinium (the host that is writing the pagecount files) showed a
high and still increasing process count for no longer needed services
(bug 70053), those services got turned off around 2014-08-26 19:00:00.

But although more resources were freed on gadolinium, its
webstatscollector's collector process degraded further. Since the
service did not recover (UDP Receive Buffers filled up again and
again, disks could not take the write load, and the service gathered
95GB of virtual memory since its last restart), the service got
restarted on 2014-08-28 ~15:32.

Since the restart did not relax the situation either, the service was
put on tmpfs 2014-08-28 ~19:48, which reduced load on the disks, and
made the service work again. The files starting at 2014-08-27 21:00
are good again.

Thanks ottomata for all those fixes!
More details are in the corresponding IRC channel logs [3].

Closer investigation of the files between 20140824 14:00 and
2014-08-27 21:00 is still pending.

[1] https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140824-140000.gz
[2] https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140827-210000.gz
[3] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140827.txt


Version: unspecified
Severity: normal

Details

Reference
bz70118

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:32 AM
bzimport set Reference to bz70118.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to christian from comment #0)

[...] restarted on 2014-08-28 ~15:32.

That timestamp is meant to read "2014-08-27 ~15:32." (one day before).

[...] put on tmpfs 2014-08-28 ~19:48, which [...]

That timestamp is meant to read "2014-08-27 ~19:48" (one day before).

Any ideas as to the root cause? Do you think this has been an ongoing problem that we haven't noticed?

(In reply to Toby Negrin from comment #2)

Any ideas as to the root cause?

Yes. Bug 70053.
Due to the increasing number of processes, less and less resources
were available for other processes, and at some point it started to
affect other services like webstatscollector.

Do you think this has been an ongoing
problem that we haven't noticed?

We have no official monitoring of webstatscollector.
So we have no way of telling.
Hence, we also have no way of being alerted about such issues in the
first place.

It would be a nice quarterly goal to have proper, official monitoring
in place ;-) That would allow us to answer such questions.

Mhmm ... that said ... no it's not an ongoing problem.
Data up to 2014-08-24 14:00 is within the usual expectations.

(In reply to christian from comment #0)

Closer investigation of the files between 20140824 14:00 and
2014-08-27 21:00 is still pending.

The ending of the interval is crisp:
The last bad file is:

https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140827-200000.gz

The first file that's good again is:

https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140827-210000.gz

But due to the lack of granular monitoring, and hard boundaries,
finding a crisp beginning of the interval of bad data did not work
out. The last known to be good file is:

https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140824-140000.gz

The first known to be bad file is:

https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140825-140000.gz