Page MenuHomePhabricator

rsync errors to beta cluster, inconsistent state after scap
Closed, ResolvedPublic

Description

The beta cluster is serving old Echo JS code.
e.g.

http://bits.beta.wmflabs.org/static-master/extensions/Echo/modules/overlay/ext.echo.overlay.js

isn't the latest file. This makes user and browser tests invalid.

Erik Bernhardson says failures show up in deployment-bastion:/data/project/logs/scap.log and and /var/log/syslog starting Aug 14 00:31:27

E.g.
Aug 14 06:44:04 deployment-bastion rsyncd[1091]: rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Connection reset by peer (104)
Aug 14 06:44:04 deployment-bastion rsyncd[1091]: rsync error: error in rsync protocol data stream (code 12) at io.c(1532) [sender=3.0.9]


Version: unspecified
Severity: major

Details

Reference
bz69590

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:32 AM
bzimport added a project: Deployments.
bzimport set Reference to bz69590.

rsync is failing because the root volume is full on deployment-bastion.eqiad.wmflabs

Mukunda: if Antoine doesn't bet you to it, can you take a look into this?

moved /srv-old to /mnt/srv-old and freed up 2.1G

scap has resumed its normal schedule. /var is within 100M of having the same problem.

I'm still not seeing new code making it from deployment-bastion to deployment-mediawiki01 though, so leaving the bug open

Erik has freed enough space by moving /srv-old which was in the root partition. Thank you!

Labs instances in eqiad have a 2GB /var/ which is often not large enough. There is 1.1GB in /var/log :-/

Top offenders:

538M /var/log/account/
335M /var/log/atop*.log
168M /var/log/diamond/

When diamond got enabled on labs, it had some full debug log being emitted. That was bug 66458 "Service diamond creates 500+ MByte /var/log/diamond/diamond.log". I have manually removed the old large logs.

I removed some archived files from /var/log/account/ but that will fill up quickly again.

Follow up bugs:

  • Bug 69601 - Log files on labs instance fill up disk (/var is only 2GB) (tracking)
    • Bug 69602 - diamond does not compress its logs
    • Bug 69604 - acct (process and login accounting) fill up instances /var/ partition
    • Bug 69605 - atop (monitoring system) logs fill up instances /var/ partition

theres something else going on as well:

ebernhardson@deployment-bastion:~$ dsh -M -g mediawiki-installation md5sum /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php 2>/dev/null
deployment-bastion.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1  /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php
deployment-jobrunner01.eqiad.wmflabs: 2344d153193c02780cf2a02bd724c125  /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php
deployment-mediawiki01.eqiad.wmflabs: 2344d153193c02780cf2a02bd724c125  /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php
deployment-mediawiki02.eqiad.wmflabs: 2344d153193c02780cf2a02bd724c125  /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php
deployment-rsync01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1  /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php
deployment-videoscaler01.eqiad.wmflabs: 2344d153193c02780cf2a02bd724c125  /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php

That change was from yesterday so i looked up a more recent merge, https://gerrit.wikimedia.org/r/#/c/154278/ was merged 15 min ago and also didn't rsync all the way out:

ebernhardson@deployment-bastion:~$ dsh -M -g mediawiki-installation md5sum /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php 2>/dev/null
deployment-bastion.eqiad.wmflabs: d52e9791a81870af920eb199494d1795  /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php
deployment-jobrunner01.eqiad.wmflabs: 78c1075dd7743c8609fab81c7428be4a  /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php
deployment-mediawiki01.eqiad.wmflabs: 78c1075dd7743c8609fab81c7428be4a  /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php
deployment-mediawiki02.eqiad.wmflabs: 78c1075dd7743c8609fab81c7428be4a  /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php
deployment-rsync01.eqiad.wmflabs: d52e9791a81870af920eb199494d1795  /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php
deployment-videoscaler01.eqiad.wmflabs: 78c1075dd7743c8609fab81c7428be4a  /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php

Reedy: since Bryan is out, can you look into this?

scap use deployment-rsync01 as a proxy from which application servers are instructed to pull from.

A shortened version of the rsync command executed on deployment-rsync01 is:

rsync01$ rsync ... deployment-bastion.eqiad.wmflabs::common /usr/local/apache/common-local

And:

rsync01$ readlink -f /usr/local/apache/common-local
/usr/local/apache/common-local

That copy is up to date.

rsync01 also has a /srv/common-local directory which is out of date. The most frequent file I found is from August 13th 21:13 UTC (might be one a bit more recent).

I suspect the apache sync from /srv/common-local instead of /usr/local/apache/common-local or that /usr/local/apache/common-local should symlink to /srv/common-local.

Running puppet on deployment-mediawiki01 :

Error: Could not retrieve catalog from remote server: Error 400 on SERVER:

Duplicate declaration:
File[/usr/local/apache/common-local] is already declared in file
  /etc/puppet/modules/beta/manifests/common.pp:8;
cannot redeclare at
  /etc/puppet/modules/mediawiki/manifests/sync.pp:26 on node i-0000044e.eqiad.wmflabs

And that sounds familiar. So as usual, the issue lies in our configuration management which is not surprising.

The root cause is:

https://gerrit.wikimedia.org/r/#/c/153807/
mediawiki: create common-local directory

merged on Aug 13 22:28

It adds to puppet class mediawiki::sync :

+ file { '/usr/local/apache/common-local':
+ ensure => directory,
+ owner => 'mwdeploy',
+ group => 'mwdeploy',
+ mode => '0775',
+ }

On beta that should be a symbolic link as described in beta::common:

 file { '/usr/local/apache/common-local':
    ensure  => link,
    # Link to files managed by scap
    target  => $::beta::config::scap_deploy_dir,
}

The change cause two issues:

  1. on deployment-rsync01 it is no more a symbolic link and scap instructs apaches from that directory though it updates the other
  1. break puppet with a duplicate definition on the application server.

Change 154329 had a related patch set uploaded by Hashar:
Revert "mediawiki: create common-local directory"

https://gerrit.wikimedia.org/r/154329

Cherry picked https://gerrit.wikimedia.org/r/154329 on beta cluster puppet master.

On rsync01 I have deleted all the content of /usr/local/apache/common-local and MANUALLY created a symbolic link to /srv/common-local

I then triggered a run of scap on beta via https://integration.wikimedia.org/ci/job/beta-scap-eqiad/

Rerunning Erik Bernhardson command:

hashar@deployment-bastion:~$ sudo -u mwdeploy dsh -M -g mediawiki-installation md5sum /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php 2>/dev/null|cut -d\ -f-2
deployment-bastion.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1
deployment-jobrunner01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1
deployment-mediawiki01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1
deployment-mediawiki02.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1
deployment-rsync01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1
deployment-videoscaler01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1

All good.

Lowering priority of the bug since it is hacked/manually fixed. I am leaving it open until the Gerrit change is reviewed / agreed / better solution found.

Change 154329 abandoned by Hashar:
Revert "mediawiki: create common-local directory"

Reason:
The paths have been reworked entirely in both prod and beta. We now use /srv/mediawiki/ and /srv/mediawiki-staging/

https://gerrit.wikimedia.org/r/154329

All good.

Lowering priority of the bug since it is hacked/manually fixed. I am leaving it open until the Gerrit change is reviewed / agreed / better solution found.

Change 154329 abandoned by Hashar:
Revert "mediawiki: create common-local directory"

Reason:
The paths have been reworked entirely in both prod and beta. We now use /srv/mediawiki/ and /srv/mediawiki-staging/

Did another changeset replace that one and it just wasn't linked or is there still more to do?

The patch I created ( https://gerrit.wikimedia.org/r/#/c/154329/ ) was a revert which I cherry picked on the beta cluster to immediately fix the scap issue.

Some other changes (can't find them) properly resolved the path issues. When abandoning the change, I simply forgot to close the bug :-]