Page MenuHomePhabricator

Some lines in zero, and mobile logs start within other log line (instead of beginning of line)
Closed, ResolvedPublic

Description

Since 2013-11-13 we have been seeing lines like

cp1046.eqiad.wmnet	1008323724	201cp3011.esams.wikimedia.org	6978832710	2013-11-15T19:29:25 [...]

in the mobile and zero logs. There, part of a log line (part past 201
in the third column) got overwritten by a new line
(cp3011.esams[...]). All 16 fields of this new line gets written,
followed by a line feed. The next line is a normal log line again.

For each file we had between 5 to 15 such lines.

We did not see such lines for the sampled-1000 logs.


Version: unspecified
Severity: normal

Details

Reference
bz58764

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:41 AM
bzimport set Reference to bz58764.

Checking for mangled lines which field got overwritten showed that the
probability of a field to get overwritten directly corresponded to the
field length.

~41% of the collisions happened for minutes 0, and 59.
~54% of the collisions happened for minutes 29, and 30.
So ~95% of the collisions happened at the times puppet is run.

It seems puppet injected a new udp2log filter configuration file and
restarted udp2log on oxygen (mobile and zero logs get written there)
upon some puppet runs. The culprit seems to be serializing the hash of
mobile hosts into the filter configuration without sorting them. So
upon each puppet run, the serialized string of mobile hosts stands a
chance of producing the hosts in a different order.

Since, commit 55ee2641ad2d74adf8b9d8353f63b87f6e007348 of the puppet
repo, the list gets sorted before serialization. Hence, the serialized
string should not change between puppet runs, if the mobile hosts do
not change.

After merging the above change, we do not see mangled log lines any
longer. Last mangled log lines are for 2013-12-16 17:30.

All other udp2log generated log files generated on oxygen have been
affected as well (e.g.: edit logs). But the problem is now fixed for
them as well.