Page MenuHomePhabricator

texvc failure due to missing MW cgroup
Closed, ResolvedPublic

Description

More and more pages on dewiki shows the error "Fehler beim Parsen(Unbekannter Fehler)" (english: Failed to parse(unknown error))

After a purge, edit or nulledit the error is gone and a png is shown, but it also possible that after a purge the png is gone and the error is shown.

MathJax works, only the PNG option is effected.

Google also indexed some of that pages:
https://www.google.de/#q=%22Fehler+beim+Parsen%22+site:de.wikipedia.org
https://www.google.de/#q=%22Failed+to+parse(unknown+error)%22+site:en.wikipedia.org

Please have a look. Thanks.


Version: wmf-deployment
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=54367

Details

Reference
bz55709

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:34 AM
bzimport set Reference to bz55709.
bzimport added a subscriber: Unknown Object (MLST).

It's not a thumb (or ops/shell) issue -- the output that I see is e.g.
<dl>
<dd><strong class='error'>Fehler beim Parsen(Unbekannter Fehler): \sigma\frown\psi:=(-1)^{pq}\psi(\sigma\circ\iota_{0\ldots q})\sigma\circ\iota_{q\ldots p}</strong></dd>
</dl>
so clearly it never gets to this point.

I quickly retracted that comment of mine on IRC last (European) night when I saw the code :) I little more investigation happened at #wikimedia-tech -- from what I can see on the SAL after I left, Tim found it was a missing cgroups issue on mw1035, mw1145, mw1078, mw1152, mw1150, which means the "texvc" invocation failed.

Tim manually fixed the situation apparently, so the effects of this bug shouldn't be still happening, but we need a proper fix in place so this won't happen again in the future. The cgroup issue is recurring (https://gerrit.wikimedia.org/r/#/c/83067/ was the latest attempt to fix it) and we've yet to found an optimal solution. On the plus side, there was a couple of additions on the logging side because of this...

(In reply to comment #3)

I quickly retracted that comment of mine on IRC last (European) night when I
saw the code :) I little more investigation happened at #wikimedia-tech --
from what I can see on the SAL after I left, Tim found it was a missing
cgroups
issue on mw1035, mw1145, mw1078, mw1152, mw1150, which means the "texvc"
invocation failed.

Actually, pretty much all of the apaches had the issue. Those 5 apaches didn't even have the cgroup filesystem mounted, indicating that cgconfig hadn't been started. On the rest of the apaches, cgconfig had been started but not mw-cgroup.

When this bug occurred just now on mw1109, I ran "initctl log-priority debug" then stopped and started cgconfig a couple of times. The logs showed the started/stopped events going to cgred, but not to mw-cgroup. When I edited mw-cgroup.conf with an irrelevant change, the syslog showed a configuration reload due to inotify, and after that, mw-cgroup was started and stopped as expected. So my suspicion is that at some point, something went wrong with cgconfig or mw-cgroup or both, which, in combination with a bug in upstart, caused mw-cgroup to stop receiving events.

379 out of 391 apaches show the old config when you run "initctl show-config mw-cgroup", i.e. without the "stop on" trigger. This is apparently because upstart config changes only take effect after the job is stopped or restarted.

The stop script of mw-cgroup will routinely fail since rmdir() fails with EBUSY if there are any tasks remaining in the cgroup. The cgdelete command can be used instead -- it attempts to move any tasks in the cgroup to the parent cgroup before executing the rmdir(). However, this is still prone to failure if new tasks are added to the cgroup while the cgdelete command is in progress:

cgdelete -r memory:mediawiki

cgdelete: cannot remove group 'mediawiki': Device or resource busy

cgclear, which is run from the stop script of the cgconfig upstart job, is also prone to failing for the same reason. This would not be a problem if cgconfig and mw-cgroup were started on boot and never touched again, but of course the cgroup-bin package has a prerm trigger which stops the cgconfig job, causing breakage on upgrade. Says mw1109:

2013-09-09 18:41:49 upgrade cgroup-bin 0.37.1-1ubuntu10 0.37.1-1ubuntu10.1

Per usual Debian convention, to fix a bug in cgrulesengd it is necessary to unmount and remount the entire cgroup filesystem.

Change 91115 had a related patch set uploaded by Tim Starling:
Improve logging for wfShellExec()

https://gerrit.wikimedia.org/r/91115

  • Bug 54367 has been marked as a duplicate of this bug. ***

I suggest migrating Apache to upstart and having it stop if mw-cgroup stops.

Change 91115 merged by jenkins-bot:
Improve logging for wfShellExec() and ignore missing cgroup

https://gerrit.wikimedia.org/r/91115

[Patch merged; resetting bug status]

looks like this might be resolved? repeating the google search shows 30 results, I spot-checked some and they look like real edits talking about the issue

Umherirrender claimed this task.