Page MenuHomePhabricator

[[wikitech:Server_admin_log]] should not rely on IRC for logmsgbot entries
Open, LowPublicFeature

Description

"Server admin log"[1] is updated by morebots (an irc + mediawiki bridge bot run from the wikitech linode instance).

Via deployments scripts[2] on fenari, logmsgsbot outputs to irc prefixed with "!log " for morebots to pick up and save to the wiki.

If freenode Libera.Chat is not responding well or if netsplit separates these two bots from each other (happened again today...), during that time deployments are not logged.

Couple options:

  • Have these bots run on the same server (e.g. both from fenari.wmnet or both from wikitech.linode)
  • Merge them into 1 bot (which would have to run from fenari in order to hook reliably into deployment scripts)
  • Have them communicate (also or only) directly to each other instead of via freenode Libera.Chat (e.g. via some socket between the two servers).

[1] http://wikitech.wikimedia.org/history/Server_admin_log
[2] http://wikitech.wikimedia.org/view/bin


Version: unspecified
Severity: enhancement
See Also:
T63544: Add !log entries from #-operations to logstash

Details

Reference
bz44791

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:13 AM
bzimport added a project: Deployments.
bzimport set Reference to bz44791.
bzimport added a subscriber: Unknown Object (MLST).

There's two separate use cases here:

  1. Someone in -operations !log'ing something to the SAL. like "!log I'm going to shutdown tampa, watch out!"
  1. Various scripts/tools on the cluster that log things to the SAL eg scap or git-deploy.

Can we decouple these two use cases? It seems that passing #2 over an IRC server is bad design for all of the reasons stated.

Proposal:

  • Use case #1 (logmsgbot) should log to Logstash directly in addition to the SAL (initially).
  • Use case #2 (scap, trebuchet, etc) should log to Logstash directly.
  • Add support to logmsgbot to announce log entries from Logstash that originated there.
  • We get rid of the wikitech SAL and create a nice looking public Logstash view.
  • Use case #1 no longer logs to SAL.

No more stupidly large edit histories!

(In reply to Greg Grossmeier from comment #1)

  • We get rid of the wikitech SAL and create a nice looking public Logstash

view.

This is probably hard given the lack of ACLs in Logstash.

Potentially we could point Wikitech to the new SAL system ( https://tools.wmflabs.org/sal/ ) as implemented via T63544: Add !log entries from #-operations to logstash.

It is just about changing the wikitech sidebar, or adding a new link.

I have updated wm-bot !sal helpers in the ops and releng channels:

#wikimedia-operations:
<hashar> !sal
<wm-bot> https://labsconsole.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need.

#wikimedia-releng
<+hashar> !sal
<wm-bot> https://tools.wmflabs.org/sal/releng

Don't we want log entries to be backed up to wikitech-static and not just be kept in labs though?

Let's think through the failure modes we're trying to guard against:

Bad things

  1. Production wikis all returning blank pages/HHVM everywhere falls over/Varnish dies
  2. All of our datacenter uplinks get backhoe'd at the same time
  3. what else?

Responses

  1. is usually a bad deploy, but either way, toollabs should be up as well (and would be a perfect storm if it wasn't)
  2. I don't think we'd look at the SAL for this :)
  3. ...

Don't we want log entries to be backed up to wikitech-static and not just be kept in labs though?

*If* we decide that the https://tools.wmflabs.org/sal/ is the nicest way to look at SAL data, we should certainly add some backup system for the Elasticsearch system that powers it. One relatively easy option would be to use a tool like https://github.com/taskrabbit/elasticsearch-dump to copy the index to a data file on a regular basis and archive that dump somewhere outside of the stashbot Labs project.

In T46791#480052, @greg wrote:

There's two separate use cases here:

  1. Someone in -operations !log'ing something to the SAL. like "!log I'm going to shutdown tampa, watch out!"
  1. Various scripts/tools on the cluster that log things to the SAL eg scap or git-deploy.

Yes, we should I have a straw man proposal here. I think it'll solve the original task issue and works in both a wikitech and/or logstash scenario.

  • logmsgbot as it exists goes away. It's replaced with a service (http/restful, I'm guessing).
    • This service could post to wikitech, or logstash, both, or whatever. This protects clients against future backend changes
  • Scap, other command line tools in production would just call this directly. Skip IRC or the IRC log is supplemental.
  • The !log functionality for humans can remain, but either as a simpler logmsgbot or (more likely/useful) subsumed by wm-bot. Said bot would just call the same service
  • [BONUS] scap [log|sal] could do the same logging as IRC. This is great for A) Avoiding context switching when you're on a machine, and B) Fallback for SAL when IRC is busted

Some related thoughts/explanations on T156079: autolog scap3 deployments in beta.

wm-bot already does too many things IMO, and Stashbot currently owns everything beyond the message getting placed by a human or a bot on irc. Stashbot is in gerrit and currently running in a Kubernetes pod. When we get a prod k8s environment it shouldn't be too hard to move it from tool labs to prod and add an HTTP and/or TCP API to make automated log submissions easier.

Some related thoughts/explanations on T156079: autolog scap3 deployments in beta.

wm-bot already does too many things IMO, and Stashbot currently owns everything beyond the message getting placed by a human or a bot on irc. Stashbot is in gerrit and currently running in a Kubernetes pod. When we get a prod k8s environment it shouldn't be too hard to move it from tool labs to prod and add an HTTP and/or TCP API to make automated log submissions easier.

So you're right re stashbot vs wm-bot. Stashbot should do this. Couple of notes from our IRC discussion:

  • An API from stashbot enables us to easily fix the production issue for l10nupdate, scap and other tools
    • And hides logstash/wikitech questions from callers
    • My scap [log|sal] suggestion as a fallback scenario also easily done as well (actually that can happen now, it's already abstracted enough)
  • Stashbot can relay the message back to IRC from entries submitted to API. This allows callers to replicate their messages for IRC consumers while not having to speak IRC themselves (use like a query parameter though, some/all messages will/won't want to go to IRC. ie: caller should decide)
  • My scap [log|sal] suggestion as a fallback scenario also easily done as well (actually that can happen now, it's already abstracted enough)

This was easy. {D567}

Legoktm renamed this task from [[wikitech:Server_admin_log]] should not rely on freenode irc for logmsgbot entries to [[wikitech:Server_admin_log]] should not rely on IRC for logmsgbot entries.Sep 13 2021, 8:44 PM
Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:13 AM

I don't really see why my team was tagged with this.