Page MenuHomePhabricator

[scap] Deploy events aren't showing up in graphite/gdash
Closed, ResolvedPublic

Description

They used to show up in gdash when you ticked the "Show Code Deploys" checkbox.

adding "&target=drawAsInfinite(deploy.any)" to the graphite urls doesn't work :/


Version: wmf-deployment
Severity: normal
See Also:
https://rt.wikimedia.org/Ticket/Display.html?id=6970
https://bugzilla.wikimedia.org/show_bug.cgi?id=66174

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:52 AM
bzimport set Reference to bz62667.
bzimport added a subscriber: Unknown Object (MLST).

Change 119339 had a related patch set uploaded by BryanDavis:
Fix MW_STATSD_PORT to point to correct listener

https://gerrit.wikimedia.org/r/119339

Change 119340 had a related patch set uploaded by BryanDavis:
Fix statsd_port value

https://gerrit.wikimedia.org/r/119340

After these patches land and we get some data in graphite again I think we'll need to look at the gdash configuration and update the metric names that it uses to identify deployments as well. deploy2graphite and scap send different metrics to graphite.

Change 119340 merged by jenkins-bot:
Fix statsd_port value

https://gerrit.wikimedia.org/r/119340

Change 119339 merged by Ori.livneh:
Fix MW_STATSD_PORT to point to correct listener

https://gerrit.wikimedia.org/r/119339

See https://gerrit.wikimedia.org/r/#/c/111409/ for the change from carbon to statsd that should have been accompanied by a change to the gdash configuration and port number as well.

When we figure out what all the new deploy metrics are they should be added to templates/gdash/deploy_addon.erb in oeprations/puppet.git to fix the marks added.

There is some additional problem with the current gdash configuration.

When the "Show Code Deploys" checkbox is active, something is causing the generated graphite URLs to contain an extraordinary number of superfluous ampersands. In one URL I just examined there are 4188 extra ampersands inserted between the deployment metric stanzas and the remainder of the graph description.

When these ampersands are removed from the graphite URL the graph renders (albeit with no deploy markers).

For what it's worth, I was seeing graphite urls like that (tons of &s) on Friday the 14th.

The configuration changes now have data being recorded in graphite for scap runs again, but there are three remaining issues:

  1. The metric names have changed. The gdash configuration is looking to add the metrics "deploy.sync-common-file", "deploy.sync-common-all" and "deploy.scap" to the graph. With the change from direct carbon communication to statsd and the changes to scap code, these metric names have changed. "scap.scap.count" should be the equivalent of the old "deploy.scap" metric.
  1. In theory the metrics for "deploy.sync-common-file" and "deploy.sync-common-all" should just need a ".count" added to them, but I'm not currently seeing metrics with those names in graphite at all.
  1. The txstatsd recorded stats for "scap.scap.count" don't look right at all. I would expect graphite to be recording the aggregate sum of the "scap.scap:1|c" calls seen in the last minute which would typically be 0 and occasionally be 1 (or possibly 2 with aborted scaps). Instead it seems to be recording a value of 1.0 every minute with occasional values of 5.0 that are not correlated with other scap logging output. [0]

[0]: https://graphite.wikimedia.org/render?from=23%3A00_20140331&until=00%3A00_20140401&target=scap.scap.count&format=json

Assigning to Ori in the hope that he can find some time to look into the txstatsd behavior and the missing metrics. Once those issues are fixed it should be pretty easy to correct the gdash configuration.

chasemp subscribed.

Assigning to Ori in the hope that he can find some time to look into the txstatsd behavior and the missing metrics. Once those issues are fixed it should be pretty easy to correct the gdash configuration.

I think the assigning missed but I am doing it now.

Dear ori,

A ticket just for you?

Best Wishes,

Chase

:)

chasemp lowered the priority of this task from High to Medium.Mar 11 2015, 9:12 PM

reducing priority to reflect the obvious back burner status

this should be working now in graphite at least

https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1430476877.683&target=drawAsInfinite(deploy.all.count)&target=drawAsInfinite(scap.scap)

@bd808 is scap sending statsd counters? if so the names will change upon next push as we've enabled extended counters in T95703

Change 208085 had a related patch set uploaded (by Filippo Giunchedi):
gdash: adjust deploy metrics

https://gerrit.wikimedia.org/r/208085

@bd808 is scap sending statsd counters? if so the names will change upon next push as we've enabled extended counters in T95703

Somebody needs to take a look at the statsd events that scap tries to send. They really haven't ever worked properly and it seems quite likely that the packets were implemented incorrectly for txstatsd and possibly for the new replacement as well.

@bd808 is scap sending statsd counters? if so the names will change upon next push as we've enabled extended counters in T95703

Somebody needs to take a look at the statsd events that scap tries to send. They really haven't ever worked properly and it seems quite likely that the packets were implemented incorrectly for txstatsd and possibly for the new replacement as well.

likely you don't need to increment and time the same metric, so https://phabricator.wikimedia.org/diffusion/MSCA/browse/master/scap/log.py;ef15380f0ffc839e64956a4c974ca280f4b660db$349 can be removed and likewise https://phabricator.wikimedia.org/diffusion/MSCA/browse/master/scap/main.py;ef15380f0ffc839e64956a4c974ca280f4b660db$265

other than that it seems fine on the surface, the deploy. metrics are working

Change 208987 had a related patch set uploaded (by BryanDavis):
Update statsd events

https://gerrit.wikimedia.org/r/208987

Change 208987 merged by jenkins-bot:
Update statsd events

https://gerrit.wikimedia.org/r/208987

Change 208085 merged by Filippo Giunchedi:
gdash: adjust deploy metrics

https://gerrit.wikimedia.org/r/208085

Change 209462 had a related patch set uploaded (by Filippo Giunchedi):
gdash: fix deploy addon urls

https://gerrit.wikimedia.org/r/209462

Change 209462 merged by Filippo Giunchedi:
gdash: fix deploy addon urls

https://gerrit.wikimedia.org/r/209462

so two additional issues identified, one fixed in https://gerrit.wikimedia.org/r/209462 the other was due to conversion from timer to simple counter to extended counter, the end result of which was metrics named the same which end up esisting as a .wsp file (simple counter) and a directory (extended counter or timer) with the former masking the latter from showing up. note that a similar root cause is probably what's causing T98380