Page MenuHomePhabricator

ULSFO post-move verification
Closed, ResolvedPublic

Description

(happened on wed 9th of July)

  • check with gage whether this work already took place
  • check that during the switchover, hosts were correctly reporting
    • like only the correct hosts going down during the migration
    • the other hosts were picking up the correct traffic
  • check that each host is still reporting the expected number of requests
    • sampled-1000 logs (stat1002 /a/squid/...)
    • mobile-sampled-100 logs (stat1002 /a/squid/...) can be done by plotting requests per host per time
    • zero logs (stat1002 /a/squid/...)
    • edit logs (stat1002 /a/squid/...)
    • Find out where those files get written, and find a way to cover
      • oxygen,
      • gadolinium (unicast)
      • gadolinium (multicast)
      • erbium if they are not covered by the above file

Version: unspecified
Severity: normal
Whiteboard: u=Kevin c=General/Unknown p=0 s=2014-07-24

Details

Reference
bz68199

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:26 AM
bzimport set Reference to bz68199.
  • check with gage whether this work already took place

Checked with gage, ops had not checked the network traffic in-depth during the switchover

  • check that during the switchover, hosts were correctly reporting
    • like only the correct hosts going down during the migration
    • the other hosts were picking up the correct traffic

Checked traffic on all hosts (ulsfo, eqiad, and esams) by using data gathered by Christian from the sampled logs. Found that only ULSFO hosts had their traffic go down, and only for the expected period. Also found that only EQIAD hosts had their traffic increase abnormally, and again for the expected period. Overall, I believe that no traffic leaked or increased anywhere outside of what was expected. I will attach pictorial proof and spreadsheets.

Created attachment 16018
the daily traffic to ulsfo and eqiad, by host, during the switchover

Attached:

daily_ulsfo_redirected_to_eqiad.png (1×1 px, 332 KB)

Created attachment 16019
the daily traffic to ulsfo and eqiad, by datacenter, during the switchover

Attached:

daily_ulsfo_redirected_to_eqiad_(totals).png (1×1 px, 138 KB)

Created attachment 16020
the hourly traffic to ulsfo and eqiad, by host, during the switchover

Attached:

hourly_ulsfo_redirected_to_eqiad.png (1×1 px, 483 KB)

Created attachment 16021
the hourly traffic to ulsfo and eqiad, by datacenter, during the switchover

Attached:

hourly_ulsfo_redirected_to_eqiad_(totals).png (1×1 px, 276 KB)

Created attachment 16022
spreadsheet of daily data for ulsfo and eqiad with totals and graph

Attached:

Created attachment 16023
spreadsheet of hourly data for ulsfo and eqiad with totals and graph

Attached:

jgage wrote:

Ok, by examining router interface statistics with LibreNMS, I have confirmed that when traffic from* ULSFO ceased, traffic from EQIAD increased by a similar amount.

  • Rather than looking at actual inbound web request traffic, I'm looking at the outbound responses because they should correlate and are much bigger.

I've provided URLs for reference; LibreNMS access may be requested by emailing access-requests@rt.wikimedia.org.

EQIAD:

cr1-eqiad xe 5/3/1 (transit)
    https://librenms.wikimedia.org/graphs/to=1406129520/id=4515/type=port_bits/from=1404228720/
    +700 Mbps
cr1-eqiad xe 4/3/2 (transit)
    https://librenms.wikimedia.org/graphs/to=1405179180/id=6821/type=port_bits/from=1404747180/
    +700 Mbps
cr1-eqiad xe 4/3/1 (peering)
    https://librenms.wikimedia.org/graphs/to=1405154040/id=6820/type=port_bits/from=1404722040/
    +250 Mbps
cr2-eqiad xe 5/3/1 (transit)
    https://librenms.wikimedia.org/graphs/to=1405159680/id=134/type=port_bits/from=1404727680/
    +1000 Mbps
cr2-eqiad xe 5/3/3 (peering)
    https://librenms.wikimedia.org/graphs/to=1405159800/id=136/type=port_bits/from=1404727800/
    +1000 Mbps

ULSFO:

cr1: 0/0/3 (transit)
    https://librenms.wikimedia.org/graphs/to=1405158120/id=7200/type=port_bits/from=1404726120/
    -1800
cr2: 0/0/2 (transit)
    https://librenms.wikimedia.org/graphs/to=1405158600/id=7139/type=port_bits/from=1404726600/
    -400 maybe
cr2: 0/0/3 (peering)
    https://librenms.wikimedia.org/graphs/to=1405158480/id=7140/type=port_bits/from=1404726480/
    -1100

Increase at EQIAD: roughly 3650 Mbps
Decrease at ULSFO: roughly 3300 Mbps

I had to visually estimate the values from the graphs, so this seems like acceptable equivalence.

In addition to the traffic math, I'm not aware of any user reports of service disruption, and our 3rd party monitoring reports 100% availability in all significant categories for that week. Therefore I have high confidence that traffic was successfully rerouted without loss during the migration.

Approximate timeline:
2014-07-09 16:00 UTC: Mark reroutes traffic to EQIAD
2014-07-09 17:00 UTC: I arrive at ULSFO
2014-07-09 17:30 UTC: ULSFO becomes unreachable

[servers and routers are moved to a new room, everything is plugged back in]

2014-07-09 21:30 UTC: routers back online
2014-07-09 22:45 UTC: Mark restores traffic to ULSFO
2014-07-10 00:30 UTC: I leave ULSFO

Awe-some. Thank you so much Jeff

(In reply to Dan Andreescu from comment #2)

Also found
that only EQIAD hosts had their traffic increase abnormally, [...]

I had expected to see amssq47 (esams) being called out, as it picked
up traffic just as ULSFO's went down.

That's just a coincidence. Right?

(In reply to Jeff Gage from comment #9)

Approximate timeline:
2014-07-09 16:00 UTC: Mark reroutes traffic to EQIAD

While I see that the timelime is labeled as “approximate”, but since
we're looking at numbers of hourly at hourly granularity ...

Looking at the graphs, they take the deep downward dive already ~4-5
hours earlier. This earlier time also nicely matches Mark's rerouting
commit [1], which is shows as gotten merged on 2014-07-09 10:40 UTC in
gerrit.

2014-07-09 22:45 UTC: Mark restores traffic to ULSFO

While that might be right, it is neither reflected by graphs, nor the
puppet repo.

Looking at the graphs, they start to rise only ~1-2 hours later. This
later time again nicely aligns with the puppet repo. There, Brandon's
(not Mark's) rerouting commits [2] are shown in as gotten merged
between 2014-07-10 00:38 and 2014-07-10 08:37.

[1] https://gerrit.wikimedia.org/r/#/c/144934/

[2] They are a series of commits between

https://gerrit.wikimedia.org/r/#/c/145182/
https://gerrit.wikimedia.org/r/#/c/145221/

jgage wrote:

Hi,

We should definitely trust the commits over my information source, the Server Admin Log whose events are manually input: https://wikitech.wikimedia.org/wiki/SAL

22:42 mark: Enabling PAIX BGP sessions on cr2-ulsfo
22:40 mark: Enabling WMF HQ BGP sessions on cr1-ulsfo
22:38 mark: Enabling TiNet transit links on cr1-ulsfo
22:35 mark: Enabling WMF HQ BGP sessions on cr2-ulsfo
22:34 mark: Enabling NTT and HE transit links on cr2-ulsfo

16:17 mark: ulsfo is now offline
16:16 mark: Shutdown NTT BGP sessions on cr2-ulsfo
16:13 mark: Shutdown TiNet BGP sessions on cr1-ulsfo
16:10 mark: Shutdown IXP BGP sessions on cr2-ulsfo
16:10 mark: Shutdown WMF HQ BGP sessions on cr2-ulsfo
16:09 mark: Shutdown WMF HQ BGP sessions on cr1-ulsfo

From the patch we can see that all traffic directed away from ULSFO was sent to EQIAD. Therefore it does seem like any increased traffic to ESAMS would be coincidental. I'll ask Mark to comment on this.

Yeah, amssq47 had been used as a test server before, not receiving any traffic. Brandon reinstalled and put it back in production around that time, so that would explain it.

(In reply to Mark Bergsma from comment #13)

[ Situation around amssq47 ]

Thanks for confirming.


Per host per hour packet loss numbers look good.
(The only host that sticks out there a bit is cp3013--a mobile esams
cache. But esams should not have seen changes from the ULSFO move, and
that host is not super-stable around packetloss anyways. Nothing
concerning, but it is on the border more often than not. As the total
volume of messages looks sound, and other parts of this host's log do
too, I am assuming it's a coincidence)

Per host per hour total traffic numbers look wrong.
But during the ULSFO floor move, a semi-final for the 2014 FIFA World
Cup took place (Yay, coincidence!) This caused a traffic spike
during the ULSFO move which makes numbers look really skew.
However, when limiting to various slices of non-soccer traffic, no
spike is visible.
For each individual non-soccer slice, data looks good.

Per host per hour urls look good.

Per host per hour per status code numbers look good.

Per host per hour referers look good.

From my point of view, log data is overall good.