During a failover test with the Bare Metal Edges we ran into an issue when pulling the plug on 1 of the TOR switches. (TOR-LEFT). During that test all BGPs on both Bare Metal edges went down. So no North-South routing anymore 🙁
The Bare Metal Edges were configured following the design guidelines from VMware:
Indentifying the problem:
So why this behaviour? And what happens when we pull the plug on the other TOR switch (TOR-RIGHT).
After performing the test with the TOR-RIGHT, the BGPs connected to TOR Left stayed established. So it has something to do with switch TOR-LEFT?
After checking the configuration on the TOR-LEFT switch we didn’t identified something that could cause this issue. But what could it be? Edges were configured by VMware guidelines and were identical configuration wise.
So going through the logs was the next step in the process, and i stumbled upon this part in the log file:
2022-10-17T10:37:08.578Z Update device fp-eth0 state to DOWN 2022-10-17T10:37:08.578Z Self Node 00363d34-fcdd-11ea-8e07-e4434ba66042 status changed from Up to Down (RTEP device down)
Can it have something to do with the federated setup (RTEP), is the RTEP only connecting over fp-eth0?
Again i went through the setup but now i also checked the fp-eth0 connections to the switches. On both BareMetal Edges the fp-eth0 was connected to the TOR-LEFT. So when we pulled the plug on that Switch it triggered the RTEP going down, which led to all BGP session going down.
This is expected behavior according to VMware!
The solution to this issue was pretty simple after we identified the cause. We switched the connection on the second Bare Metal Edge, so the pnics connected to TOR LEFT will be on TOR-RIGHT and vice versa. The opposite of the first Bare Metal Edge.