A while ago we ran into an issue after we did the upgrade from NSX-T version 3.1.3.6 to 3.1.3.7. In the alarms section at one Site. Still wanted to do a post about the issue and the solution/workaround:
Time to check the connection!
Login to the Edges and grap the VRF id of the RTEP TUNNEL.
Check the BGP and ping between the RTEP ip addresses on both sites.
As you can see all BGPs are established and the ping commands give a reply.
Let’s do another check from Postman:
Open Postman and fire a GET api call to the nsx-manager to grab the edge id we need in the next api call:
API GET call:
https://<nsxmanager ip>/api/v1/transport-nodes/
Just select Basic Auth under the Authorization tab and fill in the Admin credentials.
Hit Send, when getting a reply in the Body, search for the edge name and the corresponding id.
Now we use this id to get the RTEP status:
GET https://<nsxmanagerip>/api/v1/transport-nodes/<edgenodeid>/inter-site/bgp/summary
Check the output and the Return Status for issues, as you can see in the example above the BGP to one of the peers is establised.
Solution:
So it seems like the issue is known in the 3.1.3.7 version in a 3 manager nodes setup.
The node which has generated the alarm, only that node can clear alarm from in-memory when it will receive remove alarm from the edge node. The Alarm was resolved on 1 of the manager nodes, but it was showing on other nodes and it was keeping the alarm as active.
The following workaround will remove the alarm: Restart the proton service on ALL manager nodes.
– SSH with the admin user to the NSX-T manager nodes:
– execute the following commands:
Stop service proton
Start service proton
UPDATE: The issue is fixed in version 3.2.1