Replace NSX-T expiring localmanager and globalmanager self-signed Certificates.

Recently we saw some warnings about expiring certificates in the NSX-T Global Manager and Local Manager.

When we clicked one of the alerts we got a small description and some API calls we can fire to apply new certificates.

In the Certificates overview (System > Certificates > Certificates), we could see that the certificates Issued to the Local Manager and Global Manager were expiring. The certification id’s were also corresponding to the ones in the alert (not the ones in my screenshots).

The API calls that were mentioned in the Alert description are for the renewal of certificate to the HTTP service (UI), not the Local/Global Manager certificates.
The VMware Docs don’t explain in good detail how to change these certificates, i couln’t find it.

Try it yourself:
https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.1/administration/GUID-50C36862-A29D-48FA-8CE7-697E64E10E37.html

The only give away i could find was in step 6: (NSX Federation and the service type).

So before we can replace the certificates, we need to create new Self Signed Certificates for the Local Manager, and the Global Manager.

Create CSR on GM/LM:

to create a CSR (Certificate Signing Request) on the Global or Local Manager go to: System > Certificates > CSRs and click on “Generate CSR“.

For the Global Manager do this via the Global Manager Appliance and for the Local Manager use the Local Manager Appliance, or use the drop down on the top of the screen to choose between your Global Manager or Local Managers.

Fill in all the fields and hit the GENERATE button, example below is for the Global Manager. For The local Manager just change the word global to local:

Now we can see a new CSR in the list, the next step is to self-sign the Global Manager CSR, select the CSR and under actions choose “Self Sign Certificate for CSR”

Choose your number of days:

Now we have a new Self-Signed certificate for the Global Manager in the certificates list, with this certificate we can replace the Principal Identity certificate for the Global Manager.

For the local manager certificates, follow the steps mentioned above on the local manager appliance.

Apply Self-Signed Certificate on the Global Manager

Before we can apply the SS certificate to the Global Manager we need to copy the certificate id, click on the ID, and then select the whole id in the pop-up and copy/paste it for later use:

Now we can Fire up Postman to apply the certificate by API:

  1. Change the ACTION drop down to POST
  2. Paste the following url to your Global Manager:

    Step 1 to 4

    5. Change to Body, select raw and from the drop-down choose JSON
    6. put the following JSON in the Body:

    {
    “cert_id”: “<new cert id>”,
    “service_type”: “GLOBAL_MANAGER”
    }

    And hit the SEND button, if you get back Status 200 OK, the API call was succesful:

    Apply Self-Signed Certificate on the Local Manager

    This procedure is almost the same as on the Global Manager. Go to the Local Manager Appliance and copy the new certificate id.

    Go to Postman:

    1. Change the ACTION drop down to POST
    2. Paste the following url to your Global Manager:

      Check the old expiring certificate

      Now since we are on version 3.1, we need to check by API if the old certificate is released so we know it isn’t used anymore before we delete it.
      Note: In Newer version you have a “Where Used” field in the certificate overview.

      Go to Postman:

      1. Change the action to GET
      2. Paste the following url to your Global Manager: (use the id of the expiring certificate)
        https://<global-manager>/api/v1/trust-management/certificates/dee8d78b-5e04-4deb-8d36-6b86f79f058b
        Set Authorization the same as the previous API Calls
      3. Select Body and set it to none
      4. Hit Send

      If the Certificate is used by a node look for the used_by part, when there is a node_id, the certificate is still in use and can’t be deleted. If it is empty, you can delete the Certificate in the UI, you can do this check on the new certificate to see if it is used by the same node.

          "used_by": [
              {
                  "node_id": "515f0642-b1eb-ef39-9acc-82f8760ab0b9",
                  "service_types": [
                      "GLOBAL_MANAGER"
                  ]
              }
          ],

      Sometimes the Certificates won’t release itself, so let’s release the damn thing:

      Release a Certificate

      Please keep in mind that you only release the certificate from the node_id if you are absolutly sure, if not please raise a ticket to VMware Support.

      1. login with the admin user to the manager with ssh
      2. then type st e, enter the root password and you are now at the shell
      3. Use the certificate id and the node_id from the previous step:
      4. now use the following API call to release the Certificate of the node_id:
        curl -k -X POST -H “Content-Type: application/json” -H ‘X-NSX-Username:admin’ -H ‘X-NSX-Groups:superuser’ -d ‘{“service_type”:”API”,”node_id”:”<node_id>“}’  “http://localhost:7440/nsxapi/api/v1/trust-management/certificates/<certificate-id>?action=release
      5. This should do it, you can check the certificate again with the previous step and look for the used_by, this should be emtpy now.
      curl -k -X POST -H "Content-Type: application/json" -H 'X-NSX-Username:admin' -H 'X-NSX-Groups:superuser' -d '{"service_type":"GLOBAL_MANAGER","node_id":""647defab-13c5-5g62-93e4-da4a345d666"}'  "http://localhost:7440/nsxapi/api/v1/trust-management/certificates/ea5771fb-161e-49dd-97d7-c1483d2790666691?action=release"
      

      If the used_by is empty, you can now safely remove the certificate.

      All the steps are almost identical for the Global and Local Manager, just change the service types for GLOBAL_MANAGER to LOCAL_MANAGER, etc.

      And the Alerts are now Resolved!

Design Consideration – For Bare Metal Edge with 4 PNIC in a NSX-T Federated Environment

During a failover test with the Bare Metal Edges we ran into an issue when pulling the plug on 1 of the TOR switches. (TOR-LEFT). During that test all BGPs on both Bare Metal edges went down. So no North-South routing anymore 🙁

Edge Setup:

The Bare Metal Edges were configured following the design guidelines from VMware:
https://nsx.techzone.vmware.com/resource/nsx-t-reference-design-guide-3-0#_Edge_Node_and_1

Indentifying the problem:

So why this behaviour? And what happens when we pull the plug on the other TOR switch (TOR-RIGHT).
After performing the test with the TOR-RIGHT, the BGPs connected to TOR Left stayed established. So it has something to do with switch TOR-LEFT?

After checking the configuration on the TOR-LEFT switch we didn’t identified something that could cause this issue. But what could it be? Edges were configured by VMware guidelines and were identical configuration wise.

So going through the logs was the next step in the process, and i stumbled upon this part in the log file:

2022-10-17T10:37:08.578Z Update device fp-eth0 state to DOWN
2022-10-17T10:37:08.578Z Self Node 00363d34-fcdd-11ea-8e07-e4434ba66042 status changed from Up to Down (RTEP device down)

Can it have something to do with the federated setup (RTEP), is the RTEP only connecting over fp-eth0?

Cause:

Again i went through the setup but now i also checked the fp-eth0 connections to the switches. On both BareMetal Edges the fp-eth0 was connected to the TOR-LEFT. So when we pulled the plug on that Switch it triggered the RTEP going down, which led to all BGP session going down.

This is expected behavior according to VMware!

Solution:

The solution to this issue was pretty simple after we identified the cause. We switched the connection on the second Bare Metal Edge, so the pnics connected to TOR LEFT will be on TOR-RIGHT and vice versa. The opposite of the first Bare Metal Edge.