VCF 5.0 – Retry Bring up fails on Generate input for SSH Keys

After a Failed VCF bring-up, I wanted to retry the bring-up. Luckily the error I encountered before was resolved but again I ran in to an issue during the retry.

Now the issue was with the import of the SSH Keys

Going through some internal resources i stumbled upon the solution, since this is a nested lab environment on top of VCD you have to reset the MAC address of the ESXi Host.

VCF 5.0 LAB Deployment Issue – Failed to migrate vmnics of ESXi host to Distributed vSwitch

During my latest deployment of VCF in my lab environment I ran in to the following issue.

Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d

The error is pretty clear, the migration of vmk0 from the standard vSwitch to the Distributed vSwitch failed on esx02. I checked esx01 and on this host the migration was successfull.

I tried to manually migrating the vmk0 to the distributed vSwitch also ran in to an error in vCenter.
Right-click dvSwitch -> Add and Manage Hosts -> Manage Host Networking -> Select esx02

Click Next and leave the physical adapters as is, click next again.
On the next screen click on “Assign Port Group” next to vmk0.

Click on ASSIGN next to the management portgroup

Next, Next, Finish…..Task is running and fails after a few seconds.

Checking the Task Details on the ESX host:

After some investigation and searching internally within VMware resources and also ran into this blog article: https://mhvmw.wordpress.com/2023/03/17/issue-with-nested-vcf-4-5-deployment-lab-only/

it is a MAC address conflict when the esxi takes the mac of the physical nic for vmk0. 
By deleting and recreating the vmk0 interface you generate a new MAC address for vmk0.

Steps to check, delete and recreate vmk0 interface

Login via DCUI

Enable ESXi Shell

Next, Click ALT+F1 to access ESXi console and login as root.

Type the command:
esxcli network ip interface list

Make a note of the portgroup, in this case “Management network” and then remove the vmk0 with the following command:
esxcli network ip interface remove –-interface-name=vmk0

When vmk0 is deleted, we can immediately create a new interface with the same name and portgroup. This is done by the following command:
esxcli network ip interface add -–interface-name=vmk0 -p “Management Network”

To check if vmk0 is created again type the command:
esxcli network ip interface list

Click ALT+F2 to access ESXi DCUI and login to disable the ESXi shell.
Now we can configure the IP settings again via the DCUI

Go to Configure Management Settings -> IPv4 Configuration and set the static IPv4 configuration

Hit Enter then Esc and Yes to restart the management network

Now we can try to redeploy via cloudbuilder, after this the deployment went on succesfully

Compute Manager not visible in NSX after Upgrade of the 3 Manager Appliances

After a successful upgrade of NSX, after the last step the upgrade of the management plane the compute manager disappeared, let’s see how we can fix that!

When i try to add the vCenter it says it is already registered, let’s check with the API.

First do a API GET in Postman to get the compute manager id:

Output:

Now we have the compute manager id, we can check if it is registered and up:

Output:


As you can see the compute manager is registered and up, why is it not showing up in the UI?

Solution:

Login with the admin user by SSH, and run the following command.

start search resync inventory

Wait a few seconds and refresh the UI, now the Compute Manager is back!


Replace NSX-T expiring localmanager and globalmanager self-signed Certificates.

Recently we saw some warnings about expiring certificates in the NSX-T Global Manager and Local Manager.

When we clicked one of the alerts we got a small description and some API calls we can fire to apply new certificates.

In the Certificates overview (System > Certificates > Certificates), we could see that the certificates Issued to the Local Manager and Global Manager were expiring. The certification id’s were also corresponding to the ones in the alert (not the ones in my screenshots).

The API calls that were mentioned in the Alert description are for the renewal of certificate to the HTTP service (UI), not the Local/Global Manager certificates.
The VMware Docs don’t explain in good detail how to change these certificates, i couln’t find it.

Try it yourself:
https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.1/administration/GUID-50C36862-A29D-48FA-8CE7-697E64E10E37.html

The only give away i could find was in step 6: (NSX Federation and the service type).

So before we can replace the certificates, we need to create new Self Signed Certificates for the Local Manager, and the Global Manager.

Create CSR on GM/LM:

to create a CSR (Certificate Signing Request) on the Global or Local Manager go to: System > Certificates > CSRs and click on “Generate CSR“.

For the Global Manager do this via the Global Manager Appliance and for the Local Manager use the Local Manager Appliance, or use the drop down on the top of the screen to choose between your Global Manager or Local Managers.

Fill in all the fields and hit the GENERATE button, example below is for the Global Manager. For The local Manager just change the word global to local:

Now we can see a new CSR in the list, the next step is to self-sign the Global Manager CSR, select the CSR and under actions choose “Self Sign Certificate for CSR”

Choose your number of days:

Now we have a new Self-Signed certificate for the Global Manager in the certificates list, with this certificate we can replace the Principal Identity certificate for the Global Manager.

For the local manager certificates, follow the steps mentioned above on the local manager appliance.

Apply Self-Signed Certificate on the Global Manager

Before we can apply the SS certificate to the Global Manager we need to copy the certificate id, click on the ID, and then select the whole id in the pop-up and copy/paste it for later use:

Now we can Fire up Postman to apply the certificate by API:

  1. Change the ACTION drop down to POST
  2. Paste the following url to your Global Manager:

    Step 1 to 4

    5. Change to Body, select raw and from the drop-down choose JSON
    6. put the following JSON in the Body:

    {
    “cert_id”: “<new cert id>”,
    “service_type”: “GLOBAL_MANAGER”
    }

    And hit the SEND button, if you get back Status 200 OK, the API call was succesful:

    Apply Self-Signed Certificate on the Local Manager

    This procedure is almost the same as on the Global Manager. Go to the Local Manager Appliance and copy the new certificate id.

    Go to Postman:

    1. Change the ACTION drop down to POST
    2. Paste the following url to your Global Manager:

      Check the old expiring certificate

      Now since we are on version 3.1, we need to check by API if the old certificate is released so we know it isn’t used anymore before we delete it.
      Note: In Newer version you have a “Where Used” field in the certificate overview.

      Go to Postman:

      1. Change the action to GET
      2. Paste the following url to your Global Manager: (use the id of the expiring certificate)
        https://<global-manager>/api/v1/trust-management/certificates/dee8d78b-5e04-4deb-8d36-6b86f79f058b
        Set Authorization the same as the previous API Calls
      3. Select Body and set it to none
      4. Hit Send

      If the Certificate is used by a node look for the used_by part, when there is a node_id, the certificate is still in use and can’t be deleted. If it is empty, you can delete the Certificate in the UI, you can do this check on the new certificate to see if it is used by the same node.

          "used_by": [
              {
                  "node_id": "515f0642-b1eb-ef39-9acc-82f8760ab0b9",
                  "service_types": [
                      "GLOBAL_MANAGER"
                  ]
              }
          ],

      Sometimes the Certificates won’t release itself, so let’s release the damn thing:

      Release a Certificate

      Please keep in mind that you only release the certificate from the node_id if you are absolutly sure, if not please raise a ticket to VMware Support.

      1. login with the admin user to the manager with ssh
      2. then type st e, enter the root password and you are now at the shell
      3. Use the certificate id and the node_id from the previous step:
      4. now use the following API call to release the Certificate of the node_id:
        curl -k -X POST -H “Content-Type: application/json” -H ‘X-NSX-Username:admin’ -H ‘X-NSX-Groups:superuser’ -d ‘{“service_type”:”API”,”node_id”:”<node_id>“}’  “http://localhost:7440/nsxapi/api/v1/trust-management/certificates/<certificate-id>?action=release
      5. This should do it, you can check the certificate again with the previous step and look for the used_by, this should be emtpy now.
      curl -k -X POST -H "Content-Type: application/json" -H 'X-NSX-Username:admin' -H 'X-NSX-Groups:superuser' -d '{"service_type":"GLOBAL_MANAGER","node_id":""647defab-13c5-5g62-93e4-da4a345d666"}'  "http://localhost:7440/nsxapi/api/v1/trust-management/certificates/ea5771fb-161e-49dd-97d7-c1483d2790666691?action=release"
      

      If the used_by is empty, you can now safely remove the certificate.

      All the steps are almost identical for the Global and Local Manager, just change the service types for GLOBAL_MANAGER to LOCAL_MANAGER, etc.

      And the Alerts are now Resolved!

Design Consideration – For Bare Metal Edge with 4 PNIC in a NSX-T Federated Environment

During a failover test with the Bare Metal Edges we ran into an issue when pulling the plug on 1 of the TOR switches. (TOR-LEFT). During that test all BGPs on both Bare Metal edges went down. So no North-South routing anymore 🙁

Edge Setup:

The Bare Metal Edges were configured following the design guidelines from VMware:
https://nsx.techzone.vmware.com/resource/nsx-t-reference-design-guide-3-0#_Edge_Node_and_1

Indentifying the problem:

So why this behaviour? And what happens when we pull the plug on the other TOR switch (TOR-RIGHT).
After performing the test with the TOR-RIGHT, the BGPs connected to TOR Left stayed established. So it has something to do with switch TOR-LEFT?

After checking the configuration on the TOR-LEFT switch we didn’t identified something that could cause this issue. But what could it be? Edges were configured by VMware guidelines and were identical configuration wise.

So going through the logs was the next step in the process, and i stumbled upon this part in the log file:

2022-10-17T10:37:08.578Z Update device fp-eth0 state to DOWN
2022-10-17T10:37:08.578Z Self Node 00363d34-fcdd-11ea-8e07-e4434ba66042 status changed from Up to Down (RTEP device down)

Can it have something to do with the federated setup (RTEP), is the RTEP only connecting over fp-eth0?

Cause:

Again i went through the setup but now i also checked the fp-eth0 connections to the switches. On both BareMetal Edges the fp-eth0 was connected to the TOR-LEFT. So when we pulled the plug on that Switch it triggered the RTEP going down, which led to all BGP session going down.

This is expected behavior according to VMware!

Solution:

The solution to this issue was pretty simple after we identified the cause. We switched the connection on the second Bare Metal Edge, so the pnics connected to TOR LEFT will be on TOR-RIGHT and vice versa. The opposite of the first Bare Metal Edge.

RTEP down in Global Manager NSX-T UI

A while ago we ran into an issue after we did the upgrade from NSX-T version 3.1.3.6 to 3.1.3.7. In the alarms section at one Site. Still wanted to do a post about the issue and the solution/workaround:

Time to check the connection!
Login to the Edges and grap the VRF id of the RTEP TUNNEL.

Check the BGP and ping between the RTEP ip addresses on both sites.

As you can see all BGPs are established and the ping commands give a reply.
Let’s do another check from Postman:

Open Postman and fire a GET api call to the nsx-manager to grab the edge id we need in the next api call:
API GET call:

https://<nsxmanager ip>/api/v1/transport-nodes/

Just select Basic Auth under the Authorization tab and fill in the Admin credentials.

Hit Send, when getting a reply in the Body, search for the edge name and the corresponding id.

Now we use this id to get the RTEP status:

GET https://<nsxmanagerip>/api/v1/transport-nodes/<edgenodeid>/inter-site/bgp/summary

Check the output and the Return Status for issues, as you can see in the example above the BGP to one of the peers is establised.

Solution:

So it seems like the issue is known in the 3.1.3.7 version in a 3 manager nodes setup.

The node which has generated the alarm, only that node can clear alarm from in-memory when it will receive remove alarm from the edge node. The Alarm was resolved on 1 of the manager nodes, but it was showing on other nodes and it was keeping the alarm as active.

The following workaround will remove the alarm: Restart the proton service on ALL manager nodes.

– SSH with the admin user to the NSX-T manager nodes:
– execute the following commands:

Stop service proton
Start service proton

UPDATE: The issue is fixed in version 3.2.1

Upload MUB file fails with ansible-for-nsxt

While trying to upgrade the NSX-T enviroment via ansible we stumbled upon the issue that we couldn’t upload the mub file to the NSX Manager.

Ansible Task:

- name: Upload upgrade file to NSX-T manager coordinator node from file
  vmware.ansible_for_nsxt.nsxt_upgrade_upload_mub:
    hostname: "{{ coordinator }}"
    username: "{{ username }}"
    password: "{{ password }}"
    validate_certs: False
    file: "{{ nsx_upgrade_mub_file }}"

This gave us the following error message:

fatal: [127.0.0.1]: FAILED! => {
    "changed": true,
    "invocation": {
        "module_args": {
            "file": "/root/hypervisor/upgrade_bundle/VMware-NSX-upgrade-bundle-3.2.1.1.0.20115686.mub",
            "hostname": "manager1",
            "password": "VALUE_SPECIFIED_IN_NO_LOG_PARAMETER",
            "port": 443,
            "timeout": 9000,
            "url": null,
            "username": "admin",
            "validate_certs": false
        }
    },
    "msg": "Error: string longer than 2147483647 bytes"

So it looks like the upload can’t handle files over 2GB.

Honoustly my python skills are a bit rusty, so i asked one of the developers in our team to help me out and see if we could get this fixed.

The 2GB+ filesize is the issue. You can find multiple references to the error, usually referring to the httplib, urllib or ssl..
One solution is to use streaming upload.

This is what we did to make the upload work.

Install request-toolbelt package

Edit nsxt_upgrade_upload_mub.py
NOTE: This will break the URL upload!
Add:

import requests
from requests_toolbelt.multipart import encoder
from requests.auth import HTTPBasicAuth 

Replace line 140 – 174 with:

        session = requests.Session()
        with open(file_path, 'rb') as src_file:
             body = encoder.MultipartEncoder({
                 "file": (src_file.name, src_file, "application/octet-stream")
             })
             headers = {"Prefer": "respond-async", "Content-Type": body.content_type}
             resp = session.post(mgr_url + endpoint, auth=HTTPBasicAuth(mgr_username, mgr_password), timeout=None, verify=False, data=body, headers=headers)
             bundle_id = 'latest'#resp['bundle_id']
             headers = dict(Accept="application/json")
             headers['Content-Type'] = 'application/json'
             try:
                 wait_for_operation_to_execute(mgr_url,
                     '/upgrade/bundles/%s/upload-status'% bundle_id,
                     mgr_username, mgr_password, validate_certs,
                     ['status'], ['SUCCESS'], ['FAILED'])
             except Exception as err:
                 module.fail_json(msg='Error while uploading upgrade bundle. Error [%s]' % to_native(err))
             module.exit_json(changed=True, ip_address=ip_address,
             message='The upgrade bundle %s got uploaded successfully.' % module.params[mub_type])
        session.close()

NOTE: This will break the URL upload!

The response will show:

changed: [127.0.0.1] => {
    "changed": true,
    "invocation": {
        "module_args": {
            "file": "/upgrade_bundle/VMware-NSX-upgrade-bundle-3.2.1.1.0.20115686.mub",
            "hostname": "nsxmanager",
            "password": "VALUE_SPECIFIED_IN_NO_LOG_PARAMETER",
            "port": 443,
            "timeout": 9000,
            "url": null,
            "username": "admin",
            "validate_certs": false
        }
    },
    "ip_address": "<ip>",
    "message": "The upgrade bundle /upgrade_bundle/VMware-NSX-upgrade-bundle-3.2.1.1.0.20115686.mub got uploaded successfully."
}

Solution is also added to a github BUG report:

https://github.com/vmware/ansible-for-nsxt/issues/416

All Kudos to my colleague for fixing the issue!

VMware Usagemeter issue sending data after upgrade to 4.5.0.1

This week i upgraded an usagemeter from 4.5.0.0 to 4.5.0.1 with the inplace upgrade method.
Usage Meter 4.5.0.1 patch release rolled out on May 23rd. This release addresses a major issue found in Usage Meter 4.5. For more information i refer to the following blog:

https://blogs.vmware.com/cloudprovider/2022/05/usage-meter-4-5-0-1-why-is-it-needed-and-how-to-upgrade-to-it.html

The release notes can be found at:
https://docs.vmware.com/en/vCloud-Usage-Meter/4.5.0.1/rn/vmware-vcloud-usage-meter-4501-release-notes/index.html

We followed the upgrade guide provided by VMware:
https://docs.vmware.com/en/vCloud-Usage-Meter/4.5/Getting-Started-vCloud-Usage-Meter/GUID-AE5A81E1-097A-4EED-9A8E-8BF7E0B378A4.html?hWord=N4IghgNiBcIJYDsC0AHCYDGBTABAVxQHMAlAQQBEBREAXyA

After the reboot and check if the upgrade was successful we tried to send a test update to the Usage Insight. The test send of data failed with thw following error:

In the notifications we can find the following messages:

After a search through the VMware Knowlegde Base I came across this article:

https://kb.vmware.com/s/article/82023

To test connectivity to vCloud Usage Insight by using the curl command:

curl -kv https://ums.cloud.vmware.com/um/api/v1/ping --proxy 192.168.1.50:8080

The response came back with HTTP status code 200, so that was OK.

Now we check the vCloud Usage Meter registration status in vCloud Usage Insight by using the curl command:

curl -kv https://ums.cloud.vmware.com/um/api/v1/upload/registration?um=<UM UUID> --proxy 192.168.1.50:8080

This response also came back with HTTP status code 200, so AOK, so far for the checks…

Let’s get in touch with GSS

As this was the advice all along in the first error message ….

The GSS engineer stated that there was an issue with the nginx jvm settings when using a proxy.
We had to add a line to the nginx.conf in the following directory, but before we change anything lets make a snapshot of the system in case we ruin everything.

Remark: Please contact GSS if you want support editing files, and always make a backup or in this case a snapshot before changing settings

Edit the nginx.conf and add the follwing line somewhere around line 58,

This line set a dummy file for proxy configuration., we are setting a dummy proxy configuration as we are hitting a known issue and this will be fixed in a future release.

jvm_options "-Dproxy_config=/tmp/vami-file";

Go down one dir to /opt/vmware/cloudusagemetering

Stop NGINX service with the follwing command:

./scripts/stop.sh GW

Now start the service again:

Start NGINX service

Now get the status of all services:

You see a lot of errors, these can be ignored.

Now go to the VAMI UI and reset the proxy again:

Test

Now go to the Settings Page in the Usage Meter UI and Send an update to Usage Insight.

It works! You can also check the usagemeter in Insight: Go to https://ums.cloud.vmware.com/ui/
The last update should be after the issue was fixed!

Remark: Please contact GSS if you want support editing files, and always make a backup or in this acse a snapshot before changing settings

NSX-T IP Discovery & Realized Bindings Issues

Last week i was at the VMware Tech Summit in Cork, Ireland. I attended a session about NSX-T troubleshooting. During this session a lot of issues came to the stage which i dealt with in the last year or so.

One of the is issues was about the IP Bindings of a VM to a segment. In our case a tenant manually edited the Ipv4 address on the network Interface of the VM, and after this the connection to this VM dropped.

DFW checked, routing checked al was ok. After some digging we found out that the VM had 2 IP addresses in the realized bindings section on the logical switch. This view can be found in the manager UI of the segment port.

How can you find these Realized bindings and Fix it?

To get to the right port you can do the following, Go to segments and look for the segment to which the VM is connected. Once you found it click on the number you see beneath the ports.

This opens a window where you can find all connected port to the segment, copy the Segment Port Name.

Now search in the Search Bar for this Segment Port Name, and click the one with Resource Type Logical Ports.

This takes you to the Manager UI of this Logical port. You can always go through Networking -> then Switch to Manager UI in the upper right corner -> Select logical switches -> Search through the list for the right port.

Select Address Bindings, here you can see the Auto Discovered Bindings, both with the current IP from the VM. One learned from VMware Tools and the other by ARP Snooping. But if you take a close look at the Realized Bindings you can see a different IP learned by ARP Snooping. This was the original IP the Vm had when it connected.

This can cause connection problems! In our case the whole routing was messed up and the traffic went out via the wrong Uplink.

We can fix this quickly with moving the entry with the old IP address to the Ignore Bindings:

It will take a few seconds to updated the realized Bindings with the new lP address learned by ARP Snooping.

After this the connection came up and the tenant was happy!
But this was nothing more than a quickfix, what if all tenants gone mad and they are manually changing their IP addresses in the OS…….

So why is the IP address staying in the Realized Bindings section and keeps bringing carnage?

By default, the discovery methods ARP snooping and ND snooping operate in a mode called trust on first use (TOFU). In TOFU mode, when an address is discovered and added to the realized bindings list, that binding remains in the realized list forever.

Can we modify that mode? Yes we can!

In NSX-T we use several profiles, one of those is the IP Discovery profile. This profile can be found in the Policy UI under Segments -> Segment Profiles

Create a new Ip Discovery Profile and disable the TOFU setting, When you do this, TOFU changes to Trust On Every Use (TOEU). In this TOEU mode, the discovered IP addresses are placed in the binding list and deleted when they expire. DHCP snooping and VMware tools always operate in TOEU mode.

Now we need to adjust the segment to use the new IP Discovery Profile, go to the segment and click edit.
Under Segment profiles select the new TOFU Profile, click Save and the Close Editing.

Now when a tenant changes the IP of the Network Interface Manually the old IP learned the first time by ARP Snooping is not present anymore in the Realize Bindings section.

vSAN 7.0 Resync Throttling

When we upgraded to vSAN 7, immediately we faced an annoying error message in Skyline health which complained about the Resync Throttling.

This setting is deprecated since 6.7 and can only be set via Powercli, below a few lines of code to set the value to 0 to all Hosts in the cluster. When the VSAN Host is patched or sometimes we even see the warning again after a reboot. Just run the script again and the warning is gone.

$vc="yourVC"
Connect-VIServer $vc<br>$hosts = Get-VMHost
foreach($esx in $hosts){
Write-Host "Displaying and Updating VSAN ResyncThrottle value on $esx" 
Get-AdvancedSetting -Entity $esx -Name VSAN.DomCompResyncThrottle | Set-AdvancedSetting -Value 0 -confirm:$falseGet-AdvancedSetting -Entity $esx -Name VSAN.DomCompResyncThrottle
}

Please run scripts at your own risk! If you are not comfortable please contact VMware Support.

You can also check this KB article for another solution regarding the resync Throttling

https://kb.vmware.com/s/article/83742