When i added a host to an existing vSAN cluster via the SDDC manager the task failed with the following error: “Found zero ssd devices for SSD cache tier”
To quickly fix this we need to set the cache disk on the ESXi host to SSD, you can check the current value with the vdq -q command. As you can see in the picture below the disk I want to use for the cache is marked with a value of “0”, so it is not recognized as SSD drive.
In the past you had to set the the disk to SSD with SATP claim rules, but from version 7.x and 8.x there is a new and simpler command to do this. Run the following ESXCLI command and use the storage device ID and the -M option with value of true (or false to revert the change) to mark the device as an SSD.
esxcli storage hpp device set -d naa.6000c299027de72c68de829e23455e88 -M true
In my lab I tried to deploy Aria Operations for Networks 6.12.1 (AON/vRNI) from Aria Suite LifeCycle 8.16 (ASLC/vRLCM). Before the deployment of AON I successfully deployed other products:
Attempted to deploy Aria Ops for Networks 6.12.1 but it failed with LCMVSPHERECONFIG1000016 error.
-----------------------------------------------------------------------------------------------------------
java.io.IOException: com.vmware.vim.binding.vmodl.fault.SystemError
at com.vmware.vrealize.lcm.drivers.vsphere65.vlsi.utils.ExceptionMappingUtils.mapAndThrowImportVAppExceptions(ExceptionMappingUtils.java:78)
at com.vmware.vrealize.lcm.drivers.vsphere65.deploy.impl.BaseOvfDeploy.importOvf(BaseOvfDeploy.java:713)
at com.vmware.vrealize.lcm.plugin.core.vsphere.tasks.DeployOvfTask.execute(DeployOvfTask.java:251)
at com.vmware.vrealize.lcm.automata.core.TaskThread.run(TaskThread.java:62)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
-----------------------------------------------------------------------------------------------------------
Error Code: LCMVSPHERECONFIG1000016 IO Exception occurred while performing the operation. Check the logs for more information. Unexpected ioexception occurred.
After a failed deployment of VCF 5.0, i was left with a vSAN Datastore on the first host in the cluster, and this was blocking a retry of the deployment.
In this state the vsanDatastore is unable to be deleted. If I try to delete it, the option is greyed out.
To delete the datastore and the partitions of the disks, we first need to SSH into the host and get the cluster.
We need the Sub-Cluster Master UUID, copy it to the clipboard. To leave the cluster the command is:
During a new lab deployment of VCF 5.0 i ran in to an small issue running the validation.
I deployed the hosts up front and made them available and unique before the validation. ran the following command to regenerate the certs and restart the services:
After a Failed VCF bring-up, I wanted to retry the bring-up. Luckily the error I encountered before was resolved but again I ran in to an issue during the retry.
Now the issue was with the import of the SSH Keys
Going through some internal resources i stumbled upon the solution, since this is a nested lab environment on top of VCD you have to reset the MAC address of the ESXi Host.
During my latest deployment of VCF in my lab environment I ran in to the following issue.
Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d
The error is pretty clear, the migration of vmk0 from the standard vSwitch to the Distributed vSwitch failed on esx02. I checked esx01 and on this host the migration was successfull.
I tried to manually migrating the vmk0 to the distributed vSwitch also ran in to an error in vCenter. Right-click dvSwitch -> Add and Manage Hosts -> Manage Host Networking -> Select esx02
Click Next and leave the physical adapters as is, click next again. On the next screen click on “Assign Port Group” next to vmk0.
Click on ASSIGN next to the management portgroup
Next, Next, Finish…..Task is running and fails after a few seconds.
it is a MAC address conflict when the esxi takes the mac of the physical nic for vmk0. By deleting and recreating the vmk0 interface you generate a new MAC address for vmk0.
Steps to check, delete and recreate vmk0 interface
Login via DCUI
Enable ESXi Shell
Next, Click ALT+F1 to access ESXi console and login as root.
Type the command: esxcli network ip interface list
Make a note of the portgroup, in this case “Management network” and then remove the vmk0 with the following command: esxcli network ip interface remove –-interface-name=vmk0
When vmk0 is deleted, we can immediately create a new interface with the same name and portgroup. This is done by the following command: esxcli network ip interface add -–interface-name=vmk0 -p “Management Network”
To check if vmk0 is created again type the command: esxcli network ip interface list
Click ALT+F2 to access ESXi DCUI and login to disable the ESXi shell. Now we can configure the IP settings again via the DCUI
Go to Configure Management Settings -> IPv4 Configuration and set the static IPv4 configuration
Hit Enter then Esc and Yes to restart the management network
Now we can try to redeploy via cloudbuilder, after this the deployment went on succesfully
After a successful upgrade of NSX, after the last step the upgrade of the management plane the compute manager disappeared, let’s see how we can fix that!
When i try to add the vCenter it says it is already registered, let’s check with the API.
First do a API GET in Postman to get the compute manager id:
Output:
Now we have the compute manager id, we can check if it is registered and up:
Output:
As you can see the compute manager is registered and up, why is it not showing up in the UI?
Solution:
Login with the admin user by SSH, and run the following command.
start search resync inventory
Wait a few seconds and refresh the UI, now the Compute Manager is back!
Recently we saw some warnings about expiring certificates in the NSX-T Global Manager and Local Manager.
When we clicked one of the alerts we got a small description and some API calls we can fire to apply new certificates.
In the Certificates overview (System > Certificates > Certificates), we could see that the certificates Issued to the Local Manager and Global Manager were expiring. The certification id’s were also corresponding to the ones in the alert (not the ones in my screenshots).
The API calls that were mentioned in the Alert description are for the renewal of certificate to the HTTP service (UI), not the Local/Global Manager certificates. The VMware Docs don’t explain in good detail how to change these certificates, i couln’t find it.
The only give away i could find was in step 6: (NSX Federation and the service type).
So before we can replace the certificates, we need to create new Self Signed Certificates for the Local Manager, and the Global Manager.
Create CSR on GM/LM:
to create a CSR (Certificate Signing Request) on the Global or Local Manager go to: System > Certificates > CSRs and click on “Generate CSR“.
For the Global Manager do this via the Global Manager Appliance and for the Local Manager use the Local Manager Appliance, or use the drop down on the top of the screen to choose between your Global Manager or Local Managers.
Fill in all the fields and hit the GENERATE button, example below is for the Global Manager. For The local Manager just change the word global to local:
Now we can see a new CSR in the list, the next step is to self-sign the Global Manager CSR, select the CSR and under actions choose “Self Sign Certificate for CSR”
Choose your number of days:
Now we have a new Self-Signed certificate for the Global Manager in the certificates list, with this certificate we can replace the Principal Identity certificate for the Global Manager.
For the local manager certificates, follow the steps mentioned above on the local manager appliance.
Apply Self-Signed Certificate on the Global Manager
Before we can apply the SS certificate to the Global Manager we need to copy the certificate id, click on the ID, and then select the whole id in the pop-up and copy/paste it for later use:
Now we can Fire up Postman to apply the certificate by API:
Change the ACTION drop down to POST
Paste the following url to your Global Manager: Step 1 to 4
Set Authorization the same as the previous API Calls
Select Body and set it to none
Hit Send
If the Certificate is used by a node look for the used_by part, when there is a node_id, the certificate is still in use and can’t be deleted. If it is empty, you can delete the Certificate in the UI, you can do this check on the new certificate to see if it is used by the same node.
Sometimes the Certificates won’t release itself, so let’s release the damn thing:
Release a Certificate
Please keep in mind that you only release the certificate from the node_id if you are absolutly sure, if not please raise a ticket to VMware Support.
login with the admin user to the manager with ssh
then typest e, enter the root password and you are now at the shell
Use the certificate id and the node_id from the previous step:
now use the following API call to release the Certificate of the node_id: curl -k -X POST -H “Content-Type: application/json” -H ‘X-NSX-Username:admin’ -H ‘X-NSX-Groups:superuser’ -d ‘{“service_type”:”API”,”node_id”:”<node_id>“}’ “http://localhost:7440/nsxapi/api/v1/trust-management/certificates/<certificate-id>?action=release“
This should do it, you can check the certificate again with the previous step and look for the used_by, this should be emtpy now.
During a failover test with the Bare Metal Edges we ran into an issue when pulling the plug on 1 of the TOR switches. (TOR-LEFT). During that test all BGPs on both Bare Metal edges went down. So no North-South routing anymore 🙁
So why this behaviour? And what happens when we pull the plug on the other TOR switch (TOR-RIGHT). After performing the test with the TOR-RIGHT, the BGPs connected to TOR Left stayed established. So it has something to do with switch TOR-LEFT?
After checking the configuration on the TOR-LEFT switch we didn’t identified something that could cause this issue. But what could it be? Edges were configured by VMware guidelines and were identical configuration wise.
So going through the logs was the next step in the process, and i stumbled upon this part in the log file:
2022-10-17T10:37:08.578Z Update device fp-eth0 state to DOWN
2022-10-17T10:37:08.578Z Self Node 00363d34-fcdd-11ea-8e07-e4434ba66042 status changed from Up to Down (RTEP device down)
Can it have something to do with the federated setup (RTEP), is the RTEP only connecting over fp-eth0?
Cause:
Again i went through the setup but now i also checked the fp-eth0 connections to the switches. On both BareMetal Edges the fp-eth0 was connected to the TOR-LEFT. So when we pulled the plug on that Switch it triggered the RTEP going down, which led to all BGP session going down.
This is expected behavior according to VMware!
Solution:
The solution to this issue was pretty simple after we identified the cause. We switched the connection on the second Bare Metal Edge, so the pnics connected to TOR LEFT will be on TOR-RIGHT and vice versa. The opposite of the first Bare Metal Edge.
A while ago we ran into an issue after we did the upgrade from NSX-T version 3.1.3.6 to 3.1.3.7. In the alarms section at one Site. Still wanted to do a post about the issue and the solution/workaround:
Time to check the connection! Login to the Edges and grap the VRF id of the RTEP TUNNEL.
Check the BGP and ping between the RTEP ip addresses on both sites.
As you can see all BGPs are established and the ping commands give a reply. Let’s do another check from Postman:
Open Postman and fire a GET api call to the nsx-manager to grab the edge id we need in the next api call: API GET call:
https://<nsxmanager ip>/api/v1/transport-nodes/
Just select Basic Auth under the Authorization tab and fill in the Admin credentials.
Hit Send, when getting a reply in the Body, search for the edge name and the corresponding id.
Now we use this id to get the RTEP status:
GET https://<nsxmanagerip>/api/v1/transport-nodes/<edgenodeid>/inter-site/bgp/summary
Check the output and the Return Status for issues, as you can see in the example above the BGP to one of the peers is establised.
Solution:
So it seems like the issue is known in the 3.1.3.7 version in a 3 manager nodes setup.
The node which has generated the alarm, only that node can clear alarm from in-memory when it will receive remove alarm from the edge node. The Alarm was resolved on 1 of the manager nodes, but it was showing on other nodes and it was keeping the alarm as active.
The following workaround will remove the alarm: Restart the proton service on ALL manager nodes.
– SSH with the admin user to the NSX-T manager nodes: – execute the following commands:
Stop service proton Start service proton UPDATE: The issue is fixed in version 3.2.1