VCF 5.0 – Remove vSAN datastore after VCF deployment failed

After a failed deployment of VCF 5.0, i was left with a vSAN Datastore on the first host in the cluster, and this was blocking a retry of the deployment.

In this state the vsanDatastore is unable to be deleted. If I try to delete it, the option is greyed out.

To delete the datastore and the partitions of the disks, we first need to SSH into the host and get the cluster.

We need the Sub-Cluster Master UUID, copy it to the clipboard. To leave the cluster the command is:

esxcli vsan cluster leave -u <Sub-Cluster Master UUID>

The ESXi UI now shows that the datastore is gone.

Let’s check the cli, if there still is a vSAN cluster or storage:

esxcli vsan cluster list
esxcli vsan storage list

No vSAN clusters or storage is configured on the host, after this the retry was successfully and the VCF management cluster was deployed.

VCF 5.0 – Error connecting to ESXi Host SSL certificate common name doesn’t match ESXi FQDN.

During a new lab deployment of VCF 5.0 i ran in to an small issue running the validation.

I deployed the hosts up front and made them available and unique before the validation. ran the following command to regenerate the certs and restart the services:

/sbin/generate-certificates
/etc/init.d/hostd restart && /etc/init.d/vpxa restart

Next i wanted to see what Common Name (CN) was on the certificate:

As you can see the Common Name (CN) contains only the hostname.

Next i changed the hostname on the ESXi hosts to have the complete FQDN.

Checked the hostname on the cli, and regenerated the certificates again:

After the regeneration of the certificate on the Host, you have to restart the services:

/etc/init.d/hostd restart && /etc/init.d/vpxa restart

Check the Certificate in the browser and now the Common Name had the FQDN in it, and the validation finished successfully.


VCF 5.0 LAB Deployment Issue – Failed to migrate vmnics of ESXi host to Distributed vSwitch

During my latest deployment of VCF in my lab environment I ran in to the following issue.

Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d

The error is pretty clear, the migration of vmk0 from the standard vSwitch to the Distributed vSwitch failed on esx02. I checked esx01 and on this host the migration was successfull.

I tried to manually migrating the vmk0 to the distributed vSwitch also ran in to an error in vCenter.
Right-click dvSwitch -> Add and Manage Hosts -> Manage Host Networking -> Select esx02

Click Next and leave the physical adapters as is, click next again.
On the next screen click on “Assign Port Group” next to vmk0.

Click on ASSIGN next to the management portgroup

Next, Next, Finish…..Task is running and fails after a few seconds.

Checking the Task Details on the ESX host:

After some investigation and searching internally within VMware resources and also ran into this blog article: https://mhvmw.wordpress.com/2023/03/17/issue-with-nested-vcf-4-5-deployment-lab-only/

it is a MAC address conflict when the esxi takes the mac of the physical nic for vmk0. 
By deleting and recreating the vmk0 interface you generate a new MAC address for vmk0.

Steps to check, delete and recreate vmk0 interface

Login via DCUI

Enable ESXi Shell

Next, Click ALT+F1 to access ESXi console and login as root.

Type the command:
esxcli network ip interface list

Make a note of the portgroup, in this case “Management network” and then remove the vmk0 with the following command:
esxcli network ip interface remove –-interface-name=vmk0

When vmk0 is deleted, we can immediately create a new interface with the same name and portgroup. This is done by the following command:
esxcli network ip interface add -–interface-name=vmk0 -p “Management Network”

To check if vmk0 is created again type the command:
esxcli network ip interface list

Click ALT+F2 to access ESXi DCUI and login to disable the ESXi shell.
Now we can configure the IP settings again via the DCUI

Go to Configure Management Settings -> IPv4 Configuration and set the static IPv4 configuration

Hit Enter then Esc and Yes to restart the management network

Now we can try to redeploy via cloudbuilder, after this the deployment went on succesfully

Test Case – Cloud Director NSX-T Role Based Access

Recently i got the question if in one Cloud Director Tenant (Organization) Granular Role Based Access Control and separation of rights can be configured between multiple teams within that Organization.

Details:

In our Test case Team-A is responsible for Org VDC A and can only manage and view the Edge GW resources (networks, Edge Gateways) within that VDC-Group. Team-B responsible for Org VDC B and can manage and view the all resources in all VDC-Groups, Except the Edge GW in Org VDC A. This Edge GW can only be managed by the Team-A.

Also because of some Tenant requirements the T0 (VRF) is also split between Internet and Customer specific. You can read more about this setup in an this post.

Requirements:

  • Separation of Rights between Org VDCs
  • Shared networks between Org VDCs
  • Team-A Can only manage the Edge GW in ORG VDC A
  • Team-B can manage all resources in both ORG VDCs except the Edge GW in ORG VDC A

Setup:

  • One Provider VDC (vCenter)
  • One Organization in Cloud Director (Tenant1)
  • Two Org VDC connected to the same Provider VDC (ORG VDC A & ORG VDC B)
  • Two Data Center Groups (VDC Group A & VDC Group B)
  • Two Edge GW (Edge A connected to VDC Groupp A & Edge B connected to VDC Group B)
  • Tenant Acces Role Team-A
  • Tenant Acces Role Team-B

Datacenter Groups

From version 10.2, VMware Cloud Director supports Data Center Group networking backed by NSX-T Data Center.

A Data Center Group acts as a Cross-VDC router that provides centralized networking administration, egress point configuration, and east-west traffic between all networks within the group. 

Using Data Center Groups, we can share organization networks across various ORG VDCs. To do so we first group the virtual data centers, then create a VDC network that is scoped to the Data Center Group.
A data center group can contain between one and 16 virtual data centers that are configured to share multiple egress points. 

We need to created 2 Data Center Groups and connect them to the participating VDC & Edges

VDC Group A -> ORG VDC A (24-2 in picture below) & Edge A
VDC Group B -> ORG VDC B (24 in picture below) & Edge B

Roles

By default, organization VDCs are shared with all users and groups that have a role which includes the Allow Access to All Organization VDCs right.

As an Organization Administrator, you can limit the access to each of the organization VDCs in your organization to specific users and groups.

Our organization has multiple organization VDCs and we want to have them managed separately, so create a custom role that would function as an organization VDC administrator and assign it to specific users or groups within your organization, providing them with access only to a specific VDC’s compute and networking resources.

For Team-B we can use the predefined role Organization Administrator. In this role the following right is allowed: Allow Access to All Organization VDCs

This permission is exactly what is tells you it does, giving you permission to ALL Organizations VDC in the Organization. So with this role we are able to manage VM’s, networks, etc in all the Organization VDC’s.

Exactly what we need for Team-B.

For Team-A we need to make a new role with more granular permissions. Create a new role and exclude the Allow Access to All Organization VDCs right. Set the rest of the permissions to view and manage the Edges and Networks.

Publish both Roles to the Tenant and create 2 users in the Tenant.

Limit Access to ORG-VDC

Now we need to limit the access to the Org VDC, on the Virtual Data Center dashboard screen, click the card of the virtual data center that you want to limit access to.

Under Settings, click Sharing, The list of users and groups within the organization that have access to the VDC appears. To change the access settings to the organization VDC, click Edit.

Select Specific Users and Groups, From the Users list, select the users that you want to provide with access to the VDC, same procedure if you are using groups.

So for ORG VDC A select Team-A, Team-B already has access to all ORG VDC because of the Allow Access to All Organization VDCs right.

To share the VDC with the selected users and groups, click Share.
At this moment Team-A can only view and manage Edges and Networks in ORG VDC A, and Team B can view and manage all resources in both ORG VDCs, also the Edge in ORG VDC A. if we need to get that sorted out we need to created roles and groups for every thinkable Resource set (Edge Admin, VM Admin, DFW Admin, etc) in every ORG VDC.

But another requirement in this Test Case is the Shared Network between ORG VDCs.
For this requirement it is needed to add the other VDC to the participating VDCs in the Data Center Group.

As soon as you configure this Team-A (which can only see the Edge-A in ORG VDC A), immediately can see the Edge-B under ORG VDC B. A no go in our case. The rights are distributed very horizontal. So as soon as you have multiple participating VDC in your Data Center group the team that was restricted to viewing and managing the resources in ORG VDC A, can now also view and manage the resources it is permitted to in ORG VDC B.

Conclusion

For this Test Case the outcome was negative as we needed shared networks between ORG VDCs.
If the sharing of networks is not needed, you set a very Granular RBAC model, but keep in mind when you set the Allow Access to All Organization VDCs right in a role, the users/groups that have this role are allowed to show all resources they are eligible to in All Organization VDCs

Cloud Director NSX-T – Tenant Design Scenarios with 2 Edge Gateways

Half way december i switched to another team at my current employer, and got my hands dirty with Cloud director, NSX-T and AVI. As this was my first real hands-on with VMware Cloud Director.
I was given the task to investigate some scenarios in which a tenant is given a second Edge Gateway, for the separation of Internet and Customer Networks traffic.

Before describing the scenarios, I’ll assume you have basic NSX-T, routing and VCD knowledge.

Current Tenant Setup

In the current setup a tenant is given:

  • Cloud Director (10.3): Organization (example: Tenant1)
  • Cloud Director: Org VDC
  • NSX-T (3.2): VRF based tier-0 gateway for both Internet VRF and Customer VRF connected to Parent T0
  • Cloud Director NSX-T: 1 Edge Gateway (T1)
  • Clloud Director NSX-T: 1 VDC Group
  • If Load balancing is used: dedicated AVI Service Engine Group

Because there is a 1:1 relation between the T0 and the Edge Gateway (dedicated T0), route advertisement of connected tenant networks is available.

The traffic for both the customer networks and Internet is flowing through the same Tenant Edge gateway and VRF based tier-0 gateway.

Scenario 1 – 2 Edge Gateways Connected to 1 Shared T0.

The first scenario I tested is based on a Shared T0 (in VCD) and 2 Edge Gateways (1 for Internet and 1 for Customer Networks).

Setup:

  • Cloud Director: Organization (example: Tenant1)
  • Cloud Director: Org VDC
  • NSX-T: VRF based tier-0 gateway for both Internet VRF and Customer VRF connected to Parent T0 (parent T0 is shared between 5 Tenants)
  • Cloud Director NSX-T: Set T0 to shared
  • Cloud Director NSX-T: two Edge Gateway (T1) connected to the same T0 (shared)
  • Cloud Director NSX-T: two VDC Groups (Data Center Groups)
  • If Load balancing is used: Shared AVI Service Engine Group (dedicated as described in current setup can also be used).

The downside of a Shared T0 is that route advertising of Tenant network connected to the Edge Gateway (T1) isn’t available to the T0.

Tenant VMs can connect to the internet or customer networks by using NAT and firewall rules.
SNAT rules need to be created for outbound traffic and DNAT rules inbound traffic.
Only use private IP spaces can be used for tenant networks connected to the Edge Gateway (T1).

Also because we use two Edge Gateways in an Org VDC we need to create 2 VDC Groups (Data Center Groups), because there is a restriction of 1 Edge Gateway per VDC Group (Data Center Group), two Datacenter Groups also mean 2 Distributed Firewalls to manage.

If you need connection between a VM in VDC Group 1 and a VM in VDC Group 2, you can create a network that will be spanning across both VDC groups.

In this scenario we still have the BGPs for both Internet and Customer configured on 1 VRF based tier-0 gateway, so this is not completely dedicated. And it is giving a lot of extra effort configuring the SNAT and DNAT.

We also worked out a scenario if the tenant is using AVI. In the normal setup the tenat is given a dedicated Service Engine Group. We also discovered the option to share the Service Engine Group between the Internet and Customer Networks. This option is a valid solution as the AVI per Default configures separates the traffic based on VRF in the Service Engines.

Scenario 2 – 2 Edge GW with dedicated T0

For the 2nd scenario we decoupled also the VRF Based T0 and creating again a 1:1 relation (dedicated T0) between the Edge Gateway and the VRF Based T0’s. Because of this 1:1 relationship route advertisements are available to the T0.

Setup:

  • Cloud Director: Organization (example: Tenant1)
  • Cloud Director: Org VDC
  • NSX-T: VRF based tier-0 gateway for Internet VRF connected to Parent T0
  • NSX-T: VRF based tier-0 gateway for Customer VRF connected to Parent T0
  • Cloud Director NSX-T: Set T0 to be dedicated
  • Cloud Director NSX-T: two Edge Gateway (T1) connected to the dedicated T0
  • Cloud Director NSX-T: two VDC Groups (Data Center Groups)
  • If Load balancing is used: Dedicated AVI Service Engine Group (because no design Change is needed for current model)

If you need connection between a VM in VDC Group 1 and a VM in VDC Group 2, you can created a network that will be spanning across both VDC groups.

The downside of the Scenario is that there are a lot of extra resources needed, and these resources will be billed to the tenant.

Conclusion

We still wanted a Design that met the customer requirements and also is easy to implement in the current setup. Which already in use by several tenants and was the default tenant setup when the platform was initially designed.

In Scenario 1 the Shared setup of the shared VRF based T0 doesn’t met the Customer requirements of separating Customer Networks and Internet traffic. Also the lack of the route advertisements was an issue for the Tenant.

In Scenario 2 where we separated all parts of the setup, all the Tenant requirements were met. The downside of this Scenario is that because all parts of the setup are decoupled, the resources a tenant needs using this setup are doubled. This means:

  1. Double Billing
  2. Double Resources, check the Config Maxs
  3. Increases operational effort

Scenario 2 is basically a copy of the current scenario, so the initial setup will cost less time, where Scenario 1 will have some design changes (shared T0, no route advertisments, SNAT/DNAT).

This design was based upon some requirements of a tenant, i tried to take in account that we also van use this for other tenants with the same requirements.

I loved investigating these scenarios, and discovering Cloud Director! Watch out for the next post about a Test Case with RBAC on Cloud Director!

Thank you for reading!

I also would like to mention the following blog article that helped me creating this post, Thanks Daniël!

https://blog.zuthof.nl/2020/06/30/cloud-director-nsx-t-inter-tenant-routing/


VMware Usagemeter issue sending data after upgrade to 4.5.0.1

This week i upgraded an usagemeter from 4.5.0.0 to 4.5.0.1 with the inplace upgrade method.
Usage Meter 4.5.0.1 patch release rolled out on May 23rd. This release addresses a major issue found in Usage Meter 4.5. For more information i refer to the following blog:

https://blogs.vmware.com/cloudprovider/2022/05/usage-meter-4-5-0-1-why-is-it-needed-and-how-to-upgrade-to-it.html

The release notes can be found at:
https://docs.vmware.com/en/vCloud-Usage-Meter/4.5.0.1/rn/vmware-vcloud-usage-meter-4501-release-notes/index.html

We followed the upgrade guide provided by VMware:
https://docs.vmware.com/en/vCloud-Usage-Meter/4.5/Getting-Started-vCloud-Usage-Meter/GUID-AE5A81E1-097A-4EED-9A8E-8BF7E0B378A4.html?hWord=N4IghgNiBcIJYDsC0AHCYDGBTABAVxQHMAlAQQBEBREAXyA

After the reboot and check if the upgrade was successful we tried to send a test update to the Usage Insight. The test send of data failed with thw following error:

In the notifications we can find the following messages:

After a search through the VMware Knowlegde Base I came across this article:

https://kb.vmware.com/s/article/82023

To test connectivity to vCloud Usage Insight by using the curl command:

curl -kv https://ums.cloud.vmware.com/um/api/v1/ping --proxy 192.168.1.50:8080

The response came back with HTTP status code 200, so that was OK.

Now we check the vCloud Usage Meter registration status in vCloud Usage Insight by using the curl command:

curl -kv https://ums.cloud.vmware.com/um/api/v1/upload/registration?um=<UM UUID> --proxy 192.168.1.50:8080

This response also came back with HTTP status code 200, so AOK, so far for the checks…

Let’s get in touch with GSS

As this was the advice all along in the first error message ….

The GSS engineer stated that there was an issue with the nginx jvm settings when using a proxy.
We had to add a line to the nginx.conf in the following directory, but before we change anything lets make a snapshot of the system in case we ruin everything.

Remark: Please contact GSS if you want support editing files, and always make a backup or in this case a snapshot before changing settings

Edit the nginx.conf and add the follwing line somewhere around line 58,

This line set a dummy file for proxy configuration., we are setting a dummy proxy configuration as we are hitting a known issue and this will be fixed in a future release.

jvm_options "-Dproxy_config=/tmp/vami-file";

Go down one dir to /opt/vmware/cloudusagemetering

Stop NGINX service with the follwing command:

./scripts/stop.sh GW

Now start the service again:

Start NGINX service

Now get the status of all services:

You see a lot of errors, these can be ignored.

Now go to the VAMI UI and reset the proxy again:

Test

Now go to the Settings Page in the Usage Meter UI and Send an update to Usage Insight.

It works! You can also check the usagemeter in Insight: Go to https://ums.cloud.vmware.com/ui/
The last update should be after the issue was fixed!

Remark: Please contact GSS if you want support editing files, and always make a backup or in this acse a snapshot before changing settings