NSX-T Upgrade Issue – Blank Screen in UI

During an upgrade of NSX-T from 3.1.3.6 to 3.1.3.7 i came across an issue in the UI.
When i clicked the update button, the screen was blank and was not showing any data, sometimes after a wait of half an hour or more the screen came through and i could proceed with the upgrade.

This is ofcourse not the way it should so i wanted to get rid of the issue.

Check Manager Cluster Nodes

First i wanted to check if all the cluster nodes were stable and the services were running AOK, so i ran the following command on all 3 cluster nodes:

get cluster status

All servers seem to be running fine, and didn’t show any anomalies, next i checked if there was maybe an old update plan stuck or something like that:

get node upgrade status
% Post reboot node upgrade is not in progress

But no luck with that either, now i tested what if we start the Upgrade from another manager node.

For that to be possible i needed to execute the following command on the manager node we wnat to become the orchestrator node:

set repository-ip

But after testing all nodes, no luck at all. The UI still gave me a blank screen on the Upgrade page.

Time to get support (Cause):

We raised an SR at VMware and within a few hours we got feedback.
This issue was probably caused by an inconsistent Corfu DB, that was possibly triggered by an action we did in the past an re-deployement a Manager Node after a failure.

You can identify a possible inconsistent Corfu DB by an high EPOCH number that is increasing in the /var/log/corfu/corfu-compactor-audit.log

2022-05-27T10:53:35.446Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for fc2ada82-3ef8-335a-9fdb-c35991d3960c, entries(0), cpSize(1) bytes at snapshot Token(epoch=2888, sequence=1738972197) in 65 ms

2022-05-27T11:05:21.346Z INFO main CheckpointWriter – appendCheckpoint: completed checkpoint for fc2ada82-3ef8-335a-9fdb-c35991d3960c, entries(0), cpSize(1) bytes at snapshot Token(epoch=2921, sequence=173893455) in 34 ms

Solution

redeploy the manager nodes one-by-one…..

so here we go:

First we need to retrieve the UUID of the node we want to detach from the cluster.

get cluster status

Next run the command to detach the failed node from the cluster, from another cluster node.

detach node failed_node_uuid

The detach process might take some time, when the detaching process finishes, get the status of the cluster and check if there is indeed only 2 nodes present in the cluster.

Get cluster status

The Manager node is now detached, but the VM is still present in the vSphere Inventory, Power it down and Delete the VM. You can keep it ofcourse. But we are going to deploy a new Node with the exact same parameters, fqdn and ip. So best to disconnect the network interfaces in that case.

Now we can deploy a new Manager Node, we can do this in 2 ways.

1. From the UI

We can use the way if there is a compute manager configured where the Manager Node can be deployed.

Navigate to System ConfigurationAppliances and click Add NSX Appliance

Fill in the Hostname, IP/DNS and NTP settings and choose the Deployment size of the appliance.
In our Case this is Large and click Next.

Next fill in the configuration details for the new appliance and hit next

Followed by the credentials and the enablement of SSH and Root Access, after that hit install appliance.

Now be patient until the appliance will be deployed on the environment.

When the new appliance deployed successfully wait till all services become stable and all lights are green, check the cluster status on the CLI of the managers with:

get cluster status

If all services are stable and running on every node, you can detach the next one in line and start over until all appliances are redeployed.

2. Deploy with OVA

When you can’t deploy the new appliance from the UI, you can build it with the use of the OVA file. Download the OVA file from the VMware website:

https://customerconnect.vmware.com/downloads/details?downloadGroup=NSX-T-3137&productId=982

and start the deploy from OVA in vCenter.

Select the Computer resource:

Review the details and go on to the configuration part:

Select the appropriate deployment size:

Select the Storage where the appliance needs to land:

Next select the management network:

Ans Customize the template by filling in the Passwords for the accounts, IP details etc.

Hit Next and review the configuration before you deploy the appliance!

When the ova deployed successfully Power On the VM and wait till it is booted completely, an extra reboot can be part of this.

Login to a cluster node of the NSX Manager Cluster and run the following command to get the cluster thumbprint. Save this thumbprint we need this later on.

get certificate api thumbprint

And run the get cluster config command to get the cluster ID:

get cluster config

Now open an SSH session to the new node and run the join command to join the new node to the existing cluster.

join <Manager-IP> cluster-id <cluster-id> username <Manager-username> password <Manager-password> thumbprint <Manager-thumbprint>

When the join operation is successful, wait for all services to restart.

You can check the cluster status on the UI select System > Appliances.
and check if all services are up.

Check the cluster config on the manager nodes by running:

Get cluster config

Conclusion

When you have a inconsistent corfu DB, in some cases the redeployment of all manager nodes can be the solution. Be aware that you only detach 1 node and then redeploy the new one and so on. always keep 2 ore more nodes in the cluster to keep it healthy.

Resize a NSX-T EdgeVM

Recently I needed to resize the Edge VMs because we had a memory issue and the upgrade to NSX-T version 3.1.3.7 was not solving the issue. The already deployed Edge VMs were on Medium size which was not enough for the current load. The Edge VMs in the cluster are only used for T1 and have no stateful services and they are only being used for DR.

The current NSX version has 4 Form Factors, from Small to Extra Large. In this post we are going from a medium sized Edge Node VM to a Large Sized Edge Node VM.

The resize can also be done from edge Node VM to Bare metal or vice versa, although it is not advised to mix them in the same edge cluster.

Create a New Edge Node VM

Navigate to System > Fabric > Nodes > Edge Transport Nodes > Add Edge Node

Fill in the new name and FQDN, and choose for the Large Form Factor:

Click Next and fill in all credential details:

Again Click on Next and fill in the Deployment Details:

In the Next screen configure the Node Settings like IP, Gateway, DNS, NTP and Select the correct Management Interface. Please check if the new Edge VM is added to the DNS before deploying.

In the following screen we will configure the NSX Switches (nvds), keep the NVDS names identical to the other EdgeVMs. We encountered a BUG when selecting the interfaces during the configuration.

We have distributed Virtual Port groups configured as a Trunk for the uplinks of the Edge Node VMs.
But for some reason these port groups were not showing up in the list. So as a workaround we selected a VLAN Backed Segment as a temporary fix, later on we will reset the Uplinks to the right interfaces.

Fill in the rest of the fields, key is to keep the configuration identical to the Edge Node VM we are going to replace.

Hit the Finish button and wait until the Node Status shows Up and the Configuration State shows Succes. The Edge Node VM is deployed Successfully.

Reconfigure Uplinks

Now we have to reconfigure the Uplinks of the NVDS to the correct distributed virtual port groups.

Select the newly deployed Edge Node Vm and hit Edit

As you can see the Uplinks are not configured to an interface, Select Interfaces for both Uplinks.
Uplink-1 > EdgeVM-Uplink-1-Trunk01
Uplink-2 > EdgeVM-Uplink-1-Trunk02

But when we hit SAVE, we are getting an error:

Again we have a workaround to get around this issue, log in on another node in the NSX Manager cluster and try again to configure the right distributed virtual port groups to the uplinks of the Edge Node VM.

Hit SAVE , this time the config will be saved correctly.
Before proceeding, compare the new Edge Node VM with the one you will replace.

Maintenance Mode Current Edge Node VM

Navigate to System > Fabric > Nodes > Edge Transport Nodes

Select de Edge Node VM you want to replace with the new Large Form Factor Edge Node VM. 

Under Actions choose “Enter NSX Maintenance Mode”, wait until the Edge Node VM is placed in NSX Maintenance Mode. This will shutdown the Dataplane services on the Edge Node and trigger a failover for any Active SR on the Edge Node VM. This can cause a small disruption for the T1’s. Please check the connections after the failover.

Replace Edge Cluster Member

Now we are going to replace the Edge VM Node we put in NSX Maintenance Mode with the fresh deployed large Edge Node VM.

Navigate to  System > Fabric > Nodes > Edge Clusters and select the Edge Cluster, click Actions and select “Replace Edge Cluster Member”

Choose the Edge Node VMs:

Hit SAVE and within a few seconds all configuration is migrated and The new large sized Edge Node VM should now replace the medium sized Edge Node VM and become a part of the Edge Cluster.

Check if all the DR & SR Constructs of T0 and T1 Gateways. SSH into the new Edge Node VM and run “get logical routers”

Check if the Edge Node is available for the T1’s. Navigate to Networking > Tier-1 Gateways and Select a T1 that is running on the New Edge Node VM. Select Auto-Allocated

Now you can see if the New Edge Node VM is running in active or standby for the selected T1. If the new Edge Node VM is standby, you can trigger a failover by putting the active Node in Maintenance Mode.

Traceflow can be used to check if the connection flows are indeed running over the ne Edge Node VM:

Check if all the tunnels are up and the Edge Node VM has no active alarms, the replacement was succesful!