Resize a NSX-T EdgeVM

Recently I needed to resize the Edge VMs because we had a memory issue and the upgrade to NSX-T version 3.1.3.7 was not solving the issue. The already deployed Edge VMs were on Medium size which was not enough for the current load. The Edge VMs in the cluster are only used for T1 and have no stateful services and they are only being used for DR.

The current NSX version has 4 Form Factors, from Small to Extra Large. In this post we are going from a medium sized Edge Node VM to a Large Sized Edge Node VM.

The resize can also be done from edge Node VM to Bare metal or vice versa, although it is not advised to mix them in the same edge cluster.

Create a New Edge Node VM

Navigate to System > Fabric > Nodes > Edge Transport Nodes > Add Edge Node

Fill in the new name and FQDN, and choose for the Large Form Factor:

Click Next and fill in all credential details:

Again Click on Next and fill in the Deployment Details:

In the Next screen configure the Node Settings like IP, Gateway, DNS, NTP and Select the correct Management Interface. Please check if the new Edge VM is added to the DNS before deploying.

In the following screen we will configure the NSX Switches (nvds), keep the NVDS names identical to the other EdgeVMs. We encountered a BUG when selecting the interfaces during the configuration.

We have distributed Virtual Port groups configured as a Trunk for the uplinks of the Edge Node VMs.
But for some reason these port groups were not showing up in the list. So as a workaround we selected a VLAN Backed Segment as a temporary fix, later on we will reset the Uplinks to the right interfaces.

Fill in the rest of the fields, key is to keep the configuration identical to the Edge Node VM we are going to replace.

Hit the Finish button and wait until the Node Status shows Up and the Configuration State shows Succes. The Edge Node VM is deployed Successfully.

Reconfigure Uplinks

Now we have to reconfigure the Uplinks of the NVDS to the correct distributed virtual port groups.

Select the newly deployed Edge Node Vm and hit Edit

As you can see the Uplinks are not configured to an interface, Select Interfaces for both Uplinks.
Uplink-1 > EdgeVM-Uplink-1-Trunk01
Uplink-2 > EdgeVM-Uplink-1-Trunk02

But when we hit SAVE, we are getting an error:

Again we have a workaround to get around this issue, log in on another node in the NSX Manager cluster and try again to configure the right distributed virtual port groups to the uplinks of the Edge Node VM.

Hit SAVE , this time the config will be saved correctly.
Before proceeding, compare the new Edge Node VM with the one you will replace.

Maintenance Mode Current Edge Node VM

Navigate to System > Fabric > Nodes > Edge Transport Nodes

Select de Edge Node VM you want to replace with the new Large Form Factor Edge Node VM. 

Under Actions choose “Enter NSX Maintenance Mode”, wait until the Edge Node VM is placed in NSX Maintenance Mode. This will shutdown the Dataplane services on the Edge Node and trigger a failover for any Active SR on the Edge Node VM. This can cause a small disruption for the T1’s. Please check the connections after the failover.

Replace Edge Cluster Member

Now we are going to replace the Edge VM Node we put in NSX Maintenance Mode with the fresh deployed large Edge Node VM.

Navigate to  System > Fabric > Nodes > Edge Clusters and select the Edge Cluster, click Actions and select “Replace Edge Cluster Member”

Choose the Edge Node VMs:

Hit SAVE and within a few seconds all configuration is migrated and The new large sized Edge Node VM should now replace the medium sized Edge Node VM and become a part of the Edge Cluster.

Check if all the DR & SR Constructs of T0 and T1 Gateways. SSH into the new Edge Node VM and run “get logical routers”

Check if the Edge Node is available for the T1’s. Navigate to Networking > Tier-1 Gateways and Select a T1 that is running on the New Edge Node VM. Select Auto-Allocated

Now you can see if the New Edge Node VM is running in active or standby for the selected T1. If the new Edge Node VM is standby, you can trigger a failover by putting the active Node in Maintenance Mode.

Traceflow can be used to check if the connection flows are indeed running over the ne Edge Node VM:

Check if all the tunnels are up and the Edge Node VM has no active alarms, the replacement was succesful!

NSX-V not showing virtual wires in vCenter

Today i ran into an issue with an setup of NSX-V version 6.4.1. As this is the second time i run into this error let’s document it for future purpose.

Issue:

Viewing logical switches in the vSphere Web Client displays the error, in vCenter on the Networking & Security Tab under Logical Switches the list was empty and the UI gave the following Internal Error:

Ah quick google search pointed us to the following Knowledge Base article:
https://kb.vmware.com/s/article/54442?lang=en_US

 This issue occurs due to stale entries in the NSX Manager database, the NSX database will be inconsistent with the vCenter Server when a VM becomes a template in vCenter Server and there is config changes on the VM template. Any changes to the VM templates are not synced with NSX unless they are converted back to VM. 

NOTE:
Please contact GSS when you run into this issue on a Production environment. And ofcourse if you still run Production on NSX-V you need to take action and move to NSX-T!

Workaround

We need to identify the templates which are causing this issue and then convert the template back to a VM to sync latest updates to NSX inventory. This will remove stale entries in DB and get latest sync from VC, making NSX DB consistent with VC and there by resolving the issue. You can then convert back the VM to a template. You can then convert back the VM to a template.

Login to the NSX Manager go to Enable Mode, fill in the password and go to the engineering mode by typing the “st eng” command. Type y to continue and fill in the password: IAmOnThePhoneWithTechSupport

Now you are in the engineering mode! Got to the database cli by entering “psql -U secureall
You can do the check in 2 ways, the first one is a bit more clicking and checking, but will work when you don’t have loads of templates.

Check 1:

To get the Templates which are causing the issue run the following query:

select objectid, host_id from domain_object where dtype='VimVirtualMachine' and host_id NOT IN (select objectid from domain_object where dtype='VimHostSystem');

You will get a list with objectids which you can check in vCenter by going through the list of templates and check the url for the objectid. If the object id is identical, hen convert the template back to a VM to sync latest updates to NSX inventory and t latest sync from VC, making NSX DB consistent with VC and there by resolving the issue. You can then convert back the VM to a template. 

Check 2:

If you have a lot of Templates you can use to following query to check both objectid and template name:

select * from domain_object where dtype='VimVirtualMachine' and host_id NOT IN (select objectid from domain_object where dtype='VimHostSystem');

and scroll through the output searching for .vmx and vm-####

If you ran through all objectids which were in the list, you can check if the list is empty by running the check command again:

All issues are resolved! Let’s check if we can see the virtual wires again:

Hooray!!!

NSX-T Edge Datapath Mempool Usage Issue on EdgeVM

Recently we ran into an issue with our edge VMs on NSX-T 3.1.0.0, the datapath mempool usage for pfstate3 was at a critical high level (100%) and the edge VMs were dropping packets. A failover to another edge by triggering the NSX Maintenance Mode for the edge in question was just a quickfix.

This is a known issue by VMware and is caused by a bug in version 3.1.0.0. The issue is caused by memory leak caused by the firewall service in the Edge.

syslog.9:2022-02-16T08:49:01.306Z edgeVM01 NSX 4451 FIREWALL [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="firewalldp" level="ERROR"] Memory resource from hugepage exhausted in the firewall service size=912(0M)
syslog.9:2022-02-16T08:49:01.307387+00:00 edgeVM01 f16e2e57db2b 3179 - - 2022-02-16T08:49:01Z datapathd 4451 firewalldp [ERROR] Memory resource from hugepage exhausted in the firewall service size=912(0M)
syslog.9:2022-02-16T08:49:01.307708+00:00 edgeVM01 datapath-systemd-helper 4332 - - 2022-02-16T08:49:01Z datapathd 4451 firewalldp [ERROR] Memory resource from hugepage exhausted in the firewall service size=912(0M)

This can happen even if the gateway firewall on the Edge is not utilized.

Increasing the size of the edge will make the issue less frequent, but a permanent fix is released in version 3.1.3.6. But after upgrading to 3.1.3.6, this version was withdrawn due to a similar issue with datapath memory leak.

https://kb.vmware.com/s/article/87806

So starting a new upgrade to 3.1.3.7……

After the upgrade to 3.1.3.7, the issue was still present. So on to resizing the edge to a bigger form-factor.

Check out How You can Resize an edgeVM