A while ago we ran into an issue after we did the upgrade from NSX-T version 3.1.3.6 to 3.1.3.7. In the alarms section at one Site. Still wanted to do a post about the issue and the solution/workaround:
Time to check the connection! Login to the Edges and grap the VRF id of the RTEP TUNNEL.
Check the BGP and ping between the RTEP ip addresses on both sites.
As you can see all BGPs are established and the ping commands give a reply. Let’s do another check from Postman:
Open Postman and fire a GET api call to the nsx-manager to grab the edge id we need in the next api call: API GET call:
https://<nsxmanager ip>/api/v1/transport-nodes/
Just select Basic Auth under the Authorization tab and fill in the Admin credentials.
Hit Send, when getting a reply in the Body, search for the edge name and the corresponding id.
Now we use this id to get the RTEP status:
GET https://<nsxmanagerip>/api/v1/transport-nodes/<edgenodeid>/inter-site/bgp/summary
Check the output and the Return Status for issues, as you can see in the example above the BGP to one of the peers is establised.
Solution:
So it seems like the issue is known in the 3.1.3.7 version in a 3 manager nodes setup.
The node which has generated the alarm, only that node can clear alarm from in-memory when it will receive remove alarm from the edge node. The Alarm was resolved on 1 of the manager nodes, but it was showing on other nodes and it was keeping the alarm as active.
The following workaround will remove the alarm: Restart the proton service on ALL manager nodes.
– SSH with the admin user to the NSX-T manager nodes: – execute the following commands:
Stop service proton Start service proton UPDATE: The issue is fixed in version 3.2.1
So it looks like the upload can’t handle files over 2GB.
Honoustly my python skills are a bit rusty, so i asked one of the developers in our team to help me out and see if we could get this fixed.
The 2GB+ filesize is the issue. You can find multiple references to the error, usually referring to the httplib, urllib or ssl.. One solution is to use streaming upload.
This is what we did to make the upload work.
Install request-toolbelt package
Edit nsxt_upgrade_upload_mub.py NOTE: This will break the URL upload! Add:
import requests
from requests_toolbelt.multipart import encoder
from requests.auth import HTTPBasicAuth
This week i upgraded an usagemeter from 4.5.0.0 to 4.5.0.1 with the inplace upgrade method. Usage Meter 4.5.0.1 patch release rolled out on May 23rd. This release addresses a major issue found in Usage Meter 4.5. For more information i refer to the following blog:
After the reboot and check if the upgrade was successful we tried to send a test update to the Usage Insight. The test send of data failed with thw following error:
In the notifications we can find the following messages:
After a search through the VMware Knowlegde Base I came across this article:
This response also came back with HTTP status code 200, so AOK, so far for the checks…
Let’s get in touch with GSS
As this was the advice all along in the first error message ….
The GSS engineer stated that there was an issue with the nginx jvm settings when using a proxy. We had to add a line to the nginx.conf in the following directory, but before we change anything lets make a snapshot of the system in case we ruin everything.
Remark: Please contact GSS if you want support editing files, and always make a backup or in this case a snapshot before changing settings
Edit the nginx.conf and add the follwing line somewhere around line 58,
This line set a dummy file for proxy configuration., we are setting a dummy proxy configuration as we are hitting a known issue and this will be fixed in a future release.
jvm_options "-Dproxy_config=/tmp/vami-file";
Go down one dir to /opt/vmware/cloudusagemetering
Stop NGINX service with the follwing command:
./scripts/stop.sh GW
Now start the service again:
Start NGINX service
Now get the status of all services:
You see a lot of errors, these can be ignored.
Now go to the VAMI UI and reset the proxy again:
Test
Now go to the Settings Page in the Usage Meter UI and Send an update to Usage Insight.
It works! You can also check the usagemeter in Insight: Go to https://ums.cloud.vmware.com/ui/ The last update should be after the issue was fixed!
Remark: Please contact GSS if you want support editing files, and always make a backup or in this acse a snapshot before changing settings
Last week i was at the VMware Tech Summit in Cork, Ireland. I attended a session about NSX-T troubleshooting. During this session a lot of issues came to the stage which i dealt with in the last year or so.
One of the is issues was about the IP Bindings of a VM to a segment. In our case a tenant manually edited the Ipv4 address on the network Interface of the VM, and after this the connection to this VM dropped.
DFW checked, routing checked al was ok. After some digging we found out that the VM had 2 IP addresses in the realized bindings section on the logical switch. This view can be found in the manager UI of the segment port.
How can you find these Realized bindings and Fix it?
To get to the right port you can do the following, Go to segments and look for the segment to which the VM is connected. Once you found it click on the number you see beneath the ports.
This opens a window where you can find all connected port to the segment, copy the Segment Port Name.
Now search in the Search Bar for this Segment Port Name, and click the one with Resource Type Logical Ports.
This takes you to the Manager UI of this Logical port. You can always go through Networking -> then Switch to Manager UI in the upper right corner -> Select logical switches -> Search through the list for the right port.
Select Address Bindings, here you can see the Auto Discovered Bindings, both with the current IP from the VM. One learned from VMware Tools and the other by ARP Snooping. But if you take a close look at the Realized Bindings you can see a different IP learned by ARP Snooping. This was the original IP the Vm had when it connected.
This can cause connection problems! In our case the whole routing was messed up and the traffic went out via the wrong Uplink.
We can fix this quickly with moving the entry with the old IP address to the Ignore Bindings:
It will take a few seconds to updated the realized Bindings with the new lP address learned by ARP Snooping.
After this the connection came up and the tenant was happy! But this was nothing more than a quickfix, what if all tenants gone mad and they are manually changing their IP addresses in the OS…….
So why is the IP address staying in the Realized Bindings section and keeps bringing carnage?
By default, the discovery methods ARP snooping and ND snooping operate in a mode called trust on first use (TOFU). In TOFU mode, when an address is discovered and added to the realized bindings list, that binding remains in the realized list forever.
Can we modify that mode? Yes we can!
In NSX-T we use several profiles, one of those is the IP Discovery profile. This profile can be found in the Policy UI under Segments -> Segment Profiles
Create a new Ip Discovery Profile and disable the TOFU setting, When you do this, TOFU changes to Trust On Every Use (TOEU). In this TOEU mode, the discovered IP addresses are placed in the binding list and deleted when they expire. DHCP snooping and VMware tools always operate in TOEU mode.
Now we need to adjust the segment to use the new IP Discovery Profile, go to the segment and click edit. Under Segment profiles select the new TOFU Profile, click Save and the Close Editing.
Now when a tenant changes the IP of the Network Interface Manually the old IP learned the first time by ARP Snooping is not present anymore in the Realize Bindings section.
When we upgraded to vSAN 7, immediately we faced an annoying error message in Skyline health which complained about the Resync Throttling.
This setting is deprecated since 6.7 and can only be set via Powercli, below a few lines of code to set the value to 0 to all Hosts in the cluster. When the VSAN Host is patched or sometimes we even see the warning again after a reboot. Just run the script again and the warning is gone.
$vc="yourVC"
Connect-VIServer $vc<br>$hosts = Get-VMHost
foreach($esx in $hosts){
Write-Host "Displaying and Updating VSAN ResyncThrottle value on $esx"
Get-AdvancedSetting -Entity $esx -Name VSAN.DomCompResyncThrottle | Set-AdvancedSetting -Value 0 -confirm:$falseGet-AdvancedSetting -Entity $esx -Name VSAN.DomCompResyncThrottle
}
Please run scripts at your own risk! If you are not comfortable please contact VMware Support.
You can also check this KB article for another solution regarding the resync Throttling
During an upgrade of NSX-T from 3.1.3.6 to 3.1.3.7 i came across an issue in the UI. When i clicked the update button, the screen was blank and was not showing any data, sometimes after a wait of half an hour or more the screen came through and i could proceed with the upgrade.
This is ofcourse not the way it should so i wanted to get rid of the issue.
Check Manager Cluster Nodes
First i wanted to check if all the cluster nodes were stable and the services were running AOK, so i ran the following command on all 3 cluster nodes:
get cluster status
All servers seem to be running fine, and didn’t show any anomalies, next i checked if there was maybe an old update plan stuck or something like that:
get node upgrade status % Post reboot node upgrade is not in progress
But no luck with that either, now i tested what if we start the Upgrade from another manager node.
For that to be possible i needed to execute the following command on the manager node we wnat to become the orchestrator node:
set repository-ip
But after testing all nodes, no luck at all. The UI still gave me a blank screen on the Upgrade page.
Time to get support (Cause):
We raised an SR at VMware and within a few hours we got feedback. This issue was probably caused by an inconsistent Corfu DB, that was possibly triggered by an action we did in the past an re-deployement a Manager Node after a failure.
You can identify a possible inconsistent Corfu DB by an high EPOCH number that is increasing in the /var/log/corfu/corfu-compactor-audit.log
2022-05-27T10:53:35.446Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for fc2ada82-3ef8-335a-9fdb-c35991d3960c, entries(0), cpSize(1) bytes at snapshot Token(epoch=2888, sequence=1738972197) in 65 ms
2022-05-27T11:05:21.346Z INFO main CheckpointWriter – appendCheckpoint: completed checkpoint for fc2ada82-3ef8-335a-9fdb-c35991d3960c, entries(0), cpSize(1) bytes at snapshot Token(epoch=2921, sequence=173893455) in 34 ms
Solution
redeploy the manager nodes one-by-one…..
so here we go:
First we need to retrieve the UUID of the node we want to detach from the cluster.
get cluster status
Next run the command to detach the failed node from the cluster, from another cluster node.
detach node failed_node_uuid
The detach process might take some time, when the detaching process finishes, get the status of the cluster and check if there is indeed only 2 nodes present in the cluster.
Get cluster status
The Manager node is now detached, but the VM is still present in the vSphere Inventory, Power it down and Delete the VM. You can keep it ofcourse. But we are going to deploy a new Node with the exact same parameters, fqdn and ip. So best to disconnect the network interfaces in that case.
Now we can deploy a new Manager Node, we can do this in 2 ways.
1. From the UI
We can use the way if there is a compute manager configured where the Manager Node can be deployed.
Navigate to System > Configuration> Appliances and click Add NSX Appliance
Fill in the Hostname, IP/DNS and NTP settings and choose the Deployment size of the appliance. In our Case this is Large and click Next.
Next fill in the configuration details for the new appliance and hit next
Followed by the credentials and the enablement of SSH and Root Access, after that hit install appliance.
Now be patient until the appliance will be deployed on the environment.
When the new appliance deployed successfully wait till all services become stable and all lights are green, check the cluster status on the CLI of the managers with:
get cluster status
If all services are stable and running on every node, you can detach the next one in line and start over until all appliances are redeployed.
2. Deploy with OVA
When you can’t deploy the new appliance from the UI, you can build it with the use of the OVA file. Download the OVA file from the VMware website:
Review the details and go on to the configuration part:
Select the appropriate deployment size:
Select the Storage where the appliance needs to land:
Next select the management network:
Ans Customize the template by filling in the Passwords for the accounts, IP details etc.
Hit Next and review the configuration before you deploy the appliance!
When the ova deployed successfully Power On the VM and wait till it is booted completely, an extra reboot can be part of this.
Login to a cluster node of the NSX Manager Cluster and run the following command to get the cluster thumbprint. Save this thumbprint we need this later on.
get certificate api thumbprint
And run the get cluster config command to get the cluster ID:
get cluster config
Now open an SSH session to the new node and run the join command to join the new node to the existing cluster.
When the join operation is successful, wait for all services to restart.
You can check the cluster status on the UI select System > Appliances. and check if all services are up.
Check the cluster config on the manager nodes by running:
Get cluster config
Conclusion
When you have a inconsistent corfu DB, in some cases the redeployment of all manager nodes can be the solution. Be aware that you only detach 1 node and then redeploy the new one and so on. always keep 2 ore more nodes in the cluster to keep it healthy.
Today i ran into an issue with an setup of NSX-V version 6.4.1. As this is the second time i run into this error let’s document it for future purpose.
Issue:
Viewing logical switches in the vSphere Web Client displays the error, in vCenter on the Networking & Security Tab under Logical Switches the list was empty and the UI gave the following Internal Error:
This issue occurs due to stale entries in the NSX Manager database, the NSX database will be inconsistent with the vCenter Server when a VM becomes a template in vCenter Server and there is config changes on the VM template. Any changes to the VM templates are not synced with NSX unless they are converted back to VM.
NOTE: Please contact GSS when you run into this issue on a Production environment. And ofcourse if you still run Production on NSX-V you need to take action and move to NSX-T!
Workaround
We need to identify the templates which are causing this issue and then convert the template back to a VM to sync latest updates to NSX inventory. This will remove stale entries in DB and get latest sync from VC, making NSX DB consistent with VC and there by resolving the issue. You can then convert back the VM to a template. You can then convert back the VM to a template.
Login to the NSX Manager go to Enable Mode, fill in the password and go to the engineering mode by typing the “st eng” command. Type y to continue and fill in the password: IAmOnThePhoneWithTechSupport
Now you are in the engineering mode! Got to the database cli by entering “psql -U secureall“ You can do the check in 2 ways, the first one is a bit more clicking and checking, but will work when you don’t have loads of templates.
Check 1:
To get the Templates which are causing the issue run the following query:
select objectid, host_id from domain_object where dtype='VimVirtualMachine' and host_id NOT IN (select objectid from domain_object where dtype='VimHostSystem');
You will get a list with objectids which you can check in vCenter by going through the list of templates and check the url for the objectid. If the object id is identical, hen convert the template back to a VM to sync latest updates to NSX inventory and t latest sync from VC, making NSX DB consistent with VC and there by resolving the issue. You can then convert back the VM to a template.
Check 2:
If you have a lot of Templates you can use to following query to check both objectid and template name:
select * from domain_object where dtype='VimVirtualMachine' and host_id NOT IN (select objectid from domain_object where dtype='VimHostSystem');
and scroll through the output searching for .vmx and vm-####
If you ran through all objectids which were in the list, you can check if the list is empty by running the check command again:
All issues are resolved! Let’s check if we can see the virtual wires again: