VCF 5.0 – Retry Bring up fails on Generate input for SSH Keys

After a Failed VCF bring-up, I wanted to retry the bring-up. Luckily the error I encountered before was resolved but again I ran in to an issue during the retry.

Now the issue was with the import of the SSH Keys

Going through some internal resources i stumbled upon the solution, since this is a nested lab environment on top of VCD you have to reset the MAC address of the ESXi Host.

VCF 5.0 LAB Deployment Issue – Failed to migrate vmnics of ESXi host to Distributed vSwitch

During my latest deployment of VCF in my lab environment I ran in to the following issue.

Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d Failed to migrate vmnics of host 192.168.11.12 to DVS sfo-m01-cl01-vds01 . Reason: Failed to migrate vmknic vmk0 to DvSwitch 50 22 42 8c d5 a1 d4 8f-6d 9e 8a 1e 93 ac 5b 9d

The error is pretty clear, the migration of vmk0 from the standard vSwitch to the Distributed vSwitch failed on esx02. I checked esx01 and on this host the migration was successfull.

I tried to manually migrating the vmk0 to the distributed vSwitch also ran in to an error in vCenter.
Right-click dvSwitch -> Add and Manage Hosts -> Manage Host Networking -> Select esx02

Click Next and leave the physical adapters as is, click next again.
On the next screen click on “Assign Port Group” next to vmk0.

Click on ASSIGN next to the management portgroup

Next, Next, Finish…..Task is running and fails after a few seconds.

Checking the Task Details on the ESX host:

After some investigation and searching internally within VMware resources and also ran into this blog article: https://mhvmw.wordpress.com/2023/03/17/issue-with-nested-vcf-4-5-deployment-lab-only/

it is a MAC address conflict when the esxi takes the mac of the physical nic for vmk0. 
By deleting and recreating the vmk0 interface you generate a new MAC address for vmk0.

Steps to check, delete and recreate vmk0 interface

Login via DCUI

Enable ESXi Shell

Next, Click ALT+F1 to access ESXi console and login as root.

Type the command:
esxcli network ip interface list

Make a note of the portgroup, in this case “Management network” and then remove the vmk0 with the following command:
esxcli network ip interface remove –-interface-name=vmk0

When vmk0 is deleted, we can immediately create a new interface with the same name and portgroup. This is done by the following command:
esxcli network ip interface add -–interface-name=vmk0 -p “Management Network”

To check if vmk0 is created again type the command:
esxcli network ip interface list

Click ALT+F2 to access ESXi DCUI and login to disable the ESXi shell.
Now we can configure the IP settings again via the DCUI

Go to Configure Management Settings -> IPv4 Configuration and set the static IPv4 configuration

Hit Enter then Esc and Yes to restart the management network

Now we can try to redeploy via cloudbuilder, after this the deployment went on succesfully

Compute Manager not visible in NSX after Upgrade of the 3 Manager Appliances

After a successful upgrade of NSX, after the last step the upgrade of the management plane the compute manager disappeared, let’s see how we can fix that!

When i try to add the vCenter it says it is already registered, let’s check with the API.

First do a API GET in Postman to get the compute manager id:

Output:

Now we have the compute manager id, we can check if it is registered and up:

Output:


As you can see the compute manager is registered and up, why is it not showing up in the UI?

Solution:

Login with the admin user by SSH, and run the following command.

start search resync inventory

Wait a few seconds and refresh the UI, now the Compute Manager is back!


Test Case – Cloud Director NSX-T Role Based Access

Recently i got the question if in one Cloud Director Tenant (Organization) Granular Role Based Access Control and separation of rights can be configured between multiple teams within that Organization.

Details:

In our Test case Team-A is responsible for Org VDC A and can only manage and view the Edge GW resources (networks, Edge Gateways) within that VDC-Group. Team-B responsible for Org VDC B and can manage and view the all resources in all VDC-Groups, Except the Edge GW in Org VDC A. This Edge GW can only be managed by the Team-A.

Also because of some Tenant requirements the T0 (VRF) is also split between Internet and Customer specific. You can read more about this setup in an this post.

Requirements:

  • Separation of Rights between Org VDCs
  • Shared networks between Org VDCs
  • Team-A Can only manage the Edge GW in ORG VDC A
  • Team-B can manage all resources in both ORG VDCs except the Edge GW in ORG VDC A

Setup:

  • One Provider VDC (vCenter)
  • One Organization in Cloud Director (Tenant1)
  • Two Org VDC connected to the same Provider VDC (ORG VDC A & ORG VDC B)
  • Two Data Center Groups (VDC Group A & VDC Group B)
  • Two Edge GW (Edge A connected to VDC Groupp A & Edge B connected to VDC Group B)
  • Tenant Acces Role Team-A
  • Tenant Acces Role Team-B

Datacenter Groups

From version 10.2, VMware Cloud Director supports Data Center Group networking backed by NSX-T Data Center.

A Data Center Group acts as a Cross-VDC router that provides centralized networking administration, egress point configuration, and east-west traffic between all networks within the group. 

Using Data Center Groups, we can share organization networks across various ORG VDCs. To do so we first group the virtual data centers, then create a VDC network that is scoped to the Data Center Group.
A data center group can contain between one and 16 virtual data centers that are configured to share multiple egress points. 

We need to created 2 Data Center Groups and connect them to the participating VDC & Edges

VDC Group A -> ORG VDC A (24-2 in picture below) & Edge A
VDC Group B -> ORG VDC B (24 in picture below) & Edge B

Roles

By default, organization VDCs are shared with all users and groups that have a role which includes the Allow Access to All Organization VDCs right.

As an Organization Administrator, you can limit the access to each of the organization VDCs in your organization to specific users and groups.

Our organization has multiple organization VDCs and we want to have them managed separately, so create a custom role that would function as an organization VDC administrator and assign it to specific users or groups within your organization, providing them with access only to a specific VDC’s compute and networking resources.

For Team-B we can use the predefined role Organization Administrator. In this role the following right is allowed: Allow Access to All Organization VDCs

This permission is exactly what is tells you it does, giving you permission to ALL Organizations VDC in the Organization. So with this role we are able to manage VM’s, networks, etc in all the Organization VDC’s.

Exactly what we need for Team-B.

For Team-A we need to make a new role with more granular permissions. Create a new role and exclude the Allow Access to All Organization VDCs right. Set the rest of the permissions to view and manage the Edges and Networks.

Publish both Roles to the Tenant and create 2 users in the Tenant.

Limit Access to ORG-VDC

Now we need to limit the access to the Org VDC, on the Virtual Data Center dashboard screen, click the card of the virtual data center that you want to limit access to.

Under Settings, click Sharing, The list of users and groups within the organization that have access to the VDC appears. To change the access settings to the organization VDC, click Edit.

Select Specific Users and Groups, From the Users list, select the users that you want to provide with access to the VDC, same procedure if you are using groups.

So for ORG VDC A select Team-A, Team-B already has access to all ORG VDC because of the Allow Access to All Organization VDCs right.

To share the VDC with the selected users and groups, click Share.
At this moment Team-A can only view and manage Edges and Networks in ORG VDC A, and Team B can view and manage all resources in both ORG VDCs, also the Edge in ORG VDC A. if we need to get that sorted out we need to created roles and groups for every thinkable Resource set (Edge Admin, VM Admin, DFW Admin, etc) in every ORG VDC.

But another requirement in this Test Case is the Shared Network between ORG VDCs.
For this requirement it is needed to add the other VDC to the participating VDCs in the Data Center Group.

As soon as you configure this Team-A (which can only see the Edge-A in ORG VDC A), immediately can see the Edge-B under ORG VDC B. A no go in our case. The rights are distributed very horizontal. So as soon as you have multiple participating VDC in your Data Center group the team that was restricted to viewing and managing the resources in ORG VDC A, can now also view and manage the resources it is permitted to in ORG VDC B.

Conclusion

For this Test Case the outcome was negative as we needed shared networks between ORG VDCs.
If the sharing of networks is not needed, you set a very Granular RBAC model, but keep in mind when you set the Allow Access to All Organization VDCs right in a role, the users/groups that have this role are allowed to show all resources they are eligible to in All Organization VDCs

Cloud Director NSX-T – Tenant Design Scenarios with 2 Edge Gateways

Half way december i switched to another team at my current employer, and got my hands dirty with Cloud director, NSX-T and AVI. As this was my first real hands-on with VMware Cloud Director.
I was given the task to investigate some scenarios in which a tenant is given a second Edge Gateway, for the separation of Internet and Customer Networks traffic.

Before describing the scenarios, I’ll assume you have basic NSX-T, routing and VCD knowledge.

Current Tenant Setup

In the current setup a tenant is given:

  • Cloud Director (10.3): Organization (example: Tenant1)
  • Cloud Director: Org VDC
  • NSX-T (3.2): VRF based tier-0 gateway for both Internet VRF and Customer VRF connected to Parent T0
  • Cloud Director NSX-T: 1 Edge Gateway (T1)
  • Clloud Director NSX-T: 1 VDC Group
  • If Load balancing is used: dedicated AVI Service Engine Group

Because there is a 1:1 relation between the T0 and the Edge Gateway (dedicated T0), route advertisement of connected tenant networks is available.

The traffic for both the customer networks and Internet is flowing through the same Tenant Edge gateway and VRF based tier-0 gateway.

Scenario 1 – 2 Edge Gateways Connected to 1 Shared T0.

The first scenario I tested is based on a Shared T0 (in VCD) and 2 Edge Gateways (1 for Internet and 1 for Customer Networks).

Setup:

  • Cloud Director: Organization (example: Tenant1)
  • Cloud Director: Org VDC
  • NSX-T: VRF based tier-0 gateway for both Internet VRF and Customer VRF connected to Parent T0 (parent T0 is shared between 5 Tenants)
  • Cloud Director NSX-T: Set T0 to shared
  • Cloud Director NSX-T: two Edge Gateway (T1) connected to the same T0 (shared)
  • Cloud Director NSX-T: two VDC Groups (Data Center Groups)
  • If Load balancing is used: Shared AVI Service Engine Group (dedicated as described in current setup can also be used).

The downside of a Shared T0 is that route advertising of Tenant network connected to the Edge Gateway (T1) isn’t available to the T0.

Tenant VMs can connect to the internet or customer networks by using NAT and firewall rules.
SNAT rules need to be created for outbound traffic and DNAT rules inbound traffic.
Only use private IP spaces can be used for tenant networks connected to the Edge Gateway (T1).

Also because we use two Edge Gateways in an Org VDC we need to create 2 VDC Groups (Data Center Groups), because there is a restriction of 1 Edge Gateway per VDC Group (Data Center Group), two Datacenter Groups also mean 2 Distributed Firewalls to manage.

If you need connection between a VM in VDC Group 1 and a VM in VDC Group 2, you can create a network that will be spanning across both VDC groups.

In this scenario we still have the BGPs for both Internet and Customer configured on 1 VRF based tier-0 gateway, so this is not completely dedicated. And it is giving a lot of extra effort configuring the SNAT and DNAT.

We also worked out a scenario if the tenant is using AVI. In the normal setup the tenat is given a dedicated Service Engine Group. We also discovered the option to share the Service Engine Group between the Internet and Customer Networks. This option is a valid solution as the AVI per Default configures separates the traffic based on VRF in the Service Engines.

Scenario 2 – 2 Edge GW with dedicated T0

For the 2nd scenario we decoupled also the VRF Based T0 and creating again a 1:1 relation (dedicated T0) between the Edge Gateway and the VRF Based T0’s. Because of this 1:1 relationship route advertisements are available to the T0.

Setup:

  • Cloud Director: Organization (example: Tenant1)
  • Cloud Director: Org VDC
  • NSX-T: VRF based tier-0 gateway for Internet VRF connected to Parent T0
  • NSX-T: VRF based tier-0 gateway for Customer VRF connected to Parent T0
  • Cloud Director NSX-T: Set T0 to be dedicated
  • Cloud Director NSX-T: two Edge Gateway (T1) connected to the dedicated T0
  • Cloud Director NSX-T: two VDC Groups (Data Center Groups)
  • If Load balancing is used: Dedicated AVI Service Engine Group (because no design Change is needed for current model)

If you need connection between a VM in VDC Group 1 and a VM in VDC Group 2, you can created a network that will be spanning across both VDC groups.

The downside of the Scenario is that there are a lot of extra resources needed, and these resources will be billed to the tenant.

Conclusion

We still wanted a Design that met the customer requirements and also is easy to implement in the current setup. Which already in use by several tenants and was the default tenant setup when the platform was initially designed.

In Scenario 1 the Shared setup of the shared VRF based T0 doesn’t met the Customer requirements of separating Customer Networks and Internet traffic. Also the lack of the route advertisements was an issue for the Tenant.

In Scenario 2 where we separated all parts of the setup, all the Tenant requirements were met. The downside of this Scenario is that because all parts of the setup are decoupled, the resources a tenant needs using this setup are doubled. This means:

  1. Double Billing
  2. Double Resources, check the Config Maxs
  3. Increases operational effort

Scenario 2 is basically a copy of the current scenario, so the initial setup will cost less time, where Scenario 1 will have some design changes (shared T0, no route advertisments, SNAT/DNAT).

This design was based upon some requirements of a tenant, i tried to take in account that we also van use this for other tenants with the same requirements.

I loved investigating these scenarios, and discovering Cloud Director! Watch out for the next post about a Test Case with RBAC on Cloud Director!

Thank you for reading!

I also would like to mention the following blog article that helped me creating this post, Thanks Daniël!

https://blog.zuthof.nl/2020/06/30/cloud-director-nsx-t-inter-tenant-routing/


Exam Review CEH v12

In december 2022, i took the Certified Ethical Hacking Exam v12 from EC-Council.
This exam and training were on my todo list for a couple of years already.

Training

I took the training at Startel in the Netherlands (sept. 2021) at that time it was still version 11 of the training. Startel has a sort of bond with the trainer Dimitrios Zacharopoulos. I can tell you if you want to study for CEH, you need to get trained by this guy! Awesome dude and what a lot of knowledge from the field!

During the training you get access to the iLabs of EC-Council. In this environment you can do all the practical questions you get during the training. Real handy to get familiar with the tools and commands.

The training it self was a 5 day classroom training and those days were pretty packed.
The trainer took the chapters from the study material and gave a lot of real life samples for about every situation you can think of.

If you are lucky and get the training from Dimitrios, he will give you a lot of resources which you can use during your study after the training:

– Practice exams (created by Dimitrios himself)
– Summary of almost all available Tools
– Tools
– eBooks
– Links
– Videos

It took me ages to explore all those Gigabytes of resources, but man they were handy!

I had in mind to study for the exam about 2 or 3 months after the training and then take the exam.
Yeah Right….

I think i started studying again around May 2022, from my employer i had access to the pluralsight library and there was a course Certified Ethical Hacking prep. I watched a lot of chapters and i build my own Cybersecurity homelab with the use of another pluralsight course: Build a cybersecurity lab

Aslo if you are a vExpert you also get access to the pluralsight library.

The practical studying helped me memorizing all the different tools and for which they can be used also you have to know commandline options of several tools, and what is better than hands-on studying!

Also check the EC-Council website for the latest Exam blueprint and get familiar with all the chapters and parts mentioned in it.

About the Exam

  • Number of Questions: 125
  • Test Duration: 4 hours
  • Test Format: Multiple choice
  • Test Delivery: ECC EXAM, VUE
  • Exam Prefix: 312-50 (ECC EXAM), 312-50 (VUE)

Passing Score

The individual rating of all questions contribute to an overall “Cut Score” for each exam form. To ensure each form has equal assessment standards, cut scores are set on a “per exam form” basis. Depending on which exam form is challenged, cut scores can range from 60% to 85%.

Scheduling the Exam

After i fullfilled the classroom training I got an Exam Voucher with a validity of 1 year.
I wanted to take the exam just before the Exam Voucher would expire, but ofcourse during my holiday i did not study as much as i wanted.

I didn’t fel comfortable with taking the exam, so i contacted EC-Council and extended the voucher with 3 monts. Costs around 35 dollars.

The vouchers can be extended only 1 time, this gave me an extra 3 monts to study.
I scheduled the exam on the 9th of december 10:00 at the same location where i took the training. (This was a hard requirement for the voucher).

Exam Day

Most of the time when i book an exam it is in the morning, this time was no exception.
I arrived at 9:30 at the exam center and after login and the identity check i could immediatly start.
I think it took me roughly 2.5 hours to answer all the 125 question. so from my opinion the 4 hours was more than enough.

After clicking finish, it is always a relief to see the PASSED mark on the screen.
Man i was happy after all the work i put into it.

Exam Audit

But then…. after a few days not hearing anything from EC-Council, i got the following mail in may inbox:

So i had to ask my employer for an experience letter, when i send this the exam was cut free and it was official!

I am now a Certified Ethical Hacker, Hell yeah!

Replace NSX-T expiring localmanager and globalmanager self-signed Certificates.

Recently we saw some warnings about expiring certificates in the NSX-T Global Manager and Local Manager.

When we clicked one of the alerts we got a small description and some API calls we can fire to apply new certificates.

In the Certificates overview (System > Certificates > Certificates), we could see that the certificates Issued to the Local Manager and Global Manager were expiring. The certification id’s were also corresponding to the ones in the alert (not the ones in my screenshots).

The API calls that were mentioned in the Alert description are for the renewal of certificate to the HTTP service (UI), not the Local/Global Manager certificates.
The VMware Docs don’t explain in good detail how to change these certificates, i couln’t find it.

Try it yourself:
https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.1/administration/GUID-50C36862-A29D-48FA-8CE7-697E64E10E37.html

The only give away i could find was in step 6: (NSX Federation and the service type).

So before we can replace the certificates, we need to create new Self Signed Certificates for the Local Manager, and the Global Manager.

Create CSR on GM/LM:

to create a CSR (Certificate Signing Request) on the Global or Local Manager go to: System > Certificates > CSRs and click on “Generate CSR“.

For the Global Manager do this via the Global Manager Appliance and for the Local Manager use the Local Manager Appliance, or use the drop down on the top of the screen to choose between your Global Manager or Local Managers.

Fill in all the fields and hit the GENERATE button, example below is for the Global Manager. For The local Manager just change the word global to local:

Now we can see a new CSR in the list, the next step is to self-sign the Global Manager CSR, select the CSR and under actions choose “Self Sign Certificate for CSR”

Choose your number of days:

Now we have a new Self-Signed certificate for the Global Manager in the certificates list, with this certificate we can replace the Principal Identity certificate for the Global Manager.

For the local manager certificates, follow the steps mentioned above on the local manager appliance.

Apply Self-Signed Certificate on the Global Manager

Before we can apply the SS certificate to the Global Manager we need to copy the certificate id, click on the ID, and then select the whole id in the pop-up and copy/paste it for later use:

Now we can Fire up Postman to apply the certificate by API:

  1. Change the ACTION drop down to POST
  2. Paste the following url to your Global Manager:

    Step 1 to 4

    5. Change to Body, select raw and from the drop-down choose JSON
    6. put the following JSON in the Body:

    {
    “cert_id”: “<new cert id>”,
    “service_type”: “GLOBAL_MANAGER”
    }

    And hit the SEND button, if you get back Status 200 OK, the API call was succesful:

    Apply Self-Signed Certificate on the Local Manager

    This procedure is almost the same as on the Global Manager. Go to the Local Manager Appliance and copy the new certificate id.

    Go to Postman:

    1. Change the ACTION drop down to POST
    2. Paste the following url to your Global Manager:

      Check the old expiring certificate

      Now since we are on version 3.1, we need to check by API if the old certificate is released so we know it isn’t used anymore before we delete it.
      Note: In Newer version you have a “Where Used” field in the certificate overview.

      Go to Postman:

      1. Change the action to GET
      2. Paste the following url to your Global Manager: (use the id of the expiring certificate)
        https://<global-manager>/api/v1/trust-management/certificates/dee8d78b-5e04-4deb-8d36-6b86f79f058b
        Set Authorization the same as the previous API Calls
      3. Select Body and set it to none
      4. Hit Send

      If the Certificate is used by a node look for the used_by part, when there is a node_id, the certificate is still in use and can’t be deleted. If it is empty, you can delete the Certificate in the UI, you can do this check on the new certificate to see if it is used by the same node.

          "used_by": [
              {
                  "node_id": "515f0642-b1eb-ef39-9acc-82f8760ab0b9",
                  "service_types": [
                      "GLOBAL_MANAGER"
                  ]
              }
          ],

      Sometimes the Certificates won’t release itself, so let’s release the damn thing:

      Release a Certificate

      Please keep in mind that you only release the certificate from the node_id if you are absolutly sure, if not please raise a ticket to VMware Support.

      1. login with the admin user to the manager with ssh
      2. then type st e, enter the root password and you are now at the shell
      3. Use the certificate id and the node_id from the previous step:
      4. now use the following API call to release the Certificate of the node_id:
        curl -k -X POST -H “Content-Type: application/json” -H ‘X-NSX-Username:admin’ -H ‘X-NSX-Groups:superuser’ -d ‘{“service_type”:”API”,”node_id”:”<node_id>“}’  “http://localhost:7440/nsxapi/api/v1/trust-management/certificates/<certificate-id>?action=release
      5. This should do it, you can check the certificate again with the previous step and look for the used_by, this should be emtpy now.
      curl -k -X POST -H "Content-Type: application/json" -H 'X-NSX-Username:admin' -H 'X-NSX-Groups:superuser' -d '{"service_type":"GLOBAL_MANAGER","node_id":""647defab-13c5-5g62-93e4-da4a345d666"}'  "http://localhost:7440/nsxapi/api/v1/trust-management/certificates/ea5771fb-161e-49dd-97d7-c1483d2790666691?action=release"
      

      If the used_by is empty, you can now safely remove the certificate.

      All the steps are almost identical for the Global and Local Manager, just change the service types for GLOBAL_MANAGER to LOCAL_MANAGER, etc.

      And the Alerts are now Resolved!

Design Consideration – For Bare Metal Edge with 4 PNIC in a NSX-T Federated Environment

During a failover test with the Bare Metal Edges we ran into an issue when pulling the plug on 1 of the TOR switches. (TOR-LEFT). During that test all BGPs on both Bare Metal edges went down. So no North-South routing anymore 🙁

Edge Setup:

The Bare Metal Edges were configured following the design guidelines from VMware:
https://nsx.techzone.vmware.com/resource/nsx-t-reference-design-guide-3-0#_Edge_Node_and_1

Indentifying the problem:

So why this behaviour? And what happens when we pull the plug on the other TOR switch (TOR-RIGHT).
After performing the test with the TOR-RIGHT, the BGPs connected to TOR Left stayed established. So it has something to do with switch TOR-LEFT?

After checking the configuration on the TOR-LEFT switch we didn’t identified something that could cause this issue. But what could it be? Edges were configured by VMware guidelines and were identical configuration wise.

So going through the logs was the next step in the process, and i stumbled upon this part in the log file:

2022-10-17T10:37:08.578Z Update device fp-eth0 state to DOWN
2022-10-17T10:37:08.578Z Self Node 00363d34-fcdd-11ea-8e07-e4434ba66042 status changed from Up to Down (RTEP device down)

Can it have something to do with the federated setup (RTEP), is the RTEP only connecting over fp-eth0?

Cause:

Again i went through the setup but now i also checked the fp-eth0 connections to the switches. On both BareMetal Edges the fp-eth0 was connected to the TOR-LEFT. So when we pulled the plug on that Switch it triggered the RTEP going down, which led to all BGP session going down.

This is expected behavior according to VMware!

Solution:

The solution to this issue was pretty simple after we identified the cause. We switched the connection on the second Bare Metal Edge, so the pnics connected to TOR LEFT will be on TOR-RIGHT and vice versa. The opposite of the first Bare Metal Edge.

RTEP down in Global Manager NSX-T UI

A while ago we ran into an issue after we did the upgrade from NSX-T version 3.1.3.6 to 3.1.3.7. In the alarms section at one Site. Still wanted to do a post about the issue and the solution/workaround:

Time to check the connection!
Login to the Edges and grap the VRF id of the RTEP TUNNEL.

Check the BGP and ping between the RTEP ip addresses on both sites.

As you can see all BGPs are established and the ping commands give a reply.
Let’s do another check from Postman:

Open Postman and fire a GET api call to the nsx-manager to grab the edge id we need in the next api call:
API GET call:

https://<nsxmanager ip>/api/v1/transport-nodes/

Just select Basic Auth under the Authorization tab and fill in the Admin credentials.

Hit Send, when getting a reply in the Body, search for the edge name and the corresponding id.

Now we use this id to get the RTEP status:

GET https://<nsxmanagerip>/api/v1/transport-nodes/<edgenodeid>/inter-site/bgp/summary

Check the output and the Return Status for issues, as you can see in the example above the BGP to one of the peers is establised.

Solution:

So it seems like the issue is known in the 3.1.3.7 version in a 3 manager nodes setup.

The node which has generated the alarm, only that node can clear alarm from in-memory when it will receive remove alarm from the edge node. The Alarm was resolved on 1 of the manager nodes, but it was showing on other nodes and it was keeping the alarm as active.

The following workaround will remove the alarm: Restart the proton service on ALL manager nodes.

– SSH with the admin user to the NSX-T manager nodes:
– execute the following commands:

Stop service proton
Start service proton

UPDATE: The issue is fixed in version 3.2.1

Upload MUB file fails with ansible-for-nsxt

While trying to upgrade the NSX-T enviroment via ansible we stumbled upon the issue that we couldn’t upload the mub file to the NSX Manager.

Ansible Task:

- name: Upload upgrade file to NSX-T manager coordinator node from file
  vmware.ansible_for_nsxt.nsxt_upgrade_upload_mub:
    hostname: "{{ coordinator }}"
    username: "{{ username }}"
    password: "{{ password }}"
    validate_certs: False
    file: "{{ nsx_upgrade_mub_file }}"

This gave us the following error message:

fatal: [127.0.0.1]: FAILED! => {
    "changed": true,
    "invocation": {
        "module_args": {
            "file": "/root/hypervisor/upgrade_bundle/VMware-NSX-upgrade-bundle-3.2.1.1.0.20115686.mub",
            "hostname": "manager1",
            "password": "VALUE_SPECIFIED_IN_NO_LOG_PARAMETER",
            "port": 443,
            "timeout": 9000,
            "url": null,
            "username": "admin",
            "validate_certs": false
        }
    },
    "msg": "Error: string longer than 2147483647 bytes"

So it looks like the upload can’t handle files over 2GB.

Honoustly my python skills are a bit rusty, so i asked one of the developers in our team to help me out and see if we could get this fixed.

The 2GB+ filesize is the issue. You can find multiple references to the error, usually referring to the httplib, urllib or ssl..
One solution is to use streaming upload.

This is what we did to make the upload work.

Install request-toolbelt package

Edit nsxt_upgrade_upload_mub.py
NOTE: This will break the URL upload!
Add:

import requests
from requests_toolbelt.multipart import encoder
from requests.auth import HTTPBasicAuth 

Replace line 140 – 174 with:

        session = requests.Session()
        with open(file_path, 'rb') as src_file:
             body = encoder.MultipartEncoder({
                 "file": (src_file.name, src_file, "application/octet-stream")
             })
             headers = {"Prefer": "respond-async", "Content-Type": body.content_type}
             resp = session.post(mgr_url + endpoint, auth=HTTPBasicAuth(mgr_username, mgr_password), timeout=None, verify=False, data=body, headers=headers)
             bundle_id = 'latest'#resp['bundle_id']
             headers = dict(Accept="application/json")
             headers['Content-Type'] = 'application/json'
             try:
                 wait_for_operation_to_execute(mgr_url,
                     '/upgrade/bundles/%s/upload-status'% bundle_id,
                     mgr_username, mgr_password, validate_certs,
                     ['status'], ['SUCCESS'], ['FAILED'])
             except Exception as err:
                 module.fail_json(msg='Error while uploading upgrade bundle. Error [%s]' % to_native(err))
             module.exit_json(changed=True, ip_address=ip_address,
             message='The upgrade bundle %s got uploaded successfully.' % module.params[mub_type])
        session.close()

NOTE: This will break the URL upload!

The response will show:

changed: [127.0.0.1] => {
    "changed": true,
    "invocation": {
        "module_args": {
            "file": "/upgrade_bundle/VMware-NSX-upgrade-bundle-3.2.1.1.0.20115686.mub",
            "hostname": "nsxmanager",
            "password": "VALUE_SPECIFIED_IN_NO_LOG_PARAMETER",
            "port": 443,
            "timeout": 9000,
            "url": null,
            "username": "admin",
            "validate_certs": false
        }
    },
    "ip_address": "<ip>",
    "message": "The upgrade bundle /upgrade_bundle/VMware-NSX-upgrade-bundle-3.2.1.1.0.20115686.mub got uploaded successfully."
}

Solution is also added to a github BUG report:

https://github.com/vmware/ansible-for-nsxt/issues/416

All Kudos to my colleague for fixing the issue!