Its been a while, since i posted a troubleshooting article. But, recently I ran into an issue which was quite interesting and would like to share with you all.
There are two sites. Let’s name them Site-1 and Site-2. Site-1 and Site-2 are configured in Cross Site NSX, with universal switch and universal DLR configured. Site-1 has its own universal DLR-1 and Site-2 have its own universal DLR-2. Now, due to some issue, VXLAN in Site-1 stopped working. This issue is for some other day. But VXLAN was working fine in Site-2 with no issues. So, to restore connectivity we moved LIF with gateway 192.168.74.1 (virtual wire) from DLR-1 to DLR-2. And gateway starts to ping absolutely fine thanks to dynamic routing in place. But, when we moved VM (IP – 192.168.74.2) to Site-2 and attaches them to LIF, they are unreachable. So, what went wrong.
- Dynamic Routing is in place, routes are getting advertised properly.
- Performed packet capture and traffic is leaving vnic from VM.
- Disable and re-enable LIF, that didn’t help but this is where things get weird.
- After disabling LIF, i was still able to reach gateway from VM. Moreover, all Gateway of Site-1 attached to DLR-1 were reachable but none of DLR-2 were reachable.
- Also, gateway of DLR-1 were reachable without any hop, where as if traffic is going from Site-2 to Site-1, it should show some hops.
- So, did a quick check of neighbors on both DLR and this is where things get interesting:
DLR-1 [root@xxxx:~] net-vdr -N -l default+edge-xxxxxxx-xxxx-xxxx-xxxx-d5b4a79101c7 | grep 192.168.74 192.168.74.1 02:50:56:56:44:52 VI permanent 0 1 4e2000000020 DLR-2 [root@xxxx:~] net-vdr -N -l default+edge-xxxxxxxx-xxxx-xxxx-xxxx-4b8b8489011b | grep 192.168.74 192.168.74.2 00:00:00:00:00:00 N 0 2 4e2100000028 192.168.74.1 02:50:56:56:44:52 VI permanent 0 1 4e2100000028
- This was pretty strange, and it could be resolved by force sync on DLR.
- Well performing Force Sync resolved the issue and incorrect gateway entry disappear from DLR-1.
DLR-1 [root@xxx:~] net-vdr -N -l default+edge-xxxxxxx-xxxx-xxxx-xxxx-d5b4a79101c7 | grep 192.168.74 DLR-2 [root@xxxx:~] net-vdr -N -l default+edge-xxxxxxxx-xxxx-xxxx-xxxx-4b8b8489011b | grep 192.168.74 192.168.74.2 00:50:56:b0:a0:74 VL 561 33554463 4 4e2100000028 192.168.74.1 02:50:56:56:44:52 VI permanent 0 1 4e2100000028
- So, VM is reachable now. Everything looks good.
So force Sync services can resolve multiple issues. If there is issue at primary site and failing over to secondary doesn’t work properly, then Force Sync can be done to make sure all changes all replicated successfully to secondary site as well.
Solution was relatively easy, but this took my whole night to resolve it. Well, troubleshooting is best learning. So, happy learning.