Wednesday, September 25, 2013

The Case of the Broken ARP (In Progress)

Recently, I was tasked with a project that, as part of the preparation-phase, required patching up a whole bunch of servers to the same patch-level. In total, I patched about 20 systems that were dual-homed and equipped with asymmetrical, 10/1 active/passive bonds. All but the last system went aces. The last system ...was weird.

After patching the final system, my secondary network-pair was no longer able to talk on the network. While diagnosing, attempts to ping out to any hosts on the local LAN segment resulted in "host unreachable" errors. Any hosts that I did try to ping, ended up in the afflicted host's ARP table with a missing ("<incomplete>") MAC address entry.

Our builds are normally fairly locked down. That means, many troubleshooting tools (such as tcpdump) are not loaded. At my last straw, I opted to temporarily load tcpdump to see what, if anything, the afflicted bond (and sub-interfaces) were seeing on the network. Interestingly, as soon as I started snooping either the parent bond or the active interface, networking activity became "normal" and the previously "<incomplete>" ARP table entries populated the other hosts' MAC addresses. As soon as I stopped my tcpdump runs, networking reverted to its broken state and the other hosts' MAC address entries in the ARP table returned to "<incomplete>".

Still don't have a fix - this is a "work in progress" article, at the moment.What I ended up doing as a workaround - since this server is a critical infrastructure component - is did a `ifconfig bond1 promisc` and updated its /etc/sysconfig/network-scripts file to preserve the state should the system reboot before I find a more suitable fix. So, for right now, in order to get this one bond (of two on the system) to work, I need to leave it in promiscuous mode.

Obviously, I have our networking guys looking at the switches to see if there's a difference between how the ports for bond0 and bond1 are configured. I figure, it has to be the network, since: A) one bond works but the other doesn't; and, B) no promiscuous-mode changes were required for any of the other hosts that were patched.

At any rate, if you happen to stumble on this article before I get it beyond a "work in progress" state, please feel free to comment if you know a likely fix.

No comments:

Post a Comment