Wednesday, April 25, 2012

When VMs Go *POOF!*

Like many organizations, the one I work for is following the pathway to the cloud. Right now, this consists of moving towards an Infrastructure As A Service model. This model heavily leverages virtualization. As we've been pushing towards this model, we've been doing the whole physical to virtual dance, wherever possible.

In many cases, this has worked well. However, like many large IT organizations, we have found some assets on our datacenter floors that have been running for years on both outdated hardware and outdated operating systems (and, in some cases, "years" means "in excess of a decade"). Worse, the people that set up these systems are often no longer employed by our organization. Worse still, the vendors of the software in use long-ago discontinued either the support of the software version in use or no longer maintain any version of the software.

One such customer was migrated last year. They weren't happy about it at the time, but, they made it work. Because we weren't just going to image their existing out of date OS into the pristine, new environment, we set them up with a virtualized operating environment that was as close to what they'd had as was possible. Sadly, what they'd had was a 32bit RedHat Linux server that had been built 12 years previously. Our current offerings only go back as far as RedHat 5, so that's what we built them. While our default build is 64bit, we offer 32bit builds - unfortunately, the customer never requested a 32bit environment The customer did their level best to massage the new environment into running their old software. Much of this was done by tar'ing up directories on the old system and blowing them onto their new VM. They'd been running on this massaged system for nearly six months.

If you noticed that 32-to-64bit OS-change, you probably know where this is heading...

Unfortunately, as is inevitable, we had to take a service outage across the entire virtualized environment. We don't yet have the capability in place to live migrate VMs from one data center to another for such windows. Even if we had, the nature of this particular outage (installation of new drivers into each VM) was such that we had to reboot the customer's VM, any way.

We figured we were good to go as we had 30+ days of nightly backups of each impacted customer VM. Sadly, this particular customer, after doing the previously-described massaging of their systems, had never bothered to reboot their system. It wasn't (yet) in our procedures to do a precautionary pre-modification reboot of the customers' VMs. The maintenance work was done and the customer's VM rebooted. Unfortunately, the system didn't come back from the reboot. Even more unfortunately, the thirty-ish backup images we had for the system were similarly unbootable and, worse, unrescuable.

Eventually, we tracked down the customer to inform them of the situation and to find out if the system was actually critical (the VM had been offline for nearly a full week by the time we located the system owner, but no angry "where the hell is my system" calls had been received or tickets opened). We were a bit surprised to find that this was, somehow, a critical system. We'd been able to access the broken VM to a sufficient degree to determine that it hadn't been rebooted in nearly six months, that it's owner hadn't logged in nearly that same amount of time but that the on-disk application data appeared to be intact (the filesystems they were on were mountable without errors). So, we were able to offer the customer the option of building them a new VM and helping them migrate their application data off the old VM to the new VM.

We'd figured we'd just restore data from the old VM's backups to a directory tree on the new VM. The customer, however, wanted the original disks back as mounted disks. So, we had to change plans.

The VMs we build make use of the Linux Volume Manager software to manage storage allocation. Each customer system is built off of standard templates. Thus, each VM's default/root volume groups all share the same group name. Trying to import another host's root volume group onto a system that already has a volume group of the same name tends to be problematic in Linux. That said, it's possible to massage (there's that word again) things to allow you to do it.

The customer's old VM wasn't bootable to a point where we could simply FTP/SCP its /etc/lvm/backup/RootVG file off. The security settings on our virtualization environment also meant we couldn't cut and paste from the virtual console into a text file on one of our management hosts. Fortunately, our backup system does support file-level/granular restores. So, we pulled the file from the most recent successful backup.

Once you have an /etc/lvm/backup/<VGNAME> file available, you can effect a name change on a volume group relatively easily. Basically, you:

  1. Create your new VM
  2. Copy the old VM's  /etc/lvm/backup/<VGNAME> into the new VM's /etc/lvm/backup directory (with a new name)
  3. Edit that file, changing the old volume group's object names to ones that don't collide with the currently-booted root volume groups (really, you only need to change the name of the root volume group object - the volumes can be left as is
  4. Connect the old VM's virtual disk to the new VM
  5. Perform a rescan of the VM's scsi bus so it sees the old VM's virtual disk
  6. Do a `vgcfgrestore -f  /etc/lvm/backup/<VGNAME> <NewVGname>` 
  7. Do a `pvscan`
  8. Do a `vgscan`
  9. Do a `vgchange -ay <NewVGname>`
  10. Mount up the renamed volume group's volumes

As a side note (to answer "why didn't you just do a `boot linux rescue` and change things that way"): our security rules prevent us from keeping bootable ISOs (etc.) available to our production environment. Therefore, we can't use any of the more "normal" methods for diagnosing or recovering a lost system. Security rules trump diagnostics and/or recoverability needs. Dems da breaks.

Asymmetrical NIC-bonding

Currently, the organization I am working for is making the transition from 1Gbps networking infrastructure to 10Gbps infrastructure. The initial goal had been to first migrate all of the high network-IO servers that were using trunked 1Gbps interfaces to using 10Gbps Active/Passive configurations.
Given the current high per-port cost of 10Gbps networking, it was requested that a way be found to not waste 10Gbps ports. Having a 10Gbps port sitting idle "in case" the active port became unavailable was seen as financially wasteful. A a result, we opted to pursue the use of asymmetrical A/P bonds that used our new 10Gbps links for the active/primary path and reused our 1Gbps infrastructure for the passive/failover path.

Setting up bonding on Linux can be fairly trivial. However, when you start to do asymmetrical bonding, you want to ensure that your fastest paths are also your active paths. This requires some additional configuration of the bonded pairs beyond just the basic declaration of the bond memberships.
In a basic bonding setup, you'll have three primary files in the /etc/sysconfig/network-scripts directory: ifcfg-ethX, ifcfg-ethY and ifcfg-bondZ. The ifcfg-ethX and ifcfg-ethY files are basically identical but for their DEVICE and HWADDR parameters. At their most basic, they'll each look (roughly) like:

DEVICE=ethN
HWADDR=AA:BB:CC:DD:EE:FF
ONBOOT=yes
BOOTPROTO=none
MASTER=bondZ
SLAVE=yes

And the (basic) ifcfg-bondZ file will look like:

DEVICE=bondZ
ONBOOT=yes
BOOTPROTO=static
NETMASK=XXX.XXX.XXX.XXX
IPADDR=WWW.XXX.YYY.ZZZ
MASTER=yes
BONDING_OPTS="mode=1 miimon=100"

This type of configuration may produce the results you're looking for, but it's not guaranteed to. If you want to absolutely ensure that your faster NIC will be selected as the primary NIC (and that it will fail back to that NIC in the event that the faster NIC goes offline and then back online), you need to be a bit more explicit with your ifcfg-bondZ file. To do this, you'll mostly want to modify your BONDING_OPTS directive. I also tend to add some BONDING_SLAVEn directives, but that might be overkill. Your new ifcfg-bondZ file that forces the fastest path will look like:

DEVICE=bondZ
ONBOOT=yes
BOOTPROTO=static
NETMASK=XXX.XXX.XXX.XXX
IPADDR=WWW.XXX.YYY.ZZZ
MASTER=yes
BONDING_OPTS="mode=1 miimon=100 primary=ethX primary_reselect=1"



The primary= tells the bonding driver to set the ethX device as primary when the bonding-group first onlines. The primary_reselect= tells it to use a interface selection policy of "best".

Note: The default policy is "0". This policy simply says "return to interface declared as primary". I choose to override with policy "1" as a hedge against the primary interface coming back in some kind of degraded state (while most of our 10Gbps media is 10Gbps-only, some of the newer ones are 100/1000/1000). I only want to fail back to the 10Gbps interface if it's still running at 10Gbps and hasn't for some reason, negotiated down to some slower speed.

When using the more explicit bonding configuration, the resultant configuration will resemble something like:

Ethernet Channel Bonding Driver: v3.4.0-1 (Octobber 7, 2008)



Bonding Mode: fault-tolerance (active-backup)
Primary Slave: ethX (primary_reselect better)
Currently Active Slave: ethX
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0



Slave Interface: ethX
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: AA:BB:CC:DD:EE:FF



Slave Interface: ethY
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: FF:EE:DD:CC:BB:AA