Titular Discrepancy: When VMs Go *POOF!*

Like many organizations, the one I work for is following the pathway to the cloud. Right now, this consists of moving towards an Infrastructure As A Service model. This model heavily leverages virtualization. As we've been pushing towards this model, we've been doing the whole physical to virtual dance, wherever possible.

In many cases, this has worked well. However, like many large IT organizations, we have found some assets on our datacenter floors that have been running for years on both outdated hardware and outdated operating systems (and, in some cases, "years" means "in excess of a decade"). Worse, the people that set up these systems are often no longer employed by our organization. Worse still, the vendors of the software in use long-ago discontinued either the support of the software version in use or no longer maintain any version of the software.

One such customer was migrated last year. They weren't happy about it at the time, but, they made it work. Because we weren't just going to image their existing out of date OS into the pristine, new environment, we set them up with a virtualized operating environment that was as close to what they'd had as was possible. Sadly, what they'd had was a 32bit RedHat Linux server that had been built 12 years previously. Our current offerings only go back as far as RedHat 5, so that's what we built them. While our default build is 64bit, we offer 32bit builds - unfortunately, the customer never requested a 32bit environment The customer did their level best to massage the new environment into running their old software. Much of this was done by tar'ing up directories on the old system and blowing them onto their new VM. They'd been running on this massaged system for nearly six months.

If you noticed that 32-to-64bit OS-change, you probably know where this is heading...

Unfortunately, as is inevitable, we had to take a service outage across the entire virtualized environment. We don't yet have the capability in place to live migrate VMs from one data center to another for such windows. Even if we had, the nature of this particular outage (installation of new drivers into each VM) was such that we had to reboot the customer's VM, any way.

We figured we were good to go as we had 30+ days of nightly backups of each impacted customer VM. Sadly, this particular customer, after doing the previously-described massaging of their systems, had never bothered to reboot their system. It wasn't (yet) in our procedures to do a precautionary pre-modification reboot of the customers' VMs. The maintenance work was done and the customer's VM rebooted. Unfortunately, the system didn't come back from the reboot. Even more unfortunately, the thirty-ish backup images we had for the system were similarly unbootable and, worse, unrescuable.

Eventually, we tracked down the customer to inform them of the situation and to find out if the system was actually critical (the VM had been offline for nearly a full week by the time we located the system owner, but no angry "where the hell is my system" calls had been received or tickets opened). We were a bit surprised to find that this was, somehow, a critical system. We'd been able to access the broken VM to a sufficient degree to determine that it hadn't been rebooted in nearly six months, that it's owner hadn't logged in nearly that same amount of time but that the on-disk application data appeared to be intact (the filesystems they were on were mountable without errors). So, we were able to offer the customer the option of building them a new VM and helping them migrate their application data off the old VM to the new VM.

We'd figured we'd just restore data from the old VM's backups to a directory tree on the new VM. The customer, however, wanted the original disks back as mounted disks. So, we had to change plans.

The VMs we build make use of the Linux Volume Manager software to manage storage allocation. Each customer system is built off of standard templates. Thus, each VM's default/root volume groups all share the same group name. Trying to import another host's root volume group onto a system that already has a volume group of the same name tends to be problematic in Linux. That said, it's possible to massage (there's that word again) things to allow you to do it.

The customer's old VM wasn't bootable to a point where we could simply FTP/SCP its /etc/lvm/backup/RootVG file off. The security settings on our virtualization environment also meant we couldn't cut and paste from the virtual console into a text file on one of our management hosts. Fortunately, our backup system does support file-level/granular restores. So, we pulled the file from the most recent successful backup.

Once you have an /etc/lvm/backup/<VGNAME> file available, you can effect a name change on a volume group relatively easily. Basically, you:

Create your new VM
Copy the old VM's /etc/lvm/backup/<VGNAME> into the new VM's /etc/lvm/backup directory (with a new name)
Edit that file, changing the old volume group's object names to ones that don't collide with the currently-booted root volume groups (really, you only need to change the name of the root volume group object - the volumes can be left as is
Connect the old VM's virtual disk to the new VM
Perform a rescan of the VM's scsi bus so it sees the old VM's virtual disk
Do a `vgcfgrestore -f /etc/lvm/backup/<VGNAME> <NewVGname>`
Do a `pvscan`
Do a `vgscan`
Do a `vgchange -ay <NewVGname>`
Mount up the renamed volume group's volumes

As a side note (to answer "why didn't you just do a `boot linux rescue` and change things that way"): our security rules prevent us from keeping bootable ISOs (etc.) available to our production environment. Therefore, we can't use any of the more "normal" methods for diagnosing or recovering a lost system. Security rules trump diagnostics and/or recoverability needs. Dems da breaks.

Titular Discrepancy

Wednesday, April 25, 2012

When VMs Go POOF!

No comments:

Post a Comment