Thursday, May 7, 2015

EXTn and the Tyranny of AWS

One of the organizations I provide consulting services to opted to start migrating from an in-house, VMware-based virtualization solution to an Amazon-hosted cloud solution. The transition has been somewhat fraught, here and there - possibly moreso than the prior transition from physical (primarily Solaris-based) servers to virtualized (Linux) servers.

One of the huge "problems" is that the organization's various sub-units have habits formed of decade or longer lifecycles. Staff are particularly used to being able to get on console for various things (using GUI-based software-installers, graphical IDEs as well as tasks that actually require console access - like recovering a system stuck in its startup sequence).

For all that AWS offers, console access isn't one of them. Amazon's model for how customers should deploy and manage systems means that they don't figure that console access is strictly necessary.

In a wholly self-service model, this is probably an ok assumption. Unfortunately, the IT model that the organization in question is moving to doesn't offer instance-owners true self-service. They're essentially trying to transport an on-premises, managed multi-tenancy model into AWS. The model they're moving from didn't have self-service, so they're not interested in enabling self-service (at least not during phase one of the move). Their tenants not only don't have console access in the new environment, they don't have the ability to execute the AWS-style recovery-methods you'd use in the absence of console access. The tenants are impatient and the group supporting them is small, so it's a tough situation.

The migration has been ongoing for a sufficiently long period of time that the default `mkfs` behavior for EXT-based filesystems is starting to rear its head. Being well beyond the 180 day mark since the inception of the migration, tenants are finding that, when their Linux instances reboot, they're not coming back as quickly as they did towards the beginning of the migration ...because their builds still leave autofsck enabled.

If you're reading this, you may have run into similar issues.

The solution to this, while still maintaining the spirit of the "fsck every 180 days or so" best practices for EXTn-based filesystems is fairly straight forward:

  1. Disable the autofsck settings on your instances' EXTn filesystems: use `tune2efs -i 0 -c -1 /dev/<DEVNODE>`
  2. Schedule periodic fsck "simulations". This can be done either by running fsck in "dryrun" mode or by doing an fsck of a filesystem metadata image.
The "dryrun" method is fairly straight forward: just run fsck with the "-N" option. I'm not super much a fan of this as it doesn't feel like it gives me the info I'm looking for to feel good about the state of my filesystems.

The "fsck of a filesystem metadata image" is pretty straight forward, automatable and provides a bit more on the "warm fuzzies" side of thing. To do it:
  1. Create a metadata image file using `e2image -fr /dev/<DEVNODE> /IMAGE/FILE/PATH` (e.g. `e2image -fr /dev/RootVG/auditVol /tmp/auditVol.img`)
  2. Use `losetup` to create an fsck'able block-device from the image file (e.g., `losetup /dev/loop0 /tmp/auditVol.img`)
  3. Execute an fsck against the loopback device (e.g., `fsck /dev/loop0`). Output will look similar to the following:
    # fsck /dev/loop0
    fsck from util-linux-ng 2.17.2
    e2fsck 1.41.12 (17-May-2010)
    /dev/loop0: recovering journal
    /dev/loop0: clean, 13/297184 files, 56066/1187840 blocks
  4. If the output indicates anything other than good health, schedule an outage to do a proper repair of your live filesystem(s)
Granted, if you find you need to do the full check of the real filesystem(s), you're still potentially stuck with the "no console" issue. Even that is potentially surmountable:
  1. Create a "/forcefsck" file
  2. Create a "/fsckoptions" file with the contents "-sy"
  3. Schedule your reboot
When the reboot happens, depending how long the system takes to boot, the EC2 launch monitors may time out: just be patient. If you can't be patient, just monitor the boot logs (either in the AWS console or using the AWS CLI's equivalent option).