Monday, May 7, 2018

Streamed Backups to S3

Introduction/Background


Many would-be users of AWS come to AWS from a legacy hosting background. Often times, when moving to AWS, the question, "how do I back my stuff up when I no longer have access to my enterprise backup tools," is asked. If not, it's a question that would-be AWS users should be asking.

AWS provides a number of storage options. Each option has use-cases that it is optimized for. Each also has a combination of performance, feature and pricing tradeoffs (see my document for a quick summary of these tradeoffs). The lowest-cost - and therefore most attractive for data-retention use-cases typical of backups-related activities - is S3. Further, within S3, there are pricing/capability tiers that are appropriate to different types of backup needs (the following list is organized by price, highest to lowest):
  • If there is a need to perform frequent full or partial recoveries, the S3 Standard tier is probably the best option
  • If recovery-frequency is pretty much "never" — but needs to be quick if there actually is a need to perform recoveries — and the policies governing backups mandates up to a thirty-day recoverability window, the best option is likely the S3 Infrequent Access (IA) tier.
  • If there's generally no need for recovery beyond legal compliance capabilities, or the recovery-time objectives (RTO) for backups will tolerate a multi-hour wait for data to become unavailable, the S3 glacier layer is probable the best option.
Further, if projects backup needs span the usage profiles of the previous list, data lifecycle policies can be created that will move data from a higher-cost layer to a lower-cost layer based on time thresholds. And, to prevent being billed for data that has no further utility, the lifecycle policies can include an expiration-age at which AWS will simply delete and stop charging for the backed up data.

There are a couple of ways to get backup data into S3:
  • Copy: The easiest — and likely most well known — is to simply copy the data from a host into an S3 bucket. Every file on disk that's copied to S3 exists as an individually downloadable file in S3. Copy operations can be iterative or recursive. If the copy operation takes the form of a recursive-copy, basic location relationship between files is preserved (though, things like hard- or soft-links get converted into multiple copies of a given file). While this method is easy, it includes a loss of filesystem metadata — not just the previously-mentioned loss of link-style file-data but ownerships, permissions, MAC-tags, etc.
  • Sync: Similarly easy is the "sync" method. Like the basic copy Every file on disk that's copied to S3 exists as an individually downloadable file in S3. The sync operation is inherently recursive. Further, if an identical copy of a file exists within S3 at a given location, the sync operation will only overwrite the S3-hosted file if the to-be-copied file is different. This provides good support for incremental style backups. As with the basic copy-to-S3 method, this method results in the loss of file-link and other filesystem metadata.

    Note: if using this method, it is probably a good idea to turn on bucket-versioning to ensure that each version of an uploaded file is kept. This allows a recovery operation to recover a given point-in-time's version of the backed-up file.
  • Streaming copy: This method is the least well-known. However, this method can be leveraged to overcome the problem of loss filesystem metadata. If the stream-to-S3 operation includes an inlined data-encapsulation operation (e.g., piping the stream through the tar utility), filesystem metadata will be preserved.

    Note: the cost of preserving metadata via encapsulation is that the encapsulated object is opaque to S3. As such, there's no (direct) means by which to emulate an incremental backup operation.

Technical Implementation

As the title of this article suggests, the technical-implementation focus of this article is on streamed backups to S3.

Most users of S3 are aware of is static file-copy options. That is copying a file from an EC2 instance directly to S3. Most such users, when they want to store files in EC2 and need to retain filesystem metadata either look to things like s3fs or do staged encapsulation.

The former allows you to treat S3 as though it were a local filesystem. However, for various reasons, many organizations are not comfortable using FUSE-based filesystem-implementstions - particularly opensource project ones (usually due to fears about support if something goes awry)

The latter means using an archiving tool to create a pre-packaged copy of the data first staged to disk as a complete file and then copying that file to S3. Common archiving tools include the Linux Tape ARchive utility (`tar`), cpio or even `mkisofs`/`genisoimage`. However, if the archiving tool supports reading from STDIN and/or writing to STDOUT, the tool can be used to create an archive directly within S3 using S3's streaming-copy capabilities.

Best practices for backups is to ensure that the target data-set is in a consistent state. Generally, this means that the data to be archived is non-changing. This can be done by quiescing a filesystem ...or snapshotting a filesystem and backing up the snapshot. Use of LVM snapshots will be used to illustrate how to a consistent backup of a live filesystem (like those used to host the operating system.)

Note: this illustration assumes that the filesystem to be  backed up is built on top of LVM. If the filesystem is built on a bare (EBS-provided) device, the filesystem will need to be stopped before it can be consistently streamed to S3.

The high-level procedure is as follows:
  1. Create a snapshot of the logical volume hosting the filesystem to be backed up (note that LVM issues an `fsfreeze` operation before to creating the snapshot: this flushes all pending I/Os before making the snapshot, ensuring that the resultant snapshot is in a consistent state). Thin or static-sized snapshots may be selected (thin snapshots are especially useful when snapshotting multiple volumes within the same volume-group as one has less need to worry about getting the snapshot volume's size-specification correct).
  2. Mount the snapshot
  3. Use the archiving-tool to stream the filesystem data to standard output
  4. Pipe the stream to S3's `cp` tool, specifying to read from a stream and to write to object-name in S3
  5. Unmount the snapshot
  6. Delete the snapshot
  7. Validate the backup by using S3's `cp` tool, specifying to write to a stream and then read the stream using the original archiving tool's capability to read from standard input. If the archiving tool has a "test" mode, use that; if it does not, it is likely possible to specify /dev/null as its output destination.
For a basic, automated implementation of the above, see the linked-to tool. Note that this tool is "dumb": it assumes that all logical volumes hosting a filesystem should be backed up. The only argument it takes is the name of the S3 bucket to upload to. The script does only very basic "pre-flight" checking:
  • Ensure that the AWS CLI is found within the script's inherited PATH env.
  • Ensure that either an AWS IAM instance-role is attached to the instance or that an IAM user-role is defined in the script's execution environment (${HOME}/.aws/credential files not currently supported). No attempt is made to ensure the instance- or IAM user-role has sufficient permissions to write to the selected S3 bucket
  • Ensure that a bucket-name has been passed, but not checked for validity.
Once the pre-flights pass: the script will attept to snapshot all volumes hosting a filesystem; mount the snapshots under the /mnt hierarchy — recreating the original volumes' mount-locations, but rooted in /mnt; use the `tar` utility to encapsulate and stream the to-be-archived data to the s3 utility; use the S3 cp utility to write tar's streamed, encapsulated output to the named S3 bucket's "/Backups/" folder. Once the S3 cp utility closes the stream without errors, the script will then dismount and delete the snapshots.

Alternatives

As mentioned previously, it's possible to do similar actions to the above for filesystems that do not reside on LVM2 logical volumes. However, doing so will either require different methods for creating a consistent state for the backup-set or backing up postentially inconsistent data (and possibly even wholly missing "in flight" data).

EBS has the native ability to create copy-on-write snapshots. However, the EBS volume's snapshot capability is generally decoupled from the OS'es ability to "pause" a filesystem. One can use a tool — like those in the LxEBSbackups project — to coordinate the pausing of the filesystem so that the EBS snapshot can create a consistent copy of the data (and then unpause the filesystem as soon as the EBS snapshot has been started).

One can leave the data "as is" in the EBS snapshot or one can then mount the snapshot to the EC2 and execute a streamed archive operation to S3. The former has the value of being low effort. The latter has the benefit of storing the data to lower-priced tiers (even S3 standard is cheaper than snapshots of EBS volumes) and allowing the backed up data to be placed under S3 lifecycle policies.

Wednesday, May 2, 2018

"How Big Is It," You Ask?

For one of the projects I'm supporting, they want to deploy into an isolated network. However, they want to be able to flexibly support systems in that network being able to fetch feature and patch RPMs (and other stuff) as needed. Being on an isolated network, they don't want to overwhelm the security gateways by having a bunch of systems fetching directly from the internet. Result: a desire to host a controlled mirror of the relevant upstream software repositories.

This begged the question, "what all do we need to mirror." I stress the word "need" because this customer is used to thinking in terms of manual rather than automated efforts (working on that with them, too). As a result of this manual mind-set, they want to keep data-sets around tasks small.

Unfortunately, their idea of small is also a bit out of date. To them, 100GiB — at least for software repositories — falls into the "not small" category. To me, if I can fit something on my phone, it's a small amount of data. We'll ignore the fact that I'm biased by the fact that I could jam a 512GiB micro SD card into my phone. At any rate, they were wanting to minimize the number of repositories and channels they were going to need to mirror so that they could keep the copy-jobs small. I pointed out "you want these systems to have a similar degree of fetching-functionality to what they'd have if they weren't being blocked from downloading directly from the Internet: that means you need <insert exhaustive list of repositories and channels>". Naturally, they balked. The questions "how do you know they'll actually need all of that" and "how much space is all of that going to take" were asked (with a certain degree of overwroughtness).

I pointed out that I couldn't meaningfully answer the former question because I wasn't part of the design-groups for the systems that were to be deployed in the isolated network. I also pointed out that they'd probably be better asking the owners of the prospective systems what they'd anticipate needing (knowing full well that, usually, such questions' answers are somewhere in the "I'm not sure, yet" neighborhood). As such, my exhaustive list was a hedge: better to have and not need than to need and not have. Given the stupid-cheapness of storage and that it can actually be easier to sync all of the channels in a given repository vice syncing a subset, I didn't see a a benefit to not throw storage at the problem.

To the second question, I pointed out, "I'm sitting at this meeting, not in front of my computer. I can get you a full, channel-by-channel breakdown once I get back to my desk.

One of the nice things about yum repositories (where the feather and patch RPMs would come from) is that they're easily queryble for both item-counts and aggregate sizes. Only down side is that the OS-included tools for doing so are more for human-centric, ad hoc queries rather than something that can be jammed into an Excel spreadsheet and =SUM formulae being run. In other words, sizes are put into "friendly" units: if you have 100KiB of data, the number is reported in KiB; if you have 1055KiB of data, the number is reported in MiB; and so on. So, I needed to "wrap" the native tools output to put everything into consistent units (which Excel prefers for =SUMing and other mathematical actions). Because it was a "quick" task, I did it in BASH. In retrospect, using another language likely would have been far less ugly. However, what I came up with worked for creating a CSV:

#!/bin/bash

for YUM in $(
   yum repolist all | awk '/abled/{ print $1}' | \
      sed -e '{
         /-test/d
         /-debuginfo/d
         /-source/d
         /-media/d
      }'  | sed 's/\/.*$//'
   )
do
IFS=$'
'
   REPOSTRUCT=($(
      yum --disablerepo=* --enablerepo=${YUM} repolist -v | \
      grep ^Repo- | grep -E "(id|-name|pkgs|size) *:" | \
      sed 's/ *: /:/'
   ))
   unset IFS

   REPSZ=($(echo "${REPOSTRUCT[3]}" | sed 's/^Repo-size://'))

   if [[ $( echo "${REPSZ[1]}" ) = M ]]
   then
      SIZE=$(echo "${REPSZ[0]} * 1024" | bc)
   elif [[ $( echo "${REPSZ[1]}" ) = G ]]
   then
      SIZE=$(echo "${REPSZ[0]} * 1024 * 1024 " | bc)
   else
      SIZE="${REPSZ[0]}"
   fi
 
   for CT in 0 1 2
   do
      printf "%s;" "${REPOSTRUCT[${CT}]}"
   done
   echo ${SIZE}
done | sed 's/Repo-[a-z]*://'

Yeah... hideous and likely far from optimal ...but not worth my time (even if I had it) to revisit. It gets the job done and there's a certain joy to writing hideous code to solve a problem you didn't want to be asked to solve in the first place.

At any rate, checking against all of the repos that we'd want to mirror for the project, the initial-sync data-set would fit on my phone (without having to upgrade to one of  top-end beasties). Pointing out that the only the initial sync would be "large" and that only a couple of the channels updated with anything resembling regularity (the rest being essentially static), the monthly delta-sync would be vastly smaller and trivial to babysit. So, we'll see whether that assuages their anxieties or not.