Showing posts with label s3. Show all posts
Showing posts with label s3. Show all posts

Wednesday, August 28, 2024

Getting the Most Out of EC2 Transfers With Private S3 Endpoints

Recently, I was given a project to help a customer migrate an on-premises GitLab installation into AWS. The current GitLab was pretty large: a full export of the configuration was nearly 500GiB in size.

It turned out a good chunk of that 500GiB was due to disk-hosted artifacts and LFS objects. Since I was putting it all into AWS, I opted to make use of GitLab's ability to store BLOBs in S3. Ultimately, that turned out to be nearly 8,000 LFS objects and nearly 150,000 artifacts (plus several hundred "uploads").

The first challenge was getting the on-premises data into my EC2. Customer didn't want to give me access to their on-premises network, so I needed to have them generate the export TAR-file and upload it to S3. Once in S3, I needed to get it into an EC2.

Wanting to make sure that the S3→EC2 task was as quick as possible, I selected an instance-type rated to 12.5Gbps of network bandwidth and 10Gbps of EBS bandwidth. However, my first attempt at downloading the TAR-file from S3 took nearly an hour to run: it was barely creeping along at 120MiB/s. Abysmal.

I broke out `iostat` and found that my target EBS was reporting 100% utilization and a bit less than 125MiB/s of average throughput. That seemed "off" to me, so I looked at the EBS. Was then that I noticed that the default volume-throughput was only 125MiB/s. So, I upped the setting to its maximum: 1000MiB/s. I re-ran the transfer only to find that, while the transfer-speed had improved, it had only improved to a shade under 150MiB/s. Still abysmal.

So, I started rifling through the AWS documentation to see what CLI settings I could change to improve things. First mods were:

max_concurrent_requests = 40
multipart_chunksize = 10MB
multipart_threshold = 10MB

This didn't really make much difference. `iostat` was showing really variable utilization-numbers, but mostly that my target-disk was all but idle. Similarly, `netstat` was showing only a handful of simultaneous-streams between my EC2 and S3.

Contacted AWS support. They let me know that S3 multi-part upload and download was limited to 10,0000 chunks. So, I did the math (<FILE_SIZE> / <MAX_CHUNKS>) and changed the above to:

max_concurrent_requests = 40
multipart_chunksize = 55MB
multipart_threshold = 64MB

This time, the transfers were running about 220-250MiB/s. While that was a 46% throughput increase, it was still abysmal. While `netstat` was finally showing the expected number of simultaneous connections, my `iostat` was still saying that my EBS was mostly idle.

Reached back out to AWS support. They had the further suggestion of adding:

preferred_transfer_client = crt
target_bandwidth = 10GB/s

To my S3 configuration. Re-ran my test and was getting ≈990MiB/s of continuous throughput for the transfer! This knocked the transfer speed down from fifty-five minutes to a shade over eight minutes. In other words, I was going to be able to knock nearly an hour off the upcoming migration-task.

In digging back through the documentation, it seems that, when one doesn't specify a preferred_transfer_client value, the CLI will select the `classic` (`python`) client. And, depending on your Python version, the performance ranges from merely-horrible to ungodly-bad: using RHEL 9 for my EC2, it was pretty freaking bad, but had been less-bad when using AWS for my EC2's OS. Presumably a difference in the two distro's respective Python versions?

Specifying a preferred_transfer_client value of `crt` (C run-time client) unleashed the full might and fury of my EC2's and GP3's capabilities.

Interestingly, this "use 'classic'" behavior isn't a universal auto-selection. If you've selected an EC2 with any of the instance-types:

  • p4d.24xlarge
  • p4de.24xlarge
  • p5.48xlarge
  • trn1n.32xlarge
  • trn1.32xlarge

The auto-selection gets you `crt`. Not sure why `crt` isn't the auto-selected value for Nitro-based instance-types. But, "it's what it's". 

Side note: just selecting `crt` probably wouldn't have completely roided-out the transfer. I assume the further setting of `target_bandwidth` to `10GB/s` probably fully-unleashed things. There definitely wasn't much bandwidth leftover for me to actually monitor the transfer. I assume that the `target_bandwidth` parameter has a default value that's less than "all the bandwidth". However, I didn't actually bother to verify that.

Update: 

After asking support "why isn't `crt` the default for more instance-types", I got back the reply:

Thank you for your response. I see that these particular P5, P4d and Trn1 instances are purpose built for high-performance ML training1. Hence I assume the throughput needed for this ML applications needs to high and CRT is auto enabled for these instance types.

Currently, the CRT transfer client does not support all of the functionality available in the classic transfer client.

These are few limitations for CRT configurations2:

  • Region redirects - Transfers fail for requests sent to a region that does not match the region of the targeted S3 bucket.
  • max_concurrent_requests, max_queue_size, multipart_threshold, and max_bandwidth configuration values - Ignores these configuration values.
  • S3 to S3 copies - Falls back to using the classic transfer client


All of which is to say that, once I set `preferred_transfer_client = crt` all of my other, prior settings got ignored.

Thursday, June 4, 2020

TIL: You Gotta Be Explicit

Started working on a new contract, recently. This particular customer makes use of S3FS. To be honest, in the past half-decade, I've had a number of customers express interest in S3FS, but they're pretty much universally turned their noses up at it (due to any number of reasons that I can't disagree with — trying to use S3 like a shared filesystem is kind of horrible).

At any rate, this customer also makes use of Ansible for their provisioning automation. One of their "plays" is designed to mount the S3 buckets via s3fs. However, the manner in which they implemented it seemed kind of jacked to me: basically, they set up a lineinfile-based play to add to add s3fs commands to the /etc/rc.d/rc.local file, and then do a reboot to get the filesystems to mount up.

It wasn't a great method, to begin with, but, recently, their their security people made a change to the IAM objects they use to enable access to the S3 buckets. It, uh, broke things. Worse, because of how they implemented the s3fs-related play, there was no error trapping in their work-flow. Jobs that relied on /etc/rc.d/rc.local having worked started failing with no real indication as to why (when you pull a file directly from S3 rather than an s3fs mount, things are pretty immediately obvious what's going wrong).

At any rate, I decided to try to see if there might be a better way to manage the s3fs mounts. So, I went to the documentation. I wanted to see if there was a way to make them more "managed" by the OS such that, if there was a failure in mounting, the OS would put a screaming-halt to the automation. Overall, if I think a long-running task is likely to fail, I'd rather it fail early in the process than after I've been waiting for several minutes (or longer). So I set about simulating how they were mounting S3 buckets with s3fs.

As far as I can tell, the normal use-case for mounting S3 buckets via s3fs is to do something like:

s3fs <bucket> <mount> -o <OPTIONS>

However, they have their buckets cut up into "folders" and sub-folders and wanted to mount them individually. The s3fs documentation indicated that you could both mount individual folders and that you could do it via /etc/fstab. You simply needed an /etc/fstab that looks sorta like:
s3fs-build-bukkit:/RPMs    /provisioning/repo       fuse.s3fs    _netdev,allow_other,umask=0000,nonempty 0 0
s3fs-build-bukkit:/EXEs    /provisioning/installer  fuse.s3fs    _netdev,allow_other,umask=0000,nonempty 0 0
s3fs-users-bukkit:/build   /Data/personal           fuse.s3fs    _netdev,allow_other,umask=0000,nonempty 0 0

However, I was finding that, even though the mount-requests weren't erroring, they also weren't mounting. So, hit up the almighty Googs and found an issue-report in the S3FS project that matched my symptoms. The issue ultimately linked to a (poorly-worded) FAQ-entry. In short, I was used to implicit "folders" (ones that exist by way of an S3 object containing a slash-delimited key), but s3fs relies on explicitly-created "folders" (e.g., null objects with key-names that end in `/` — as would be created by doing `aws s3api put-object --bucket s3fs-build-bukkit --key test-folder/`). Once I explicitly created these trailing-slash null-objects, my /etc/fstab entries started working the way the documentation indicated they ought to have been doing all along.


Tuesday, May 7, 2019

Crib-Notes: Offline Delta-Syncs of S3 Buckets

In the normal world, synchronizing two buckets is as simple as doing `aws s3 sync <SYNC_OPTIONS> <SOURCE_BUCKET> <DESTINATION_BUCKET>`. However, due to the information security needs of some of my customers, it's occasionally necessary to perform data-synchronizations between two S3 buckets, but using methods that amount to "offline" transfers.

To illustrate what is meant by "offline":
  1. Create a transfer-archive from a data source
  2. Copy the transfer-archive across a security boundary
  3. Unpack the transfer-archive to its final destination

Note that things are a bit more involved than the summary of the process – but this gives you the gist of the major effort-points.

The first time you do an offline bucket sync, transferring the entirety of a bucket is typically the goal. However, for a refresh-sync – particularly for a bucket of greater than a trivial content-size, this can be sub-ideal. For example, it might be necessary to do monthly syncs of a bucket that grows by a few Gigabytes per month. After a year, a full sync can mean having to move tens to hundreds of gigabytes. A better way is to only sync the deltas – copying only what's changed between the current and immediately-prior sync-tasks (a few GiB rather than tens to hundreds).

The AWS CLI tools don't really have a "sync only the files that have been added/modified since <DATE>". That said, it's not super difficult to work around that gap. A simple shell script like the following works a trick:

for FILE in $( aws s3 ls --recursive s3://<SOURCE_BUCKET>/  | \
   awk '$1 > "2019-03-01 00:00:00" {print $4}' )
do
   echo "Downloading ${FILE}"
   install -bDm 000644 <( aws s3 cp "s3://<SOURCE_BUCKET>/${FILE}" - ) \
     "<STAGING_DIR>/${FILE}"
done

To explain the above:

  1. Create a list of files to iterate:
    1. Invoke a subprocess using the $() notation. Within that subprocess...
    2. Invoke the AWS CLI's S3 module to recursively list the source-bucket's contents (`aws s3 ls --recursive`)
    3. Pipe the output to `awk` – looking for any date-string that's newer than the value in s3 ls's first output-column (the file-modification date column) and print out only the fourth column (the S3 object-path)
    The output from the subprocess is captured that output as an iterable list-structure
  2. Use a for loop-method to iterate the previously-assembled list, assigning each S3 object-path to the ${FILE} variable
  3. Since I hate sending programs off to do things in silence (I don't trust them to not hang), my first looped-command is to say what's happening via the echo "Downloading ${FILE}" directive.
  4. The install line makes use of some niftiness within both BASH and the AWS CLI's S3 command:
    1. By specifying "-" as the "destination" for the file-copy operation, you tell the S3 command to write the fetched object-contents to STDOUT.
    2. BASH allows you take a stream of output and assign a file-handle to it by surrounding the output-producing command with <( ).
    3. Invoking the install command with the -D flag tells the command to "create all necessary path-elements to place the source 'file' in the desired location within the filesystem, even if none of the intervening directory structure exists, yet."
    Putting it all together, the install operation takes the streamed s3 cp output, and installs it as a file (with mode 000644) at the location derived from the STAGING_DIR plus the S3 object-path ...thus preserving the SOURCE_BUCKET's content-structure within the STAGING_DIR
Obviously, this method really only works for additive/substitutive deltas. If you need to account for deletions and/or moves, this approach will be insufficient.

Thursday, January 3, 2019

Isolated Network, You Say

The vast majority of my clients, for the past decade and a half, have been very security conscious. The frequency with which other companies end up in the news for data-leaks — either due to hackers or simply leaving an S3 bucket inadequately protected — has made many of them extremely cautious as they move to the cloud.

One of my customers has been particularly wary. As a result, their move to the cloud has included significant use of very locked-down and, in some cases, isolated VPCs. It has made implementing things both challenging and frustrating.

Most recently, I had to implement self-hosted GitLab solution within a locked down VPC. And, when I say "locked down VPC", I mean that even the standard AWS service-endpoints have been (effectively) replaced with custom, heavily-controlled endpoints. It's, uh, fun.

As I was deploying a new GitLab instance, I noticed that its backup jobs were failing. Yeah, I'd done what I thought was sufficient configuration via the gitlab.rb file's gitlab_rails['backup_upload_connection'] configuration-block. I'd even dug into the documentation to find the juju necessary for specifying the requisite custom-endpoint. While I'd ended up following a false lead to the documentation for fog (the Ruby module GitLab uses to interact with cloud-based storage options), I ultimately found the requisite setting is in the Digital Ocean section of the backup and restore document (simply enough, it requires setting an appropriate value for the "endpoint" parameter).

However, that turned out to not be enough. When I looked through git's error logs, I saw that it was getting SSL errors from the Excon Ruby module. Yes, everything in the VPC was using certificates from a private certificate authority (CA), but I'd installed the root CA into the OS's trust-chain. All the OS level tools were fine with using certificates from the private CA. All of the AWS CLIs and SDKs were similarly fine (since I'd included logic to ensure they were all pointing at the OS trust-store) - doing `aws s3 ls` (etc.) worked as one would expect. So, ended up digging around some more. Found the in-depth configuration-guidance for SSL and the note at the beginning of the Details on how GitLab and SSL work section:

GitLab-Omnibus includes its own library of OpenSSL and links all compiled programs (e.g. Ruby, PostgreSQL, etc.) against this library. This library is compiled to look for certificates in /opt/gitlab/embedded/ssl/certs.

This told me I was on the right path. Indeed, reading down just a page-scroll further, I found:

Note that the OpenSSL library supports the definition of SSL_CERT_FILE and SSL_CERT_DIR environment variables. The former defines the default certificate bundle to load, while the latter defines a directory in which to search for more certificates. These variables should not be necessary if you have added certificates to the trusted-certs directory. However, if for some reason you need to set them, they can be defined as envirnoment variables.

So, I added a:

gitlab_rails['env'] = {
        "SSL_CERT_FILE" => "/etc/pki/tls/certs/ca-bundle.crt"
}

To my gitlab.rb and did a quick `gitlab-ctl reconfigure` to make the new settings active in the running service. Afterwards, my GitLab backups to S3 worked without further issue.

Notes:

  • We currently use the Omnibus installation of GitLab. Methods for altering source-built installations will be different. See the GitLab documentation.
  • The above path for the "SSL_CERT_FILE" parameter is appropriate for RedHat/CentOS 7. If using a different distro, consult your distro's manuals for the appropriate location.

Tuesday, October 2, 2018

S3 And Impacts of Using an IAM Role Instead of an IAM User

I work for a group that manages a number of AWS accounts. To keep things somewhat simpler from a user-management perspective, we use IAM roles to access the various accounts rather than IAM users. This means that we can manage users via Active Directory such that, as new team members are added, existing team members' responsibilities change or team members leave, all we have to do is update their AD user-objects and the breadth and depth of their access-privileges are changed.

However, the use of IAM Roles isn't without it's limitations. Role-based users' accesses are granted by way of ephemeral tokens. Tokens last, at most, 3600 seconds. If you're a GUI user, it's kind of annoying because it means you that across an 8+ hour work-session, you'll be prompted to refresh your tokens at least seven times (on top of the initial login). If you're working mostly via the CLI, you won't necessarily notice things as each command you run silently refreshes your token.

I use the caveat "won't necessarily notice" because there are places where using an IAM Role can bite you in the ass. Worse, it will bite you in the ass in a way that may leave you scratching your head wondering "WTF isn't this working as I expected".

The most recent adventure in "WTF isn't this working as I expected" was in the context of trying to use S3 pre-signed URLs. If you haven't used them before, pre-singed URLs allow you to provide temporary access to S3-hosted objects that you otherwise want to not make anonymously-available. In the grand scheme of things, one can do one of the following to provide access to S3-hosted data for automated tools:
  • Public-read object hosted in a public S3 bucket: this is a great way to end up in new stories about accidental data leaks
  • Public-read object hosted in a private S3 bucket: you're still providing unfettered access to a specific S3-hosted object/object-set, but some rando can't simply explore the hosting S3 bucket for interesting data. You can still end up in the newspapers, but, the scope of the damage is likely to be much smaller
  • Private-read object hosted in a private S3 bucket: the least-likely to get you in the newspapers, but requires some additional "magic" to allow your automation access to S3-hosted files:
    • IAM User Credentials stored on the EC2 instance requiring access to the S3-hosted files: a good method, right until someone compromises that EC2 instance and steals the credentials. Then, the attacker has access to everything those credentials had access to (until such time that you discover the breach, and have deactivated or changed the credentials)
    • IAM Instance-role: a good way of providing broad-scope access to S3 buckets to an instance or group of instances sharing a role. Note that, absent some additional configuration trickery, every process running on an enabled instance has access to everything that the Instance-role provides access to. Thus, probably not a good choice for systems that allow interactive logins or that run more than one, attackable service.
    • Pre-signed URLs: a way to provide fine-grained, temporary access to S3-hosted objects. Primary down-fall is there's significant overhead in setting up access to a large collection of files or providing continuing access to said files. Thus, likely suitable for providing basic CloudFormation EC2 access to configuration files, but not if the EC2s are going to need ongoing access to said files (as they would in, say, an AutoScaling Group type of use-case)
There's likely more access methods one can play with - each with their own security, granularity and overhead tradeoffs. The above are simply the ones I've used.

The automation I write tends to include a degree of user-customizable content. Which is to say, I write my automation to take care of 90% or so of a given service's configuration, then hand the automation off to others to use as they see fit. To help prevent the need for significant code-divergence in these specific use-cases, my automation generally allows a user to specify the ingestion of configuration-specification files and/or secondary configuration scripts. These files or scripts often contain data that you wouldn't want to put in a public GitHub repository. Thus, I generally recommend to the automation's users "put that data someplace safe - here's methods for doing it relatively safely via S3".

Circling back so that the title of this post makes sense... Typically, I recommend the use of pre-signed URLs for automation-users that want to provide secure access to these once-in-an-instance-lifecycle files. Pre-signed URLs' access-granting can be as little as a couple seconds to as much as seven days. Without specifying a desired time, the granted-access is granted for two hours.

However, that degree of flexibility depends greatly on what type of IAM object is creating the ephemeral access-grant. A standard IAM user can grant with all of the previously mentioned time-flexibility. An IAM role, however, is constrained to however long their currently-active token is good for. Thus, if executing from the CLI using an IAM role the grantable lifetime for a pre-signed is 0-3600 seconds.

If you're confused whether the presigned URL you've created/were given was from an IAM user or Role, look for the presence of X-Amz-* in the URL. If you see any such elements, it was generate by an IAM Role and will only last up to 3600 seconds.

Wednesday, September 19, 2018

Exposed By Time and Use

Last year, a project with both high complexity and high urgency was dropped in my lap. Worse, the project was, to be charitable, not well specified. It was basically, "we need you to automate the deployment of these six services and we need to have something demo-able within thirty days".

Naturally, the six services were ones that I'd never worked with from the standpoint of installing or administering. Thus, there as a bit of a learning curve around how best to automate things that wasn't aided by the paucity of "here's the desired end-state" or other related details. All they really knew was:

  • Automate the deployment into AWS
  • Make sure it works on our customized/hardened build
  • Make sure that it backs itself up in case things blow up
  • GO!
I took my shot at the problem. I met the deadline. Obviously, the results were not exactly "optimal" — especially from the "turn it over to others to maintain standpoint. Naturally, after I turned over the initial capability to the requester, they were radio silent for a couple weeks.

When they finally replied, it was to let me know that the deadline had been extended by several months. So, I opted to use that time to make the automation a bit friendlier for the uninitiated to use. That's mostly irrelevant here — just more "background".

At any rate, we're now nearly a year-and-a-half removed from that initial rush-job. And, while I've improved the ease of use for the automation (it's been turned over to others for daily care-and-feeding), much of the underlying logic hasn't been revisited.

Over that kind of span, time tends to expose a given solution's shortcomings. Recently, they were attempting to do a parallel upgrade of one of the services and found that the data-move portion of the solution was resulting in build-timeouts. Turns out the size of the dataset being backed up (and recovered from as part of the automated migration process) had exploded. I'd set up the backups to operate incrementally, so, the increase in raw transfer times had been hidden.

The incremental backups were only taking a few minutes; however, restores of the dataset were taking upwards of 35 minutes. The build-automation was set to time out at 15 minutes (early in the service-deployment, a similar opration took 3-7 minutes) So, step one was to adjust the automation's timeouts to make allowances for the new restore-time realities. Second step was to investigate why the restores were so slow.

The quick-n-dirty backup method I'd opted for was a simple `s3 sync --delete /<APPLICATION_HOME_DIR>/ s3://<BUCKET>/<FOLDER>`. It was a dead-simple way to "sweep" the contents of the  directory /<APPLICATION_HOME_DIR>/ to S3. And, because S3's `sync` method defaults to incremental, the cron-managed backups were taking the same couple minutes each day that they were a year-plus ago.

Fun fact about S3 and its transfer performance: if the objects you're uploading have keys with high degrees of commonality, transfer performance will become abysmal.

You may be asking why I mention "keys" since I've not mentioned encryption. S3, being an object-based filesystem, doesn't have the hierarchical layout of legacy, host-oriented storage. If I take a file from a host-oriented storage and use the S3 CLI utility to copy that file via its fully-qualified pathname to S3, the object created in S3 will look like:
<FULLY>/<QUALIFIED>/<PATH>/<FILE>
Of the above, "<FILE>" is the object name stored in S3 while "<FULLY>/<QUALIFIED>/<PATH>" is the key for that file. If you have a few thousand objects with the same or sufficiently-similar "<FULLY>/<QUALIFIED>/<PATH>" values, you'll run into that "transfer performance will become abysmal" issue mentioned earlier.

We very definitely did run into that problem. HARD. The dataset in question is (currently) a skosh less than 11GiB in size. The instance being backed up has an expected througput of about 0.45Gbps of sustained network throughput. So, we were expecting that dataset to take only a couple minutes to transfer. However, as noted above, it was taking 35+ minutes to do so.

So, how to fix? One of the easier methods is to stream your backups to a single file. I quick series of benchmarking-runs showed that doing so cut that transfer from over 35 minutes to under five minutes. Similarly, were one to iterate over all the files in the dataset, and individually copying the files into S3 using either randomize filenames (and setting the "real" fully-qualified-path as an attribute/tag of the file) or simply reversing the path-name (doing something like `S3NAME=$( echo "${REAL_PATHNAME} | perl -lne 'print join "/", reverse split/\//;')`) and storing that then your performance goes up dramatically.

I'll likely end up doing one of those three methods ...once I have enough contiguous time to allocate to re-engineering the backup and restore/rebuild methods.

Monday, May 7, 2018

Streamed Backups to S3

Introduction/Background


Many would-be users of AWS come to AWS from a legacy hosting background. Often times, when moving to AWS, the question, "how do I back my stuff up when I no longer have access to my enterprise backup tools," is asked. If not, it's a question that would-be AWS users should be asking.

AWS provides a number of storage options. Each option has use-cases that it is optimized for. Each also has a combination of performance, feature and pricing tradeoffs (see my document for a quick summary of these tradeoffs). The lowest-cost - and therefore most attractive for data-retention use-cases typical of backups-related activities - is S3. Further, within S3, there are pricing/capability tiers that are appropriate to different types of backup needs (the following list is organized by price, highest to lowest):
  • If there is a need to perform frequent full or partial recoveries, the S3 Standard tier is probably the best option
  • If recovery-frequency is pretty much "never" — but needs to be quick if there actually is a need to perform recoveries — and the policies governing backups mandates up to a thirty-day recoverability window, the best option is likely the S3 Infrequent Access (IA) tier.
  • If there's generally no need for recovery beyond legal compliance capabilities, or the recovery-time objectives (RTO) for backups will tolerate a multi-hour wait for data to become unavailable, the S3 glacier layer is probable the best option.
Further, if projects backup needs span the usage profiles of the previous list, data lifecycle policies can be created that will move data from a higher-cost layer to a lower-cost layer based on time thresholds. And, to prevent being billed for data that has no further utility, the lifecycle policies can include an expiration-age at which AWS will simply delete and stop charging for the backed up data.

There are a couple of ways to get backup data into S3:
  • Copy: The easiest — and likely most well known — is to simply copy the data from a host into an S3 bucket. Every file on disk that's copied to S3 exists as an individually downloadable file in S3. Copy operations can be iterative or recursive. If the copy operation takes the form of a recursive-copy, basic location relationship between files is preserved (though, things like hard- or soft-links get converted into multiple copies of a given file). While this method is easy, it includes a loss of filesystem metadata — not just the previously-mentioned loss of link-style file-data but ownerships, permissions, MAC-tags, etc.
  • Sync: Similarly easy is the "sync" method. Like the basic copy Every file on disk that's copied to S3 exists as an individually downloadable file in S3. The sync operation is inherently recursive. Further, if an identical copy of a file exists within S3 at a given location, the sync operation will only overwrite the S3-hosted file if the to-be-copied file is different. This provides good support for incremental style backups. As with the basic copy-to-S3 method, this method results in the loss of file-link and other filesystem metadata.

    Note: if using this method, it is probably a good idea to turn on bucket-versioning to ensure that each version of an uploaded file is kept. This allows a recovery operation to recover a given point-in-time's version of the backed-up file.
  • Streaming copy: This method is the least well-known. However, this method can be leveraged to overcome the problem of loss filesystem metadata. If the stream-to-S3 operation includes an inlined data-encapsulation operation (e.g., piping the stream through the tar utility), filesystem metadata will be preserved.

    Note: the cost of preserving metadata via encapsulation is that the encapsulated object is opaque to S3. As such, there's no (direct) means by which to emulate an incremental backup operation.

Technical Implementation

As the title of this article suggests, the technical-implementation focus of this article is on streamed backups to S3.

Most users of S3 are aware of is static file-copy options. That is copying a file from an EC2 instance directly to S3. Most such users, when they want to store files in EC2 and need to retain filesystem metadata either look to things like s3fs or do staged encapsulation.

The former allows you to treat S3 as though it were a local filesystem. However, for various reasons, many organizations are not comfortable using FUSE-based filesystem-implementstions - particularly opensource project ones (usually due to fears about support if something goes awry)

The latter means using an archiving tool to create a pre-packaged copy of the data first staged to disk as a complete file and then copying that file to S3. Common archiving tools include the Linux Tape ARchive utility (`tar`), cpio or even `mkisofs`/`genisoimage`. However, if the archiving tool supports reading from STDIN and/or writing to STDOUT, the tool can be used to create an archive directly within S3 using S3's streaming-copy capabilities.

Best practices for backups is to ensure that the target data-set is in a consistent state. Generally, this means that the data to be archived is non-changing. This can be done by quiescing a filesystem ...or snapshotting a filesystem and backing up the snapshot. Use of LVM snapshots will be used to illustrate how to a consistent backup of a live filesystem (like those used to host the operating system.)

Note: this illustration assumes that the filesystem to be  backed up is built on top of LVM. If the filesystem is built on a bare (EBS-provided) device, the filesystem will need to be stopped before it can be consistently streamed to S3.

The high-level procedure is as follows:
  1. Create a snapshot of the logical volume hosting the filesystem to be backed up (note that LVM issues an `fsfreeze` operation before to creating the snapshot: this flushes all pending I/Os before making the snapshot, ensuring that the resultant snapshot is in a consistent state). Thin or static-sized snapshots may be selected (thin snapshots are especially useful when snapshotting multiple volumes within the same volume-group as one has less need to worry about getting the snapshot volume's size-specification correct).
  2. Mount the snapshot
  3. Use the archiving-tool to stream the filesystem data to standard output
  4. Pipe the stream to S3's `cp` tool, specifying to read from a stream and to write to object-name in S3
  5. Unmount the snapshot
  6. Delete the snapshot
  7. Validate the backup by using S3's `cp` tool, specifying to write to a stream and then read the stream using the original archiving tool's capability to read from standard input. If the archiving tool has a "test" mode, use that; if it does not, it is likely possible to specify /dev/null as its output destination.
For a basic, automated implementation of the above, see the linked-to tool. Note that this tool is "dumb": it assumes that all logical volumes hosting a filesystem should be backed up. The only argument it takes is the name of the S3 bucket to upload to. The script does only very basic "pre-flight" checking:
  • Ensure that the AWS CLI is found within the script's inherited PATH env.
  • Ensure that either an AWS IAM instance-role is attached to the instance or that an IAM user-role is defined in the script's execution environment (${HOME}/.aws/credential files not currently supported). No attempt is made to ensure the instance- or IAM user-role has sufficient permissions to write to the selected S3 bucket
  • Ensure that a bucket-name has been passed, but not checked for validity.
Once the pre-flights pass: the script will attept to snapshot all volumes hosting a filesystem; mount the snapshots under the /mnt hierarchy — recreating the original volumes' mount-locations, but rooted in /mnt; use the `tar` utility to encapsulate and stream the to-be-archived data to the s3 utility; use the S3 cp utility to write tar's streamed, encapsulated output to the named S3 bucket's "/Backups/" folder. Once the S3 cp utility closes the stream without errors, the script will then dismount and delete the snapshots.

Alternatives

As mentioned previously, it's possible to do similar actions to the above for filesystems that do not reside on LVM2 logical volumes. However, doing so will either require different methods for creating a consistent state for the backup-set or backing up postentially inconsistent data (and possibly even wholly missing "in flight" data).

EBS has the native ability to create copy-on-write snapshots. However, the EBS volume's snapshot capability is generally decoupled from the OS'es ability to "pause" a filesystem. One can use a tool — like those in the LxEBSbackups project — to coordinate the pausing of the filesystem so that the EBS snapshot can create a consistent copy of the data (and then unpause the filesystem as soon as the EBS snapshot has been started).

One can leave the data "as is" in the EBS snapshot or one can then mount the snapshot to the EC2 and execute a streamed archive operation to S3. The former has the value of being low effort. The latter has the benefit of storing the data to lower-priced tiers (even S3 standard is cheaper than snapshots of EBS volumes) and allowing the backed up data to be placed under S3 lifecycle policies.

Tuesday, December 5, 2017

I've Used How Much Space??

A customer of mine needed me to help them implement a full CI/CD tool-chain in AWS. As part of that implementation, they wanted to have daily backups. Small problem: the enterprise backup software that their organization normally uses isn't available in their AWS-hosted development account/environment. That environment is mostly "support it yourself".

Fortunately, AWS has a number of tools that can help with things like backup tasks. The customer didn't have strong specifications on how they wanted things backed up, retention-periods, etc. Just "we need daily backups". So I threw them together some basic "pump it into S3" type of jobs with the caveat "you'll want to keep an eye on this because, right now, there's no data lifecycle elements in place".

For the first several months things ran fine. Then, as they often do, problems began popping up. Their backup jobs started experiencing periodic errors. Wasn't able to find underlying causes. However, in my searching around, it occurred to me "wonder if these guys have been aging stuff off like I warned them they'd probably want to do."

AWS provides a nifty GUI option in the S3 console that will show you storage utilization. A quick look in their S3 backup buckets told me, "doesn't look like they have".

Not being over much of a GUI-jockey, I wanted something I could run from the CLI that could be fed to an out-of-band notifier. The AWS CLI offers the `s3api` tool-set that comes in handy for such actions. My first dig through (and some Googling), I sorted out "how do I get a total-utilization view for this bucket". It looks something like:

aws s3api list-objects --bucket toolbox-s3res-12wjd9bihhuuu-backups-q5l4kntxp35k \
    --output json --query "[sum(Contents[].Size), length(Contents[])]" | \
    awk 'NR!=2 {print $0;next} NR==2 {print $0/1024/1024/1024" GB"}'
[
1671.5 GB
    423759
]

The above agreed with the GUI and was more space than I'd assumed they'd be using at this point. So, I wanted to see "can I clean up".

aws s3api list-objects --bucket toolbox-s3res-12wjd9bihhuuu-backups-q5l4kntxp35k \
    --prefix Backups/YYYYMMDD/ --output json \
    --query "[sum(Contents[].Size), length(Contents[])]" | \
    awk 'NR!=2 {print $0;next} NR==2 {print $0/1024/1024/1024" GB"}'
[
198.397 GB
    50048
]
That "one day's worth of backups" was also more than expected. Last time I'd censused their backups (earlier in the summer), they had maybe 40GiB worth of data. They wanted a week's worth of backups. However, at 200GiB/day worth of backups, I could see that I really wasn't going to be able to trim the utilization. Also meant that maybe they were keeping on top of aging things off.
Note: yes, S3 has lifecycle policies that allow you to automate moving things to lower-cost tiers. Unfortunately, the auto-tiering (at least from regular S3 to S3-IA) has a minimum age of 30 days. Not helpful, here.
Saving grace: at least I snuffled up a way to verify get metrics without the web GUI. As a side effect, also meant I had a way to see that the amount that reaches S3 matches the amount being exported from their CI/CD applications.