Titular Discrepancy

Friday, January 18, 2019

GitLab: You're Kidding Me, Right?

Some of the organizations I do work for run their own, internal/private git servers (mostly GitLab CE or EE but the occasional GitHub EE). However, the way we try to structure our contracts, we maintain overall ownership of code we produce. As part of this, we do all of our development in our corporate GitHub.Com account. When customers want the content in their git servers, we set up a replication-job to take care of the requisite heavy-lifting.

One of the side-effects of developing externally, this way, is that the internal/private git service won't really know about the email addresses associated with the externally-sourced commits. While you can add all of your external email addresses to your account within the internal/private git service, some of those external email addresses may not be verifiable (e.g., if you use GitHub's "noreply" address-hiding option).

GitLab makes having these non-verifiable addresses in your commit-history not particularly fun/easy to resolve. To "fix" the problem, you need to go into the GitLab server's administration CLI and fix things. So, to add my GitHub "noreply" email, I needed to do:

SSH to the GitLab server
Change privileges (sudo) to an account that has the ability to invoke the administration CLI
Start the GitLab administration CLI
Use a query to set a modification-handle for the target account (my contributor account)
Add a new email address (the GitHub "noreply" address)
Tell GitLab "you don't need to verify this" (mandatory: this must be said in a Obi-Wan Kenobi voice)
Hit save and exit the administration CLI

For me, this basically looked like:

-------------------------------------------------------------------------------------
 GitLab:       11.6.5 (237bddc)
 GitLab Shell: 8.4.3
 postgresql:   9.6.10
-------------------------------------------------------------------------------------
Loading production environment (Rails 5.0.7)
irb(main):002:0> user = User.find_by(email: 'my@ldap.email.address')
=> #
irb(main):003:0> user.email = 'ferricoxide@users.noreply.github.com'
=> "ferricoxide@users.noreply.github.com"
irb(main):004:0> user.skip_reconfirmation!
=> true
irb(main):005:0> user.save!
=> true
irb(main):006:0>

Once this is done, when I look at my profile page, my GitHub "noreply" address appears as verified (and all commits associated with that address show up with my Avatar)

Thursday, January 3, 2019

Isolated Network, You Say

The vast majority of my clients, for the past decade and a half, have been very security conscious. The frequency with which other companies end up in the news for data-leaks — either due to hackers or simply leaving an S3 bucket inadequately protected — has made many of them extremely cautious as they move to the cloud.

One of my customers has been particularly wary. As a result, their move to the cloud has included significant use of very locked-down and, in some cases, isolated VPCs. It has made implementing things both challenging and frustrating.

Most recently, I had to implement self-hosted GitLab solution within a locked down VPC. And, when I say "locked down VPC", I mean that even the standard AWS service-endpoints have been (effectively) replaced with custom, heavily-controlled endpoints. It's, uh, fun.

As I was deploying a new GitLab instance, I noticed that its backup jobs were failing. Yeah, I'd done what I thought was sufficient configuration via the gitlab.rb file's gitlab_rails['backup_upload_connection'] configuration-block. I'd even dug into the documentation to find the juju necessary for specifying the requisite custom-endpoint. While I'd ended up following a false lead to the documentation for fog (the Ruby module GitLab uses to interact with cloud-based storage options), I ultimately found the requisite setting is in the Digital Ocean section of the backup and restore document (simply enough, it requires setting an appropriate value for the "endpoint" parameter).

However, that turned out to not be enough. When I looked through git's error logs, I saw that it was getting SSL errors from the Excon Ruby module. Yes, everything in the VPC was using certificates from a private certificate authority (CA), but I'd installed the root CA into the OS's trust-chain. All the OS level tools were fine with using certificates from the private CA. All of the AWS CLIs and SDKs were similarly fine (since I'd included logic to ensure they were all pointing at the OS trust-store) - doing `aws s3 ls` (etc.) worked as one would expect. So, ended up digging around some more. Found the in-depth configuration-guidance for SSL and the note at the beginning of the Details on how GitLab and SSL work section:

GitLab-Omnibus includes its own library of OpenSSL and links all compiled programs (e.g. Ruby, PostgreSQL, etc.) against this library. This library is compiled to look for certificates in /opt/gitlab/embedded/ssl/certs.

This told me I was on the right path. Indeed, reading down just a page-scroll further, I found:

Note that the OpenSSL library supports the definition of SSL_CERT_FILE and SSL_CERT_DIR environment variables. The former defines the default certificate bundle to load, while the latter defines a directory in which to search for more certificates. These variables should not be necessary if you have added certificates to the trusted-certs directory. However, if for some reason you need to set them, they can be defined as envirnoment variables.

So, I added a:

gitlab_rails['env'] = {
        "SSL_CERT_FILE" => "/etc/pki/tls/certs/ca-bundle.crt"
}

To my gitlab.rb and did a quick `gitlab-ctl reconfigure` to make the new settings active in the running service. Afterwards, my GitLab backups to S3 worked without further issue.

Notes:

We currently use the Omnibus installation of GitLab. Methods for altering source-built installations will be different. See the GitLab documentation.
The above path for the "SSL_CERT_FILE" parameter is appropriate for RedHat/CentOS 7. If using a different distro, consult your distro's manuals for the appropriate location.

Monday, December 17, 2018

Crib-Notes: Tuning MTU-size on EL7-based DHCP Clients

Because:

One of the networks I deal with is broken
I have only about a 15% hit-rate of what I actually need whenever I try to re-find the info
most of my Google hits seem to think that using `ip link set...` is a fine, long-term solution and the ones that point to persistency referenc either nmcli or the /etc/sysconfig/network-scripts file

Need to save this here (maybe by doing so, Google will at least return this page when I need to re-search)...

To override the default MTU (in my case turning off Jumbo frames), edit the /etc/dhcp/dhclient-eth0.conf file like so:

interface eth0 {
    supersede interface-mtu 1350;
}

Reboot and everything should be fine (actually able to vi a file, use man pages, etc., without locking up my SSH session!).

Tuesday, November 6, 2018

Shutdown In a Hurry

Last year, I put together an automated deployment of a CI tool for a customer. The customer was using it to provide a CI service for different developer-groups within their company. Each developer group acts like an individual tenant of my customer's CI service.

Recently, as more of their developer-groups have started using the service, space-consumption has exploded. Initially, they tried setting up job within their CI system to automate cleanups of some of their tenants more-poorly architected CI jobs — ones that didn't appropriately clean up after themselves. Unfortunately, once the CI systems' filesystems fill up, jobs stop working ...including the cleanup jobs.

When I'd written the initial automation, I'd written automation to create two different end-states for their CI system: one that is basically just a cloud-hosted version of a standard, standalone deployment of the CI service; the other being a cloud-enabled version that's designed to automatically rebuild itself on a regular basis or in the event of service failure. They chose to deploy the former because it's the type of deployment they were most familiar with.

With their tenants frequently causing service-components to go offline and their cleaning jobs not being reliable, they asked me to investigate things and come up with a work around. I pointed them to the original set of auto-rebuilding tools I'd provided them, noting that, with a small change, those tools could detect the filesystem-full state and initiate emergency actions. In this case, the proposed emergency action being a system-suicide that caused the automation to rebuild the service back to a healthy state.

Initially, I was going to patch the automation so that, upon detecting a disk-full state, it would trigger a graceful shutdown. Then it struck me, "I don't care about these systems' integrity, I care about them going away quickly," since the quicker they go away, the quicker the automation will notice and start taking steps to re-establish the service. What I wanted was, instead of an `init 6` style shutdown, to have a "yank the power cord out of the wall" style of shutdown.

In my pre-Linux days, the commercial UNIX systems I dealt with each had methods for forcing a system to immediately stop dead. Do not pass go. Do not collect $200. So, started digging around for analogues.

Prior to many Linux distributions — including the one the CI service was deployed onto — moving to systemd, you could halt the system by doing:

# kill -SEGV 1

Under systemd, however, the above will cause systemd to dump a core file but not halt the system. Instead, systemd instantly respawns:

Broadcast message from systemd-journald@test02.lab (Tue 2018-11-06 18:35:52 UTC):

systemd[1]: Caught <segv>, dumped core as pid 23769.


Broadcast message from systemd-journald@test02.lab (Tue 2018-11-06 18:35:52 UTC):

systemd[1]: Freezing execution.


Message from syslogd@test02 at Nov  6 18:35:52 ...
 systemd:Caught <segv>, dumped core as pid 23769.

Message from syslogd@test02 at Nov  6 18:35:52 ...
 systemd:Freezing execution.

[root@test02 ~]# who -r
         run-level 5  2018-11-06 00:16

So, I did a bit more digging around. In doing so, I found that Linux has a functional analog to hitting the SysRq key on a 80s or 90s vintage system. This is an optional functionality that is enabled in Enterprise Linux. For a list of things that used to be doable with a SysRq key, I'd point you to this "Magic SysRq Key" article.

So, I tested it out by doing:

# echo o >/proc/sysrq-trigger

Which pretty much instantly causes an SSH connection to the remote system to "hang". It's not so much that the connection has hung as doing the above causes the remote system to immediately halt. It can take a few seconds, but, eventually the SSH client will decide that the endpoint has dropped, resulting in the "hung" SSH session exiting similarly to:

# Connection reset by 10.6.142.20

This bit of knowledge verified, I had my suicide-method for my automation. Basically, I set up a cron job to run every five minutes to test the "fullness" of ${APPLICATION_HOME}:

#!/bin/bash
#
##################################################
PROGNAME="$(basename ${0})"
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
APPDIR=$(awk -F= '/APP_WORKDIR/{print $2}' /etc/cfn/Deployment.envs)
RAWBLOCKS="$(( $( stat -f --format='%b*%S' ${APPDIR} ) ))"
USEBLOCKS="$(( $( stat -f --format='%a*%S' ${APPDIR} ) ))"
PRCNTFREE=$( printf '%.0f' $(awk "BEGIN { print ( ${USEBLOCKS} / ${RAWBLOCKS} ) * 100 }") )

if [[ ${PRCNTFREE} -gt 5 ]]
then
   echo "Found enough free space" > /dev/null
else
   # Put a bullet in systemd's brain
   echo o > /proc/sysrq-trigger
fi

Basically, the above finds the number of available blocks in the target filesystem and divides by the number of raw blocks and converts it into a percentage (specifically, it checks the number of blocks available to a non-privileged application rather than the root user). If the percentage drops to or below "5", it sends a "hard stop" signal via the SysRq interface.

The external (re)build automation takes things from there. The preliminary benchmarked recovery time from exceeding the disk-full threshold to being back in business on a replacement node is approximately 15 minutes (though will be faster in future iterations by better optimizing the build-time hardening routines).

Tuesday, October 2, 2018

S3 And Impacts of Using an IAM Role Instead of an IAM User

I work for a group that manages a number of AWS accounts. To keep things somewhat simpler from a user-management perspective, we use IAM roles to access the various accounts rather than IAM users. This means that we can manage users via Active Directory such that, as new team members are added, existing team members' responsibilities change or team members leave, all we have to do is update their AD user-objects and the breadth and depth of their access-privileges are changed.

However, the use of IAM Roles isn't without it's limitations. Role-based users' accesses are granted by way of ephemeral tokens. Tokens last, at most, 3600 seconds. If you're a GUI user, it's kind of annoying because it means you that across an 8+ hour work-session, you'll be prompted to refresh your tokens at least seven times (on top of the initial login). If you're working mostly via the CLI, you won't necessarily notice things as each command you run silently refreshes your token.

I use the caveat "won't necessarily notice" because there are places where using an IAM Role can bite you in the ass. Worse, it will bite you in the ass in a way that may leave you scratching your head wondering "WTF isn't this working as I expected".

The most recent adventure in "WTF isn't this working as I expected" was in the context of trying to use S3 pre-signed URLs. If you haven't used them before, pre-singed URLs allow you to provide temporary access to S3-hosted objects that you otherwise want to not make anonymously-available. In the grand scheme of things, one can do one of the following to provide access to S3-hosted data for automated tools:

Public-read object hosted in a public S3 bucket: this is a great way to end up in new stories about accidental data leaks
Public-read object hosted in a private S3 bucket: you're still providing unfettered access to a specific S3-hosted object/object-set, but some rando can't simply explore the hosting S3 bucket for interesting data. You can still end up in the newspapers, but, the scope of the damage is likely to be much smaller
Private-read object hosted in a private S3 bucket: the least-likely to get you in the newspapers, but requires some additional "magic" to allow your automation access to S3-hosted files:

IAM User Credentials stored on the EC2 instance requiring access to the S3-hosted files: a good method, right until someone compromises that EC2 instance and steals the credentials. Then, the attacker has access to everything those credentials had access to (until such time that you discover the breach, and have deactivated or changed the credentials)
IAM Instance-role: a good way of providing broad-scope access to S3 buckets to an instance or group of instances sharing a role. Note that, absent some additional configuration trickery, every process running on an enabled instance has access to everything that the Instance-role provides access to. Thus, probably not a good choice for systems that allow interactive logins or that run more than one, attackable service.
Pre-signed URLs: a way to provide fine-grained, temporary access to S3-hosted objects. Primary down-fall is there's significant overhead in setting up access to a large collection of files or providing continuing access to said files. Thus, likely suitable for providing basic CloudFormation EC2 access to configuration files, but not if the EC2s are going to need ongoing access to said files (as they would in, say, an AutoScaling Group type of use-case)

There's likely more access methods one can play with - each with their own security, granularity and overhead tradeoffs. The above are simply the ones I've used.

The automation I write tends to include a degree of user-customizable content. Which is to say, I write my automation to take care of 90% or so of a given service's configuration, then hand the automation off to others to use as they see fit. To help prevent the need for significant code-divergence in these specific use-cases, my automation generally allows a user to specify the ingestion of configuration-specification files and/or secondary configuration scripts. These files or scripts often contain data that you wouldn't want to put in a public GitHub repository. Thus, I generally recommend to the automation's users "put that data someplace safe - here's methods for doing it relatively safely via S3".

Circling back so that the title of this post makes sense... Typically, I recommend the use of pre-signed URLs for automation-users that want to provide secure access to these once-in-an-instance-lifecycle files. Pre-signed URLs' access-granting can be as little as a couple seconds to as much as seven days. Without specifying a desired time, the granted-access is granted for two hours.

However, that degree of flexibility depends greatly on what type of IAM object is creating the ephemeral access-grant. A standard IAM user can grant with all of the previously mentioned time-flexibility. An IAM role, however, is constrained to however long their currently-active token is good for. Thus, if executing from the CLI using an IAM role the grantable lifetime for a pre-signed is 0-3600 seconds.

If you're confused whether the presigned URL you've created/were given was from an IAM user or Role, look for the presence of X-Amz-* in the URL. If you see any such elements, it was generate by an IAM Role and will only last up to 3600 seconds.

Wednesday, September 26, 2018

So You Created a Regression?

Sometimes, when you're fixing up files in git-managed files, you'll create a regression by nuking a line (or whole blocks) of code. If you can remember something that was in that nuked chunk of code, you can write a quick script to find all prior commits that referenced that chunk of code.

In this particular case, one of my peers was working on writing some Jenkins pipeline-definitions. The pipeline needed to automagically create S3 pre-signed URLs. At some point, the routine for doing so got nuked. Because coming up with the requisite interpolation-protections had been kind of a pain in the ass, we really didn't want to have to go through the pain of reinventing that particular wheel.

So, how to make git help us find the missing snippet. `git log`, horse to a quick loop-iteration, can do the trick:

for FILE in $( find <PROJECT_ROOT_DIR> -name "*.groovy" )
do
   echo $FILE
   git log --pretty="   %H" -Spresign $FILE
done | grep ^commit

In the above:

We're executing within the directory created by the original `git clone` invocation. To limit the search-scope, you can also run it from a subdirectory of the project.
Since, in this example case, we know that all the Jenkins pipeline definitions end with the `.groovy` extension, we limit our search to just those file-types.
The `-Spresign` is used to tell `git log` to look for the string `presign`.
The `--pretty=" %H"` suppresses all the other output from the `git log` command's output - ensuring that only the commit-ID is print. The leading spaces in the quoted string provide a bit of indenting to make the output-groupings a bit easier to intuit.

This quick loop provides us a nice list of commit-IDs like so:

./Deployment/Jenkins/agent/agent-instance.groovy
   64e2039d593f653f75fd1776ca94bdf556166710
   619a44054a6732bacfacc51305b353f8a7e5ebf6
   1e8d5e40c7db2963671457afaf2d16e80e42951f
   bb7af6fd6ed54aeca54b084627e7e98f54025c85
./Deployment/Jenkins/master/Jenkins_master_infra.groovy
./Deployment/Jenkins/master/Jenkins_S3-MigrationHelper.groovy
./Deployment/Jenkins/master/Master-Ec2-Instance.groovy
./Deployment/Jenkins/master/Master-Elbv1.groovy
./Deployment/Jenkins/master/parent-instance.groovy

In the above, only the file `Deployment/Jenkins/agent/agent-instance.groovy` contains our searched-for string. The first-listed commit-ID (indented under the filename) contains the subtraction-diff for the targeted string. Similary, the second commit-ID contains the code snippet we're actually after. The remaining commit-IDs contain the original "invention of the wheel",

In this particular case, we couldn't simply revert to the specific commit as there were a lot of other changes that were desired. However, it did let the developer use `git show` so that he could copy out the full snippet he wanted back in the current version(s) of his pipelines.

Wednesday, September 19, 2018

Exposed By Time and Use

Last year, a project with both high complexity and high urgency was dropped in my lap. Worse, the project was, to be charitable, not well specified. It was basically, "we need you to automate the deployment of these six services and we need to have something demo-able within thirty days".

Naturally, the six services were ones that I'd never worked with from the standpoint of installing or administering. Thus, there as a bit of a learning curve around how best to automate things that wasn't aided by the paucity of "here's the desired end-state" or other related details. All they really knew was:

Automate the deployment into AWS
Make sure it works on our customized/hardened build
Make sure that it backs itself up in case things blow up
GO!

I took my shot at the problem. I met the deadline. Obviously, the results were not exactly "optimal" — especially from the "turn it over to others to maintain standpoint. Naturally, after I turned over the initial capability to the requester, they were radio silent for a couple weeks.

When they finally replied, it was to let me know that the deadline had been extended by several months. So, I opted to use that time to make the automation a bit friendlier for the uninitiated to use. That's mostly irrelevant here — just more "background".

At any rate, we're now nearly a year-and-a-half removed from that initial rush-job. And, while I've improved the ease of use for the automation (it's been turned over to others for daily care-and-feeding), much of the underlying logic hasn't been revisited.

Over that kind of span, time tends to expose a given solution's shortcomings. Recently, they were attempting to do a parallel upgrade of one of the services and found that the data-move portion of the solution was resulting in build-timeouts. Turns out the size of the dataset being backed up (and recovered from as part of the automated migration process) had exploded. I'd set up the backups to operate incrementally, so, the increase in raw transfer times had been hidden.

The incremental backups were only taking a few minutes; however, restores of the dataset were taking upwards of 35 minutes. The build-automation was set to time out at 15 minutes (early in the service-deployment, a similar opration took 3-7 minutes) So, step one was to adjust the automation's timeouts to make allowances for the new restore-time realities. Second step was to investigate why the restores were so slow.

The quick-n-dirty backup method I'd opted for was a simple `s3 sync --delete /<APPLICATION_HOME_DIR>/ s3://<BUCKET>/<FOLDER>`. It was a dead-simple way to "sweep" the contents of the directory /<APPLICATION_HOME_DIR>/ to S3. And, because S3's `sync` method defaults to incremental, the cron-managed backups were taking the same couple minutes each day that they were a year-plus ago.

Fun fact about S3 and its transfer performance: if the objects you're uploading have keys with high degrees of commonality, transfer performance will become abysmal.

You may be asking why I mention "keys" since I've not mentioned encryption. S3, being an object-based filesystem, doesn't have the hierarchical layout of legacy, host-oriented storage. If I take a file from a host-oriented storage and use the S3 CLI utility to copy that file via its fully-qualified pathname to S3, the object created in S3 will look like:

<FULLY>/<QUALIFIED>/<PATH>/<FILE>

Of the above, "<FILE>" is the object name stored in S3 while "<FULLY>/<QUALIFIED>/<PATH>" is the key for that file. If you have a few thousand objects with the same or sufficiently-similar "<FULLY>/<QUALIFIED>/<PATH>" values, you'll run into that "transfer performance will become abysmal" issue mentioned earlier.

We very definitely did run into that problem. HARD. The dataset in question is (currently) a skosh less than 11GiB in size. The instance being backed up has an expected througput of about 0.45Gbps of sustained network throughput. So, we were expecting that dataset to take only a couple minutes to transfer. However, as noted above, it was taking 35+ minutes to do so.

So, how to fix? One of the easier methods is to stream your backups to a single file. I quick series of benchmarking-runs showed that doing so cut that transfer from over 35 minutes to under five minutes. Similarly, were one to iterate over all the files in the dataset, and individually copying the files into S3 using either randomize filenames (and setting the "real" fully-qualified-path as an attribute/tag of the file) or simply reversing the path-name (doing something like `S3NAME=$( echo "${REAL_PATHNAME} | perl -lne 'print join "/", reverse split/\//;')`) and storing that then your performance goes up dramatically.

I'll likely end up doing one of those three methods ...once I have enough contiguous time to allocate to re-engineering the backup and restore/rebuild methods.

Thursday, August 16, 2018

Systemd And Timing Issues

Unlike a lot of Linux people, I'm not a knee-jerk hater of systemd. My "salaried UNIX" background, up through 2008, was primarily with OSes like Solaris and AIX. With Solaris, in particular, I was used to sytemd-type init-systems due to SMF.

That said, making the switch from RHEL and CentOS 6 to RHEL and CentOS 7 hasn't been without its issues. The change from upstart to systemd is a lot more dramatic than from SysV-init to upstart.

Much of the pain with systemd comes with COTS software originally written to work on EL6. Some vendors really only due fairly cursory testing before saying something is EL7 compatible. Many — especially earlier in the EL 7 lifecycle — didn't bother creating systemd services at all. They simply relied on systemd-sysv-generator utility to do the dirty work for them.

While the systemd-sysv-generator utility does a fairly decent job, one of the places it can fall down is if the legacy-init script (files hosted in /etc/rc.d/init.d) is actually a symbolic link to someplace else in the filesystem. Even then, it's not super much a problem if "someplace else" is still within the "/" filesystem. However, if your SOPs include "segregate OS and application onto different filesystems", then "someplace else can" very much be a problem — when "someplace else" is on a different filesystem from "/".

Recently, I was asked to automate the installation of some COTS software with the "it works on EL6 so it ought to work on EL7" type of "compatibility". Not only did the software not come with systemd service files, its legacy-init files linked out to software installed in /opt. Our shop's SOPs are of the "applications on their own filesystems" variety. Thus, the /opt/<APPLICATION> directory is actually its own filesystem hosted on its own storage device. After doing the installation, I'd reboot the system. ...And when the system came back, even though there was a boot script in /etc/rc.d/init.d, the service wasn't starting. Poring over the logs, I eventually found:

systemd-sysv-generator[NNN]: stat() failed on /etc/rc.d/init.d/<script_name>
No such file or directory

This struck me odd given that the link and its destination very much did exist.

Turns out, systemd invokes the systemd-sysv-generator utility very early in the system-initialization proces. It invokes it so early, in fact, that the /opt/<APPLICATION> filesystem has yet to be mounted when it runs. Thus, when it's looking to do the conversion the file the sym-link points to actually does not yet exist.

My first thought was, "screw it: I'll just write a systemd service file for the stupid application." Unfortunately, the application's starter was kind of a rats nest of suck and fail; complete and utter lossage. Trying to invoke it from directly via a systemd service definition resulted in the application's packaged controller-process not knowing where to find a stupid number of its sub-components. Brittle. So, I searched for other alternatives...

Eventually, my searches led me to both the nugget about when systemd invokes the systemd-sysv-generator utility and how to overcome the "sym-link to a yet-to-be-mounted-filesystem" problem. Under systemd-enabled systems, there's a new-with-systemd mount-option you can place in /etc/fstab — x-initrd.mount. You also need to make sure that your filesystem's fs_passno is set to "0" ...and if your filesystem lives on an LVM2 volume, you need to update your GRUB2 config to ensure that the LVM gets onlined prior to systemd invoking the systemd-sysv-generator utility. Fugly.

At any rate, once I implemented this fix, the systemd-sysv-generator utility became happy with the sym-linked legacy-init script ...And my vendor's crappy application was happy to restart on reboot.

Given that I'm deploying on AWS, I was able to accommodate setting these fstab options by doing:

mounts:
  - [ "/dev/nvme1n1", "/opt/<application> , "auto", "defaults,x-initrd.mount", "0", "0" ]

Within my cloud-init declaration-block. This should work in any context that allows you to use cloud-init.

I wish I could say that this was the worst problem I've run into with this particular application. But, really, this application is an all around steaming pile of technology.

Wednesday, August 15, 2018

Diagnosing Init Script Issues

Recently, I've been wrestling with some COTS software that one of my customers wanted me to automate the deployment of. Automating its deployment has been... Patience-trying.

The first time I ran the installer as a boot-time "run once" script, it bombed. Over several troubleshooting iterations, I found that one of the five sub-components was at fault. When I deployed the five sub-components individually (and to different hosts), all but that one sub-component failed. Worse, if I ran the automated installer from an interactive user's shell, the installation would succeed. Every. Damned. Time.

So, I figured, "hmm... need to see if I can adequately simulate how to run this thing from an interactive user's shell yet fool the installer into thinking it had a similar environment to what the host's init process provides." So, I commenced to Google'ing.

Commence to Googlin'

Eventually, I found reference to `setsid`. Basically, this utility allows you to spawn a subshell that's detached from a TTY ...just like an init-spawned process. So, I started out with:

     setsid bash -c '/path/to/installer -- \
       -c /path/to/answer.json' < /dev/null 2>&1 \
       | tee /root/log.noshell

Unfortunately, while the above creates a TTY-less subshell for the script to run in, it still wasn't quite fully simulating an init-like context. The damned installer was still completing successfully. Not what I wanted since, when run from an init-spawned context, it was failing in a very specific way. So, back to the almighty Googs.

Eventually, it occurred to me, "init-spawned processes have very different environments set than interactive user shells do. And, as a shell forked off from my login shell, that `setsid` process would inherit all my interactive shell's environment settings. Once I came to that realization, the Googs quickly pointed me to `env -i`. Upon integrating this nugget:

     setsid bash -c 'env -i /path/to/installer -- \
       -c /path/to/answer.json' < /dev/null 2>&1 \
       | tee /root/log.noshell

My installer started failing the same when launched from an interactive shell as from an init-spawned context. I had something with which I could tell the vendor, "your unattended installer is broken: here's how to simulate/reproduce the problem. FIX IT." I dunno about you, but, to me, an "unattended installer" that I have to kick off by hand from an interactive shell really isn't all that useful.

Friday, August 3, 2018

JMSE Query Fu

Yesterday, I posted up an article illustrating the pain of LDAP's query-language. Today, I had to dick around with JMSE reporting.

The group I work for manages a number of AWS accounts in a number of regions. We also provide support to a number of tenants and their AWS accounts.

This support is necessary because, in much the same way AWS strives to make it easy to adopt "cloud", a lot of people flocking aboard don't necessarily know how to do so cost-effectively. Using AWS can be a cost-effective way to do things, but failure to exercise adequate house-keeping can bite you right in the wallet. Unfortunately, part of the low bar to entry means that a lot of users coming into AWS don't really comprehend that housekeeping is necessary. And, even if they did, they don't necessarily understand how to implement automated housekeeping methods. So, stuff tends to build up over time ...leading to unnecessary cost-exposures.

Inevitably, this leads to a "help: our expenses are running much higher than expected" types of situation. What usually turns out to be a majar cause is disused storage (and orphaned EC2s left running 24/7/365 ...but that's another story). Frequently, this comes out to some combination of orphaned (long-detached) EBS volumes, poorly maintained S3 buckets and elderly EBS snapshots.

IDin'g orphaned EBS volumes is pretty straight forward. Not a lot of query-fu is needed. To see it.

Poorly maintained S3 buckets are a skosh harder to suss out. But, you can usually just enable bucket inventory-reporting an lifecycle policies to help automate your cost-exposure reduction.

While EBS snapshots are relatively low-cost, when enough of them build up, you start to notice the expenses. Reporting can be relatively straightforward. However, if you happen to be maintaining fleets of custom AMIs, doing a simple "what can I delete" report becomes an exercise in query-filtering. Fortunately, AMI-related snapshots are generally identifiable (excludable) with a couple of filters, leaving you a list of snapshots to futher investigate. Unfortunately, writing the query-filters means dicking with JMSE filter syntax ...which is kind of horrid. Dunno if it's more or less horrid than LDAP's.

At any rate, I ended up writing a really simple reporting script to give me a basic idea of whether stale snapshots are even a problem ...and the likely scope of said problem if it existed. While few of our tenants maintain their own AMIs, we do in the account I primarily work with. So, I wrote my script to exclude their snapshots from relevant reports:

for YEAR in $( seq 2015 2018 )
do
   for MONTH in $( seq -w 1 12 )
   do
      for REGION in us-{ea,we}st-{1,2}
      do
         printf "%s-%s (%s): " "${YEAR}" "${MONTH}" "${REGION}"
         aws --region "${REGION}" ec2 describe-snapshots --owner <ACCOUNT> \
           --filter "Name=start-time,Values=${YEAR}-${MONTH}*" --query \
           'Snapshots[?starts_with(Description, `Created by CreateImage`) == `false`]|[?starts_with(Description, `Copied for DestinationAmi`) == `false`].[StartTime,SnapshotId,VolumeSize,Description]' \
           --output text | wc -l
      done
    done
done | sed '/ 0$/d'

The first suck part with a JMSE query is it can be really long. Worse: it really doesn't tolerate line-breaking to make scripts less "sprawly". If you're like me, you like to keep a given line of code in a script no longer than "X" characters. I usually prefer X=80. JMSE queries pretty much say, "yeah: screw all that prettiness nonsense".

At any rate to quickly explain the easier parts of the above:

First, this is a basic BASH wrapper (should work in other shells that are aware of Bourne-ish syntax)
I'm using nested loops: first level is year, second level is month, third level is AWS region. This allows easy grouping of output.
I'm using `printf` to give my output meaning
At the end of my `aws` command, I'm using `wc` to provide a simple count of lines returned: when one uses the `--output text` argument to the `aws` command, each object's returned data is done as a single line ...allowing `wc` to provide a quick tally of lines meeting the JMSE selection-criteria
At the end of all my looping, I'm using `sed` to suppress any lines where a given region/month/year has no snapshots found

Notice I don't have a bullet answering why I'm bothering to define an output string ...and then, essentially, throwing that string away. Simply put, I have to output something for `wc` to count and I may as well output several useful items in case I want to `tee` it off and use the data in an extension to the script.

The fun part of the above is the JMSE horror-show:

When you query a EBS snapshot object, it outputs a JSON document. That document's structure looks like:

{
    "Snapshots": [
        {
            "Description": "Snapshot Description Strin",
            "Tags": [
                {
                    "Value": "TAG1 Value",
                    "Key": "TAG1"
                },
                {
                    "Value": "TAG2 Value",
                    "Key": "TAG2"
                },
                {
                    "Value": "TAG3 Value",
                    "Key": "TAG3"
                }
            ],
            "Encrypted": false,
            "VolumeId": "",
            "State": "completed",
            "VolumeSize": ,
            "StartTime": "--T::.000Z",
            "Progress": "100%",
            "OwnerId": "",
            "SnapshotId": ""
        }
    ]
}

The ".[StartTime,SnapshotId,VolumeSize,Description]" portion of the "--query" string constrains the output to containing the "StartTime", "SnapshotId", "VolumeSize" and "Description" elements from the object's JSON document. Again, not of immediate use when doing a basic element-count census but is convertible to something useful in later tasks.
When constructing a JMSE query, you're informing the tool, which elements of the JSON node-tree you want to work on. You pretty much always have to specify the top-level document node. In this case, that's "Snapshots". Thus, the initial part of the query-string is "Snapshots[]". Not directly germane to this document, but likely worth knowing:
- In JMSE, "Snapshots[]" basically means "give me everything from 'Snapshots' on down".
- An equivalent to this is "Snapshots[*]".
- If you wanted just the first object returned, you'd use "Snapshots[0]"
- If you wanted just a sub-range of objects returned, you'd use "Snapshots[X:Y]"
If you want to restrict your output based on selection criteria, JMSE provides the "[?]" construct. In this particular case, I'm using the "starts_with" query, or "[?starts_with]". This query takes arguments: what sub-attribute you're querying; the string to match against and whether your positively or negatively selecting. Thus:
- To eliminate snapshots with descriptions starting "Created by CreateImage", my query looks like: "?starts_with(Description, `Created by CreateImage`) == `false`"
- To eliminate snapshots with descriptions starting "Copied for DestinationAmi", my query looks like: "?starts_with(Description, `Copied for DestinationAmi`) == `false`"
- Were I selecting for either of these strings — rather than the current against-match — I would need to change my "false" selectors to "true".
To do a compound query of an OR type, one does "[query1]|[query2]". Notionally, I could string together as many of these as I wanted if I needed to select for/against any arbitrary number of attribute values by adding in further "|" operators.

At the end of running my script, I ended up with output that looked like:

    2017-07 (us-east-1): 61
    2017-08 (us-east-1): 746
    2017-09 (us-east-1): 196
    2017-10 (us-east-1): 4
    2017-11 (us-east-1): 113
    2017-12 (us-east-1): 600
    2018-01 (us-east-1): 149
    2018-03 (us-east-1): 6
    2018-04 (us-east-1): 3
    2018-06 (us-east-1): 302
    2018-07 (us-east-1): 1620
    2018-08 (us-east-1): 206
    2018-08 (us-west-2): 3

Obviously, there's some cleanup to be done in the queried account: it's unlikely that any snapshots older than 30 days — stuff that's from 2017 is almost a lock to be the result of "bad housekeeping".