Friday, August 16, 2019

Unsupported Things Never Stay Unsupported

A few years ago, the project-lead on the project I was working, at the time, asked me, "can you write some backup scripts that our cloud pathfinder programs can use? Moving to AWS, their systems won't be covered by the enterprise's NetBackup system and they need something to help with disaster-mitigation."

At the time, the team I was one was very small. While there were nearly a dozen people – a couple of whom were also technically-oriented – on the team, it was pretty much just me and one other person that were automation-oriented (and responsible for the entire tooling workload). That other person was more Windows-focussed and our pathfinders were pretty much wholly Linux-based. I am, to make an understatement, UNIX/Linux-oriented. Thus it fell on me.

I ended up whacking together a very quick-and-dirty set of tools. Since our pathfinders were notionally technically-savvy (otherwise, they ought not have been pathfinders), I built the tools with a few assumptions: 1) that they'd read the (minimal) documentation I included with the project; 2) that, having read that, if they found it lacking, they'd examine the code (since it was hosted in a public GitHub project); 3) that they'd contact me or the team if 1 and/or 2 to fail - preferably by filing a GitHub issue against the project; and, 4) that they might be the type of users who tried to use more-complex storage configurations (especially given that, at the time, you couldn't expand an EBS volume once created). 

Since I wanted something fairly quickly-done, I wrote the tools to leverage AWS's native storage-snapshotting capability. Using storage-snapshots was also attractive because it was more cost-efficient than mirroring and online EBS to an offline EBS (especially the multiple such offline EBSes that would be necessary to create multiple days worth of recovery-points). Lastly, because of the way the snapshots work, once you initiate the snapshot, you don't have to worry about filesystem changes while the snapshot-creation finishes.

Resultant of assumption #4, I wrote the scripts to accommodate the possibility that they'd be trying to back up filesystems that spanned – either as concats or stripes – two or more EBS volumes. This accommodation meant that I added an option to freeze the filesystem before requesting snapshot and (because AWS had yet to add the option to snapshot a set of EBSes all at once) implemented multi-threaded snapshot-request logic. The combination of the two meant that, in the few milliseconds that it took to initiate the (effectively) parallel snapshot-requests, I/Os would be halted to the filesystem and prevent there from being any consistency-gaps between members of the EBS-set.

Unfortunately, as my primary customers' adoption went from the (technically-savvy) pathfinder-projects to progressively less-technical follow-on projects, the assumptions under which the tools were authored became less valid. Because the follow-on projects tended to be peopled by staff that are very "cut-n-paste" oriented, they tended to do some silly things:
  • Trying to freeze the "/" filesystem (you can do it, but you're probably not going to be able to undo it …without rebooting)
  • Trying to freeze filesystems that didn't exist ("but it was in the example")
  • Trying to freeze filesystems that weren't built on top of spans (not inherently "wrong", per se, but also not needed from a "keep all the sub-volumes in sync" perspective)
Worse, as our team added (junior) people to help act as stewards for these projects, they didn't really understand the vagaries of storage – why certain considerations are warranted or not (and the tool-features associated thereto). So, they weren't steering the new tool-users away from bad usages. Bearing in mind that the tools were a quick-and-dirty effort meant for use by technically-savvy users, the tools had pretty much no "safeties" built in.

Side-note: To me, you build safeties into tools that you're planning to support (to minimize the amount of "help me, help me: stuff blowed up" emails).

Being short on time and with a huge stack of other projects to work on when I first wrote them, smoothing off the rough-edges wasn't a priority. And, when the request for the tools was originally made, in response to my saying "I, don't have the spare cycles to fully engineer a supportable tool-set," I was told, "these will made available as references, not supported tools."

Harkening back to the additions to the team, one of the other things that wasn't being adequately communicated to them in their onboarding was that the tooling wasn't "supported". We ended up on a cycle where, about every 6-11 months, people would be screaming about "the scripts are broken and tenants are screaming". At which point would have to remind the people made the "they're just references" tool-request that, "1) these tools were never meant as more than a reference; 2) they're working perfectly-adequately and just like they did back in 2015 – it's just that both the program-users and our team's stewards are just botching their use."

The most recent iteration of the preceding, I actually had some time to retrofit some safeties. Having done so, the people that were whining about the tools being "broken" were chuffed and asked, "cool, when do we make the announcement of the upates." I responded, "well, you don't really need to." At which point I was asked, "why do you want to hide these updates." My response was, "there's a difference between 'not announcing updates' and 'hiding availability of updates'. Besides: I made no functional changes to the tools. The only thing that's changed is it's harder for people to use them in ways that will cause problems." In disbelief, I was asked, "if there's no functionality change, why did we waste time updating the tools."


Muttering under my breath as I composed my reply, "because you guys were (erroneously) stating that the tools were broken and that they needed to be made more-friendly. The 'make more friendly' is now done. That said, announcing to people that are already successfully using them, 'the tools have been updated,' provides no value to those people: whether they continue using the pre-safeties versions or the updated versions, their jobs will continue to function without any need for changes. All that announcing will inevitably do is cause those people to harangue you about 'we updated: why aren't we seeing any functional enhancements'?"

Friday, August 9, 2019

So You Want to Access a FIPSed EL7 Host Via RDP

One of the joys of the modern, corporate security-landscape is that enterprises frequently end up lockin down their internal networks to fairly extraordinary degrees. That as software and operating system vendors offer new bolts to tighten, organizations will tend to do so - sometimes without considering the impact of what that tightening will do.

Several of my customers protect their networks not only with inbound firewalls, but firewalls that severely restrict outbound connectivity. Pretty much, their users' desktop systems can only access an external service if its offered via HTTP/S. Similarly, their users' desktop systems are configured with application whitelisting enabled - preventing not only power users from installing software that requires privileged-access to install, but prevents users from installing things that are wholly constrained to their home directories. This kind of security-posture is suitable for the vast majority of users, but is considerably less so for developers.

The group I work for provides cloud-enablement services. This means that we are both developers and provide services to our customers' developers. Both for our own needs (when on-site) and for those of customers' developers, this has meant needing to have remote (cloud-hosted), "developer" desktops. The cloud service providers (CSPs) we and our customers use provide remote desktop solutions (e.g., AWS's "Workspaces"). However, these services are typically not usable at our customer sites due to the previously-mentioned network and desktop lockdowns: even if the local desktop has tools like RDP and SSH clients installed, those tools are only usable within the enterprises' internal networks; if the remote desktop offering is reachable via HTTP/S, it's typically through a widget that the would-be remote desktop user would install to their local workstation if application-whitelisting didn't prevent it.

To solve this problem or both our own needs (when on-site) and our customers' developers' needs, we stood up a set of remote (cloud-hosted), Windows-based desktops. To make them usable from locked-down networks, we employed Apache's Guacamole service. Guacamole makes remote Windows and Linux desktops available within a user's web browser.

Guacamole-fronted Windows desktops proved to be a decent solution for several years. Unfortunately, as the cloud wars heat up and CSPs try to find ways to bring - or force - customers into their datacenters, what was once a decent solution can become not decent - often due to pricing factors. Sadly, it appears that Microsoft may be trying to pump up Azure-adoption by increasing the price of cloud-hosted Windows services when those services are run in other CSPs' datacenters.

While we wait to see if and how this plays out, financially, we opted to see "can we find lower-cost alternatives to Windows-based (remote) developer desktops." Most of our and our customers' developers are Linux-oriented - or at least Linux-comfortable: it was a no-brainer to see what we could do using Linux. Our Guacamole service already uses Linux-based containers to provide the HTTP/S-encapsulation for RDP and Guacamole natively supports the fronting of Linux-based graphical desktops via VNC. That said, given that the infrastructure is built around an RDP, it might prove to ease some of the rearchitecting-process by keeping communications RDP-based even without Windows in the solution-stack.

Because our security guidance has previously required us to use "hardened" Red Hat and CentOS-based servers to host Linux applications, that was our starting-point for this process. This hardening almost always introduces "wrinkles" into deployment of solutions - usually because the software isn't SELinux-enabled or relies on kernel-bits that are disabled under FIPS mode. This time, the problem was FIPS mode.

While installing and using RDP on Linux has become a lot easier than it used to be (tools like XRDP now actually ship with SELinux policy-modules!), not all of the kinks are gone, yet. What I discovered, when starting on the investigation path, is that the XRDP installer for Enterprise Linux 7 isn't designed to work in FIPS mode. Specifically, when the installer goes to set up its encryption-keys, it attempts to do so using MD5-based methods. When FIPS mode is enabled on a Linux kernel, MD5 is disabled.

Fortunately, this only effects legacy RDP connections. The currently-preferred solution for RDP leverages TLS. Both TLS and its preferred ciphers and algorithms are all FIPS compatible. Further, even though tin installer fails to set up the encryption keys, these keys are effectively optional: a file at the expected location for keys merely needs to exist, not actually be a valid key. This meant that the problem in the installer was trivially worked around by adding a `touch /etc/xrdp/rsakeys.ini` to the install process. Getting a cloud-hosted, Linux-based, graphical desktop ultimately becomes a matter of:

  1. Stand up a cloud-hosted Red Hat or CentOS 7 system
  2. Ensure that the "GNOME Desktop" and "Graphical Administration Tools" package-groups are installed (since, if your EL7 starting-point is like ours, no GUIs will be in the base system-image)
  3. Once those are installed, ensure that the system's default run-state has been set to "graphical.target". The installers for the "GNOME Desktop" package-group should have taken care of this for you. Check the run-level with `systemctl get-default`. If the installers for the "GNOME Desktop" package-group didn't properly set things, correct it by executing `systemctl set-default graphical.target`
  4. Ensure that the xrdp and tigervnc-server RPMs are installed
  5. Make sure that firewalld allows connections to the XRDP service by executing `firewall-cmd --add-port=3389/tcp --permanent`
  6. Similarly, ensure that whatever CSP-layer networking controls are present allow TCP port 3389 inbound to your XRDP-enabled Linux host.
  7. ...And if you want users of your Linux-based RDP host to be able remotely-access actual Windows-based servers, install Vinagre.
  8. Reboot to ensure everything is in place and running.
Once the above is done, you can test things out by RDPing into your new Linux host from a Windows host …and, if you've installed Vinagre, RDP from your new, XRDP-enabled Linux host to Windows host (for a nice case of RDP-inception).



References:

Friday, July 19, 2019

Why I Default to The Old Ways

I work with a growing team of automation engineers. Most are purely dev types. Those that have lived in the Operations world, at all, skew heavily towards Windows or only had to very lightly deal with UNIX or Linux.

I, on the other hand, have been using UNIX flavors since 1989. My first Linux system was the result of downloading a distribution from the MIT mirrors in 1992. Result, I have a lot of old habits (seriously: some of my habits are older than some of my teammates). And, because I've had to get deep into the weeds with all of those operating systems many, many, many times, over the years, those habits are pretty entrenched ("learned with blood" and all that rot).

A year or so ago, I'd submitted a PR that included some regex-heavy shell scripts. The person that reviewed the PR had asked "why are you using '[<space><TAB>]*' in your regexes rather than just '\s'?". At the time, I think my response was a semi-glib, "A) old habits die hard; and, B) I know that the former method always works".

That said, I am a lazy-typist. Typing "\s" is a lot fewer keystrokes than is "[<space><TAB>]*". Similarly, "\s" takes up a lot less in the way of column-width than does "[<space><TAB>]*" (and I/we generally like to code to fairly standard page-widths). So, for both laziness reasons and column-conservation reasons, I started to move more towards using "\s" and away from using "[<space><TAB>]*".  I think in the last 12 months, I've moved almost exclusively to  "\s".

Today, that move bit me in the ass. Well, yesterday, actually, because that's when I started receiving reports that the tool I'd authored on EL7 wasn't working when installed/used on EL6. Ultimately, I traced the problem to an `awk` invocation. Specifically, I had a chunk of code (filtering DNS output) that looked like:

awk '/\sIN SRV\s/{ printf("%s;%s\n",$7,$8)}'

Which worked a treat on EL7 but on EL6, "not so much." When I altered it to the older-style invocation:

awk '/[  ]*IN[  ]*SRV[  ]*/{ printf("%s;%s\n",$7,$8)}'

It worked fine on both EL7 and EL6. Turns out the ancient version of `awk` (3.1.7) on EL6 didn't know how to properly interpret the "\s" token. Oddly (my recollection from writing other tooling) is that EL6's version of `grep` understands the "\s" token just fine.

When I Slacked the person I'd had the original conversation with a link to the PR with a "see: this is why" note, he replied, "oh: I never really used awk, so never ran into it".

Wednesday, July 17, 2019

Crib-Notes: Manifest Deltas

Each month, the group I work for publishes new CentOS and Red Hat AMIs (and Azure templates and Vagrant boxes). When we complete the publication-event, we post a news announcement to our user-portal so that subscribers can receive an alert of the new publication. Included in that news announcement is a "what's changed" section.

In prior months, figuring out "what changed" was left as a manual step for the team-member charged with running the automation for a given month's publication event. This month, no one generated that news article and there were several updated and new RPMs included in the new image. So, I set about figuring out "how to extract this information programmatically so as to more-easily suss-out what to include in the announcement posting." The following does so (though, presumably, in a not-particularly-optimized) fashion:
git diff $(
      git log --pretty='%H' --follow -- <PATH_TO_MANIFEST_FILE> | \
      head -2 | \
      tac | \
      sed 'N;s/\n/../'
   ) -- <PATH_TO_MANIFEST_FILE> | \
grep -E '(amazon|aws|ec2)-' | \
sed 's/^./& /' | \
sort -k 2
To explain:
  1. Use `git log` to output the commit-hashes for all the commits for the target file (in this case, the project's manifest-file)
  2. Use `head -2` to grab only the two most-recent commit hashes from the output-stream
  3. Use the `tac` command to invert the order of the two lines returned from the `head` command
  4. Use the `sed` command to join the two lines, replacing the first line's line-ending newline character with ".."
  5. Use `git diff` against the output created in steps 1-4, and constrain the diff-activity to just the manifest-file.
  6. Pipe that output through `grep` to suppress all information other than the bits containing the `amazon-`, `aws-` and `ec2-` substrings.
  7. Pipe that through `sed` so that the +/- that `git diff` uses to show new and removed files, respectively, becomes an easily-tokenized substring.
  8. Sort the remaining output-stream (with `sort`) so that the lines are groups by manifest-element (the second key/token in the sorted output)
Taking that output and converting to a news article is still manual, but it at least makes it a lot easier to do than either hand-diffing two files or having to "just know" what's changed.

Notes

Because Red Hat has placed EL6 is in its final stage of de-support, we've stopped publishing CentOS6 and RHEL6. We did this to discourage our subscribers from doing new deployments on EL6 (since the underlying platform will go into final de-support come November of this year).

Similarly, due to current lack of CentOS offering for EL8, lack of security-related build- or hardening-guidance for EL8 and associated lack of subscriber-demand for an EL8 build, we don't yet include builds for CentOS8 or RHEL8 in our process. Thus, for the time being, we only need to provide a "whats changed" for EL7 builds. Given this, we currently only need to do change-queries against the "manifests/spel-minimal-centos-7-hvm.manifest.txt" file.

Thursday, June 20, 2019

Crib-Notes: EC2 UserData Audit

Sometimes, I find that I'll return to a customer/project and forget what's "normal" for them in how they deploy their EC2s. If I know a given customer/project tends to deploy EC2s that include UserData, but they don't keep good records of what they tend to do for said UserData, I find the following BASH scriptlet to be useful for getting myself back into the swing of things:

for INSTANCE in $( aws ec2 describe-instances --query 'Reservations[].Instances[].InstanceId' | \
                   sed -e '/^\[/'d -e '/^]/d' -e 's/^ *"//' -e 's/".*//' )
do
   printf "###############\n# %s\n###############\n" "${INSTANCE}"
   aws ec2 describe-instance-attribute --instance-id "${INSTANCE}" --attribute userData | \
   jq -r .UserData.Value | base64 -d
   echo
done | tee /tmp/DiceLab-EC2-UserData.log

To explain, what the above does is:
  1. Initiates a for-loop using ${INSTANCE} as the iterated-value
  2. With each iteration, the value injected into ${INSTANCE} is derived from a line of output from the aws ec2 describe-instances command. Normally, this command outputs a JSON document containing a bunch of information about each instance in the account-region. Using the --query option, the output is constrained to only output each EC2 instance's InstanceId value. This is then piped through sed so that the extraneous characters are removed, resulting in a clean list of EC2 instance-IDs.
  3. The initial printf line creates a bit of an output-header. This will make it easier to pore through the output and keep each iterated instance's individual UserData content separate
  4. Instance UserData is considered to be an attribute of a given EC2 instance. The aws ec2 describe-instance-attribute command is what is used to actually pull this content from the target EC2. I could have used a --query filter to constrain my output. However, I instead chose to use jq as it allows me to both constrain my output as well as do output-cleanup, eliminating the need for the kind of complex sed statement I used in the loop initialization (cygwin's jq was crashing this morning when I was attempting to use it in the loop-initialization phase - in case you were wondering about the inconsistent constraint/cleanup methods). Because the UserData output is stored as a BASE64-encoded string, I have to pipe the cleaned-up output through the base64 utility to get my plain-text data back.
  5. I inject a closing blank line into my output stream (via the echo command) to make the captured output slightly easier to scan.
  6. I like to watch my scriptlet's progress, but still like to capture that output into a file for subsequent perusal, thus I pipe the entire loop's output through tee so I can capture as I view.
I could have set it up so that each instance's data was dumped to an individual output-file. This would have saved the need for the printf and echo lines. However, I like having one, big file to peruse (rather than having to hunt through scads of individual files) ...and a single file-open/close action is marginally faster than scads of open/closes.

In an account-region that had hundreds of EC2s, I'd probably have been more selective with which instance-IDs I initiated my loop. I would have used a --filter statement in my aws ec2 describe-instances command - likely filtering by VPC-ID and one or two other selectors.

Tuesday, May 7, 2019

Crib-Notes: Offline Delta-Syncs of S3 Buckets

In the normal world, synchronizing two buckets is as simple as doing `aws s3 sync <SYNC_OPTIONS> <SOURCE_BUCKET> <DESTINATION_BUCKET>`. However, due to the information security needs of some of my customers, it's occasionally necessary to perform data-synchronizations between two S3 buckets, but using methods that amount to "offline" transfers.

To illustrate what is meant by "offline":
  1. Create a transfer-archive from a data source
  2. Copy the transfer-archive across a security boundary
  3. Unpack the transfer-archive to its final destination

Note that things are a bit more involved than the summary of the process – but this gives you the gist of the major effort-points.

The first time you do an offline bucket sync, transferring the entirety of a bucket is typically the goal. However, for a refresh-sync – particularly for a bucket of greater than a trivial content-size, this can be sub-ideal. For example, it might be necessary to do monthly syncs of a bucket that grows by a few Gigabytes per month. After a year, a full sync can mean having to move tens to hundreds of gigabytes. A better way is to only sync the deltas – copying only what's changed between the current and immediately-prior sync-tasks (a few GiB rather than tens to hundreds).

The AWS CLI tools don't really have a "sync only the files that have been added/modified since <DATE>". That said, it's not super difficult to work around that gap. A simple shell script like the following works a trick:

for FILE in $( aws s3 ls --recursive s3://<SOURCE_BUCKET>/  | \
   awk '$1 > "2019-03-01 00:00:00" {print $4}' )
do
   echo "Downloading ${FILE}"
   install -bDm 000644 <( aws s3 cp "s3://<SOURCE_BUCKET>/${FILE}" - ) \
     "<STAGING_DIR>/${FILE}"
done

To explain the above:

  1. Create a list of files to iterate:
    1. Invoke a subprocess using the $() notation. Within that subprocess...
    2. Invoke the AWS CLI's S3 module to recursively list the source-bucket's contents (`aws s3 ls --recursive`)
    3. Pipe the output to `awk` – looking for any date-string that's newer than the value in s3 ls's first output-column (the file-modification date column) and print out only the fourth column (the S3 object-path)
    The output from the subprocess is captured that output as an iterable list-structure
  2. Use a for loop-method to iterate the previously-assembled list, assigning each S3 object-path to the ${FILE} variable
  3. Since I hate sending programs off to do things in silence (I don't trust them to not hang), my first looped-command is to say what's happening via the echo "Downloading ${FILE}" directive.
  4. The install line makes use of some niftiness within both BASH and the AWS CLI's S3 command:
    1. By specifying "-" as the "destination" for the file-copy operation, you tell the S3 command to write the fetched object-contents to STDOUT.
    2. BASH allows you take a stream of output and assign a file-handle to it by surrounding the output-producing command with <( ).
    3. Invoking the install command with the -D flag tells the command to "create all necessary path-elements to place the source 'file' in the desired location within the filesystem, even if none of the intervening directory structure exists, yet."
    Putting it all together, the install operation takes the streamed s3 cp output, and installs it as a file (with mode 000644) at the location derived from the STAGING_DIR plus the S3 object-path ...thus preserving the SOURCE_BUCKET's content-structure within the STAGING_DIR
Obviously, this method really only works for additive/substitutive deltas. If you need to account for deletions and/or moves, this approach will be insufficient.

Wednesday, April 24, 2019

Crib-Notes: End-to-End SSL Within AWS

In general, when using an Elastic Load-Balancer (ELB) to do SSL-encrypted proxying for an AWS-hosted, Internet-facing application, it's typically sufficient to simply do SSL-termination at the ELB and call it a day. That said:
  • If you're paranoid, you can ensure that only the proxied EC2(s) and the ELB are able to communicate with each other via security groups.
  • If you're under compliance-requirements (e.g., PCI/DSS), you can enable end-to-end SSL such that:
    1. Connections between Internet-based client and the ELB are encrypted
    2. Connections between the ELB and the application-hosting EC2(s) is encrypted
Either/both amount to a "belt and suspenders"  approach. Other than "meeting policy", security isn't meaningfully improved: given AWS's overarching security-design, even if someone else has access to your application-EC2(s) and ELB's VPC, they won't be able to sniff packets/data – encrypted or not.

Technical need aside... Implementing end-to-end SSL is trivial:
  • ACM allows easy provisioning of SSL certificates for the ELB (with the security-bonus of automatically rotating said certificates). 
  • You can very generic, self-signed certificates on your application-hosting EC2s:
    • The certificate's Subject doesn't matter
    • The certificate's validity window doesn't matter (no need to worry about rotating certificates that have expired)
Thus setup comes down to:
  1. Create an EC2-hosted application/service:
    1. Launch EC2
    2. Install HTTPS-capable application
    3. Generate a self-signed certificate (setting the -days to as little as 1 day). Example (using the OpenSSL utility):

      openssl req -x509 -nodes -days 1 -newkey rsa:2048 \
         -keyout peer.key -out peer.crt

      When prompted for input, just hit the <RETURN> key (this will create a cert with defaulted values ...which, as noted previously, don't really have bearing on the ELB's trust of the certificate). Similary, one can wholly omit the -days 1 flag and value – the default certificate will be valid for 30 days (but, ELB doesn't care about the validity time-window).
    4. Configure the HTTPS-capable application to load the certificate
    5. Configure the EC2's host-based firewall to allow connections to whatever port the application listens on for SSL-protected connections
    6. Configure the EC2's security group to allow connections to whatever port the application listens on for SSL-protected connections
  2. Create an ELB:
    1. Set the ELB to listen for SSL-based connection-requests (using a certificate from ACM or IAM)
    2. Set the ELB to forward connections using the HTTPS protocol to connect to the target EC2(s) over whatever port the application listens on for SSL-protected connections
    3. Ensure the ELB's healthcheck is requesting a suitable URL to establish the health of the application 

Once the ELB's healthcheck goes green, it should be possible to connect to the EC2-hosted application via SSL.If one wants to verify the encryption-state of the connetction between the ELB and EC2(s), one would need to login to the EC2(s) and sniff the inbound packets (e.g., by using a tool like WireShark).

Wednesday, April 17, 2019

Crib-Notes: Validating Consistent ENA and SRIOV Support in AMIs

One of the contracts I work on, we're responsible for producing the AMIs used for the entire enterprise. At this point, the process is heavily automated. Basically, we use a pipeline that leverages some CI tools, Packer and a suite of BASH scripts to do all the grunt-work and produce not only an AMI but artifacts like AMI configuration- and package-manifests.

When we first adopted Packer, it had some limitations on how it registered AMIs (or, maybe, we just didn't find the extra flags back when we first selected Packer – who knows, it's lost to the mists of time at this point). If you wanted the resultant AMIs to have ENA and/or SRIOV support baked in (we do), your upstream AMI needed to have it baked in as well. This necessitated creating our own "bootstrap" AMIs as you couldn't count on these features being baked in – not even within the upstream vendor's (in our case, Red Hat's and CentOS's) AMIs.

At any rate, because the overall process has been turned over from the people that originated the automation to people that basically babysit automated tasks, the people running the tools don't necessarily have a firm grasp of everything that the automation's doing. Further, the people that are tasked with babysitting the automation differe from run-to-run. While automation should see to it that this doesn't matter, sometimes it pays to be paranoid. So a quick way to assuage that paranoia is to run quick reports from the AWS CLI. The following snippet makes for an adequate, "fifty-thousand foot" consistency-check:

aws ec2 describe-images --owner <AWS_ACCOUNT_ID> --query \
      'Images[].[ImageId,Name,EnaSupport,SriovNetSupport]' \
      --filters 'Name=name,Values=<SEARCH_STRING_PATTERN>' \
      --out text | \
   aws 'BEGIN {
         printf("%-18s%-60s%-14s-10s\n","AMI ID","AMI Name","ENA Support","SRIOV Support")
      } {
         printf("%-18s%-60s%-14s-10s\n",$1,$2,$3,$4)
      }'
  • There's a lot of organizations and individuals publishing AMIs. Thus, we use the --owner flag to search only for AMIs we've published.
  • We produce a couple of different families of AMIs. Thus, we use the --filter statement to only show the subset of our AMIs we're interested in.
  • I really only care about four attributes of the AMIs being reported on: ImageId, Name, EnaSupport and SriovNetSupport. Thus, the use of the JMSE --query statement to suppress all output except for that in which I'm interested.
  • Since I want the output to be pretty, I used the compound awk statement to create a formatted header and apply the same formatting to the output from the AWS CLI (using but a tiny bit of the printf routine's many capabilities).

This will produce output similar to:

   AMI ID                 AMI Name                                        ENA Support  SRIOV Support
   ami-187af850f113c24e1  spel-minimal-centos-7-hvm-2019.03.1.x86_64-gp2  True         simple
   ami-91b38c446d188643e  spel-minimal-centos-7-hvm-2019.02.1.x86_64-gp2  True         simple
   ami-22867cf08bb264ac4  spel-minimal-centos-7-hvm-2019.01.1.x86_64-gp2  True         simple
   [...elided...]
   ami-71c3822ed119c3401  spel-minimal-centos-7-hvm-2018.03.1.x86_64-gp2  None         simple
   [...elided...]
   ami-8057c2bf443dc01f5  spel-minimal-centos-7-hvm-2016.06.1.x86_64-gp2  None         None

As you can see, not all of the above AMIs are externally alike. While this could indicate a process or personnel problem, what my output actually shows is evolution in our AMIs. Originally, we weren't doing anything to support SRIOV or ENA. Then we added SRIOV support (because our AMI users were finally asking for it). Finally, we added ENA support (mostly so we could use the full range and capabilities of the fifth-generation EC2 instance-types).

At any rate, running a report like the above, we can identfy if there's unexpected differences and, if a sub-standard AMI slips out, we can alert our AMI users "don't use <AMI> if you have need of ENA and/or SRIOV".

Saturday, April 6, 2019

Crib-Notes: Tracing Permissions for an Instance Role That Uses Inline Policies

The rationale for this crib-note is essentially identical to that for the previous topic on tracing IAM permissions in instance-roles that use managed-policies. So I'm just going to crib the rest of the intro...

When working with AWS EC2 instances – particularly when automating the deployments and lifecycles thereof – it's common to make use of the AWS IAM system's Instance Role Policy feature. Occasionally, you might get asked, "what permissions have been given to that instance." To answer, you might use AWS's IAM web console or the IAM CLI. In the former case, to get the information to the requestor, you're left with the options of either taking screen-shots or trying to copy and paste the text from the UI to your email/slack/etc. reply. In the latter case, you can just dump out the JSON-formatted policy-document that enumerates the permissions.

Generally, I prefer the CLI option since I don't have to worry "what does the recipient need to do with the results". Just dumping a text file, I needn't worry about being yelled at that the contents of a screen-shot can't be used to easily create another, similar policy. It's also easy to simply redirect the output to a file and attach it to whatever media I'm responding to ...or, if mail is enabled within the CLI environment, simply pipe the output directly to a CLI-based email tool and save a step in the process (laziness for the win!).

But, how to go from "there's an instance in this account: what privileges does it have" to "here's what it can do?". Basically, it's a three-step process:

  1. Get the name of the Instance Role attached to the EC2 instance. This can be done with a method similar to:

    $ aws ec2 describe-instances --instance-id <INSTANCE_ID> \
          --query 'Reservations[].Instances[].IamInstanceProfile[].Arn[]' --out text | \
      sed 's/^.*arn:.*profile\///'

  2. Get th list of inline policies attached to role. This can be done with a method similar to:

    $ aws iam list-role-policies --role-name <OUPUT_FROM_PREVIOUS>

  3. Get the list of permissions associated with the instance role's inline policy/policies. This can be done with a method similar to:

    $ aws iam get-role-policy --role-name INSTANCE --policy-name <OUPUT_FROM_PREVIOUS>

The above steps will dump out the full IAM policy/permissions of the queried inline policy:

  • If the inline policy was the only policy attached to the role, the output will show all of the permissions the instance role grants any EC2s it is attached to.
  • If the inline policy was the not the only inline policy attached to the role, it will be necessary to iterate over the remaining policies attached to the role to get the aggregated permission-set.
To facilitate the iteration, one can use a script similar to the following to encapsulate the second and third steps from prior process description:


#!/bin/bash
#
# Script to dump out all permissions granted through an IAM role with multiple
# inline IAM policies attached.
###########################################################################

if [[ $# -eq 0 ]]
then
   echo "Usage: ${0} <instance-id>" >&2
   exit 1
fi

PROFILE_NAME=$( aws ec2 describe-instances --instance-id "${1}" \
   --query 'Reservations[].Instances[].IamInstanceProfile[].Arn' --out text | \
  tr '\t' '\n' | sort -u | sed 's/^.*arn:.*profile\///' )

POLICY_LIST_RAW=$( aws iam list-role-policies --role-name ${PROFILE_NAME} )
POLICY_LIST_CLN=($( echo ${POLICY_LIST_RAW} | jq .PolicyNames[] | sed 's/"//g' ))

for ITER in $( seq 0 $(( ${#POLICY_LIST_CLN[@]} - 1 )) )
do
  aws iam get-role-policy --role-name "${PROFILE_NAME}" \
    --policy-name "${POLICY_LIST_CLN[${ITER}]}"
done

Friday, April 5, 2019

Crib-Notes: Tracing Permissions for an Instance Role That Uses Managed Policies

When working with AWS EC2 instances – particularly when automating the deployments and lifecycles thereof – it's common to make use of the AWS IAM system's Instance Role Policy feature. Occasionally, you might get asked, "what permissions have been given to that instance." To answer, you might use AWS's IAM web console or the IAM CLI. In the former case, to get the information to the requestor, you're left with the options of either taking screen-shots or trying to copy and paste the text from the UI to your email/slack/etc. reply. In the latter case, you can just dump out the JSON-formatted policy-document that enumerates the permissions.

Generally, I prefer the CLI option since I don't have to worry "what does the recipient need to do with the results". Just dumping a text file, I needn't worry about being yelled at that the contents of a screen-shot can't be used to easily create another, similar policy. It's also easy to simply redirect the output to a file and attach it to whatever media I'm responding to ...or, if mail is enabled within the CLI environment, simply pipe the output directly to a CLI-based email tool and save a step in the process (laziness for the win!).

But, how to go from "there's an instance in this account: what privileges does it have" to "here's what it can do?" Basically, it's a four-step process:

  1. Get the name of the Instance Role attached to the EC2 instance. This can be done with a method similar to:
    $ aws ec2 describe-instances --instance-id <INSTANCE_ID> \
        --query 'Reservations[].Instances[].IamInstanceProfile[].Arn[]' --out text | \
      sed 's/^.*arn:.*profile\///'
    $ aws iam get-instance-profile --instance-profile-name <OUPUT_FROM_PREVIOUS> \
        --query 'InstanceProfile.Roles[].RoleName' --out text
     
  2. Get list of policies attached to role. This can be done with a method similar to:
    aws iam list-attached-role-policies --role-name <OUPUT_FROM_PREVIOUS> \
        --query 'AttachedPolicies[].PolicyArn[]' --out text
     
  3. Find current version of attached policy/policies. Thi can be done with a method similar to:
    aws iam get-policy --policy-arn <OUPUT_FROM_PREVIOUS> \
        --query 'Policy.DefaultVersionId' --out text
    
  4. Get contents of attached policy/policies active version. This can be done – using the outputs from steps #2 and #3 – with a method similar to:
    aws iam get-policy-version --policy-arn <OUPUT_FROM_STEP_2> \
         --version-id <OUPUT_FROM_STEP_3>
     
The above steps will dump out the full IAM policy/permissions of the queried managed-policy:
  • If the inline policy was the only policy attached to the role, the output will show all of the permissions the instance role grants any EC2s it is attached to.
  • If the inline policy was the not the only inline policy attached to the role, it will be necessary to iterate over the remaining policies attached to the role to get the aggregated permission-set.
To facilitate the iteration, one can use a script similar to the following to encapsulate the steps from prior process description:

#!/bin/bash
#
# Script to dump out all permissions granted through an IAM role with multiple
# managed IAM policies attached.
###########################################################################

if [[ $# -eq 0 ]]
then
   echo "Usage: ${0} <instance-id>" >&2
   exit 1
fi

PROFILE_NAME="$( aws ec2 describe-instances --instance-id "${1}" \
  --query 'Reservations[].Instances[].IamInstanceProfile[].Arn[]' \
  --out text | sed 's/^.*arn:.*profile\///' )"
ROLE_NAME="$( aws iam get-instance-profile \
  --instance-profile-name "${PROFILE_NAME}" \
  --query 'InstanceProfile.Roles[].RoleName' \
  --out text | sed 's/^.*arn:.*profile\///' )"
ATTACHED_POLICIES=($( aws iam list-attached-role-policies \
  --role-name "${ROLE_NAME}" --query 'AttachedPolicies[].PolicyArn[]' | \
  jq .[] | sed 's/"//g' ))

for ITER in $( seq 0 $(( ${#ATTACHED_POLICIES[@]} - 1 )) )
do
   POLICY_VERSION=$( aws iam get-policy --policy-arn \
     ${ATTACHED_POLICIES[${ITER}]} --query \
     'Policy.DefaultVersionId' --out text )
   aws iam get-policy-version --policy-arn ${ATTACHED_POLICIES[${ITER}]} \
     --version-id "${POLICY_VERSION}"
done