Thursday, December 5, 2019

Seriously Jenkins^H^H^H^H I Already Used That

A number of months ago, I delivered a set of CloudFormation templates an Jenkins pipelines to drive them. Recently, I was brought back onto the project to help them clean some things up. One of the questions I was asked was, "is there any way we can reduce the number of parameters the Jenkins jobs require?"

While I'd originally developed the pipelines on a Jenkins server that had the "Rebuild" plugin, the Jenkins servers they were trying to use didn't have that plugin. Thus, in order to re-run a Jenkins job, they had two choices: use the built-in "replay" option or the built in "build with parameters" option. The former precludes the ability to change parameter values. The latter means that you have to repopulate all of the parameters values. When a Jenkins job has only a very few parameters, using the "build with parameters" option is relatively painless. When you start topping five parameters, it becomes more and more painful to use when all you want to do is tweak one or two values.

Unfortunately, for the sake of portability across this customer's various Jenkins domains, my pipelines require a minimum of four parameters just to enable tailoring for a specific Jenkins domain's environmental uniqueness. Yeah, you'd think that the various domains' Jenkins services would be sufficiently identical to not require this ...but we don't live in a perfect world. Apparently, even though the same group owns three of the domains to use, each deployment is pretty much wholly unlike the others.


At any rate... I replied back, "I can probably make it so that the pipelines read the bulk of their parameters from an S3-hosted file, but it will take me some figuring out. Once I do, you should only need to specify which Jenkins stored-credentials to use and the S3 path of the parameter file". Yesterday, I set about figuring out how to do that. It was, uh, beastly.

At any rate, what I found was that I could store  parameter/value-pairs in a plain text file posted to S3. I could then stream-down that file and use a tool like awk to extract the values and assign them to values. Only problem is, I like to segment my Jenkins pipelines ...and it's kind of painful (in much the same way that rubbing ghost peppers into an open wound is "kind of" painful) to make variables set in one job-stage available in another job-stage. Ultimately, what I came up with was code similar to the following (I'm injecting explanation within the job-skeleton to hopefully make things easier to follow):

pipeline {

    agent any

    […elided…]

    environment {
        AWS_DEFAULT_REGION = "${AwsRegion}"
        AWS_SVC_ENDPOINT = "${AwsSvcEndpoint}"
        AWS_CA_BUNDLE = '/etc/pki/tls/certs/ca-bundle.crt'
        REQUESTS_CA_BUNDLE = '/etc/pki/tls/certs/ca-bundle.crt'
    }

My customer operates in a couple of different AWS partitions. The environment{} block customizes the job's behavior so that it can work across the various partitions. Unfortunately, can't really hard-code those values and still maintain portability. Thus, those values are populated from the following  parameters{} section:
   parameters {
         string(name: 'AwsRegion', defaultValue: 'us-east-1', description: 'Amazon region to deploy resources into')
         string(name: 'AwsSvcEndpoint',  description: 'Override the AWS service-endpoint as necessary')
         string(name: 'AwsCred', description: 'Jenkins-stored AWS credential with which to execute cloud-layer commands')
         string(name: 'ParmFileS3location', description: 'S3 URL for parameter file (e.g., "s3:///")')
    }

The parameters{} section allows a pipeline-user to specify environment-appropriate values for the AwsRegion, AwsSvcEndpoint and AwsCred used for governing the behavior of the AWS CLI utilities. Yes, there are plugins available that would obviate needing to use the AWS CLI, but, as with other plugins I can't rely on being universally-available, I can't rely on the more-advanced AWS-related plugins. Thus, I have to rely on the AWS CLI since that one actually is available in all of their Jenkins environments. But for the need to work across AWS partitions, I could have made the pipeline require only a single parameter: ParmFileS3location.

What follows is the stage that prepares the run-environment for the rest of the Jenkins job:
    stages {
        stage ('Push Vals Into Job-Environment') {
            steps {
                // Make sure work-directory is clean //
                deleteDir()

                // Fetch parm-file
                withCredentials([[
                    $class: 'AmazonWebServicesCredentialsBinding',
                    accessKeyVariable: 'AWS_ACCESS_KEY_ID',
                    credentialsId: "${AwsCred}",
                    secretKeyVariable: 'AWS_SECRET_ACCESS_KEY'
                ]]) {
                    sh '''#!/bin/bash
                        # For compatibility with ancient AWS CLI utilities
                        if [[ -v ${AWS_SVC_ENDPOINT+x} ]]
                        then
                           AWSCMD="aws s3 --endpoint-url s3.${AWS_SVC_ENDPOINT}"
                        else
                           AWSCMD="aws s3"
                        fi
                        ${AWSCMD} --region "${AwsRegion}" cp "${ParmFileS3location}" Pipeline.envs
                    '''
                }
                // Populate job-env from parm-file
script { def GitCred = sh script:'awk -F "=" \'/GitCred/{ print $2 }\' Pipeline.envs', returnStdout: true env.GitCred = GitCred.trim() def GitProjUrl = sh script:'awk -F "=" \'/GitProjUrl/{ print $2 }\' Pipeline.envs', returnStdout: true env.GitProjUrl = GitProjUrl.trim() def GitProjBranch = sh script:'awk -F "=" \'/GitProjBranch/{ print $2 }\' Pipeline.envs', returnStdout: true env.GitProjBranch = GitProjBranch.trim() […elided…] } } }

The above stage-definition has three main steps:
  1. The deleteDir() statement ensures that the workspace assigned on the Jenkins agent-node doesn't contain any content left over from prior runs. Leftovers can have bad effects on subsequent runs. Bad juju.
  2. The shell invocation is wrapped in a call to the Jenkins credentials-binding plugin (and the CloudBees AWS helper-plugin). Wrapping the shell-invocation, this way, allows the contained call to the AWS CLI to work as desired. Worth noting:

    • The credentials-binding plugin is a default Jenkins plugin
    • The CloudBees AWS helper-plugin is not

    If the CloudBees plugin is missing, the above won't work. Fortunately, that's one of the optional plugins they do seem to have in all of the Jenkins domains they're using.
  3. The script{} section does the heavy lifting of pulling values from the downloaded parameters file and making those values available to subsequent job-stages
The really important part to explain is the script{} section, as the prior two are easily understood from either the Jenkins pipeline documentation or the innumerable Google-hits you'd get on a basic search. Basically, for each parameter that I need to extract from the parameter file and make available to subsequent job-stages, I have to do a couple things:

  1. I have to define a variable scoped to the currently-running stage
  2. I have to pull value-data from the parameter file and assign it to the stage-local variable. I use a call to a sub-shell so that I can use awk to do the extraction.
  3. I then create a global-scope environment variable from the stage-local variable. I need to do things this way so that I can invoke the .trim() method against the stage-local variable. Failing to do that leaves an unwanted <CRLF> at the end of my environment variable's value. To me, this feels like back when I was writing Perl code for CGI scripts and other utilities and had to call chomp() on everything. At any rate, absent the need to clip off the deleterious <CRLF>, I probably could have done a direct assignment. Which is to say, I might have been able to simply do:
    env.GitProjUrl = sh script:'awk -F "=" \'/GitProjUrl/{ print $2 }\' Pipeline.envs',
        returnStdout: true
Once the parameter files' parameter-values have all been pushed to the Jenkins job's environment, they're now available for use. In this particular case, that means I can then use the Jenkins git SCM sub-module to pull the desired branch/tag from the desired git project using the Jenkins-stored SSH credential specified within the parameters file:

        stage("Print fetched Info") {
            steps {
                checkout scm: [
                        $class: 'GitSCM',
                        userRemoteConfigs: [[
                            url: "${GitProjUrl}",
                            credentialsId: "${GitCred}"
                        ]],
                        branches: [[
                            name: "${GitProjBranch}"
                        ]]
                    ],
                    poll: false
            }
        }
    }


But, yeah, sorting this out resulted in quite a few more shouts of "seriously, Jenkins?!?"

Tuesday, December 3, 2019

Seriously, Jenkins?

Have I used that title before? I feel like I ask that question a lot when writing pipelines for Jenkins.

Today's bit of credulity-testing had to do with using the pipeline DSL's git directive. Prior to today, I'd been setting up Jenkins jobs to only deal with branches. Since the project I'm working on has a bit more complexity in its code-flow, I figured I'd try using release-tags instead of branching.

It, uh, didn't go well. Normally, when I'm using a branch-based approach, I'm able to get away with DSL that looks something like:

pipeline {

    agent any

    parameters {
        string(
            name: 'GitCred',
            description: 'Jenkins-stored Git credential with which to execute git commands'
        )
        string(
            name: 'GitProjUrl',
            description: 'SSH URL from which to download the git project contents'
        )
        string(
            name: 'GitProjBranch',
            description: 'Project-branch to use from the git project'
        )
    }

    stages {
        stage ('Prep Work Environment') {
            steps {
                deleteDir()
                git branch: "${GitProjBranch}",
                credentialsId: "${GitCred}",
                url: "${GitProjUrl}"

[...elided...]

Unfortunately, when you want to use tags the invocation is a bit more pedantic (and the documentation is maddenly obtuse in trying to find this juju):

[...elided...]
    stages {
        stage ('prep Work Environment') {
            steps {
                deleteDir()
                checkout scm: [
                        $class: 'GitSCM',
                        userRemoteConfigs: [
                            [
                                url: "${GitProjUrl}",
                                credentialsId: "${GitCred}"
                            ]
                        ],
                        branches: [
                            [
                                name: "${GitProjBranch}"
                            ]
                        ]
                    ],
                    poll: false
[...elided...]

It was where, having switched to this form and my job started working that I loudly uttered, "seriously, Jenkins??". My cube-mates love me.

On the plus side, while my original invocation works only for branches, the more-pedantic invocation works for both branches and tags.

Thursday, November 7, 2019

The Punishment of NFS on Hardened EL7 Systems

All of the customers I currently serve operate under two, main requirements:
  • If using Linux, it has to be Enterprise Linux (RHEL or CentOS)
  • All such systems must be hardened to meet organizational specifications
The latter means that IPv6 support has to be disabled on all ELx deployments.

On the plus side of current customer-trends, most are (finally) making the effort to migrate from EL6 to EL7. Unfortunately, recent releases of EL7 included an update to the RPC subsystem. Further unfortunately, this update can cause the RPC subsystem to break on a system that's been hardened to disable IPv6.

With later updates to the RPC subsystems, it will attempt to perform an IPv6 network-bind. It makes the determination of whether to attempt this based on whether the IPv6 components are available/enabled in the initramfs boot-kernel.

With typical hardening-routines, IPv6 disablement happens after the initramfs boot-kernel has loaded. This is done when the boot processes read the /etc/sysctl.conf file and files within /etc/sysctl.d. Unfortunately, if the system-owner hasn't ensured that the /etc/sysctl.conf file packaged within the initramfs looks like the one in the booted system's root filesystem and the root filesystem's /etc/sysctl.conf file disables IPv6, bad times ensue. The RPC subsystem assumes that IPv6 is available. Then, when systemd attempts to start the rpcbind.socket unit, it fails. All the other systemd units that depend on the rpcbind.socket unit then also fail. This means no RPC service and no NFS server or client services.

In this scenario, the general fix-process is:
  1. Uninstall the dracut-config-generic RPM (`yum erase -y dracut-config-generic`)
  2. Rebuild the kernel (`dracut -v -f`)
  3. Reboot the system
Once the system comes back from the reboot, all of the RPC components – and services that rely on them – should function as expected.

...but that's only the first hurdle. When using default NFS mount options, NFS clients will attempt to perform an NFS v4.1 mount of the NFS server's shares. If NFS hasn't been explicitly configured for GSS-protected mounts, the mount of the filesystem typically takes around two minutes to occur (while the GSS-related subsystems try to negotiate the session before ultimately timing-out and reverting to "sys" security-mode). One can either force the use of NFSv3,  explicitly request the "sys" security-mode or wholly disable the rpc-gssd service-components. Explicitly requesting the sys security-model (using the sec=sys mount-option) halves the amount of time needed to negotiate the initial mount-request. Requesting NFSv3 (using the vers=3 mount-option) avoids the security-related negotiations, altogether, making the mount-action almost instantaneous. Similarly, disabling the rpc-gssd service-components (using systemctl's mask command for the rpc-gssd and/or nfs-secure services) avoids the GSS-related negotiation-components, making the mount-action almost spontaneous.

Once those bits are out of the way, then it's usually just a matter of configuring appropriate SELinux elements to allow the sharing-out of the desired filesystems and setting up the export-definitions.

Friday, August 16, 2019

Unsupported Things Never Stay Unsupported

A few years ago, the project-lead on the project I was working, at the time, asked me, "can you write some backup scripts that our cloud pathfinder programs can use? Moving to AWS, their systems won't be covered by the enterprise's NetBackup system and they need something to help with disaster-mitigation."

At the time, the team I was one was very small. While there were nearly a dozen people – a couple of whom were also technically-oriented – on the team, it was pretty much just me and one other person that were automation-oriented (and responsible for the entire tooling workload). That other person was more Windows-focussed and our pathfinders were pretty much wholly Linux-based. I am, to make an understatement, UNIX/Linux-oriented. Thus it fell on me.

I ended up whacking together a very quick-and-dirty set of tools. Since our pathfinders were notionally technically-savvy (otherwise, they ought not have been pathfinders), I built the tools with a few assumptions: 1) that they'd read the (minimal) documentation I included with the project; 2) that, having read that, if they found it lacking, they'd examine the code (since it was hosted in a public GitHub project); 3) that they'd contact me or the team if 1 and/or 2 to fail - preferably by filing a GitHub issue against the project; and, 4) that they might be the type of users who tried to use more-complex storage configurations (especially given that, at the time, you couldn't expand an EBS volume once created). 

Since I wanted something fairly quickly-done, I wrote the tools to leverage AWS's native storage-snapshotting capability. Using storage-snapshots was also attractive because it was more cost-efficient than mirroring and online EBS to an offline EBS (especially the multiple such offline EBSes that would be necessary to create multiple days worth of recovery-points). Lastly, because of the way the snapshots work, once you initiate the snapshot, you don't have to worry about filesystem changes while the snapshot-creation finishes.

Resultant of assumption #4, I wrote the scripts to accommodate the possibility that they'd be trying to back up filesystems that spanned – either as concats or stripes – two or more EBS volumes. This accommodation meant that I added an option to freeze the filesystem before requesting snapshot and (because AWS had yet to add the option to snapshot a set of EBSes all at once) implemented multi-threaded snapshot-request logic. The combination of the two meant that, in the few milliseconds that it took to initiate the (effectively) parallel snapshot-requests, I/Os would be halted to the filesystem and prevent there from being any consistency-gaps between members of the EBS-set.

Unfortunately, as my primary customers' adoption went from the (technically-savvy) pathfinder-projects to progressively less-technical follow-on projects, the assumptions under which the tools were authored became less valid. Because the follow-on projects tended to be peopled by staff that are very "cut-n-paste" oriented, they tended to do some silly things:
  • Trying to freeze the "/" filesystem (you can do it, but you're probably not going to be able to undo it …without rebooting)
  • Trying to freeze filesystems that didn't exist ("but it was in the example")
  • Trying to freeze filesystems that weren't built on top of spans (not inherently "wrong", per se, but also not needed from a "keep all the sub-volumes in sync" perspective)
Worse, as our team added (junior) people to help act as stewards for these projects, they didn't really understand the vagaries of storage – why certain considerations are warranted or not (and the tool-features associated thereto). So, they weren't steering the new tool-users away from bad usages. Bearing in mind that the tools were a quick-and-dirty effort meant for use by technically-savvy users, the tools had pretty much no "safeties" built in.

Side-note: To me, you build safeties into tools that you're planning to support (to minimize the amount of "help me, help me: stuff blowed up" emails).

Being short on time and with a huge stack of other projects to work on when I first wrote them, smoothing off the rough-edges wasn't a priority. And, when the request for the tools was originally made, in response to my saying "I, don't have the spare cycles to fully engineer a supportable tool-set," I was told, "these will made available as references, not supported tools."

Harkening back to the additions to the team, one of the other things that wasn't being adequately communicated to them in their onboarding was that the tooling wasn't "supported". We ended up on a cycle where, about every 6-11 months, people would be screaming about "the scripts are broken and tenants are screaming". At which point would have to remind the people made the "they're just references" tool-request that, "1) these tools were never meant as more than a reference; 2) they're working perfectly-adequately and just like they did back in 2015 – it's just that both the program-users and our team's stewards are just botching their use."

The most recent iteration of the preceding, I actually had some time to retrofit some safeties. Having done so, the people that were whining about the tools being "broken" were chuffed and asked, "cool, when do we make the announcement of the upates." I responded, "well, you don't really need to." At which point I was asked, "why do you want to hide these updates." My response was, "there's a difference between 'not announcing updates' and 'hiding availability of updates'. Besides: I made no functional changes to the tools. The only thing that's changed is it's harder for people to use them in ways that will cause problems." In disbelief, I was asked, "if there's no functionality change, why did we waste time updating the tools."


Muttering under my breath as I composed my reply, "because you guys were (erroneously) stating that the tools were broken and that they needed to be made more-friendly. The 'make more friendly' is now done. That said, announcing to people that are already successfully using them, 'the tools have been updated,' provides no value to those people: whether they continue using the pre-safeties versions or the updated versions, their jobs will continue to function without any need for changes. All that announcing will inevitably do is cause those people to harangue you about 'we updated: why aren't we seeing any functional enhancements'?"

Friday, August 9, 2019

So You Want to Access a FIPSed EL7 Host Via RDP

One of the joys of the modern, corporate security-landscape is that enterprises frequently end up lockin down their internal networks to fairly extraordinary degrees. That as software and operating system vendors offer new bolts to tighten, organizations will tend to do so - sometimes without considering the impact of what that tightening will do.

Several of my customers protect their networks not only with inbound firewalls, but firewalls that severely restrict outbound connectivity. Pretty much, their users' desktop systems can only access an external service if its offered via HTTP/S. Similarly, their users' desktop systems are configured with application whitelisting enabled - preventing not only power users from installing software that requires privileged-access to install, but prevents users from installing things that are wholly constrained to their home directories. This kind of security-posture is suitable for the vast majority of users, but is considerably less so for developers.

The group I work for provides cloud-enablement services. This means that we are both developers and provide services to our customers' developers. Both for our own needs (when on-site) and for those of customers' developers, this has meant needing to have remote (cloud-hosted), "developer" desktops. The cloud service providers (CSPs) we and our customers use provide remote desktop solutions (e.g., AWS's "Workspaces"). However, these services are typically not usable at our customer sites due to the previously-mentioned network and desktop lockdowns: even if the local desktop has tools like RDP and SSH clients installed, those tools are only usable within the enterprises' internal networks; if the remote desktop offering is reachable via HTTP/S, it's typically through a widget that the would-be remote desktop user would install to their local workstation if application-whitelisting didn't prevent it.

To solve this problem or both our own needs (when on-site) and our customers' developers' needs, we stood up a set of remote (cloud-hosted), Windows-based desktops. To make them usable from locked-down networks, we employed Apache's Guacamole service. Guacamole makes remote Windows and Linux desktops available within a user's web browser.

Guacamole-fronted Windows desktops proved to be a decent solution for several years. Unfortunately, as the cloud wars heat up and CSPs try to find ways to bring - or force - customers into their datacenters, what was once a decent solution can become not decent - often due to pricing factors. Sadly, it appears that Microsoft may be trying to pump up Azure-adoption by increasing the price of cloud-hosted Windows services when those services are run in other CSPs' datacenters.

While we wait to see if and how this plays out, financially, we opted to see "can we find lower-cost alternatives to Windows-based (remote) developer desktops." Most of our and our customers' developers are Linux-oriented - or at least Linux-comfortable: it was a no-brainer to see what we could do using Linux. Our Guacamole service already uses Linux-based containers to provide the HTTP/S-encapsulation for RDP and Guacamole natively supports the fronting of Linux-based graphical desktops via VNC. That said, given that the infrastructure is built around an RDP, it might prove to ease some of the rearchitecting-process by keeping communications RDP-based even without Windows in the solution-stack.

Because our security guidance has previously required us to use "hardened" Red Hat and CentOS-based servers to host Linux applications, that was our starting-point for this process. This hardening almost always introduces "wrinkles" into deployment of solutions - usually because the software isn't SELinux-enabled or relies on kernel-bits that are disabled under FIPS mode. This time, the problem was FIPS mode.

While installing and using RDP on Linux has become a lot easier than it used to be (tools like XRDP now actually ship with SELinux policy-modules!), not all of the kinks are gone, yet. What I discovered, when starting on the investigation path, is that the XRDP installer for Enterprise Linux 7 isn't designed to work in FIPS mode. Specifically, when the installer goes to set up its encryption-keys, it attempts to do so using MD5-based methods. When FIPS mode is enabled on a Linux kernel, MD5 is disabled.

Fortunately, this only effects legacy RDP connections. The currently-preferred solution for RDP leverages TLS. Both TLS and its preferred ciphers and algorithms are all FIPS compatible. Further, even though tin installer fails to set up the encryption keys, these keys are effectively optional: a file at the expected location for keys merely needs to exist, not actually be a valid key. This meant that the problem in the installer was trivially worked around by adding a `touch /etc/xrdp/rsakeys.ini` to the install process. Getting a cloud-hosted, Linux-based, graphical desktop ultimately becomes a matter of:

  1. Stand up a cloud-hosted Red Hat or CentOS 7 system
  2. Ensure that the "GNOME Desktop" and "Graphical Administration Tools" package-groups are installed (since, if your EL7 starting-point is like ours, no GUIs will be in the base system-image)
  3. Once those are installed, ensure that the system's default run-state has been set to "graphical.target". The installers for the "GNOME Desktop" package-group should have taken care of this for you. Check the run-level with `systemctl get-default`. If the installers for the "GNOME Desktop" package-group didn't properly set things, correct it by executing `systemctl set-default graphical.target`
  4. Ensure that the xrdp and tigervnc-server RPMs are installed
  5. Make sure that firewalld allows connections to the XRDP service by executing `firewall-cmd --add-port=3389/tcp --permanent`
  6. Similarly, ensure that whatever CSP-layer networking controls are present allow TCP port 3389 inbound to your XRDP-enabled Linux host.
  7. ...And if you want users of your Linux-based RDP host to be able remotely-access actual Windows-based servers, install Vinagre.
  8. Reboot to ensure everything is in place and running.
Once the above is done, you can test things out by RDPing into your new Linux host from a Windows host …and, if you've installed Vinagre, RDP from your new, XRDP-enabled Linux host to Windows host (for a nice case of RDP-inception).



References:

Friday, July 19, 2019

Why I Default to The Old Ways

I work with a growing team of automation engineers. Most are purely dev types. Those that have lived in the Operations world, at all, skew heavily towards Windows or only had to very lightly deal with UNIX or Linux.

I, on the other hand, have been using UNIX flavors since 1989. My first Linux system was the result of downloading a distribution from the MIT mirrors in 1992. Result, I have a lot of old habits (seriously: some of my habits are older than some of my teammates). And, because I've had to get deep into the weeds with all of those operating systems many, many, many times, over the years, those habits are pretty entrenched ("learned with blood" and all that rot).

A year or so ago, I'd submitted a PR that included some regex-heavy shell scripts. The person that reviewed the PR had asked "why are you using '[<space><TAB>]*' in your regexes rather than just '\s'?". At the time, I think my response was a semi-glib, "A) old habits die hard; and, B) I know that the former method always works".

That said, I am a lazy-typist. Typing "\s" is a lot fewer keystrokes than is "[<space><TAB>]*". Similarly, "\s" takes up a lot less in the way of column-width than does "[<space><TAB>]*" (and I/we generally like to code to fairly standard page-widths). So, for both laziness reasons and column-conservation reasons, I started to move more towards using "\s" and away from using "[<space><TAB>]*".  I think in the last 12 months, I've moved almost exclusively to  "\s".

Today, that move bit me in the ass. Well, yesterday, actually, because that's when I started receiving reports that the tool I'd authored on EL7 wasn't working when installed/used on EL6. Ultimately, I traced the problem to an `awk` invocation. Specifically, I had a chunk of code (filtering DNS output) that looked like:

awk '/\sIN SRV\s/{ printf("%s;%s\n",$7,$8)}'

Which worked a treat on EL7 but on EL6, "not so much." When I altered it to the older-style invocation:

awk '/[  ]*IN[  ]*SRV[  ]*/{ printf("%s;%s\n",$7,$8)}'

It worked fine on both EL7 and EL6. Turns out the ancient version of `awk` (3.1.7) on EL6 didn't know how to properly interpret the "\s" token. Oddly (my recollection from writing other tooling) is that EL6's version of `grep` understands the "\s" token just fine.

When I Slacked the person I'd had the original conversation with a link to the PR with a "see: this is why" note, he replied, "oh: I never really used awk, so never ran into it".

Wednesday, July 17, 2019

Crib-Notes: Manifest Deltas

Each month, the group I work for publishes new CentOS and Red Hat AMIs (and Azure templates and Vagrant boxes). When we complete the publication-event, we post a news announcement to our user-portal so that subscribers can receive an alert of the new publication. Included in that news announcement is a "what's changed" section.

In prior months, figuring out "what changed" was left as a manual step for the team-member charged with running the automation for a given month's publication event. This month, no one generated that news article and there were several updated and new RPMs included in the new image. So, I set about figuring out "how to extract this information programmatically so as to more-easily suss-out what to include in the announcement posting." The following does so (though, presumably, in a not-particularly-optimized) fashion:
git diff $(
      git log --pretty='%H' --follow -- <PATH_TO_MANIFEST_FILE> | \
      head -2 | \
      tac | \
      sed 'N;s/\n/../'
   ) -- <PATH_TO_MANIFEST_FILE> | \
grep -E '(amazon|aws|ec2)-' | \
sed 's/^./& /' | \
sort -k 2
To explain:
  1. Use `git log` to output the commit-hashes for all the commits for the target file (in this case, the project's manifest-file)
  2. Use `head -2` to grab only the two most-recent commit hashes from the output-stream
  3. Use the `tac` command to invert the order of the two lines returned from the `head` command
  4. Use the `sed` command to join the two lines, replacing the first line's line-ending newline character with ".."
  5. Use `git diff` against the output created in steps 1-4, and constrain the diff-activity to just the manifest-file.
  6. Pipe that output through `grep` to suppress all information other than the bits containing the `amazon-`, `aws-` and `ec2-` substrings.
  7. Pipe that through `sed` so that the +/- that `git diff` uses to show new and removed files, respectively, becomes an easily-tokenized substring.
  8. Sort the remaining output-stream (with `sort`) so that the lines are groups by manifest-element (the second key/token in the sorted output)
Taking that output and converting to a news article is still manual, but it at least makes it a lot easier to do than either hand-diffing two files or having to "just know" what's changed.

Notes

Because Red Hat has placed EL6 is in its final stage of de-support, we've stopped publishing CentOS6 and RHEL6. We did this to discourage our subscribers from doing new deployments on EL6 (since the underlying platform will go into final de-support come November of this year).

Similarly, due to current lack of CentOS offering for EL8, lack of security-related build- or hardening-guidance for EL8 and associated lack of subscriber-demand for an EL8 build, we don't yet include builds for CentOS8 or RHEL8 in our process. Thus, for the time being, we only need to provide a "whats changed" for EL7 builds. Given this, we currently only need to do change-queries against the "manifests/spel-minimal-centos-7-hvm.manifest.txt" file.

Thursday, June 20, 2019

Crib-Notes: EC2 UserData Audit

Sometimes, I find that I'll return to a customer/project and forget what's "normal" for them in how they deploy their EC2s. If I know a given customer/project tends to deploy EC2s that include UserData, but they don't keep good records of what they tend to do for said UserData, I find the following BASH scriptlet to be useful for getting myself back into the swing of things:

for INSTANCE in $( aws ec2 describe-instances --query 'Reservations[].Instances[].InstanceId' | \
                   sed -e '/^\[/'d -e '/^]/d' -e 's/^ *"//' -e 's/".*//' )
do
   printf "###############\n# %s\n###############\n" "${INSTANCE}"
   aws ec2 describe-instance-attribute --instance-id "${INSTANCE}" --attribute userData | \
   jq -r .UserData.Value | base64 -d
   echo
done | tee /tmp/DiceLab-EC2-UserData.log

To explain, what the above does is:
  1. Initiates a for-loop using ${INSTANCE} as the iterated-value
  2. With each iteration, the value injected into ${INSTANCE} is derived from a line of output from the aws ec2 describe-instances command. Normally, this command outputs a JSON document containing a bunch of information about each instance in the account-region. Using the --query option, the output is constrained to only output each EC2 instance's InstanceId value. This is then piped through sed so that the extraneous characters are removed, resulting in a clean list of EC2 instance-IDs.
  3. The initial printf line creates a bit of an output-header. This will make it easier to pore through the output and keep each iterated instance's individual UserData content separate
  4. Instance UserData is considered to be an attribute of a given EC2 instance. The aws ec2 describe-instance-attribute command is what is used to actually pull this content from the target EC2. I could have used a --query filter to constrain my output. However, I instead chose to use jq as it allows me to both constrain my output as well as do output-cleanup, eliminating the need for the kind of complex sed statement I used in the loop initialization (cygwin's jq was crashing this morning when I was attempting to use it in the loop-initialization phase - in case you were wondering about the inconsistent constraint/cleanup methods). Because the UserData output is stored as a BASE64-encoded string, I have to pipe the cleaned-up output through the base64 utility to get my plain-text data back.
  5. I inject a closing blank line into my output stream (via the echo command) to make the captured output slightly easier to scan.
  6. I like to watch my scriptlet's progress, but still like to capture that output into a file for subsequent perusal, thus I pipe the entire loop's output through tee so I can capture as I view.
I could have set it up so that each instance's data was dumped to an individual output-file. This would have saved the need for the printf and echo lines. However, I like having one, big file to peruse (rather than having to hunt through scads of individual files) ...and a single file-open/close action is marginally faster than scads of open/closes.

In an account-region that had hundreds of EC2s, I'd probably have been more selective with which instance-IDs I initiated my loop. I would have used a --filter statement in my aws ec2 describe-instances command - likely filtering by VPC-ID and one or two other selectors.

Tuesday, May 7, 2019

Crib-Notes: Offline Delta-Syncs of S3 Buckets

In the normal world, synchronizing two buckets is as simple as doing `aws s3 sync <SYNC_OPTIONS> <SOURCE_BUCKET> <DESTINATION_BUCKET>`. However, due to the information security needs of some of my customers, it's occasionally necessary to perform data-synchronizations between two S3 buckets, but using methods that amount to "offline" transfers.

To illustrate what is meant by "offline":
  1. Create a transfer-archive from a data source
  2. Copy the transfer-archive across a security boundary
  3. Unpack the transfer-archive to its final destination

Note that things are a bit more involved than the summary of the process – but this gives you the gist of the major effort-points.

The first time you do an offline bucket sync, transferring the entirety of a bucket is typically the goal. However, for a refresh-sync – particularly for a bucket of greater than a trivial content-size, this can be sub-ideal. For example, it might be necessary to do monthly syncs of a bucket that grows by a few Gigabytes per month. After a year, a full sync can mean having to move tens to hundreds of gigabytes. A better way is to only sync the deltas – copying only what's changed between the current and immediately-prior sync-tasks (a few GiB rather than tens to hundreds).

The AWS CLI tools don't really have a "sync only the files that have been added/modified since <DATE>". That said, it's not super difficult to work around that gap. A simple shell script like the following works a trick:

for FILE in $( aws s3 ls --recursive s3://<SOURCE_BUCKET>/  | \
   awk '$1 > "2019-03-01 00:00:00" {print $4}' )
do
   echo "Downloading ${FILE}"
   install -bDm 000644 <( aws s3 cp "s3://<SOURCE_BUCKET>/${FILE}" - ) \
     "<STAGING_DIR>/${FILE}"
done

To explain the above:

  1. Create a list of files to iterate:
    1. Invoke a subprocess using the $() notation. Within that subprocess...
    2. Invoke the AWS CLI's S3 module to recursively list the source-bucket's contents (`aws s3 ls --recursive`)
    3. Pipe the output to `awk` – looking for any date-string that's newer than the value in s3 ls's first output-column (the file-modification date column) and print out only the fourth column (the S3 object-path)
    The output from the subprocess is captured that output as an iterable list-structure
  2. Use a for loop-method to iterate the previously-assembled list, assigning each S3 object-path to the ${FILE} variable
  3. Since I hate sending programs off to do things in silence (I don't trust them to not hang), my first looped-command is to say what's happening via the echo "Downloading ${FILE}" directive.
  4. The install line makes use of some niftiness within both BASH and the AWS CLI's S3 command:
    1. By specifying "-" as the "destination" for the file-copy operation, you tell the S3 command to write the fetched object-contents to STDOUT.
    2. BASH allows you take a stream of output and assign a file-handle to it by surrounding the output-producing command with <( ).
    3. Invoking the install command with the -D flag tells the command to "create all necessary path-elements to place the source 'file' in the desired location within the filesystem, even if none of the intervening directory structure exists, yet."
    Putting it all together, the install operation takes the streamed s3 cp output, and installs it as a file (with mode 000644) at the location derived from the STAGING_DIR plus the S3 object-path ...thus preserving the SOURCE_BUCKET's content-structure within the STAGING_DIR
Obviously, this method really only works for additive/substitutive deltas. If you need to account for deletions and/or moves, this approach will be insufficient.

Wednesday, April 24, 2019

Crib-Notes: End-to-End SSL Within AWS

In general, when using an Elastic Load-Balancer (ELB) to do SSL-encrypted proxying for an AWS-hosted, Internet-facing application, it's typically sufficient to simply do SSL-termination at the ELB and call it a day. That said:
  • If you're paranoid, you can ensure that only the proxied EC2(s) and the ELB are able to communicate with each other via security groups.
  • If you're under compliance-requirements (e.g., PCI/DSS), you can enable end-to-end SSL such that:
    1. Connections between Internet-based client and the ELB are encrypted
    2. Connections between the ELB and the application-hosting EC2(s) is encrypted
Either/both amount to a "belt and suspenders"  approach. Other than "meeting policy", security isn't meaningfully improved: given AWS's overarching security-design, even if someone else has access to your application-EC2(s) and ELB's VPC, they won't be able to sniff packets/data – encrypted or not.

Technical need aside... Implementing end-to-end SSL is trivial:
  • ACM allows easy provisioning of SSL certificates for the ELB (with the security-bonus of automatically rotating said certificates). 
  • You can very generic, self-signed certificates on your application-hosting EC2s:
    • The certificate's Subject doesn't matter
    • The certificate's validity window doesn't matter (no need to worry about rotating certificates that have expired)
Thus setup comes down to:
  1. Create an EC2-hosted application/service:
    1. Launch EC2
    2. Install HTTPS-capable application
    3. Generate a self-signed certificate (setting the -days to as little as 1 day). Example (using the OpenSSL utility):

      openssl req -x509 -nodes -days 1 -newkey rsa:2048 \
         -keyout peer.key -out peer.crt

      When prompted for input, just hit the <RETURN> key (this will create a cert with defaulted values ...which, as noted previously, don't really have bearing on the ELB's trust of the certificate). Similary, one can wholly omit the -days 1 flag and value – the default certificate will be valid for 30 days (but, ELB doesn't care about the validity time-window).
    4. Configure the HTTPS-capable application to load the certificate
    5. Configure the EC2's host-based firewall to allow connections to whatever port the application listens on for SSL-protected connections
    6. Configure the EC2's security group to allow connections to whatever port the application listens on for SSL-protected connections
  2. Create an ELB:
    1. Set the ELB to listen for SSL-based connection-requests (using a certificate from ACM or IAM)
    2. Set the ELB to forward connections using the HTTPS protocol to connect to the target EC2(s) over whatever port the application listens on for SSL-protected connections
    3. Ensure the ELB's healthcheck is requesting a suitable URL to establish the health of the application 

Once the ELB's healthcheck goes green, it should be possible to connect to the EC2-hosted application via SSL.If one wants to verify the encryption-state of the connetction between the ELB and EC2(s), one would need to login to the EC2(s) and sniff the inbound packets (e.g., by using a tool like WireShark).

Wednesday, April 17, 2019

Crib-Notes: Validating Consistent ENA and SRIOV Support in AMIs

One of the contracts I work on, we're responsible for producing the AMIs used for the entire enterprise. At this point, the process is heavily automated. Basically, we use a pipeline that leverages some CI tools, Packer and a suite of BASH scripts to do all the grunt-work and produce not only an AMI but artifacts like AMI configuration- and package-manifests.

When we first adopted Packer, it had some limitations on how it registered AMIs (or, maybe, we just didn't find the extra flags back when we first selected Packer – who knows, it's lost to the mists of time at this point). If you wanted the resultant AMIs to have ENA and/or SRIOV support baked in (we do), your upstream AMI needed to have it baked in as well. This necessitated creating our own "bootstrap" AMIs as you couldn't count on these features being baked in – not even within the upstream vendor's (in our case, Red Hat's and CentOS's) AMIs.

At any rate, because the overall process has been turned over from the people that originated the automation to people that basically babysit automated tasks, the people running the tools don't necessarily have a firm grasp of everything that the automation's doing. Further, the people that are tasked with babysitting the automation differe from run-to-run. While automation should see to it that this doesn't matter, sometimes it pays to be paranoid. So a quick way to assuage that paranoia is to run quick reports from the AWS CLI. The following snippet makes for an adequate, "fifty-thousand foot" consistency-check:

aws ec2 describe-images --owner <AWS_ACCOUNT_ID> --query \
      'Images[].[ImageId,Name,EnaSupport,SriovNetSupport]' \
      --filters 'Name=name,Values=<SEARCH_STRING_PATTERN>' \
      --out text | \
   aws 'BEGIN {
         printf("%-18s%-60s%-14s-10s\n","AMI ID","AMI Name","ENA Support","SRIOV Support")
      } {
         printf("%-18s%-60s%-14s-10s\n",$1,$2,$3,$4)
      }'
  • There's a lot of organizations and individuals publishing AMIs. Thus, we use the --owner flag to search only for AMIs we've published.
  • We produce a couple of different families of AMIs. Thus, we use the --filter statement to only show the subset of our AMIs we're interested in.
  • I really only care about four attributes of the AMIs being reported on: ImageId, Name, EnaSupport and SriovNetSupport. Thus, the use of the JMSE --query statement to suppress all output except for that in which I'm interested.
  • Since I want the output to be pretty, I used the compound awk statement to create a formatted header and apply the same formatting to the output from the AWS CLI (using but a tiny bit of the printf routine's many capabilities).

This will produce output similar to:

   AMI ID                 AMI Name                                        ENA Support  SRIOV Support
   ami-187af850f113c24e1  spel-minimal-centos-7-hvm-2019.03.1.x86_64-gp2  True         simple
   ami-91b38c446d188643e  spel-minimal-centos-7-hvm-2019.02.1.x86_64-gp2  True         simple
   ami-22867cf08bb264ac4  spel-minimal-centos-7-hvm-2019.01.1.x86_64-gp2  True         simple
   [...elided...]
   ami-71c3822ed119c3401  spel-minimal-centos-7-hvm-2018.03.1.x86_64-gp2  None         simple
   [...elided...]
   ami-8057c2bf443dc01f5  spel-minimal-centos-7-hvm-2016.06.1.x86_64-gp2  None         None

As you can see, not all of the above AMIs are externally alike. While this could indicate a process or personnel problem, what my output actually shows is evolution in our AMIs. Originally, we weren't doing anything to support SRIOV or ENA. Then we added SRIOV support (because our AMI users were finally asking for it). Finally, we added ENA support (mostly so we could use the full range and capabilities of the fifth-generation EC2 instance-types).

At any rate, running a report like the above, we can identfy if there's unexpected differences and, if a sub-standard AMI slips out, we can alert our AMI users "don't use <AMI> if you have need of ENA and/or SRIOV".

Saturday, April 6, 2019

Crib-Notes: Tracing Permissions for an Instance Role That Uses Inline Policies

The rationale for this crib-note is essentially identical to that for the previous topic on tracing IAM permissions in instance-roles that use managed-policies. So I'm just going to crib the rest of the intro...

When working with AWS EC2 instances – particularly when automating the deployments and lifecycles thereof – it's common to make use of the AWS IAM system's Instance Role Policy feature. Occasionally, you might get asked, "what permissions have been given to that instance." To answer, you might use AWS's IAM web console or the IAM CLI. In the former case, to get the information to the requestor, you're left with the options of either taking screen-shots or trying to copy and paste the text from the UI to your email/slack/etc. reply. In the latter case, you can just dump out the JSON-formatted policy-document that enumerates the permissions.

Generally, I prefer the CLI option since I don't have to worry "what does the recipient need to do with the results". Just dumping a text file, I needn't worry about being yelled at that the contents of a screen-shot can't be used to easily create another, similar policy. It's also easy to simply redirect the output to a file and attach it to whatever media I'm responding to ...or, if mail is enabled within the CLI environment, simply pipe the output directly to a CLI-based email tool and save a step in the process (laziness for the win!).

But, how to go from "there's an instance in this account: what privileges does it have" to "here's what it can do?". Basically, it's a three-step process:

  1. Get the name of the Instance Role attached to the EC2 instance. This can be done with a method similar to:

    $ aws ec2 describe-instances --instance-id <INSTANCE_ID> \
          --query 'Reservations[].Instances[].IamInstanceProfile[].Arn[]' --out text | \
      sed 's/^.*arn:.*profile\///'

  2. Get th list of inline policies attached to role. This can be done with a method similar to:

    $ aws iam list-role-policies --role-name <OUPUT_FROM_PREVIOUS>

  3. Get the list of permissions associated with the instance role's inline policy/policies. This can be done with a method similar to:

    $ aws iam get-role-policy --role-name INSTANCE --policy-name <OUPUT_FROM_PREVIOUS>

The above steps will dump out the full IAM policy/permissions of the queried inline policy:

  • If the inline policy was the only policy attached to the role, the output will show all of the permissions the instance role grants any EC2s it is attached to.
  • If the inline policy was the not the only inline policy attached to the role, it will be necessary to iterate over the remaining policies attached to the role to get the aggregated permission-set.
To facilitate the iteration, one can use a script similar to the following to encapsulate the second and third steps from prior process description:


#!/bin/bash
#
# Script to dump out all permissions granted through an IAM role with multiple
# inline IAM policies attached.
###########################################################################

if [[ $# -eq 0 ]]
then
   echo "Usage: ${0} <instance-id>" >&2
   exit 1
fi

PROFILE_NAME=$( aws ec2 describe-instances --instance-id "${1}" \
   --query 'Reservations[].Instances[].IamInstanceProfile[].Arn' --out text | \
  tr '\t' '\n' | sort -u | sed 's/^.*arn:.*profile\///' )

POLICY_LIST_RAW=$( aws iam list-role-policies --role-name ${PROFILE_NAME} )
POLICY_LIST_CLN=($( echo ${POLICY_LIST_RAW} | jq .PolicyNames[] | sed 's/"//g' ))

for ITER in $( seq 0 $(( ${#POLICY_LIST_CLN[@]} - 1 )) )
do
  aws iam get-role-policy --role-name "${PROFILE_NAME}" \
    --policy-name "${POLICY_LIST_CLN[${ITER}]}"
done

Friday, April 5, 2019

Crib-Notes: Tracing Permissions for an Instance Role That Uses Managed Policies

When working with AWS EC2 instances – particularly when automating the deployments and lifecycles thereof – it's common to make use of the AWS IAM system's Instance Role Policy feature. Occasionally, you might get asked, "what permissions have been given to that instance." To answer, you might use AWS's IAM web console or the IAM CLI. In the former case, to get the information to the requestor, you're left with the options of either taking screen-shots or trying to copy and paste the text from the UI to your email/slack/etc. reply. In the latter case, you can just dump out the JSON-formatted policy-document that enumerates the permissions.

Generally, I prefer the CLI option since I don't have to worry "what does the recipient need to do with the results". Just dumping a text file, I needn't worry about being yelled at that the contents of a screen-shot can't be used to easily create another, similar policy. It's also easy to simply redirect the output to a file and attach it to whatever media I'm responding to ...or, if mail is enabled within the CLI environment, simply pipe the output directly to a CLI-based email tool and save a step in the process (laziness for the win!).

But, how to go from "there's an instance in this account: what privileges does it have" to "here's what it can do?" Basically, it's a four-step process:

  1. Get the name of the Instance Role attached to the EC2 instance. This can be done with a method similar to:
    $ aws ec2 describe-instances --instance-id <INSTANCE_ID> \
        --query 'Reservations[].Instances[].IamInstanceProfile[].Arn[]' --out text | \
      sed 's/^.*arn:.*profile\///'
    $ aws iam get-instance-profile --instance-profile-name <OUPUT_FROM_PREVIOUS> \
        --query 'InstanceProfile.Roles[].RoleName' --out text
     
  2. Get list of policies attached to role. This can be done with a method similar to:
    aws iam list-attached-role-policies --role-name <OUPUT_FROM_PREVIOUS> \
        --query 'AttachedPolicies[].PolicyArn[]' --out text
     
  3. Find current version of attached policy/policies. Thi can be done with a method similar to:
    aws iam get-policy --policy-arn <OUPUT_FROM_PREVIOUS> \
        --query 'Policy.DefaultVersionId' --out text
    
  4. Get contents of attached policy/policies active version. This can be done – using the outputs from steps #2 and #3 – with a method similar to:
    aws iam get-policy-version --policy-arn <OUPUT_FROM_STEP_2> \
         --version-id <OUPUT_FROM_STEP_3>
     
The above steps will dump out the full IAM policy/permissions of the queried managed-policy:
  • If the inline policy was the only policy attached to the role, the output will show all of the permissions the instance role grants any EC2s it is attached to.
  • If the inline policy was the not the only inline policy attached to the role, it will be necessary to iterate over the remaining policies attached to the role to get the aggregated permission-set.
To facilitate the iteration, one can use a script similar to the following to encapsulate the steps from prior process description:

#!/bin/bash
#
# Script to dump out all permissions granted through an IAM role with multiple
# managed IAM policies attached.
###########################################################################

if [[ $# -eq 0 ]]
then
   echo "Usage: ${0} <instance-id>" >&2
   exit 1
fi

PROFILE_NAME="$( aws ec2 describe-instances --instance-id "${1}" \
  --query 'Reservations[].Instances[].IamInstanceProfile[].Arn[]' \
  --out text | sed 's/^.*arn:.*profile\///' )"
ROLE_NAME="$( aws iam get-instance-profile \
  --instance-profile-name "${PROFILE_NAME}" \
  --query 'InstanceProfile.Roles[].RoleName' \
  --out text | sed 's/^.*arn:.*profile\///' )"
ATTACHED_POLICIES=($( aws iam list-attached-role-policies \
  --role-name "${ROLE_NAME}" --query 'AttachedPolicies[].PolicyArn[]' | \
  jq .[] | sed 's/"//g' ))

for ITER in $( seq 0 $(( ${#ATTACHED_POLICIES[@]} - 1 )) )
do
   POLICY_VERSION=$( aws iam get-policy --policy-arn \
     ${ATTACHED_POLICIES[${ITER}]} --query \
     'Policy.DefaultVersionId' --out text )
   aws iam get-policy-version --policy-arn ${ATTACHED_POLICIES[${ITER}]} \
     --version-id "${POLICY_VERSION}"
done

Wednesday, March 27, 2019

Travis, EL7 and Docker-Based Testing

As noted in a prior post, lot of my customer-oriented activities support deployment within networks that are either partially- or wholly-isolated from the public Internet. Yesterday, as part of supporting one such customer, I was stood up a new project to help automate the creation of yum repository configuration RPMs for private networks. I've had to hand-jam such files twice, now, and there's unwanted deltas between the two jam-sessions (in defense, they were separated from each other by nearly a three-year time-span). So, I figured it was time to standardize and automate things.

Usually, when I stand up a project, I like to include tests of the content that I wish to deliver. Since most of my projects are done in public GitHub repositories, I typically use TravisCI to automate my testing. Prior to this project, however, I wasn't trying to automate the validity-testing of RPM recipes via Travis. Typically, when automating creation of RPMs I wish to retain or deliver, I set up a Jenkins job that takes the resultant RPMs and stores them in Artifactory – both privately-hosted services. Most of my prior Travis jobs were simple, syntax-checkers (using tools like shellcheck, JSON validators, CFn validators, etc.) rather than functionality-checkers.

This time, however, I was trying to deliver a functionality (RPM spec files that would be used to generate source files from templates and package the results). So, I needed to be able to test that a set of spec files and source-templates could be reliably-used to generate RPMs. This meant, I needed my TravisCI job to generate "throwaway" RPMs from the project-files.

The TravisCI system's test-hosts are Ubuntu-based rather than RHEL or CentOS based.While there are some tools that will allow you to generate RPMs on Ubuntu, there've been some historical caveats on their reliability and/or representativeness. So, my preference was to be able to use a RHEL or CentOS-based context for my test-packagings. Fortunately, TravisCI does offer the ability to use Docker on their test-hosts.

In general, setting up a Docker-oriented job is relatively straight forward. Where things get "fun" is that the version of `rpmbuild` that comes with Enterprise Linux 7 gets kind of annoyed if it's not able to resolve the UIDs and GIDs of the files it's trying to build from (never mind that the build-user inside the running Docker-container is "root" ...and has unlimited access within that container). If it can't resolve them, the rpmbuild tasks fail with a multitude of not terribly helpful "error: Bad owner/group: /path/to/repository/file" messages.

After googling about, I ultimately found that I needed to ensure that the UIDs and GIDs of the project-content need to exist within the Docker-container's /etc/passwd and /etc/group files, respectively. Note: most of the "top" search results Google returned to me indicated that the project files needed to be `chown`ed. However, simply being mappable proved to be sufficient.

Rounding the home stretch...

To resolve the problem, I needed to determine what UIDs and GIDs the project-content had inside my Docker-container. That meant pushing a Travis job that included a (temporary) diagnostic-block to stat the relevant files and return me their UIDs and GIDs. Once the UIDs and GIDs were determined, I needed to update my Travis job to add relevant groupadd and useradd statements to my container-preparation steps. What I ended up with was.

    sudo docker exec centos-${OS_VERSION} groupadd -g 2000 rpmbuilder
    sudo docker exec centos-${OS_VERSION} adduser -g 2000 -u 2000 rpmbuilder

It was late in the day, by this point, so I simply assumed that the IDs were stable. I ran about a dozen iterations of my test, and they stayed stable, but that may have just been "luck". If I run into future "owner/group" errors, I'll update my Travis job-definition to scan the repository-contents for their current UIDs and GIDs and then set them based on those. But, for now, my test harness works: I'm able to know that updates to existing specs/templates or additional specs/templates will create working RPMs when they're taken to where they need to be used.

Friday, January 18, 2019

GitLab: You're Kidding Me, Right?

Some of the organizations I do work for run their own, internal/private git servers (mostly GitLab CE or EE but the occasional GitHub EE). However, the way we try to structure our contracts, we maintain overall ownership of code we produce. As part of this, we do all of our development in our corporate GitHub.Com account. When customers want the content in their git servers, we set up a replication-job to take care of the requisite heavy-lifting.

One of the side-effects of developing externally, this way, is that the internal/private git service won't really know about the email addresses associated with the externally-sourced commits. While you can add all of your external email addresses to your account within the internal/private git service, some of those external email addresses may not be verifiable (e.g., if you use GitHub's "noreply" address-hiding option).

GitLab makes having these non-verifiable addresses in your commit-history not particularly fun/easy to resolve. To "fix" the problem, you need to go into the GitLab server's administration CLI and fix things. So, to add my GitHub "noreply" email, I needed to do:

  1. SSH to the GitLab server
  2. Change privileges (sudo) to an account that has the ability to invoke the administration CLI
  3. Start the GitLab administration CLI
  4. Use a query to set a modification-handle for the target account (my contributor account)
  5. Add a new email address (the GitHub "noreply" address)
  6. Tell GitLab "you don't need to verify this" (mandatory: this must be said in a Obi-Wan Kenobi voice)
  7. Hit save and exit the administration CLI
For me, this basically looked like:
-------------------------------------------------------------------------------------
 GitLab:       11.6.5 (237bddc)
 GitLab Shell: 8.4.3
 postgresql:   9.6.10
-------------------------------------------------------------------------------------
Loading production environment (Rails 5.0.7)
irb(main):002:0> user = User.find_by(email: 'my@ldap.email.address')
=> #
irb(main):003:0> user.email = 'ferricoxide@users.noreply.github.com'
=> "ferricoxide@users.noreply.github.com"
irb(main):004:0> user.skip_reconfirmation!
=> true
irb(main):005:0> user.save!
=> true
irb(main):006:0>
Once this is done, when I look at my profile page, my GitHub "noreply" address appears as verified (and all commits associated with that address show up with my Avatar)

Thursday, January 3, 2019

Isolated Network, You Say

The vast majority of my clients, for the past decade and a half, have been very security conscious. The frequency with which other companies end up in the news for data-leaks — either due to hackers or simply leaving an S3 bucket inadequately protected — has made many of them extremely cautious as they move to the cloud.

One of my customers has been particularly wary. As a result, their move to the cloud has included significant use of very locked-down and, in some cases, isolated VPCs. It has made implementing things both challenging and frustrating.

Most recently, I had to implement self-hosted GitLab solution within a locked down VPC. And, when I say "locked down VPC", I mean that even the standard AWS service-endpoints have been (effectively) replaced with custom, heavily-controlled endpoints. It's, uh, fun.

As I was deploying a new GitLab instance, I noticed that its backup jobs were failing. Yeah, I'd done what I thought was sufficient configuration via the gitlab.rb file's gitlab_rails['backup_upload_connection'] configuration-block. I'd even dug into the documentation to find the juju necessary for specifying the requisite custom-endpoint. While I'd ended up following a false lead to the documentation for fog (the Ruby module GitLab uses to interact with cloud-based storage options), I ultimately found the requisite setting is in the Digital Ocean section of the backup and restore document (simply enough, it requires setting an appropriate value for the "endpoint" parameter).

However, that turned out to not be enough. When I looked through git's error logs, I saw that it was getting SSL errors from the Excon Ruby module. Yes, everything in the VPC was using certificates from a private certificate authority (CA), but I'd installed the root CA into the OS's trust-chain. All the OS level tools were fine with using certificates from the private CA. All of the AWS CLIs and SDKs were similarly fine (since I'd included logic to ensure they were all pointing at the OS trust-store) - doing `aws s3 ls` (etc.) worked as one would expect. So, ended up digging around some more. Found the in-depth configuration-guidance for SSL and the note at the beginning of the Details on how GitLab and SSL work section:

GitLab-Omnibus includes its own library of OpenSSL and links all compiled programs (e.g. Ruby, PostgreSQL, etc.) against this library. This library is compiled to look for certificates in /opt/gitlab/embedded/ssl/certs.

This told me I was on the right path. Indeed, reading down just a page-scroll further, I found:

Note that the OpenSSL library supports the definition of SSL_CERT_FILE and SSL_CERT_DIR environment variables. The former defines the default certificate bundle to load, while the latter defines a directory in which to search for more certificates. These variables should not be necessary if you have added certificates to the trusted-certs directory. However, if for some reason you need to set them, they can be defined as envirnoment variables.

So, I added a:

gitlab_rails['env'] = {
        "SSL_CERT_FILE" => "/etc/pki/tls/certs/ca-bundle.crt"
}

To my gitlab.rb and did a quick `gitlab-ctl reconfigure` to make the new settings active in the running service. Afterwards, my GitLab backups to S3 worked without further issue.

Notes:

  • We currently use the Omnibus installation of GitLab. Methods for altering source-built installations will be different. See the GitLab documentation.
  • The above path for the "SSL_CERT_FILE" parameter is appropriate for RedHat/CentOS 7. If using a different distro, consult your distro's manuals for the appropriate location.