Wednesday, November 4, 2020

Increasing Verbosity of Ansible Jobs

Sometimes, Ansible doesn't have really native methods for installing and/or configuring some types of content. As a result, you may find yourself resorting to using Ansible's shell: or command: modules. – basically a shell-out ("escape"_ methods to execute such tasks tasks. While these methods are ok for smaller tasks, for larger tasks they can expose you to a number of problems:
  • The shell-out can take quite a long time to return
  • The shell-out can leave you guessing what it's actually doing – leaving you wondering:
    • "is it actually still working"
    • "is it hung"
    • "is this going to leave me waiting forever or will it ultimately time out"
    • etc.
  • The shell-out can be uninformative.
  • The shell-out can return an incorrect status:
    • It may report a change where none actually occurred
    • It may report a success where there was partial- or even total-failure
    • It may report a failure where there was partial-success
While there's often not a lot that can be done about execution time – things take as long as they take – the other problems are addressable.

Trying to fix the "what's it actually doing" problem from within the shell-out, itself, often isn't meaningful: Ansible gathers up all the shell-output and only returns it once the shell exits. That said, improving your shell-out's output isn't a wholly-wasted effort: making your shell-out more verbose can help you if it does error-out; it can also provide greater assurance if you want to pore through the change: or ok: (success) output. This can help you with some of the "can be uninformative" problem, even if only after-the-fact.

Similarly, trying to fix the "return an incorrect status" problem strictly from within the shell-out doesn't necessarily provide a full solution. It can improve the overall reliability of the shell-out. However, it doesn't necessarily fix the status that Ansible uses when it tries to decide, "should I abort this run or should I continue" or any other contingent- or branching-logic you might want to implement.

Recently, I ran into these kinds of problems at one of my customer sites. They're a shop that has a significant percentage of their user-base that are data-science oriented. As such, they use Ansible to install the R language binaries along with a few hundred modules that their various users like to use. While they're an Ansible shop, they'd implemented the installation as a call to an external shell script. Ansible first pushes the script out and then executes it on the targets.

The script, itself, wasn't especially robustly-written: next to no error-handling or -reporting. It's basically a launch-and-pray kind of tool. On the plus side, they had thought ahead well enough to provide a mechanism for determining where in the shell-managed installation-process things had died. Basically, as the script runs, it creates a simple file containing a list of yet-to-be-installed modules from which you could infer that one or more of them had failed. That said, to see that file, you have to manually SSH to the managed-system and go view it.

Unfortunately, because the script is doing so many module-installs, it takes a really long time to execute (a few hours!). Because Ansible only reports script output when an invoked-script exits, Ansible pauses for a loooong time with no nerve-soothing output. And, as previously mentioned, if it does fail, you're stuck having to visit managed-systems to try to figure out why.

I'm not real tolerant of waiting for things to run with no output to tell me, "at least I know that it's doing something." So, I set out to refactor. While I'd hoped there was a native Ansible module for the task, my Google-fu wasn't able to turn anything up (my particular customer's environment doesn't lend itself well to using extensions to Ansible functionality such as one might find on Galaxy or GitHub). So, I too resorted to the shell-escape method.

That said, upon looking at the external shell script they wrote, I realized it was an extremely simple script. As such, I opted to replace it with an equivalent shell: | block in the associated Ansible plays.

---
- name: Iteratively install R-modules
  args:
    executable: /bin/bash
  changed_when: "'Added' in modInstall_result.stdout"
  environment:
    http_proxy: 'http://{{proxy_user}}:{{proxy_password}}@{{proxy_host}}:80/'
    https_proxy: $http_proxy
  failed_when: "modInstall_result.rc != 0 or 'had non-zero exit status' in modInstall_result.stderr"
  register: modInstall_result
  shell: |
    if [[ {{ item }} =~ ":" ]]
    then
       PACKAGE="$( echo {{ item }} | cut -d : -f 1 )"
       VERSION="$( echo {{ item }} | cut -d : -f 2 )"
       VERSTRING="version = '${VERSION}',"
    else
       PACKAGE="{{ item }}"
       VERSTRING=""
    fi

    Rscript --slave --no-save --no-restore-history -e "
    if (! ('${PACKAGE}' %in% installed.packages()[,'Package'])) {
      require(devtools);
      install_version(package = '${PACKAGE}', ${VERSTRING} upgrade = 'never', repos=c('http://cran.us.r-project.org'));
      print('Added');
    } else {
      print('Already installed');
    }"
  with_items: "{{ Rmodules.split('\n') }}"
...
The value of the refactored approach is that, instead of waiting hours for output, there's output associated with each installation-attempt. Further, the output is all captured on the host running the Ansible play: no having to visit managed systems to find something resembling a logfile.

Explaining the Play... 

When I construct plays, I like to order the YAML alphabetically (with the exception of the "name" parameter – that always goes first). Which is to say, anything at a given directive-level will be ordered from A-Z. Some people prefer to put things in something more-resembling a functional order. I choose alphabetical because it makes it easier for me to cross-reference with the documentation.
  • "name" is fairly self-explanatory. It just provides a human-friendly indication of what Ansible is doing. In this case, iteratively installing R-modules (duh!).
  • "args":  This can have a number of sub-parameters (I have yet to dig through the documentation or source to find all of them). I've only ever had use for the "executable" sub-parameter.  
  • "changed_when": This parameter allows you to tell Ansible how to know that the shell-escape changed something. In this instance, I'm having it evaluate data contained in a variable named "modInstall_result" (set later via the "register" action).
  • "executable": this allows you to explicitly tell Ansible which interpreter to use. I'm pedantic, so, I like to tell it, "use /bin/bash".
  • "environment": this allows you to set execution-specific environment-variables for the shell to use. In this case, I'm setting the "http_proxy" and "https_proxy" environmentals. This is necessary because the build environment is isolated and I'm trying to let R's in-built URL-fetcher pull content from public, internet-hosted repositories (see this vendor documentation for explanation). The customer doesn't have a full CRAN mirror, so, leveraging this installation method minimizes having to account for dependencies.
  • "failed_when": This parameter allows you to tell Ansible how to know that the shell-escape failed. In this instance, I'm having it evaluate data contained in a variable named "modInstall_result" (set later via the "register" action).
  • "register": Used in this manner, it collects all of the inputs to and outputs produced by the shell and its exit code and store it in a JSON-formatted variable. In this case, the variable is named "modInstall_result". Data can be extracted via regular JSON-extraction methods
  • "shell": This is the actual code that Ansible will execute (via the previously-requested /bin/bash interpreter). I'll explain the block's contents, shortly
  • "with_items": This is one of the ways that Ansible allows you to run a given play iteratively. External to this play, I had defined the "Rmodules" variable to read in a text file – via Ansible's lookup() function – that contained one R module-name per line (and, optionally, an associated version-number). The "with_items" parameter-value is in the form of a list. The lookup() function originally created the "Rmodules" value as a single string with embedded newlines. Using the .split() function converts that string into a list. As Ansible iterates this list, each list-element is popped off and assigned to the temporary-variable "item".

Explaining the script-stanza:
The "shell:" module can be invoked as either a single line or as a block of text. Using the "|" as the value for the module tells the module that the following, indented block of text is to be treated as a single input-block. For readability of anything but very short script-content, I prefer the readability of the block of text.

The BASH content, itself, is basically two parts.
  • The first part takes the value of the iterated "item" temporary-value and parses it. If the string contains a colon, the string is split into defined PACKAGE and VERSION BASH-variables (with the VERSION BASH-variable being further expanded to an version-string statement suitable for use with the Rscript command). If the string does not contain a colon, the PACKAGE BASH-variable is set to the R module-name and the version-string BASH-variable is set to a null/empty value.
  • The second part is the Rscript logic. Using R's installed.packages() function, the existing R installation is checked for the presence of the R module-name contained in the PACKAGE BASH-variable. If not present, R's install_version() function is used, along with the PACKAGE BASH-variable and the version-string variable to install (an optionally versioned) R-module. The if check helps with idempotency – preventing attempts to reinstall the module and the not-inconsiderable time that reinstallation can take.
Note that, in order for the Rscript logic to work, the devtools (link goes to a specific, older version; other versions should work) R module must have been previously installed. In my case, this installation has been taken care of in a prior Ansible play (not presented here: the primary focus of this article was to illustrate how to use iteration to make for more-verbose configuration-management) 

Tuesday, October 27, 2020

Smashing Walls of Text

On a few projects I work on, we make use of Ansible to automate system-configuration tasks. For better or worse, the automation relies on some upstream content that is outside the control of the customer. Translation: every few months, automation that has worked for months will suddenly no longer work.

By itself, this is more an inconvenience than a real problem. However, because the automation is handed off to other, more-junior people to execute, when errors are encountered, those others are frequently at a loss as to what to do.

Frequently, they don't bother to include logs of any sort ...or if they do, they include log snippets (or, worse: screen-caps of text-based log snippets!), the snippets frequently don't contain the information critical to solve the problem. So, first response in the support request is a "send me all the logs" type of reply.

Now, depending on the size of the Ansible job, the associate log-file might be HUGE. Generally, I'm only interested in where Ansible has failed. Parsing through an entire Ansible log-file can be like trying to find a specific brick in the Great Wall of China. So, to help preserve my sanity, I cooked up a quick BASH script to both help "cut to the chase" and provide more-easily readable log-output. That script looks like:

#!/bin/bash
#
# Script to filter the output of Ansible log files to something more-readable
#
###########################################################################
OUTFILE="${1:-/tmp/Ansible.log}"

mapfile -t ERROUT < <(
   grep ^failed: "${OUTFILE}"
)

# Make sure we got a mount-string
if [[ ${#ERROUT[@]} -lt 1 ]]
then
   echo "No failure-messages found in ${OUTFILE}"
fi

# Iterate over error-string
ITER=0
while [[ ${ITER} -lt ${#ERROUT[@]} ]]
do
   printf '##########\n## %s\n##########\n' "$(
      echo "${ERROUT[${ITER}]}" | sed 's/^failed:.* => //' | 
      python3 -c "import sys, json; print(json.load(sys.stdin)['item'])"
   )"
   
   echo "${ERROUT[${ITER}]}" | sed 's/^failed:.* => //' | \
     python3 -c "import sys, json; print(json.load(sys.stdin)['stderr'])" | sed 's/^/    /'
     
   printf '####################\n\n\n'
   ITER=$(( ITER + 1 ))
done

Wednesday, August 5, 2020

Implementing (Psuedo) Profiles in Git

I an automation consultant for an IT contracting company. Using git is a daily part of my work-life. Early on, things were easy: set up my (one) identity in GitHub, GitLab, BitBucket, etc. depending on the requirements of the customer. I always just used my personal email address for each. Then things started shifting, a bit. Some customers wanted me my corporate email address as my ID. Annoying, but not an especially big deal, by itself. Then, some wanted me to use their privately-hosted repositories and wanted me to use identities issued by them.

Personally, I'm a fan of git-over-ssh. I also like to use signed-commits (whether my customers require them or not). Unfortunately, when you move to customer-prescribed or -issued(!) SSH and GPG keys, things get a bit more annoying.

While git has mechanisms for using per-project keys, it means abandoning the stuff you've set via `git config --global …`. If your customer only has one project they want you to contribute to, `git config --local …` is annoying but not especially cumbersome or prone to accidentally committing with the wrong signature (fortunately, you can't commit with the wrong SSH key, since you'll get a big, fat permission denied type of error if you try). However, most of my customers have multiple projects they want me to make commits to. 

Having to remember to `git config --local …` for multiple projects introduces more opportunities to forget to do so. Which means it also introduces more opportunities to apply an incorrect signature to a commit. For many tools, I'd set up a new profile, virtual-environment, etc. Unfortuntately, git doesn't (directly) have that capability. That said, you can fake that capability by using the git client's `includeif` configuration directive (and, if you're like me, using ${HOME}/.ssh/config).

SSH Setup:

  1. Generate – or retrieve(!) – a project-specific SSH key
  2. Open your ${HOME}/.ssh/config file for editing
  3. Add a stanza similar to the following:
    Host gitlab.<private.domain>
      IdentityFile /home/<USERID>/.ssh/<PRIVATE_KEY_NAME>
      AddKeysToAgent yes
Note: The `AddKeysToAgent` config-statement is iffy. While it injects the key into your keychain and relieves you of having to provide your key's password (you did set a password on the key, right?) every time you do an operation against the remote, the order the keychain presents it may not be adequately predictable. If you have a bunch of SSH keys (say, more than three) in your keychain, you may find that you start getting "too many authentication failures" errors from the remote's SSH authentication process.

Git Setup:

  1. Create a project-group git config file (say, "/home/<USERID>/.gitconfig-<PROJECT_GROUP>")
  2. Create a directory specific to the project-group (e.g., "/home/<USERID>/GIT/<PROJECT_GROUP>")
  3. Add the configuration-contents you'd normally place in "/home/<USERID>/.gitconfig" into this file: any competing directives in the include file will override the "/home/<USERID>/.gitconfig" file's contents.
  4. Tell git the conditions under which to use the project-group config file
    $ git config --global includeIf."gitdir:/home/<USERID>/GIT/<PROJECT_GROUP>".path \
        /home/<USERID>/.gitconfig-<PROJECT_GROUP>
    The above is both ughly and not especially clearly-documented (literally tried a half-dozen iterations before finding the right command: hand-editing might be mo-easier

Profit:

Once the above is done, any git operations performed within the "/home/<USERID>/GIT/<PROJECT_GROUP>" directory should be governed by the directives in the "/home/<USERID>/.gitconfig-<PROJECT_GROUP>" file:
  • If you're using project-specific SSH keys, a `git clone` will succeed
  • If you're using commit-signing, the signed commit will have the project-specific signature applied (use `git log --show-signature` to verify)
If neither of the preceeding is true, something's not set up correctly in your pseudo-profile.

Wednesday, July 15, 2020

Sometimes The (Workable) Answer Is Too Simple to See

One of the tasks I was asked to tackle was helping the team I'm working with move their (Python and Ansible-based) automation-development efforts from a Windows environment to a couple of Linux servers.

Were the problem to only require addressing how I tend to work, the solution would have been very straight-forward. Since I work across a bunch of contracts, I tend to select tooling that is reliably available: pretty much just git + vi ...and maybe some linting-containers.

However, on this new contract, much of my team doesn't work that way. While they've used git before, it was mostly within the context of VSCode. So, required me to solve two problems – one relevant to the team I'm enabling and one possibly relevant only to me: making it so the team could use VSCode without Windows; making it so that wouldn't be endlessly prompted for passwords.

The easy "VSCode without Windows" problem was actually the easier problem. It boiled down to:

  1. Install the VSCode server/agent on the would be (Linux-based) dev servers
  2. Update the Linux-based dev servers' /etc/ssh/sshd_config file's AllowTcpForwarding setting (this one probably shouldn't have been necessary: the VSCode documentation indicates that one should be able to use UNIX domain-sockets on the remote host; however, the setting for doing so didn't appear to be available in my VSCode client)
  3. Point my laptop's VSCode to the Linux-based dev servers
Because I'm lazy, I hate having to enter passwords over and over. This means that, to the greatest degree possible, I make use of things like keyrings. In most of my prior environments, things were PuTTY-based. So, my routine, upon logging in to my workstation each morning, included "fire up Pageant and load my keys": whether then using the PuTTY or MobaXterm ssh-client, this meant no need to enter passwords (the rest of the day) beyond having entered them as part of Pageant's key-loading.

According to the documentation, VSCode isn't compatible with PuTTY – and, by extension, not compatible with Pageant. So, I dutifully Googled around for how to solve the problem. Most of the hits I found seem to rely on having a greater level of access to our customer-issued laptops than what we're afforded: I don't have enough permissions to even check if our PowerShell installations have the OpenSSH-related bits.

Ultimately, I turned to my employer's internal Slack channel for our automation team and posed the question. I was initially met with links to the same pages my Google searches had turned up. Since our customer-issued laptops do come with Git-BASH installed, someone suggested setting up its keyring and then firing-up VSCode from within that application. Being so used to accessing Windows apps via clicky-clicky, it totally hadn't occurred to me to try that. It actually worked (surprising both me and the person who suggested it). 

That said, it means I have an all-but-unused Git-BASH session taunting me from the task-bar. Fortunately, I have the taskbar set to auto-hide. But still: "not tidy".

Also: because everybody uses VSCode on this project, nobody really uses Git-BASH. So, any solution I propose that uses it will require further change-accommodation by the incumbent staff.

Fortunately, most of the incumbent staff already uses MobaXterm when they need CLI-based access to remote systems. Since MobaXterm integrates with Pageant, it's a small skip-and-a-jump to have VSCode use MobaXterm's keyring service ...which pulls from Pageant. Biggest change will be telling them "Once you've opened Moba, invoke VSCode from within it rather than going clicky-clicky on the pretty desktop icon".

I'm sure there's other paths to a solution. Mostly comes down to: A) time available to research and validate them; and, B) how much expertise is needed to use them, since I'll have to write any setup-documentation appropriate to the audience it's meant to serve

Monday, July 6, 2020

Taming the CUDA (Pt. II)

So, today, finally had a chance to implement in Ansible what I'd learned in Taming the CUDA.

Given that it takes a significant time to run the uninstall/new-install/reboot operation, I didn't want to just blindly execute the logic. So, I wanted to implement logic that checked to see what version, if any, of the CUDA drivers were already installed on the Ansible target. First step to this was as follows:
- name: Gather the rpm package facts
  package_facts:
    manager: auto
This tells Ansible to check the managed-host and gather relevant package-information for the base cuda RPM and stuff the return of the action into a registered variable `cuda_pkginfo`. This variable is a JSON structure that's then referencable by subsequent Ansible actions. Since I'm only interested in the installed version, I'm able to grab that information by grabbing the `cuda_pkginfo.results[0].version` value from the JSON structure and using it in a `when` conditional.

Because I had multiple actions that I wanted to make conditional on a common condition, I didn't want to have a bunch of configuration-blocks with the same conditional statement. Did some quick Googling and found that, yes, Ansible does support executing multiple steps within a shared-condition block. You just have to use (wait for it...)  the `block` statement in concert with the shared condition-statement. When you use that statement, you then nest actions that you might otherwise have put in their own, individual action-blocks. In my case, the block ended up looking like:
- name: Update CUDA drivers as necessary
  block:
    - name: Copy CUDA RPM-repository definition
      copy:
        src: files/cuda-rhel7-11-0-local.repo-DSW
        dest: /etc/yum.repos.d/cuda-rhel7-11-0-local.repo
        group: 'root'
        mode: '000644'
        owner: 'root'
        selevel: 's0'
        serole: 'object_r'
        setype: 'etc_t'
        seuser: 'system_u'
    - name: Uninstall previous CUDA packages
      shell: |
          UNDOID=$( yum history info cuda | sed -n '/Transaction ID/p' | \
                    cut -d: -f 2 | sed 's/^[     ]*//g' | sed -n 1p )
          yum -y history undo "${UNDOID}"
    - name: Install new CUDA packages (main)
      yum:
        name:
          - cuda
          - nvidia-driver-latest-dkms
        state: latest
    - name: Install new CUDA packages (drivers)
      yum:
        name: cuda-drivers
        state: latest
  when:
    ansible_facts.packages['cuda'][0].version.split('.')[0]|int < 11
I'd considered doing the shell-out a bit more tersely – something like:
yum -y history undo $( yum history info cuda | \
sed -n '/Transaction ID/p' | cut -d: -f 2 | sed -n 1p)
But figured what I ended up using was marginally more readable for the very junior staff that will have to own this code after I'm gone.

Any way you slice it, though, I'm not super chuffed that I had to resort to a shell-out for the targeted/limited removal of packages. So, if you know a more Ansible-y way of doing this, please let me know.

I'd have also finished-out with one yum install-statement rather than the two, but the nVidia documentation for EL7 explicitly states to install the two groups separately. 🤷

Oh... And because I didn't want my `when` statement to be tied to the full X.Y.Z versioning of the drivers, I added the `split()` method so I could match against just the major number. Might have to revisit this if they ever reach a point where they care about the major and minor or the major, minor and release number. But, for now, the above suffices and is easy enough to extend via a compound `when` statement. Similarly, because Ansible defaults to string-output, I needed forcibly cast the string-output to an integer so that numeric comparison would work properly.

Final note: I ended up line-breaking where I did because yamllint had popped "too wide" alerts when I ran my playbook through it.

Thursday, July 2, 2020

Taming the CUDA

Recently, I was placed on a new contract supporting a data science project. I'm not doing any real data-science work, simply improving the architecture and automation of the processes used to manage and deploy their data-science tooling.

Like most of my customers, the current customer is an Enterprise Linux shop and an AWS shop. Amazon makes available several GPU-enabled instance-types that are well-disposed to running data science types of tasks. And, while RHEL is generically suitable to running on GPU-enabled instance types, to get the best performance out of them, you need to run the GPU drivers published by the GPU-vendor rather than the ones bundled with RHEL.

Unfortunately, as third-party drivers, there's some gotchas with using them. The one they'd been most-plagued by was updating drivers as the GPU-vendor made further updates available. While doing a simple `yum upgrade` works for most packagings, it can be problematic when using third-party drivers. When you try to do `yum upgrade` (after having ensured the new driver-RPMs are available via `yum`), you'll ultimately get a bunch of dependency errors due to the driver DSOs being in use.

Ultimately, what I had to move to was a workflow that looked like:

  1. Uninstall the current GPU-driver RPMs
  2. Install the new GPU-driver RPMs
  3. Reboot
Unfortunately, "uninstall the current GPU-driver RPMs" actually means "uninstall just the 60+ RPMs that were previously installed ...and nothing beyond that. And, while I could have done something like `yum uninstall <DRIVER_STUB-NAME>`, doing so would result in more packages being removed than I intended.

Fortunately, RHEL (7+) include a nice option with the `yum` package-management utility: `yum history undo <INSTALL_ID>`.  

Due to the data science users individual EC2s being of varying vintage (and launched from different AMIs), the value of <INSTALL_ID> is not stable across their entire environment.

The automation gods giveth; the automation gods taketh away.

That said, there's a quick method to make the <INSTALL_ID> instability pretty much a non-problem:

yum history undo $( yum history info <rpm_name>| \
   sed -n '/Transaction ID/p' | \
   cut -d: -f 2 )
Which is to say "Undo the yum transaction-ID returned when querying the yum history for <rpm_name>". Works like a champ and made the overall update process go very smoothly.

Now to wrap it up within the automation framework they're leveraging (Ansible). I don't think it natively understands the above logic, so, I'll probably have to shell-escape to get step #1 done.

Wednesday, June 10, 2020

TIL: Podman Cleanup

Recently, I started working on a gig that uses Ansible for their build-automation tasks. While I have experience with other types of build-automation frameworks, Ansible was new to me.

Unfortunately, my customer is very early in their DevOps journey. While my customer has some privately-hosted toolchain services, they're not really fully fleshed out: their GitLab has no runners; their Jenkins is not general access; etc. In short, not a lot of ability to develop in their environment — at least not in a way that allows me to set up automated validation of my work.

Ultimately, I opted to move my initial efforts to my laptop with the goal of exporting the results. Because my customer is a RHEL environment, I set up RHEL and CentOS 7 and 8 VMs on my laptop via Hyper•V. 

Side-note: While on prior laptops I used other virtualization solutions, I'm using Hyper•V because it came with Windows 10, not because I prefer it over other options. Hypervisor selection aside…

As easy as VMs are to rebuild, I've yet to actually take the time out to automate my VMs' builds to make it less painful if I do something that renders one of them utterly FUBAR. Needless to say, I don't particularly want to crap-up my VMs, right now. So, how to provide a degree of blast-isolation within those VMs to hopefully better-avoid not-yet-automated rebuilds?

Containers can be a great approach. And, for something as simple as experimenting with Ansible and writing actual playbooks, it's more than sufficient. That said, since my VMs are all Enterpise Linux 7.8 or higher, Podman seemed the easier path than Docker ...and definitely easier than either full Kubernetes or K3S. After all, Podman is just a `yum install` away from being able to start cranking containers. Podman also means can run containers in user-space (without needing to set up Kubernetes or K3S), which further limits how hard I can bone myself.

At any rate, I've been playing around with Ansible, teaching myself how to author flexible playbooks and even starting to write some content that will eventually go into production for my customer. However, after creating and destroying dozens of containers over the past couple weeks, I happened to notice that the partition my ${HOME} is on was nearly full. I'd made the silly assumption that when I killed and removed my running containers that the associated storage was released. Instead, I found that my ${HOME}/.local/share/containers was chewing up nearly 4GiB of space. Worse, when I ran find (ahead of doing any rms), I was getting all sorts of permission denied errors. This kind of surprised me since I thought that, by running in user-space, any files that would be created would be owned by me.

So, I hit up the almighty Googs. I ended up finding Dan Walsh's blog-entry on the topic. Turns out that, because of how Podman uses name-spaces, it creates files that my non-privileged user can't actually directly access. Per the blog-entry, instead of being able to just do find ${HOME}/.local/share/containers -mtime +3 | xargs rm, I had to invoke buildah unshare and do my cleanup using that context.

So, "today I learned" ...and now I have over 3GiB of the nearly 4GiB of space back.

Friday, June 5, 2020

Ansible Journey: Adding /etc/fstab Entries

As noted in yesterday's post, I'm working on a new customer-project. One of the automation-tools this customer uses is Ansible. This is a new-to-me automation-technology. Previously — and aside from just writing bare BASH and Python code — I've used frameworks like Puppet, SaltStack and a couple others. So, picking up a new automation-technology — especially one that uses a DSL not terribly unlike one I was already familiar with, hasn't been super much of a stretch.

After sorting out yesterday's problem and how I wanted my /etc/fstab to look, I set about implementing it via Ansible. Ultimately, I ended up settling on a list-of-maps variable to drive a lineinfile role-task. I chose a list-of-maps variable mostly because the YAML that Ansible relies on doesn't really do tuples. My var ended up looking like:

s3fs_fstab_nest:
  - mountpoint: /provisioning/repo
    bucket: s3fs-build-bukkit
    folder: RPMs
  - mountpoint: /provisioning/installers
    bucket: s3fs-build-bukkit
    folder: EXEs
  - mountpoint: /Data/personal
    bucket: s3fs-users-bukkit
    folder: build

And my play ended up looking like:

---
- name:  "Add mount to /etc/fstab"
  lineinfile:
    path: '/etc/fstab'
    line: "s3fs#{{ item.bucket }}:/{{ item.folder }}\t{{ item.mountpoint }}fuse\t_netdev,allow_other,umask=0000,nonempty 0 0"
  loop: "{{ s3fs_fstab_nest }}"
...

Was actually a lot simpler than I was expecting it to be.

Thursday, June 4, 2020

TIL: You Gotta Be Explicit

Started working on a new contract, recently. This particular customer makes use of S3FS. To be honest, in the past half-decade, I've had a number of customers express interest in S3FS, but they're pretty much universally turned their noses up at it (due to any number of reasons that I can't disagree with — trying to use S3 like a shared filesystem is kind of horrible).

At any rate, this customer also makes use of Ansible for their provisioning automation. One of their "plays" is designed to mount the S3 buckets via s3fs. However, the manner in which they implemented it seemed kind of jacked to me: basically, they set up a lineinfile-based play to add to add s3fs commands to the /etc/rc.d/rc.local file, and then do a reboot to get the filesystems to mount up.

It wasn't a great method, to begin with, but, recently, their their security people made a change to the IAM objects they use to enable access to the S3 buckets. It, uh, broke things. Worse, because of how they implemented the s3fs-related play, there was no error trapping in their work-flow. Jobs that relied on /etc/rc.d/rc.local having worked started failing with no real indication as to why (when you pull a file directly from S3 rather than an s3fs mount, things are pretty immediately obvious what's going wrong).

At any rate, I decided to try to see if there might be a better way to manage the s3fs mounts. So, I went to the documentation. I wanted to see if there was a way to make them more "managed" by the OS such that, if there was a failure in mounting, the OS would put a screaming-halt to the automation. Overall, if I think a long-running task is likely to fail, I'd rather it fail early in the process than after I've been waiting for several minutes (or longer). So I set about simulating how they were mounting S3 buckets with s3fs.

As far as I can tell, the normal use-case for mounting S3 buckets via s3fs is to do something like:

s3fs <bucket> <mount> -o <OPTIONS>

However, they have their buckets cut up into "folders" and sub-folders and wanted to mount them individually. The s3fs documentation indicated that you could both mount individual folders and that you could do it via /etc/fstab. You simply needed an /etc/fstab that looks sorta like:
s3fs-build-bukkit:/RPMs    /provisioning/repo       fuse.s3fs    _netdev,allow_other,umask=0000,nonempty 0 0
s3fs-build-bukkit:/EXEs    /provisioning/installer  fuse.s3fs    _netdev,allow_other,umask=0000,nonempty 0 0
s3fs-users-bukkit:/build   /Data/personal           fuse.s3fs    _netdev,allow_other,umask=0000,nonempty 0 0

However, I was finding that, even though the mount-requests weren't erroring, they also weren't mounting. So, hit up the almighty Googs and found an issue-report in the S3FS project that matched my symptoms. The issue ultimately linked to a (poorly-worded) FAQ-entry. In short, I was used to implicit "folders" (ones that exist by way of an S3 object containing a slash-delimited key), but s3fs relies on explicitly-created "folders" (e.g., null objects with key-names that end in `/` — as would be created by doing `aws s3api put-object --bucket s3fs-build-bukkit --key test-folder/`). Once I explicitly created these trailing-slash null-objects, my /etc/fstab entries started working the way the documentation indicated they ought to have been doing all along.


Friday, May 22, 2020

But We Want a Pre-Authentication Consent Banner!

My primary client-base is security- and compliance-focussed. As part of this, they follow security guidelines that include things like "system must display consent-to-monitor banners".

Up until recently, I only ever worked on the server end of things — almost exclusively UNIX- or Linux-based (though, the past decade has been almost exclusively Linux). This meant that all I really had to worry about was ensuring a suitable "/etc/issue file was in place and that any interactive login services (mostly SSH; occasionally FTP) were configured to reference that file."

Any GUI-dependent developers that my customers might have had did all of their development work on Windows boxes or specially-prepared Linux desktops. In either case, they were using on-premises desktops that I didn't have to worry about.

With COVID19 and the mass adoption of telework, those carefully-prepared development desktops are sitting disused at my customers offices. Developers have been forced to work remote. As such, we've been asked, as part of our "enablement" mandate, to help these newly-remote developers set up cloud-hosted development-systems. A significant percentage of these developers rely on graphical tools. While most — if not all — of these graphical tools would be usable as individually-launched applications with their X11 tunneled through SSH back to the workstations they're using, that hasn't been enough for some of them:
  • Many don't have X11 servers installed on their laptops
    • I usually like to point BYOD Windows users to Cygwin/X, MobaXterm, Xming, VcXsrv and the corporate-issued Windows users to contact their employers and ask that they add any one of the commercial Xserver offerings to their laptops
    • Though, given the number of Macintosh users in their ranks, they have Xservers, but it seems like they don't quite understand that fact:
  • In either case, most don't seem to understand that the "easy button" solution is to add:
    Host *
      ForwardAgent true
    To their ${HOME}/.ssh/config files. And that doing so allows them to SSH to their development-host and, by executing <GRAPHICAL_UTILITY> on the remote server, it will cause the utility to magically appear on their laptop.
  • Even once you show them the magic of X11 forwarding, many will whine "I don't want to have to have a terminal open just to executed programs, I want the entire remote desktop so I can just use the launchers."
Result, I've been having to help Linux neophytes install the full Gnome desktop on top of our standard AMIs and figuring out how to access them.

Our standard AMIs are RHEL- and CentOS-based have little more on them than the @Core RPM-group. The AMIs also have boot-EBSes that are as small as we could make them and meet the server security-requirements that require the AMIs to be carved up into partitions. These initially led to several common questions:

  1. "Can you create new AMIs that have enough room for us to install the Gnome desktop onto"; and
  2. "Can you create an AMI with the Gnome-desktop preinstalled". 
The answer to both has always been, "no, but if you follow the FAQ on how to expand the disk at launch-time, you'll have plenty of space for installing the Gnome desktop yourself".

Unfortunately, with the sudden proliferation of people building themselves graphical, our various customers' security assessors have taken notice. That means that our (headless) server-oriented automated-hardening tools are not accounting for the now Gnome-enabled desktops. By itself, not a problem, but different assessors have different demands around banners.
For some assessors, simply painting the contents of /etc/issue on the Gnome login screen's root window is sufficient. Fortunately, the Gnome project provides instructions that are also somewhere in the neighborhood of "dead-easy to do". Once done, you have a login screen that looks something like:



Unfortunately, some assessors insist that your consent banner must be a pop-up that is displayed prior to the would-be-user being allowed to enter their username or password. Meeting this requirement is a skosh more-involved. Instead of doing the above Gnome mods, one needs to do:
  1. Move the existing /etc/gdm/Init/Default file contents aside (e.g., `mv /etc/gdm/Init/Default{,-DIST}`)
  2. Install a new /etc/gdm/Init/Default file with contents similar to:
    /usr/bin/zenity --text-info --width=700 --height=300 \
    --title="Security Message" --filename=/etc/issue
    Which uses Zenity to create a create a dialogue-box from the /etc/issue file.
  3. Restart the GDM service (e.g., `systemctl restart gdm`)
Doing this results in production of a pre-authentication pop-up similar to the following:



If you want something marginally fancier, you can install `yad` (from EPEL). If using `yad`, changing your /etc/gdm/Init/Default contents to something like:
/usr/bin/yad \
  --center \
  --buttons-layout=center \
  --button=OK:0 \
  --no-escape \
  --fontname="Monospace Regular 10" \
  --fore tomato4 \
  --text-info \
  --width=700 \
  --height=300 \
  --wrap \
  --title="Security Message" \
  --filename=/etc/issue

Will produce a screen like:



Primary differences being ability to set the text-color and get rid of the extraneous "Cancel" button (and center the "Ok" button).

Thursday, May 7, 2020

Punishing Network Performance

Recently, I took delivery of a new laptop. My old laptop was rolling up on five years old, was still running Windows 7 and had less than 1GiB of free disk space. So, "it was time".

At any rate, the new laptop runs Windows 10 Pro. One of the nice things about this is that, because MicroSoft makes Hyper-V available for free for my OS, I no longer have to deal with VirtualBox (or any add-on virtualization solution, for that matter). It means I'm free of the endless parade of "you must reboot to install new VirtualBox drivers" that pretty much caused me to stop with local virtualization on my prior laptop.

One of the nice things with virtualization is that it lets me run Linux, locally. While WSL/WSL2 show promise, they aren't quite there yet. Similarly, I'm not super chuffed that Docker Desktop for Windows requires me to run Docker as root. So, my Linux VM is an EL8 host that I am able to run PodMan on (and able to run my containers as an unprivileged user within that VM).

That said, most of the "how to run Linux on Hyper-V on Windows 10" guides I found resulted in my networking getting completely and totally jacked up. When not jacked up, my Internet connection looks like:


However, after initially booting my CentOS VM, I noticed that my yum downloads were running stoopid-slow. I assumed the problem was constrained to the VM. However, being detail-oriented, I checked my OS's network speed. It was, uh, "not good". I was getting less than 4Mbps down and about 0.2Mbps up. I also found that, if I rebooted my system, my download speeds would come back, but my uploads still sucked. Lastly, I found that if I wholly de-configured Hyper-V, my networking returned to normal.

Sleuthing time.

I hit the Googs and everyone is saying "this is a known issue: you can fix it by changing your NIC settings." Unfortunately, the mentioned settings weren't available in my NIC. Dunno if it's because I'm using WiFi or what. All I know is that it was looking like my choices were going to be "have good Internet speeds or have Linux on Hyper-V".

I'm not real keen on such choices. So, I kept digging – eventually getting to the point where I went beyond the first page of results. Ultimately, I found an answer on one of the StackExchange forums. Interestingly, the answer I found wasn't even marked as the "best answer". Had I not read through all the responses on one thread, I wouldn't even have found it.

At any rate, the fix was to not use my NIC in bridged mode at all. Instead, I needed to ignore all the top-match guides' instructions to use an "External" interface (which puts your NIC into bridged mode) and, instead, use an "Internal" interface ...and then set up connection-sharing, allowing my private vSwitch to share (rather than bridge) through my WiFi adapter. As soon as I made that change, my speeds came back.

Tuesday, April 7, 2020

Crib Notes: Finding Missing Single- Or Double-Quote Pairs

So, today, was writing a BASH utility. As per normal, I have a commit-time check that runs it through shellchecker. That test came up green. However, when I ran the script if finished by complaining:


XXXXXX.sh: line 290: unexpected EOF while looking for matching `"'
Naturally, my reaction was something along the lines of:




I mean, shellchecker's usually pretty damned good about finding trivial flubs like that. So, Googled about to see if there was an easy way to double-check things. My search was fruitful – I found this cute, little snippet:
| tr -cd '"\n' | awk 'length%2==1 {print NR, $0}'
I catted my script to that pipe and, for better or worse, it agreed with shellchecker that all my single- and double-quotes were properly paired.

Wednesday, February 26, 2020

Hold On A Second There, Partner

One of the devs on  project many of us work on was excited to announce that we'd topped 2000 commits on the project. I'm a pedant. I know that we configure our projects with a lot of "administrativa" bots, so I wanted to look harder at the numbers. What I found was:

The raw number of commits was:
$ git shortlog -s -n | awk '{ sum += $1 } END { print sum }'
2008

Total minus bot-commits:
$ git shortlog -s -n | grep -v bot | awk '{ sum += $1 } END { print sum }'
1476

Total minus bot- and merge-commits:
$ git shortlog -s -n --no-merges | grep -v bot | awk '{ sum += $1 } END { print sum }'
752

I wasn't trying to rain on his parade, but it is kind of cool that you can break things out that way and see just how much of your project's activity is "administrativa" type of actions.

Also, removing the filtering/counting pipelines gives you a nice output of who's contributed and how much (though "how much" is only in commits ...which, in our case, are generally squashed when merging back from an individual developer's fork).

Tuesday, February 11, 2020

"Simple" Doesn't Always Mean It's Actually "Simpler"

Let it be known, if you're part of the group of people that have been foisting "simplified" markup tools on the community at large, I probably want to chop you in the adam's apple. HTML just ain't that hard to learn, especially the basics you'd need to do project documentation. And, if you find that your "simplified" documentation-language isn't sufficient to documentation tasks, the solution isn't to continue down the path of making your "simplified" markup language more complex. That's simply a sign that you screwed up and should probably set fire to what you've done to date.

We've been through this before with the whole "SGML was too hard, let's create HTML" debacle. I don't want to be back here again in 10-15 years having to deal with a plethora of new "simplified" markup languages just because today's "simplified" markup languages have become too complex.


  • A dozen plus flavors of things all claiming to be "markdown" isn't an improvement over knowing basic HTML and CSS
  • Having to differentiate the subtleties between each of the flavors isn't an improvement over knowing basic HTML and CSS.
  • Relying on bridge markup tools like reStructured isn't an improvement over knowing basic HTML and CSS (especially if I have to pollute my markdown with it). And, frankly, its syntax is more clunky and gibberish than either HTML or even troff/nroff.

Knock off the sprawling simplifications. You're not improving things, you're making things even more of a shit show (and, by extension, further discouraging people to write documentation at all).