Wednesday, July 15, 2020

Sometimes The (Workable) Answer Is Too Simple to See

One of the tasks I was asked to tackle was helping the team I'm working with move their (Python and Ansible-based) automation-development efforts from a Windows environment to a couple of Linux servers.

Were the problem to only require addressing how I tend to work, the solution would have been very straight-forward. Since I work across a bunch of contracts, I tend to select tooling that is reliably available: pretty much just git + vi ...and maybe some linting-containers.

However, on this new contract, much of my team doesn't work that way. While they've used git before, it was mostly within the context of VSCode. So, required me to solve two problems – one relevant to the team I'm enabling and one possibly relevant only to me: making it so the team could use VSCode without Windows; making it so that wouldn't be endlessly prompted for passwords.

The easy "VSCode without Windows" problem was actually the easier problem. It boiled down to:

  1. Install the VSCode server/agent on the would be (Linux-based) dev servers
  2. Update the Linux-based dev servers' /etc/ssh/sshd_config file's AllowTcpForwarding setting (this one probably shouldn't have been necessary: the VSCode documentation indicates that one should be able to use UNIX domain-sockets on the remote host; however, the setting for doing so didn't appear to be available in my VSCode client)
  3. Point my laptop's VSCode to the Linux-based dev servers
Because I'm lazy, I hate having to enter passwords over and over. This means that, to the greatest degree possible, I make use of things like keyrings. In most of my prior environments, things were PuTTY-based. So, my routine, upon logging in to my workstation each morning, included "fire up Pageant and load my keys": whether then using the PuTTY or MobaXterm ssh-client, this meant no need to enter passwords (the rest of the day) beyond having entered them as part of Pageant's key-loading.

According to the documentation, VSCode isn't compatible with PuTTY – and, by extension, not compatible with Pageant. So, I dutifully Googled around for how to solve the problem. Most of the hits I found seem to rely on having a greater level of access to our customer-issued laptops than what we're afforded: I don't have enough permissions to even check if our PowerShell installations have the OpenSSH-related bits.

Ultimately, I turned to my employer's internal Slack channel for our automation team and posed the question. I was initially met with links to the same pages my Google searches had turned up. Since our customer-issued laptops do come with Git-BASH installed, someone suggested setting up its keyring and then firing-up VSCode from within that application. Being so used to accessing Windows apps via clicky-clicky, it totally hadn't occurred to me to try that. It actually worked (surprising both me and the person who suggested it). 

That said, it means I have an all-but-unused Git-BASH session taunting me from the task-bar. Fortunately, I have the taskbar set to auto-hide. But still: "not tidy".

Also: because everybody uses VSCode on this project, nobody really uses Git-BASH. So, any solution I propose that uses it will require further change-accommodation by the incumbent staff.

Fortunately, most of the incumbent staff already uses MobaXterm when they need CLI-based access to remote systems. Since MobaXterm integrates with Pageant, it's a small skip-and-a-jump to have VSCode use MobaXterm's keyring service ...which pulls from Pageant. Biggest change will be telling them "Once you've opened Moba, invoke VSCode from within it rather than going clicky-clicky on the pretty desktop icon".

I'm sure there's other paths to a solution. Mostly comes down to: A) time available to research and validate them; and, B) how much expertise is needed to use them, since I'll have to write any setup-documentation appropriate to the audience it's meant to serve

Monday, July 6, 2020

Taming the CUDA (Pt. II)

So, today, finally had a chance to implement in Ansible what I'd learned in Taming the CUDA.

Given that it takes a significant time to run the uninstall/new-install/reboot operation, I didn't want to just blindly execute the logic. So, I wanted to implement logic that checked to see what version, if any, of the CUDA drivers were already installed on the Ansible target. First step to this was as follows:
- name: Gather the rpm package facts
  package_facts:
    manager: auto
This tells Ansible to check the managed-host and gather relevant package-information for the base cuda RPM and stuff the return of the action into a registered variable `cuda_pkginfo`. This variable is a JSON structure that's then referencable by subsequent Ansible actions. Since I'm only interested in the installed version, I'm able to grab that information by grabbing the `cuda_pkginfo.results[0].version` value from the JSON structure and using it in a `when` conditional.

Because I had multiple actions that I wanted to make conditional on a common condition, I didn't want to have a bunch of configuration-blocks with the same conditional statement. Did some quick Googling and found that, yes, Ansible does support executing multiple steps within a shared-condition block. You just have to use (wait for it...)  the `block` statement in concert with the shared condition-statement. When you use that statement, you then nest actions that you might otherwise have put in their own, individual action-blocks. In my case, the block ended up looking like:
- name: Update CUDA drivers as necessary
  block:
    - name: Copy CUDA RPM-repository definition
      copy:
        src: files/cuda-rhel7-11-0-local.repo-DSW
        dest: /etc/yum.repos.d/cuda-rhel7-11-0-local.repo
        group: 'root'
        mode: '000644'
        owner: 'root'
        selevel: 's0'
        serole: 'object_r'
        setype: 'etc_t'
        seuser: 'system_u'
    - name: Uninstall previous CUDA packages
      shell: |
          UNDOID=$( yum history info cuda | sed -n '/Transaction ID/p' | \
                    cut -d: -f 2 | sed 's/^[     ]*//g' | sed -n 1p )
          yum -y history undo "${UNDOID}"
    - name: Install new CUDA packages (main)
      yum:
        name:
          - cuda
          - nvidia-driver-latest-dkms
        state: latest
    - name: Install new CUDA packages (drivers)
      yum:
        name: cuda-drivers
        state: latest
  when:
    ansible_facts.packages['cuda'][0].version.split('.')[0]|int < 11
I'd considered doing the shell-out a bit more tersely – something like:
yum -y history undo $( yum history info cuda | \
sed -n '/Transaction ID/p' | cut -d: -f 2 | sed -n 1p)
But figured what I ended up using was marginally more readable for the very junior staff that will have to own this code after I'm gone.

Any way you slice it, though, I'm not super chuffed that I had to resort to a shell-out for the targeted/limited removal of packages. So, if you know a more Ansible-y way of doing this, please let me know.

I'd have also finished-out with one yum install-statement rather than the two, but the nVidia documentation for EL7 explicitly states to install the two groups separately. 🤷

Oh... And because I didn't want my `when` statement to be tied to the full X.Y.Z versioning of the drivers, I added the `split()` method so I could match against just the major number. Might have to revisit this if they ever reach a point where they care about the major and minor or the major, minor and release number. But, for now, the above suffices and is easy enough to extend via a compound `when` statement. Similarly, because Ansible defaults to string-output, I needed forcibly cast the string-output to an integer so that numeric comparison would work properly.

Final note: I ended up line-breaking where I did because yamllint had popped "too wide" alerts when I ran my playbook through it.

Thursday, July 2, 2020

Taming the CUDA

Recently, I was placed on a new contract supporting a data science project. I'm not doing any real data-science work, simply improving the architecture and automation of the processes used to manage and deploy their data-science tooling.

Like most of my customers, the current customer is an Enterprise Linux shop and an AWS shop. Amazon makes available several GPU-enabled instance-types that are well-disposed to running data science types of tasks. And, while RHEL is generically suitable to running on GPU-enabled instance types, to get the best performance out of them, you need to run the GPU drivers published by the GPU-vendor rather than the ones bundled with RHEL.

Unfortunately, as third-party drivers, there's some gotchas with using them. The one they'd been most-plagued by was updating drivers as the GPU-vendor made further updates available. While doing a simple `yum upgrade` works for most packagings, it can be problematic when using third-party drivers. When you try to do `yum upgrade` (after having ensured the new driver-RPMs are available via `yum`), you'll ultimately get a bunch of dependency errors due to the driver DSOs being in use.

Ultimately, what I had to move to was a workflow that looked like:

  1. Uninstall the current GPU-driver RPMs
  2. Install the new GPU-driver RPMs
  3. Reboot
Unfortunately, "uninstall the current GPU-driver RPMs" actually means "uninstall just the 60+ RPMs that were previously installed ...and nothing beyond that. And, while I could have done something like `yum uninstall <DRIVER_STUB-NAME>`, doing so would result in more packages being removed than I intended.

Fortunately, RHEL (7+) include a nice option with the `yum` package-management utility: `yum history undo <INSTALL_ID>`.  

Due to the data science users individual EC2s being of varying vintage (and launched from different AMIs), the value of <INSTALL_ID> is not stable across their entire environment.

The automation gods giveth; the automation gods taketh away.

That said, there's a quick method to make the <INSTALL_ID> instability pretty much a non-problem:

yum history undo $( yum history info <rpm_name>| \
   sed -n '/Transaction ID/p' | \
   cut -d: -f 2 )
Which is to say "Undo the yum transaction-ID returned when querying the yum history for <rpm_name>". Works like a champ and made the overall update process go very smoothly.

Now to wrap it up within the automation framework they're leveraging (Ansible). I don't think it natively understands the above logic, so, I'll probably have to shell-escape to get step #1 done.

Wednesday, June 10, 2020

TIL: Podman Cleanup

Recently, I started working on a gig that uses Ansible for their build-automation tasks. While I have experience with other types of build-automation frameworks, Ansible was new to me.

Unfortunately, my customer is very early in their DevOps journey. While my customer has some privately-hosted toolchain services, they're not really fully fleshed out: their GitLab has no runners; their Jenkins is not general access; etc. In short, not a lot of ability to develop in their environment — at least not in a way that allows me to set up automated validation of my work.

Ultimately, I opted to move my initial efforts to my laptop with the goal of exporting the results. Because my customer is a RHEL environment, I set up RHEL and CentOS 7 and 8 VMs on my laptop via Hyper•V. 

Side-note: While on prior laptops I used other virtualization solutions, I'm using Hyper•V because it came with Windows 10, not because I prefer it over other options. Hypervisor selection aside…

As easy as VMs are to rebuild, I've yet to actually take the time out to automate my VMs' builds to make it less painful if I do something that renders one of them utterly FUBAR. Needless to say, I don't particularly want to crap-up my VMs, right now. So, how to provide a degree of blast-isolation within those VMs to hopefully better-avoid not-yet-automated rebuilds?

Containers can be a great approach. And, for something as simple as experimenting with Ansible and writing actual playbooks, it's more than sufficient. That said, since my VMs are all Enterpise Linux 7.8 or higher, Podman seemed the easier path than Docker ...and definitely easier than either full Kubernetes or K3S. After all, Podman is just a `yum install` away from being able to start cranking containers. Podman also means can run containers in user-space (without needing to set up Kubernetes or K3S), which further limits how hard I can bone myself.

At any rate, I've been playing around with Ansible, teaching myself how to author flexible playbooks and even starting to write some content that will eventually go into production for my customer. However, after creating and destroying dozens of containers over the past couple weeks, I happened to notice that the partition my ${HOME} is on was nearly full. I'd made the silly assumption that when I killed and removed my running containers that the associated storage was released. Instead, I found that my ${HOME}/.local/share/containers was chewing up nearly 4GiB of space. Worse, when I ran find (ahead of doing any rms), I was getting all sorts of permission denied errors. This kind of surprised me since I thought that, by running in user-space, any files that would be created would be owned by me.

So, I hit up the almighty Googs. I ended up finding Dan Walsh's blog-entry on the topic. Turns out that, because of how Podman uses name-spaces, it creates files that my non-privileged user can't actually directly access. Per the blog-entry, instead of being able to just do find ${HOME}/.local/share/containers -mtime +3 | xargs rm, I had to invoke buildah unshare and do my cleanup using that context.

So, "today I learned" ...and now I have over 3GiB of the nearly 4GiB of space back.

Friday, June 5, 2020

Ansible Journey: Adding /etc/fstab Entries

As noted in yesterday's post, I'm working on a new customer-project. One of the automation-tools this customer uses is Ansible. This is a new-to-me automation-technology. Previously — and aside from just writing bare BASH and Python code — I've used frameworks like Puppet, SaltStack and a couple others. So, picking up a new automation-technology — especially one that uses a DSL not terribly unlike one I was already familiar with, hasn't been super much of a stretch.

After sorting out yesterday's problem and how I wanted my /etc/fstab to look, I set about implementing it via Ansible. Ultimately, I ended up settling on a list-of-maps variable to drive a lineinfile role-task. I chose a list-of-maps variable mostly because the YAML that Ansible relies on doesn't really do tuples. My var ended up looking like:

s3fs_fstab_nest:
  - mountpoint: /provisioning/repo
    bucket: s3fs-build-bukkit
    folder: RPMs
  - mountpoint: /provisioning/installers
    bucket: s3fs-build-bukkit
    folder: EXEs
  - mountpoint: /Data/personal
    bucket: s3fs-users-bukkit
    folder: build

And my play ended up looking like:

---
- name:  "Add mount to /etc/fstab"
  lineinfile:
    path: '/etc/fstab'
    line: "s3fs#{{ item.bucket }}:/{{ item.folder }}\t{{ item.mountpoint }}fuse\t_netdev,allow_other,umask=0000,nonempty 0 0"
  loop: "{{ s3fs_fstab_nest }}"
...

Was actually a lot simpler than I was expecting it to be.

Thursday, June 4, 2020

TIL: You Gotta Be Explicit

Started working on a new contract, recently. This particular customer makes use of S3FS. To be honest, in the past half-decade, I've had a number of customers express interest in S3FS, but they're pretty much universally turned their noses up at it (due to any number of reasons that I can't disagree with — trying to use S3 like a shared filesystem is kind of horrible).

At any rate, this customer also makes use of Ansible for their provisioning automation. One of their "plays" is designed to mount the S3 buckets via s3fs. However, the manner in which they implemented it seemed kind of jacked to me: basically, they set up a lineinfile-based play to add to add s3fs commands to the /etc/rc.d/rc.local file, and then do a reboot to get the filesystems to mount up.

It wasn't a great method, to begin with, but, recently, their their security people made a change to the IAM objects they use to enable access to the S3 buckets. It, uh, broke things. Worse, because of how they implemented the s3fs-related play, there was no error trapping in their work-flow. Jobs that relied on /etc/rc.d/rc.local having worked started failing with no real indication as to why (when you pull a file directly from S3 rather than an s3fs mount, things are pretty immediately obvious what's going wrong).

At any rate, I decided to try to see if there might be a better way to manage the s3fs mounts. So, I went to the documentation. I wanted to see if there was a way to make them more "managed" by the OS such that, if there was a failure in mounting, the OS would put a screaming-halt to the automation. Overall, if I think a long-running task is likely to fail, I'd rather it fail early in the process than after I've been waiting for several minutes (or longer). So I set about simulating how they were mounting S3 buckets with s3fs.

As far as I can tell, the normal use-case for mounting S3 buckets via s3fs is to do something like:

s3fs <bucket> <mount> -o <OPTIONS>

However, they have their buckets cut up into "folders" and sub-folders and wanted to mount them individually. The s3fs documentation indicated that you could both mount individual folders and that you could do it via /etc/fstab. You simply needed an /etc/fstab that looks sorta like:
s3fs-build-bukkit:/RPMs    /provisioning/repo       fuse.s3fs    _netdev,allow_other,umask=0000,nonempty 0 0
s3fs-build-bukkit:/EXEs    /provisioning/installer  fuse.s3fs    _netdev,allow_other,umask=0000,nonempty 0 0
s3fs-users-bukkit:/build   /Data/personal           fuse.s3fs    _netdev,allow_other,umask=0000,nonempty 0 0

However, I was finding that, even though the mount-requests weren't erroring, they also weren't mounting. So, hit up the almighty Googs and found an issue-report in the S3FS project that matched my symptoms. The issue ultimately linked to a (poorly-worded) FAQ-entry. In short, I was used to implicit "folders" (ones that exist by way of an S3 object containing a slash-delimited key), but s3fs relies on explicitly-created "folders" (e.g., null objects with key-names that end in `/` — as would be created by doing `aws s3api put-object --bucket s3fs-build-bukkit --key test-folder/`). Once I explicitly created these trailing-slash null-objects, my /etc/fstab entries started working the way the documentation indicated they ought to have been doing all along.


Friday, May 22, 2020

But We Want a Pre-Authentication Consent Banner!

My primary client-base is security- and compliance-focussed. As part of this, they follow security guidelines that include things like "system must display consent-to-monitor banners".

Up until recently, I only ever worked on the server end of things — almost exclusively UNIX- or Linux-based (though, the past decade has been almost exclusively Linux). This meant that all I really had to worry about was ensuring a suitable "/etc/issue file was in place and that any interactive login services (mostly SSH; occasionally FTP) were configured to reference that file."

Any GUI-dependent developers that my customers might have had did all of their development work on Windows boxes or specially-prepared Linux desktops. In either case, they were using on-premises desktops that I didn't have to worry about.

With COVID19 and the mass adoption of telework, those carefully-prepared development desktops are sitting disused at my customers offices. Developers have been forced to work remote. As such, we've been asked, as part of our "enablement" mandate, to help these newly-remote developers set up cloud-hosted development-systems. A significant percentage of these developers rely on graphical tools. While most — if not all — of these graphical tools would be usable as individually-launched applications with their X11 tunneled through SSH back to the workstations they're using, that hasn't been enough for some of them:
  • Many don't have X11 servers installed on their laptops
    • I usually like to point BYOD Windows users to Cygwin/X, MobaXterm, Xming, VcXsrv and the corporate-issued Windows users to contact their employers and ask that they add any one of the commercial Xserver offerings to their laptops
    • Though, given the number of Macintosh users in their ranks, they have Xservers, but it seems like they don't quite understand that fact:
  • In either case, most don't seem to understand that the "easy button" solution is to add:
    Host *
      ForwardAgent true
    To their ${HOME}/.ssh/config files. And that doing so allows them to SSH to their development-host and, by executing <GRAPHICAL_UTILITY> on the remote server, it will cause the utility to magically appear on their laptop.
  • Even once you show them the magic of X11 forwarding, many will whine "I don't want to have to have a terminal open just to executed programs, I want the entire remote desktop so I can just use the launchers."
Result, I've been having to help Linux neophytes install the full Gnome desktop on top of our standard AMIs and figuring out how to access them.

Our standard AMIs are RHEL- and CentOS-based have little more on them than the @Core RPM-group. The AMIs also have boot-EBSes that are as small as we could make them and meet the server security-requirements that require the AMIs to be carved up into partitions. These initially led to several common questions:

  1. "Can you create new AMIs that have enough room for us to install the Gnome desktop onto"; and
  2. "Can you create an AMI with the Gnome-desktop preinstalled". 
The answer to both has always been, "no, but if you follow the FAQ on how to expand the disk at launch-time, you'll have plenty of space for installing the Gnome desktop yourself".

Unfortunately, with the sudden proliferation of people building themselves graphical, our various customers' security assessors have taken notice. That means that our (headless) server-oriented automated-hardening tools are not accounting for the now Gnome-enabled desktops. By itself, not a problem, but different assessors have different demands around banners.
For some assessors, simply painting the contents of /etc/issue on the Gnome login screen's root window is sufficient. Fortunately, the Gnome project provides instructions that are also somewhere in the neighborhood of "dead-easy to do". Once done, you have a login screen that looks something like:



Unfortunately, some assessors insist that your consent banner must be a pop-up that is displayed prior to the would-be-user being allowed to enter their username or password. Meeting this requirement is a skosh more-involved. Instead of doing the above Gnome mods, one needs to do:
  1. Move the existing /etc/gdm/Init/Default file contents aside (e.g., `mv /etc/gdm/Init/Default{,-DIST}`)
  2. Install a new /etc/gdm/Init/Default file with contents similar to:
    /usr/bin/zenity --text-info --width=700 --height=300 \
    --title="Security Message" --filename=/etc/issue
    Which uses Zenity to create a create a dialogue-box from the /etc/issue file.
  3. Restart the GDM service (e.g., `systemctl restart gdm`)
Doing this results in production of a pre-authentication pop-up similar to the following:



If you want something marginally fancier, you can install `yad` (from EPEL). If using `yad`, changing your /etc/gdm/Init/Default contents to something like:
/usr/bin/yad \
  --center \
  --buttons-layout=center \
  --button=OK:0 \
  --no-escape \
  --fontname="Monospace Regular 10" \
  --fore tomato4 \
  --text-info \
  --width=700 \
  --height=300 \
  --wrap \
  --title="Security Message" \
  --filename=/etc/issue

Will produce a screen like:



Primary differences being ability to set the text-color and get rid of the extraneous "Cancel" button (and center the "Ok" button).

Thursday, May 7, 2020

Punishing Network Performance

Recently, I took delivery of a new laptop. My old laptop was rolling up on five years old, was still running Windows 7 and had less than 1GiB of free disk space. So, "it was time".

At any rate, the new laptop runs Windows 10 Pro. One of the nice things about this is that, because MicroSoft makes Hyper-V available for free for my OS, I no longer have to deal with VirtualBox (or any add-on virtualization solution, for that matter). It means I'm free of the endless parade of "you must reboot to install new VirtualBox drivers" that pretty much caused me to stop with local virtualization on my prior laptop.

One of the nice things with virtualization is that it lets me run Linux, locally. While WSL/WSL2 show promise, they aren't quite there yet. Similarly, I'm not super chuffed that Docker Desktop for Windows requires me to run Docker as root. So, my Linux VM is an EL8 host that I am able to run PodMan on (and able to run my containers as an unprivileged user within that VM).

That said, most of the "how to run Linux on Hyper-V on Windows 10" guides I found resulted in my networking getting completely and totally jacked up. When not jacked up, my Internet connection looks like:


However, after initially booting my CentOS VM, I noticed that my yum downloads were running stoopid-slow. I assumed the problem was constrained to the VM. However, being detail-oriented, I checked my OS's network speed. It was, uh, "not good". I was getting less than 4Mbps down and about 0.2Mbps up. I also found that, if I rebooted my system, my download speeds would come back, but my uploads still sucked. Lastly, I found that if I wholly de-configured Hyper-V, my networking returned to normal.

Sleuthing time.

I hit the Googs and everyone is saying "this is a known issue: you can fix it by changing your NIC settings." Unfortunately, the mentioned settings weren't available in my NIC. Dunno if it's because I'm using WiFi or what. All I know is that it was looking like my choices were going to be "have good Internet speeds or have Linux on Hyper-V".

I'm not real keen on such choices. So, I kept digging – eventually getting to the point where I went beyond the first page of results. Ultimately, I found an answer on one of the StackExchange forums. Interestingly, the answer I found wasn't even marked as the "best answer". Had I not read through all the responses on one thread, I wouldn't even have found it.

At any rate, the fix was to not use my NIC in bridged mode at all. Instead, I needed to ignore all the top-match guides' instructions to use an "External" interface (which puts your NIC into bridged mode) and, instead, use an "Internal" interface ...and then set up connection-sharing, allowing my private vSwitch to share (rather than bridge) through my WiFi adapter. As soon as I made that change, my speeds came back.

Tuesday, April 7, 2020

Crib Notes: Finding Missing Single- Or Double-Quote Pairs

So, today, was writing a BASH utility. As per normal, I have a commit-time check that runs it through shellchecker. That test came up green. However, when I ran the script if finished by complaining:


XXXXXX.sh: line 290: unexpected EOF while looking for matching `"'
Naturally, my reaction was something along the lines of:




I mean, shellchecker's usually pretty damned good about finding trivial flubs like that. So, Googled about to see if there was an easy way to double-check things. My search was fruitful – I found this cute, little snippet:
| tr -cd '"\n' | awk 'length%2==1 {print NR, $0}'
I catted my script to that pipe and, for better or worse, it agreed with shellchecker that all my single- and double-quotes were properly paired.

Wednesday, February 26, 2020

Hold On A Second There, Partner

One of the devs on  project many of us work on was excited to announce that we'd topped 2000 commits on the project. I'm a pedant. I know that we configure our projects with a lot of "administrativa" bots, so I wanted to look harder at the numbers. What I found was:

The raw number of commits was:
$ git shortlog -s -n | awk '{ sum += $1 } END { print sum }'
2008

Total minus bot-commits:
$ git shortlog -s -n | grep -v bot | awk '{ sum += $1 } END { print sum }'
1476

Total minus bot- and merge-commits:
$ git shortlog -s -n --no-merges | grep -v bot | awk '{ sum += $1 } END { print sum }'
752

I wasn't trying to rain on his parade, but it is kind of cool that you can break things out that way and see just how much of your project's activity is "administrativa" type of actions.

Also, removing the filtering/counting pipelines gives you a nice output of who's contributed and how much (though "how much" is only in commits ...which, in our case, are generally squashed when merging back from an individual developer's fork).