Wednesday, July 15, 2020

Sometimes The (Workable) Answer Is Too Simple to See

One of the tasks I was asked to tackle was helping the team I'm working with move their (Python and Ansible-based) automation-development efforts from a Windows environment to a couple of Linux servers.

Were the problem to only require addressing how I tend to work, the solution would have been very straight-forward. Since I work across a bunch of contracts, I tend to select tooling that is reliably available: pretty much just git + vi ...and maybe some linting-containers.

However, on this new contract, much of my team doesn't work that way. While they've used git before, it was mostly within the context of VSCode. So, required me to solve two problems – one relevant to the team I'm enabling and one possibly relevant only to me: making it so the team could use VSCode without Windows; making it so that wouldn't be endlessly prompted for passwords.

The easy "VSCode without Windows" problem was actually the easier problem. It boiled down to:

  1. Install the VSCode server/agent on the would be (Linux-based) dev servers
  2. Update the Linux-based dev servers' /etc/ssh/sshd_config file's AllowTcpForwarding setting (this one probably shouldn't have been necessary: the VSCode documentation indicates that one should be able to use UNIX domain-sockets on the remote host; however, the setting for doing so didn't appear to be available in my VSCode client)
  3. Point my laptop's VSCode to the Linux-based dev servers
Because I'm lazy, I hate having to enter passwords over and over. This means that, to the greatest degree possible, I make use of things like keyrings. In most of my prior environments, things were PuTTY-based. So, my routine, upon logging in to my workstation each morning, included "fire up Pageant and load my keys": whether then using the PuTTY or MobaXterm ssh-client, this meant no need to enter passwords (the rest of the day) beyond having entered them as part of Pageant's key-loading.

According to the documentation, VSCode isn't compatible with PuTTY – and, by extension, not compatible with Pageant. So, I dutifully Googled around for how to solve the problem. Most of the hits I found seem to rely on having a greater level of access to our customer-issued laptops than what we're afforded: I don't have enough permissions to even check if our PowerShell installations have the OpenSSH-related bits.

Ultimately, I turned to my employer's internal Slack channel for our automation team and posed the question. I was initially met with links to the same pages my Google searches had turned up. Since our customer-issued laptops do come with Git-BASH installed, someone suggested setting up its keyring and then firing-up VSCode from within that application. Being so used to accessing Windows apps via clicky-clicky, it totally hadn't occurred to me to try that. It actually worked (surprising both me and the person who suggested it). 

That said, it means I have an all-but-unused Git-BASH session taunting me from the task-bar. Fortunately, I have the taskbar set to auto-hide. But still: "not tidy".

Also: because everybody uses VSCode on this project, nobody really uses Git-BASH. So, any solution I propose that uses it will require further change-accommodation by the incumbent staff.

Fortunately, most of the incumbent staff already uses MobaXterm when they need CLI-based access to remote systems. Since MobaXterm integrates with Pageant, it's a small skip-and-a-jump to have VSCode use MobaXterm's keyring service ...which pulls from Pageant. Biggest change will be telling them "Once you've opened Moba, invoke VSCode from within it rather than going clicky-clicky on the pretty desktop icon".

I'm sure there's other paths to a solution. Mostly comes down to: A) time available to research and validate them; and, B) how much expertise is needed to use them, since I'll have to write any setup-documentation appropriate to the audience it's meant to serve

Monday, July 6, 2020

Taming the CUDA (Pt. II)

So, today, finally had a chance to implement in Ansible what I'd learned in Taming the CUDA.

Given that it takes a significant time to run the uninstall/new-install/reboot operation, I didn't want to just blindly execute the logic. So, I wanted to implement logic that checked to see what version, if any, of the CUDA drivers were already installed on the Ansible target. First step to this was as follows:
- name: Gather the rpm package facts
  package_facts:
    manager: auto
This tells Ansible to check the managed-host and gather relevant package-information for the base cuda RPM and stuff the return of the action into a registered variable `cuda_pkginfo`. This variable is a JSON structure that's then referencable by subsequent Ansible actions. Since I'm only interested in the installed version, I'm able to grab that information by grabbing the `cuda_pkginfo.results[0].version` value from the JSON structure and using it in a `when` conditional.

Because I had multiple actions that I wanted to make conditional on a common condition, I didn't want to have a bunch of configuration-blocks with the same conditional statement. Did some quick Googling and found that, yes, Ansible does support executing multiple steps within a shared-condition block. You just have to use (wait for it...)  the `block` statement in concert with the shared condition-statement. When you use that statement, you then nest actions that you might otherwise have put in their own, individual action-blocks. In my case, the block ended up looking like:
- name: Update CUDA drivers as necessary
  block:
    - name: Copy CUDA RPM-repository definition
      copy:
        src: files/cuda-rhel7-11-0-local.repo-DSW
        dest: /etc/yum.repos.d/cuda-rhel7-11-0-local.repo
        group: 'root'
        mode: '000644'
        owner: 'root'
        selevel: 's0'
        serole: 'object_r'
        setype: 'etc_t'
        seuser: 'system_u'
    - name: Uninstall previous CUDA packages
      shell: |
          UNDOID=$( yum history info cuda | sed -n '/Transaction ID/p' | \
                    cut -d: -f 2 | sed 's/^[     ]*//g' | sed -n 1p )
          yum -y history undo "${UNDOID}"
    - name: Install new CUDA packages (main)
      yum:
        name:
          - cuda
          - nvidia-driver-latest-dkms
        state: latest
    - name: Install new CUDA packages (drivers)
      yum:
        name: cuda-drivers
        state: latest
  when:
    ansible_facts.packages['cuda'][0].version.split('.')[0]|int < 11
I'd considered doing the shell-out a bit more tersely – something like:
yum -y history undo $( yum history info cuda | \
sed -n '/Transaction ID/p' | cut -d: -f 2 | sed -n 1p)
But figured what I ended up using was marginally more readable for the very junior staff that will have to own this code after I'm gone.

Any way you slice it, though, I'm not super chuffed that I had to resort to a shell-out for the targeted/limited removal of packages. So, if you know a more Ansible-y way of doing this, please let me know.

I'd have also finished-out with one yum install-statement rather than the two, but the nVidia documentation for EL7 explicitly states to install the two groups separately. 🤷

Oh... And because I didn't want my `when` statement to be tied to the full X.Y.Z versioning of the drivers, I added the `split()` method so I could match against just the major number. Might have to revisit this if they ever reach a point where they care about the major and minor or the major, minor and release number. But, for now, the above suffices and is easy enough to extend via a compound `when` statement. Similarly, because Ansible defaults to string-output, I needed forcibly cast the string-output to an integer so that numeric comparison would work properly.

Final note: I ended up line-breaking where I did because yamllint had popped "too wide" alerts when I ran my playbook through it.

Thursday, July 2, 2020

Taming the CUDA

Recently, I was placed on a new contract supporting a data science project. I'm not doing any real data-science work, simply improving the architecture and automation of the processes used to manage and deploy their data-science tooling.

Like most of my customers, the current customer is an Enterprise Linux shop and an AWS shop. Amazon makes available several GPU-enabled instance-types that are well-disposed to running data science types of tasks. And, while RHEL is generically suitable to running on GPU-enabled instance types, to get the best performance out of them, you need to run the GPU drivers published by the GPU-vendor rather than the ones bundled with RHEL.

Unfortunately, as third-party drivers, there's some gotchas with using them. The one they'd been most-plagued by was updating drivers as the GPU-vendor made further updates available. While doing a simple `yum upgrade` works for most packagings, it can be problematic when using third-party drivers. When you try to do `yum upgrade` (after having ensured the new driver-RPMs are available via `yum`), you'll ultimately get a bunch of dependency errors due to the driver DSOs being in use.

Ultimately, what I had to move to was a workflow that looked like:

  1. Uninstall the current GPU-driver RPMs
  2. Install the new GPU-driver RPMs
  3. Reboot
Unfortunately, "uninstall the current GPU-driver RPMs" actually means "uninstall just the 60+ RPMs that were previously installed ...and nothing beyond that. And, while I could have done something like `yum uninstall <DRIVER_STUB-NAME>`, doing so would result in more packages being removed than I intended.

Fortunately, RHEL (7+) include a nice option with the `yum` package-management utility: `yum history undo <INSTALL_ID>`.  

Due to the data science users individual EC2s being of varying vintage (and launched from different AMIs), the value of <INSTALL_ID> is not stable across their entire environment.

The automation gods giveth; the automation gods taketh away.

That said, there's a quick method to make the <INSTALL_ID> instability pretty much a non-problem:

yum history undo $( yum history info <rpm_name>| \
   sed -n '/Transaction ID/p' | \
   cut -d: -f 2 )
Which is to say "Undo the yum transaction-ID returned when querying the yum history for <rpm_name>". Works like a champ and made the overall update process go very smoothly.

Now to wrap it up within the automation framework they're leveraging (Ansible). I don't think it natively understands the above logic, so, I'll probably have to shell-escape to get step #1 done.