Thursday, July 2, 2020

Taming the CUDA

Recently, I was placed on a new contract supporting a data science project. I'm not doing any real data-science work, simply improving the architecture and automation of the processes used to manage and deploy their data-science tooling.

Like most of my customers, the current customer is an Enterprise Linux shop and an AWS shop. Amazon makes available several GPU-enabled instance-types that are well-disposed to running data science types of tasks. And, while RHEL is generically suitable to running on GPU-enabled instance types, to get the best performance out of them, you need to run the GPU drivers published by the GPU-vendor rather than the ones bundled with RHEL.

Unfortunately, as third-party drivers, there's some gotchas with using them. The one they'd been most-plagued by was updating drivers as the GPU-vendor made further updates available. While doing a simple `yum upgrade` works for most packagings, it can be problematic when using third-party drivers. When you try to do `yum upgrade` (after having ensured the new driver-RPMs are available via `yum`), you'll ultimately get a bunch of dependency errors due to the driver DSOs being in use.

Ultimately, what I had to move to was a workflow that looked like:

  1. Uninstall the current GPU-driver RPMs
  2. Install the new GPU-driver RPMs
  3. Reboot
Unfortunately, "uninstall the current GPU-driver RPMs" actually means "uninstall just the 60+ RPMs that were previously installed ...and nothing beyond that. And, while I could have done something like `yum uninstall <DRIVER_STUB-NAME>`, doing so would result in more packages being removed than I intended.

Fortunately, RHEL (7+) include a nice option with the `yum` package-management utility: `yum history undo <INSTALL_ID>`.  

Due to the data science users individual EC2s being of varying vintage (and launched from different AMIs), the value of <INSTALL_ID> is not stable across their entire environment.

The automation gods giveth; the automation gods taketh away.

That said, there's a quick method to make the <INSTALL_ID> instability pretty much a non-problem:

yum history undo $( yum history info <rpm_name>| \
   sed -n '/Transaction ID/p' | \
   cut -d: -f 2 )
Which is to say "Undo the yum transaction-ID returned when querying the yum history for <rpm_name>". Works like a champ and made the overall update process go very smoothly.

Now to wrap it up within the automation framework they're leveraging (Ansible). I don't think it natively understands the above logic, so, I'll probably have to shell-escape to get step #1 done.

No comments:

Post a Comment