Like most of my customers, the current customer is an Enterprise Linux shop and an AWS shop. Amazon makes available several GPU-enabled instance-types that are well-disposed to running data science types of tasks. And, while RHEL is generically suitable to running on GPU-enabled instance types, to get the best performance out of them, you need to run the GPU drivers published by the GPU-vendor rather than the ones bundled with RHEL.
Unfortunately, as third-party drivers, there's some gotchas with using them. The one they'd been most-plagued by was updating drivers as the GPU-vendor made further updates available. While doing a simple `yum upgrade` works for most packagings, it can be problematic when using third-party drivers. When you try to do `yum upgrade` (after having ensured the new driver-RPMs are available via `yum`), you'll ultimately get a bunch of dependency errors due to the driver DSOs being in use.
Ultimately, what I had to move to was a workflow that looked like:
- Uninstall the current GPU-driver RPMs
- Install the new GPU-driver RPMs
- Reboot
Unfortunately, "uninstall the current GPU-driver RPMs" actually means "uninstall just the 60+ RPMs that were previously installed ...and nothing beyond that. And, while I could have done something like `yum uninstall <DRIVER_STUB-NAME>`, doing so would result in more packages being removed than I intended.
Fortunately, RHEL (7+) include a nice option with the `yum` package-management utility: `yum history undo <INSTALL_ID>`.
Due to the data science users individual EC2s being of varying vintage (and launched from different AMIs), the value of <INSTALL_ID> is not stable across their entire environment.
The automation gods giveth; the automation gods taketh away.
That said, there's a quick method to make the <INSTALL_ID> instability pretty much a non-problem:
Which is to say "Undo the yum transaction-ID returned when querying the yum history for <rpm_name>". Works like a champ and made the overall update process go very smoothly.yum history undo $( yum history info <rpm_name>| \ sed -n '/Transaction ID/p' | \ cut -d: -f 2 )
Now to wrap it up within the automation framework they're leveraging (Ansible). I don't think it natively understands the above logic, so, I'll probably have to shell-escape to get step #1 done.
No comments:
Post a Comment