Titular Discrepancy: drivers

Showing posts with label drivers. Show all posts

Monday, July 6, 2020

Taming the CUDA (Pt. II)

So, today, finally had a chance to implement in Ansible what I'd learned in Taming the CUDA.

Given that it takes a significant time to run the uninstall/new-install/reboot operation, I didn't want to just blindly execute the logic. So, I wanted to implement logic that checked to see what version, if any, of the CUDA drivers were already installed on the Ansible target. First step to this was as follows:

- name: Gather the rpm package facts
  package_facts:
    manager: auto

This tells Ansible to check the managed-host and gather relevant package-information for the base cuda RPM and stuff the return of the action into a registered variable `cuda_pkginfo`. This variable is a JSON structure that's then referencable by subsequent Ansible actions. Since I'm only interested in the installed version, I'm able to grab that information by grabbing the `cuda_pkginfo.results[0].version` value from the JSON structure and using it in a `when` conditional.

Because I had multiple actions that I wanted to make conditional on a common condition, I didn't want to have a bunch of configuration-blocks with the same conditional statement. Did some quick Googling and found that, yes, Ansible does support executing multiple steps within a shared-condition block. You just have to use (wait for it...) the `block` statement in concert with the shared condition-statement. When you use that statement, you then nest actions that you might otherwise have put in their own, individual action-blocks. In my case, the block ended up looking like:

- name: Update CUDA drivers as necessary
  block:
    - name: Copy CUDA RPM-repository definition
      copy:
        src: files/cuda-rhel7-11-0-local.repo-DSW
        dest: /etc/yum.repos.d/cuda-rhel7-11-0-local.repo
        group: 'root'
        mode: '000644'
        owner: 'root'
        selevel: 's0'
        serole: 'object_r'
        setype: 'etc_t'
        seuser: 'system_u'
    - name: Uninstall previous CUDA packages
      shell: |
          UNDOID=$( yum history info cuda | sed -n '/Transaction ID/p' | \
                    cut -d: -f 2 | sed 's/^[     ]*//g' | sed -n 1p )
          yum -y history undo "${UNDOID}"
    - name: Install new CUDA packages (main)
      yum:
        name:
          - cuda
          - nvidia-driver-latest-dkms
        state: latest
    - name: Install new CUDA packages (drivers)
      yum:
        name: cuda-drivers
        state: latest
  when:
    ansible_facts.packages['cuda'][0].version.split('.')[0]|int < 11

I'd considered doing the shell-out a bit more tersely – something like:

yum -y history undo $( yum history info cuda | \
sed -n '/Transaction ID/p' | cut -d: -f 2 | sed -n 1p)

But figured what I ended up using was marginally more readable for the very junior staff that will have to own this code after I'm gone.

Any way you slice it, though, I'm not super chuffed that I had to resort to a shell-out for the targeted/limited removal of packages. So, if you know a more Ansible-y way of doing this, please let me know.

I'd have also finished-out with one yum install-statement rather than the two, but the nVidia documentation for EL7 explicitly states to install the two groups separately. 🤷

Oh... And because I didn't want my `when` statement to be tied to the full X.Y.Z versioning of the drivers, I added the `split()` method so I could match against just the major number. Might have to revisit this if they ever reach a point where they care about the major and minor or the major, minor and release number. But, for now, the above suffices and is easy enough to extend via a compound `when` statement. Similarly, because Ansible defaults to string-output, I needed forcibly cast the string-output to an integer so that numeric comparison would work properly.

Final note: I ended up line-breaking where I did because yamllint had popped "too wide" alerts when I ran my playbook through it.

Thursday, July 2, 2020

Taming the CUDA

Recently, I was placed on a new contract supporting a data science project. I'm not doing any real data-science work, simply improving the architecture and automation of the processes used to manage and deploy their data-science tooling.

Like most of my customers, the current customer is an Enterprise Linux shop and an AWS shop. Amazon makes available several GPU-enabled instance-types that are well-disposed to running data science types of tasks. And, while RHEL is generically suitable to running on GPU-enabled instance types, to get the best performance out of them, you need to run the GPU drivers published by the GPU-vendor rather than the ones bundled with RHEL.

Unfortunately, as third-party drivers, there's some gotchas with using them. The one they'd been most-plagued by was updating drivers as the GPU-vendor made further updates available. While doing a simple `yum upgrade` works for most packagings, it can be problematic when using third-party drivers. When you try to do `yum upgrade` (after having ensured the new driver-RPMs are available via `yum`), you'll ultimately get a bunch of dependency errors due to the driver DSOs being in use.

Ultimately, what I had to move to was a workflow that looked like:

Uninstall the current GPU-driver RPMs
Install the new GPU-driver RPMs
Reboot

Unfortunately, "uninstall the current GPU-driver RPMs" actually means "uninstall just the 60+ RPMs that were previously installed ...and nothing beyond that. And, while I could have done something like `yum uninstall <DRIVER_STUB-NAME>`, doing so would result in more packages being removed than I intended.

Fortunately, RHEL (7+) include a nice option with the `yum` package-management utility: `yum history undo <INSTALL_ID>`.

Due to the data science users individual EC2s being of varying vintage (and launched from different AMIs), the value of <INSTALL_ID> is not stable across their entire environment.

The automation gods giveth; the automation gods taketh away.

That said, there's a quick method to make the <INSTALL_ID> instability pretty much a non-problem:

yum history undo $( yum history info <rpm_name>| \
   sed -n '/Transaction ID/p' | \
   cut -d: -f 2 )

Which is to say "Undo the yum transaction-ID returned when querying the yum history for <rpm_name>". Works like a champ and made the overall update process go very smoothly.

Now to wrap it up within the automation framework they're leveraging (Ansible). I don't think it natively understands the above logic, so, I'll probably have to shell-escape to get step #1 done.

Friday, October 7, 2016

Using DKMS to maintain driver modules

In my prior post, I noted that maintaining custom drivers for the the kernel in RHEL and CentOS hosts can be a bit painful (and prone to leaving you with an unreachable or even unbootable system). One way to take some of the pain out of owning a system with custom drivers is to leverage DKMS. In general, DKMS is the recommended way to ensure that, as kernels are updated, required kernel modules are also (automatically) updated.

Unfortunately, use of the DKMS method will require that developer tools (i.e., the GNU C-compiler) be present on the system - either in perpetuity or just any time kernel updates are applied. It is very likely that your security team will object to - or even prohibit - this. If the objection/prohibition cannot be overridden, use of the DKMS method will not be possible.

Steps

Set an appropriate version string into the shell-environment:
```
export VERSION=3.2.2
```
Make sure that appropriate header files for the running-kernel are installed
```
yum install -y kernel-devel-$(uname -r)
```
Ensure that the dkms utilities are installed:
```
yum --enablerepo=epel install dkms
```

Download the driver sources and unarchive into the /usr/src directory:

wget https://sourceforge.net/projects/e1000/files/ixgbevf%20stable/${VERSION}/ixgbevf-${VERSION}.tar.gz/download \
    -O /tmp/ixgbevf-${VERSION}.tar.gz && \
   ( cd /usr/src && \
      tar zxf /tmp/ixgbevf-${VERSION}.tar.gz )

Create an appropriate DKMS configuration file for the driver:

cat > /usr/src/ixgbevf-${VERSION}/dkms.conf << EOF
PACKAGE_NAME="ixgbevf"
PACKAGE_VERSION="${VERSION}"
CLEAN="cd src/; make clean"
MAKE="cd src/; make BUILD_KERNEL=\${kernelver}"
BUILT_MODULE_LOCATION[0]="src/"
BUILT_MODULE_NAME[0]="ixgbevf"
DEST_MODULE_LOCATION[0]="/updates"
DEST_MODULE_NAME[0]="ixgbevf"
AUTOINSTALL="yes"
EOF

Register the module to the DKMS-managed kernel tree:
```
dkms add -m ixgbevf -v ${VERSION}
```
Build the module against the currently-running kernel:
```
dkms build ixgbevf/${VERSION}
```

Verification

The easiest way to verify the correct functioning of DKMS is to:

Perform a `yum update -y`
Check that the new drivers were created by executing `find /lib/modules -name ixgbevf.ko`. Output should be similar to the following:
```
find /lib/modules -name ixgbevf.ko | grep extra
/lib/modules/2.6.32-642.1.1.el6.x86_64/extra/ixgbevf.ko
/lib/modules/2.6.32-642.6.1.el6.x86_64/extra/ixgbevf.ko
```
There should be at least two output-lines: one for the currently-running kernel and one for the kernel update. If more kernels are installed, there may be more than just two output-lines
Reboot the system, then check what version is active:
```
modinfo ixgbevf | grep extra
filename:       /lib/modules/2.6.32-642.1.1.el6.x86_64/extra/ixgbevf.ko
```
If the output is null, DKMS didn't build the new module.