Tuesday, November 6, 2018

Shutdown In a Hurry

Last year, I put together an automated deployment of a CI tool for a customer. The customer was using it to provide a CI service for different developer-groups within their company. Each developer group acts like an individual tenant of my customer's CI service.

Recently, as more of their developer-groups have started using the service, space-consumption has exploded. Initially, they tried setting up  job within their CI system to automate cleanups of some of their tenants more-poorly architected CI jobs — ones that didn't appropriately clean up after themselves. Unfortunately, once the CI systems' filesystems fill up, jobs stop working ...including the cleanup jobs.

When I'd written the initial automation, I'd written automation to create two different end-states for their CI system: one that is basically just a cloud-hosted version of a standard, standalone deployment of the CI service; the other being a cloud-enabled version that's designed to automatically rebuild itself on a regular basis or in the event of service failure. They chose to deploy the former because it's the type of deployment they were most familiar with.

With their tenants frequently causing service-components to go offline and their cleaning jobs not being reliable, they asked me to investigate things and come up with a work around. I pointed them to the original set of auto-rebuilding tools I'd provided them, noting that, with a small change, those tools could detect the filesystem-full state and initiate emergency actions. In this case, the proposed emergency action being a system-suicide that caused the automation to rebuild the service back to a healthy state.

Initially, I was going to patch the automation so that, upon detecting a disk-full state, it would trigger a graceful shutdown. Then it struck me, "I don't care about these systems' integrity, I care about them going away quickly," since the quicker they go away, the quicker the automation will notice and start taking steps to re-establish the service. What I wanted was, instead of an `init 6` style shutdown, to have a "yank the power cord out of the wall" style of shutdown.

In my pre-Linux days, the commercial UNIX systems I dealt with each had methods for forcing a system to immediately stop dead. Do not pass go. Do not collect $200. So, started digging around for analogues.

Prior to many Linux distributions — including the one the CI service was deployed onto — moving to systemd, you could halt the system by doing:
# kill -SEGV 1
Under systemd, however, the above will cause systemd to dump a core file but not halt the system. Instead, systemd instantly respawns:
Broadcast message from systemd-journald@test02.lab (Tue 2018-11-06 18:35:52 UTC):

systemd[1]: Caught <segv>, dumped core as pid 23769.


Broadcast message from systemd-journald@test02.lab (Tue 2018-11-06 18:35:52 UTC):

systemd[1]: Freezing execution.


Message from syslogd@test02 at Nov  6 18:35:52 ...
 systemd:Caught <segv>, dumped core as pid 23769.

Message from syslogd@test02 at Nov  6 18:35:52 ...
 systemd:Freezing execution.

[root@test02 ~]# who -r
         run-level 5  2018-11-06 00:16
So, I did a bit more digging around. In doing so, I found that Linux has a functional analog to hitting the SysRq key on a 80s or 90s vintage system. This is an optional functionality that is enabled in Enterprise Linux. For a list of things that used to be doable with a SysRq key, I'd point you to this "Magic SysRq Key" article.

So, I tested it out by doing:
# echo o >/proc/sysrq-trigger
Which pretty much instantly causes an SSH connection to the remote system to "hang". It's not so much that the connection has hung as doing the above causes the remote system to immediately halt. It can take a few seconds, but, eventually the SSH client will decide that the endpoint has dropped, resulting in the "hung" SSH session exiting similarly to:
# Connection reset by 10.6.142.20
This bit of knowledge verified, I had my suicide-method for my automation. Basically, I set up a cron job to run every five minutes to test the "fullness" of ${APPLICATION_HOME}:
#!/bin/bash
#
##################################################
PROGNAME="$(basename ${0})"
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
APPDIR=$(awk -F= '/APP_WORKDIR/{print $2}' /etc/cfn/Deployment.envs)
RAWBLOCKS="$(( $( stat -f --format='%b*%S' ${APPDIR} ) ))"
USEBLOCKS="$(( $( stat -f --format='%a*%S' ${APPDIR} ) ))"
PRCNTFREE=$( printf '%.0f' $(awk "BEGIN { print ( ${USEBLOCKS} / ${RAWBLOCKS} ) * 100 }") )

if [[ ${PRCNTFREE} -gt 5 ]]
then
   echo "Found enough free space" > /dev/null
else
   # Put a bullet in systemd's brain
   echo o > /proc/sysrq-trigger
fi
Basically, the above finds the number of available blocks in the target filesystem and divides by the number of raw blocks and converts it into a percentage (specifically, it checks the number of blocks available to a non-privileged application rather than the root user). If the percentage drops to or below "5", it sends a "hard stop" signal via the SysRq interface.

The external (re)build automation takes things from there. The preliminary benchmarked recovery time from exceeding the disk-full threshold to being back in business on a replacement node is approximately 15 minutes (though will be faster in future iterations by better optimizing the build-time hardening routines).