Friday, February 19, 2021

Working Around Errors Caused By Poorly-Built AMIs (Networking Edition)

Over the past several years, the team I work on created a set of provisioning-automation tools that we've used with/for a NUMBER of customers. The automation is pretty well designed to run "anywhere".

Cue current customer/project. They're an AWS-using customer. They maintain their own AMIs. Unfortunately, our automation would break during the hardening phase of the deployment automation. After a waste of more than a man-day, discovered the root cause of the problem: when they build their EL7 AMIs, they don't do an adequate cleanup job.

Discovered that there were spurious ifcfg-* files in the resultant EC2s' /etc/sysconfig/network-scripts directory. Customer's AMI-users had never really noticed this oversight. All they really knew was that "networking appears to work", so had never noticed that the network.service systemd unit was actually in a fault state. Whipped out journalctl to find that the systemd unit was attempting to online interfaces that didn't exist on their EC2s ...because, while there were ifcfg-* files present, corresponding interface-directories in /sys/class/net didn't actually exist.

Because our hardening tools, as part of ensuring that network-related hardenings all get applied, does (the equivalent of) systemctl restart network.service. Unfortunately, due to the aforementioned problem, this action resulted in a non-zero exit. Consequently, our tools were aborting.

So, how to pre-clean the system so that the standard provisioning automation would work? Fortunately, AWS lets you inject boot-time logic via cloud-init scripts. I whipped up a quick script to eliminate the superfluous ifcfg-* files:  

for IFFILE in $( echo /etc/sysconfig/network-scripts/ifcfg-* )
do
   [[ -e /sys/class/net/${IFFILE//*ifcfg-/} ]] || (
      printf "Device %s not found. Nuking... " "${IFFILE//*ifcfg-/}" &&
      rm "${IFFILE}" || ( echo FAILED ; exit 1 )
      echo "Success!"
   )
done

Launched a new EC2 with the userData addition. When the "real" provisioning automation ran, no more errors. Dandy.

Ugh... Hate having to kludge to work around error-conditions that simply should not occur.

No comments:

Post a Comment