Titular Discrepancy: systemd

Showing posts with label systemd. Show all posts

Friday, February 19, 2021

Working Around Errors Caused By Poorly-Built AMIs (Networking Edition)

Over the past several years, the team I work on created a set of provisioning-automation tools that we've used with/for a NUMBER of customers. The automation is pretty well designed to run "anywhere".

Cue current customer/project. They're an AWS-using customer. They maintain their own AMIs. Unfortunately, our automation would break during the hardening phase of the deployment automation. After a waste of more than a man-day, discovered the root cause of the problem: when they build their EL7 AMIs, they don't do an adequate cleanup job.

Discovered that there were spurious ifcfg-* files in the resultant EC2s' /etc/sysconfig/network-scripts directory. Customer's AMI-users had never really noticed this oversight. All they really knew was that "networking appears to work", so had never noticed that the network.service systemd unit was actually in a fault state. Whipped out journalctl to find that the systemd unit was attempting to online interfaces that didn't exist on their EC2s ...because, while there were ifcfg-* files present, corresponding interface-directories in /sys/class/net didn't actually exist.

Because our hardening tools, as part of ensuring that network-related hardenings all get applied, does (the equivalent of) systemctl restart network.service. Unfortunately, due to the aforementioned problem, this action resulted in a non-zero exit. Consequently, our tools were aborting.

So, how to pre-clean the system so that the standard provisioning automation would work? Fortunately, AWS lets you inject boot-time logic via cloud-init scripts. I whipped up a quick script to eliminate the superfluous ifcfg-* files:

for IFFILE in $( echo /etc/sysconfig/network-scripts/ifcfg-* )
do
   [[ -e /sys/class/net/${IFFILE//*ifcfg-/} ]] || (
      printf "Device %s not found. Nuking... " "${IFFILE//*ifcfg-/}" &&
      rm "${IFFILE}" || ( echo FAILED ; exit 1 )
      echo "Success!"
   )
done

Launched a new EC2 with the userData addition. When the "real" provisioning automation ran, no more errors. Dandy.

Ugh... Hate having to kludge to work around error-conditions that simply should not occur.

Tuesday, November 6, 2018

Shutdown In a Hurry

Last year, I put together an automated deployment of a CI tool for a customer. The customer was using it to provide a CI service for different developer-groups within their company. Each developer group acts like an individual tenant of my customer's CI service.

Recently, as more of their developer-groups have started using the service, space-consumption has exploded. Initially, they tried setting up job within their CI system to automate cleanups of some of their tenants more-poorly architected CI jobs — ones that didn't appropriately clean up after themselves. Unfortunately, once the CI systems' filesystems fill up, jobs stop working ...including the cleanup jobs.

When I'd written the initial automation, I'd written automation to create two different end-states for their CI system: one that is basically just a cloud-hosted version of a standard, standalone deployment of the CI service; the other being a cloud-enabled version that's designed to automatically rebuild itself on a regular basis or in the event of service failure. They chose to deploy the former because it's the type of deployment they were most familiar with.

With their tenants frequently causing service-components to go offline and their cleaning jobs not being reliable, they asked me to investigate things and come up with a work around. I pointed them to the original set of auto-rebuilding tools I'd provided them, noting that, with a small change, those tools could detect the filesystem-full state and initiate emergency actions. In this case, the proposed emergency action being a system-suicide that caused the automation to rebuild the service back to a healthy state.

Initially, I was going to patch the automation so that, upon detecting a disk-full state, it would trigger a graceful shutdown. Then it struck me, "I don't care about these systems' integrity, I care about them going away quickly," since the quicker they go away, the quicker the automation will notice and start taking steps to re-establish the service. What I wanted was, instead of an `init 6` style shutdown, to have a "yank the power cord out of the wall" style of shutdown.

In my pre-Linux days, the commercial UNIX systems I dealt with each had methods for forcing a system to immediately stop dead. Do not pass go. Do not collect $200. So, started digging around for analogues.

Prior to many Linux distributions — including the one the CI service was deployed onto — moving to systemd, you could halt the system by doing:

# kill -SEGV 1

Under systemd, however, the above will cause systemd to dump a core file but not halt the system. Instead, systemd instantly respawns:

Broadcast message from systemd-journald@test02.lab (Tue 2018-11-06 18:35:52 UTC):

systemd[1]: Caught <segv>, dumped core as pid 23769.


Broadcast message from systemd-journald@test02.lab (Tue 2018-11-06 18:35:52 UTC):

systemd[1]: Freezing execution.


Message from syslogd@test02 at Nov  6 18:35:52 ...
 systemd:Caught <segv>, dumped core as pid 23769.

Message from syslogd@test02 at Nov  6 18:35:52 ...
 systemd:Freezing execution.

[root@test02 ~]# who -r
         run-level 5  2018-11-06 00:16

So, I did a bit more digging around. In doing so, I found that Linux has a functional analog to hitting the SysRq key on a 80s or 90s vintage system. This is an optional functionality that is enabled in Enterprise Linux. For a list of things that used to be doable with a SysRq key, I'd point you to this "Magic SysRq Key" article.

So, I tested it out by doing:

# echo o >/proc/sysrq-trigger

Which pretty much instantly causes an SSH connection to the remote system to "hang". It's not so much that the connection has hung as doing the above causes the remote system to immediately halt. It can take a few seconds, but, eventually the SSH client will decide that the endpoint has dropped, resulting in the "hung" SSH session exiting similarly to:

# Connection reset by 10.6.142.20

This bit of knowledge verified, I had my suicide-method for my automation. Basically, I set up a cron job to run every five minutes to test the "fullness" of ${APPLICATION_HOME}:

#!/bin/bash
#
##################################################
PROGNAME="$(basename ${0})"
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
APPDIR=$(awk -F= '/APP_WORKDIR/{print $2}' /etc/cfn/Deployment.envs)
RAWBLOCKS="$(( $( stat -f --format='%b*%S' ${APPDIR} ) ))"
USEBLOCKS="$(( $( stat -f --format='%a*%S' ${APPDIR} ) ))"
PRCNTFREE=$( printf '%.0f' $(awk "BEGIN { print ( ${USEBLOCKS} / ${RAWBLOCKS} ) * 100 }") )

if [[ ${PRCNTFREE} -gt 5 ]]
then
   echo "Found enough free space" > /dev/null
else
   # Put a bullet in systemd's brain
   echo o > /proc/sysrq-trigger
fi

Basically, the above finds the number of available blocks in the target filesystem and divides by the number of raw blocks and converts it into a percentage (specifically, it checks the number of blocks available to a non-privileged application rather than the root user). If the percentage drops to or below "5", it sends a "hard stop" signal via the SysRq interface.

The external (re)build automation takes things from there. The preliminary benchmarked recovery time from exceeding the disk-full threshold to being back in business on a replacement node is approximately 15 minutes (though will be faster in future iterations by better optimizing the build-time hardening routines).

Thursday, August 16, 2018

Systemd And Timing Issues

Unlike a lot of Linux people, I'm not a knee-jerk hater of systemd. My "salaried UNIX" background, up through 2008, was primarily with OSes like Solaris and AIX. With Solaris, in particular, I was used to sytemd-type init-systems due to SMF.

That said, making the switch from RHEL and CentOS 6 to RHEL and CentOS 7 hasn't been without its issues. The change from upstart to systemd is a lot more dramatic than from SysV-init to upstart.

Much of the pain with systemd comes with COTS software originally written to work on EL6. Some vendors really only due fairly cursory testing before saying something is EL7 compatible. Many — especially earlier in the EL 7 lifecycle — didn't bother creating systemd services at all. They simply relied on systemd-sysv-generator utility to do the dirty work for them.

While the systemd-sysv-generator utility does a fairly decent job, one of the places it can fall down is if the legacy-init script (files hosted in /etc/rc.d/init.d) is actually a symbolic link to someplace else in the filesystem. Even then, it's not super much a problem if "someplace else" is still within the "/" filesystem. However, if your SOPs include "segregate OS and application onto different filesystems", then "someplace else can" very much be a problem — when "someplace else" is on a different filesystem from "/".

Recently, I was asked to automate the installation of some COTS software with the "it works on EL6 so it ought to work on EL7" type of "compatibility". Not only did the software not come with systemd service files, its legacy-init files linked out to software installed in /opt. Our shop's SOPs are of the "applications on their own filesystems" variety. Thus, the /opt/<APPLICATION> directory is actually its own filesystem hosted on its own storage device. After doing the installation, I'd reboot the system. ...And when the system came back, even though there was a boot script in /etc/rc.d/init.d, the service wasn't starting. Poring over the logs, I eventually found:

systemd-sysv-generator[NNN]: stat() failed on /etc/rc.d/init.d/<script_name>
No such file or directory

This struck me odd given that the link and its destination very much did exist.

Turns out, systemd invokes the systemd-sysv-generator utility very early in the system-initialization proces. It invokes it so early, in fact, that the /opt/<APPLICATION> filesystem has yet to be mounted when it runs. Thus, when it's looking to do the conversion the file the sym-link points to actually does not yet exist.

My first thought was, "screw it: I'll just write a systemd service file for the stupid application." Unfortunately, the application's starter was kind of a rats nest of suck and fail; complete and utter lossage. Trying to invoke it from directly via a systemd service definition resulted in the application's packaged controller-process not knowing where to find a stupid number of its sub-components. Brittle. So, I searched for other alternatives...

Eventually, my searches led me to both the nugget about when systemd invokes the systemd-sysv-generator utility and how to overcome the "sym-link to a yet-to-be-mounted-filesystem" problem. Under systemd-enabled systems, there's a new-with-systemd mount-option you can place in /etc/fstab — x-initrd.mount. You also need to make sure that your filesystem's fs_passno is set to "0" ...and if your filesystem lives on an LVM2 volume, you need to update your GRUB2 config to ensure that the LVM gets onlined prior to systemd invoking the systemd-sysv-generator utility. Fugly.

At any rate, once I implemented this fix, the systemd-sysv-generator utility became happy with the sym-linked legacy-init script ...And my vendor's crappy application was happy to restart on reboot.

Given that I'm deploying on AWS, I was able to accommodate setting these fstab options by doing:

mounts:
  - [ "/dev/nvme1n1", "/opt/<application> , "auto", "defaults,x-initrd.mount", "0", "0" ]

Within my cloud-init declaration-block. This should work in any context that allows you to use cloud-init.

I wish I could say that this was the worst problem I've run into with this particular application. But, really, this application is an all around steaming pile of technology.

Friday, March 23, 2018

So You Wanna Run Sonarqube on (Hardened) RHEL/CentOS 7

As developers become more concerned with injecting security-awareness into their processes, tools like Sonarqube become more popular. As that popularity increases, the desire to run that software on more than just the Linux distro(s) it was originally built to run on increases.

Most of my customers only allow the use of Red Hat (and, increasingly, CentOS) on their production networks. As these customers hire developers that cut their teeth not on Red Hat, the pressure for deploying their preferred-tools onto Enterprise Linux increases.

Sonarqube has two relevant installation-methods listed on their installation documentation. One is to explode a ZIP archive and install the files wherever desired and set up requisite automated startup as appropriate. The other is to install a "semi-official" RPM-packaging. Initially, I chose the former, but subsequently changed to the latter. While the former is fine if you're installing very infrequently (and installing by hand), it can fall down if you're doing automated deployments by way of service-frameworks like Amazon's CloudFormation (CFn) and/or AutoScaling.

My preferred installation method is to use CFn to create AutoScaling Groups (ASGs). I don't run Sonarqube clustered: the community edition that my developers lack of funding allows me to deploy doesnt support clustering. Using ASGs means that I can increase the service's availability by having the Sonarqube service simply rebuild itself if the service ever becomes unexpectedly offline. Rebuilds only take a couple minutes and generally don't require any administrator intervention.

This method is/was compatible with the ZIP-based installation method for the first month or so of running the service. Then, one night, it stopped working. When I investigated to figure out why it stop, I found that the URL the deployment-automation was pointing to no longer existed. I'm not really sure what happened because the Sonarqube maintaiers seem to keep back versions of the software around indefinitely.

/shrug

At any rate, it caused me to return to the installation documentation. This time, I noticed that there was an RPM option. So, I went down that path. For the most part, it made things dead easy vice using the ZIP packaging. There were only a few gotchas with the RPM:

The current packaging, in addition to being a "noarch" packaging, is neither EL version-specific nor is it cross-version. Effectively, it's EL6-oriented: it includes a legacy init script but not a systemd service definition. The RPM maintainer likely needs to either make an EL6 and an EL7 packaging or needs to update the packaging to include files for both init-types and then use a platform-selector to install and activate the appropriate one

The current packaging creates the service account 'sonar' and installs the packaged files as user:group 'sonar' ...but doesn't include start-logic to ensure that the Sonarqube runs as 'sonar'. This will manifest in Sonarqube's es.log file similarly to:

[2018-02-21T15:45:51,714][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.RuntimeException: can not run elasticsearch as root
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:125) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:112) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.cli.SettingCommand.execute(SettingCommand.java:54) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:96) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.cli.Command.main(Command.java:62) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:89) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:82) ~[elasticsearch-5.1.1.jar:5.1.1]
Caused by: java.lang.RuntimeException: can not run elasticsearch as root
        at org.elasticsearch.bootstrap.Bootstrap.initializeNatives(Bootstrap.java:100) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:176) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:306) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:121) ~[elasticsearch-5.1.1.jar:5.1.1]
        ... 6 more

The embedded ElastiCache service requires that the file-descriptors ulimit be at a specific minimum value: if the deployment is on a hardened system, it is reasonably likely the default ulimits are too low. If too low, this will manifest in Sonarqube's es.log file similarly to:

[2018-02-21T15:47:23,787][ERROR][bootstrap                ] Exception
java.lang.RuntimeException: max file descriptors [65535] for elasticsearch process likely too low, increase to at least [65536]
    at org.elasticsearch.bootstrap.BootstrapCheck.check(BootstrapCheck.java:79)
    at org.elasticsearch.bootstrap.BootstrapCheck.check(BootstrapCheck.java:60)
    at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:188)
    at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:264)
    at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:111)
    at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:106)
    at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:88)
    at org.elasticsearch.cli.Command.main(Command.java:53)
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:74)
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:67)

The embedded ElastiCache service requires that the kernel's vm.max_map_count runtime parameter be tweaked. If not adequately tweaked, this will manifest in Sonarqube's es.log file similarly to:
```
max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
```

The embedded ElastiCache service's JVM won't start if the /tmp directory's had the noexec mount-option applied as part of the host's hardening. This will manifest in Sonarqube's es.log file similarly to:

WARN  es[][o.e.b.Natives] cannot check if running as root because JNA is not available 
WARN  es[][o.e.b.Natives] cannot install system call filter because JNA is not available 
WARN  es[][o.e.b.Natives] cannot register console handler because JNA is not available 
WARN  es[][o.e.b.Natives] cannot getrlimit RLIMIT_NPROC because JNA is not available
WARN  es[][o.e.b.Natives] cannot getrlimit RLIMIT_AS because JNA is not available
WARN  es[][o.e.b.Natives] cannot getrlimit RLIMIT_FSIZE because JNA is not available

External users won't be able to communicate with the service if the host-based firewall hasn't been appropriately configured.

The first three items are all addressable via a properly-written systemd service definition. The following is what I put together for my deployments:

[Unit]
Description=SonarQube
After=network.target network-online.target
Wants=network-online.target

[Service]
ExecStart=/opt/sonar/bin/linux-x86-64/sonar.sh start
ExecStop=/opt/sonar/bin/linux-x86-64/sonar.sh stop
ExecReload=/opt/sonar/bin/linux-x86-64/sonar.sh restart
Group=sonar
LimitNOFILE=65536
PIDFile=/opt/sonar/bin/linux-x86-64/SonarQube.pid
Restart=always
RestartSec=30
User=sonar
Type=forking

[Install]
WantedBy=multi-user.target

I put this into GitHub project I set up as part of automating my deployments for AWS. The 'User=sonar' option causes systemd to start the process as the sonar user. The 'Group=sonar' directive ensures that the process runs under the sonar. The 'LimitNOFILE=65536' ensures that the service's file-descriptor's ulimit is set sufficiently high.

The fourth item is addressed by creating an /etc/sysctl.d/sonarqube file containing:

vm.max_map_count = 262144

I take care of this with my automation but the RPM maintainer could obviate the need to do so by including an /etc/sysctl.d/sonarqube file as part of the RPM.

The fifth item is addressable through the sonar.properties file. Adding a line like:

sonar.search.javaAdditionalOpts=-Djava.io.tmpdir=/var/tmp/elasticsearch

Will cause Sonarqube to point ElasticSearch's temp-directory to /var/tmp/elasticsearch. Note that this is an example directory: pick a location in which the sonar user can create a directory and that is on a filesystem that does not have the noexec mount-option set. Without making this change, ElasticSearch will fail to start because JNI cannot be started.

The final item is adressable by adding a port-exception to the firewalld service. This can be done as simply as executing `firewall-cmd --permanent --service=sonarqube --add-port=9000/tcp` or adding a /etc/firewalld/services/sonarqube.xml file containing:

<?xml version="1.0" encoding="utf-8"?>
<service>
  <short>Sonarqube service ports</short>
  <description>Firewalld options supporting Sonarqube deployments</description>
  <port protocol="tcp" port="9000" />
  <port protocol="tcp" port="9001" />
</service>

Either is suitable for inclusion in an RPM: the former by way of a %post script; the latter as a config-file packaged in the RPM.

Once all of the bulleted-items are accounted for, the RPM-included service should start right up and function correctly on a hardened EL7 system. Surprisingly enough, no other tweaks are typically needed to make Sonarqube run on a hardened system. Sonarqube will typically function just fine with both FIPS-mode and SELinux active.

It's likely that once Sonarqube is deployed, functionality beyond the defaults will be desired. There are quite a number of functionality-extending plugins available. Download the ones you want and install them to <SONAR_INSTALL_ROOT>/extensions/plugins and restart the service. Most plugins behavior is configurable via the Sonarqube administration GUI.

Note: while much of the 'fixes' following the gotcha list are scattered throughout the Sonarqube documentation, on an unhardened EL 7 system, the system defaults are loose enough that accounting for them was not necessary for either the ZIP- or RPM-based installation-methods.

Extra: If you want the above taken care of for you and you're trying to deploy Sonarqube onto AWS, I put together a set of template-driven CloudFormation templates and helper scripts. They can be found on GitHub.

Monday, January 8, 2018

The Joys of SCL (or, "So You Want to Host RedMine on Red Hat/CentOS")

Recently, I needed to go back and try to re-engineer my company's RedMine deployment. Previously, I had rushed through a quick-standup of the service using the generic CentOS AMI furnished by CentOS.Org. All of the supporting automation was very slap-dash and had more things hard-coded than I really would have liked. However, the original time-scale the automation needed to be delivered under meant that "stupid but works ain't stupid" ruled the day.

That ad hoc implementation was over two years ago, at this point. Since the initial implementation, I had never had the chance to start to revisit it. The lack of a contiguous chunk of time to re-implement changed over the holidays (hooray for everyone being gone over the holiday weeks!). At any rate, there were a couple things I wanted to accomplish with the re-build:

Change from using the generic CentOS AMI to using our customized AMI
- Our AMI has a specific partitioning-scheme for our OS disks that we prefer to be used across all of our services. Wanted to make our RedMine deployment no longer be an "outlier".
- Our AMIs are updated monthly, whereas the CentOS.Org ones are typically updated when there's a new 7.X. Using our AMIs means not having to wait nearly as long for launch-time patch-up to complete before things can become "live".
I wanted to make use of our standard, provisioning-time OS-hardening framework — both so that the instance is better configured to be Internet-facing and so it is more in accordance with our other service-hosts
I wanted to be able to use a vendor-packaged Ruby rather than the self-packaged/self-managed Ruby. I had previously gone the custom Ruby route both because the vendor's default Ruby is too old for RedMine to use and because SCL was not quite as mature an offering back in 2015. In updating my automation to make use of SCL, I was figuring that a vendor-packaged RPM is more likely to be kept up to date without potentially introducing breakage that self-packaged RPMs can be prone to.
I wanted to switch from using UserData to using CloudFormation (CFn) for launch automation:
- Relatively speaking, CFn automation includes a lot more ability to catch errors/deployment-breakage with lower-effort. Typically, it's easier to avoid the "I thought it deployed correctly" happenstances.
- Using CFn automation exposes less information via instance-metadata either via the management UIs/CLIs or by reading metadata from the instance.
- Using CFn automation makes it easier to coordinate the provisioning of non-EC2 solution components (e.g., security groups, IAM roles, S3 buckets, EFS shares, RDS databases, load-balancers etc.)
I wanted to clean up the overall automation-scripts so I could push more stuff into (public) GitHub (projects) and have to pull less information from private repositories (like S3 buckets).

Ultimately, the biggest challenge turned out to be dealing with SCL. I can probably account for the problems with SCL being their focus. Specifically, they seem to be designed more for use by developers in interactive environments than they are to underpin services. If you're a developer, all you really need to do to have a "more up-to-date than vendor-standard" language packages is:

Enable the SCL repos
Install the SCL-hosted language package you're looking for
Update either your user's shell-inits or the system-wide shell inits
Log-out and back in
Go to town with your new SCL-furnished capabilities.

If, on the other hand, you want to use an SCL-managed language to back a service that can't run with the standard vendor-packaged languages (etc)., things are a little more involved.

A little bit of background:

The SCL-managed utilities are typically rooted wholly in "/opt/rh". This means binaries, libraries, documentation. Because of this, your shell's PATH, MANPATH and linker-paths will need to be updated. Fortunately, each SCL-managed package ships with an `enable` utility. You can do something as simple as `scl_enabled <SCL_PACKAGE_NAME> bash`, do the equivalent in your user's "${HOME}/.profile" and/or "${HOME}/.bash_profile" files, or you can do something by way of a file in "/etc/profile.d/". Since I assume "everyone on the system wants to use it", I enable by way of a file in "/etc/profile.d/". Typically, I create a file named "enable_<PACKAGE>.sh" and create contents similar to:

#!/bin/bash
source /opt/rh/<RPM_NAME>/enable
export X_SCLS="$(scl enable <RPM_NAME> 'echo $X_SCLS')"

Within it. Each time an interactive-user logs in, they get the SCL-managed package configured for their shell. If multiple SCL-managed packages are installed, they all "stack" nicely (failure to do the above can result in one SCL-managed package interfering with another).

Caveats:

Unfortunately, while this works well for interactive shell users, it's not generally sufficient for the non-interactive environments. This includes actual running services or stuff that's provisioned at launch-time via scripts kicked off by way of `systemd` (or `init`). Such processes do not use the contents of the "/etc/profile.d" directory (or other user-level initialization scripts). This creates a "what to do" scenario.

Linux affords a couple methods for handling this:

You can modify the system-wide PATH (etc.). To make the run-time linker find libraries in non-standard paths, you can create files in "/etc/ld.conf.d" that enumerate the desired linker search paths and then run `ldconfig` (the vendor-packaging of MariaDB and a few others do this). However, this is traditionally been considered not to be "good form" and frowned upon.
Use under systemd:
- You can modify the scripts invoked via systemd/init; however, if those scripts are RPM-managed, such modifications can be lost when the RPM gets updated via a system patch-up
- Under systemd, you can create customized service-definitions that contain the necessary environmentals. Typically, this can be quickly done by copying the service-definition found in the "/usr/lib/systemd/system" directory to the "/etc/systemd/system" directory and modifying as necessary.
Cron jobs: If you need to automate the running of tasks via cron, you can either:
- Write action-wrappers that include the requisite environmental settings
- Make use of the ability to set environmentals in jobs defined in /etc/cron.d
Some applications (such as Apache) allow you to extend library and other search directives via run-time configuration directives. In the case of RedMine, Apache calls the Passenger framework which, in turn, runs the Ruby-based RedMine code. To make Apache happy, you create an "/etc/httpd/conf.d/Passenger.conf" that looks something like:
```
LoadModule passenger_module /opt/rh/rh-ruby24/root/usr/local/share/gems/gems/passenger-5.1.12/buildout/apache2/mod_passenger.so

<IfModule mod_passenger.c>
        PassengerRoot /opt/rh/rh-ruby24/root/usr/local/share/gems/gems/passenger-5.1.12
        PassengerDefaultRuby /opt/rh/rh-ruby24/root/usr/bin/ruby
        SetEnv LD_LIBRARY_PATH /opt/rh/rh-ruby24/root/usr/lib64
</IfModule>
```
The directives:
- "LoadModule" directive binds the module-name to the fully-qualified path to the Passenger Apache-plugin (shared-library). This file is created by the Passenger installer.
- "PassengerRoot" tells Apache where to look for further, related gems and libraries.
- "PassengerDefaultRuby" tells Apache which Ruby to use (useful if there's more than one Ruby version installed or, as in the case of SCL-managed packages, Ruby is not found in the standard system search paths).
- "SetEnv LD_LIBRARY_PATH" tells Apachels run-time linker where else to look for shared library files.
Without these, Apache won't find RedMine or all of the associate (SCL-managed) Ruby dependencies.

The above list should be viewed in "worst-to-first" ascending-order of preference. The second systemd and cron methods and the application-configuration method (notionally) have the narrowest scope (area of effect, blast-radius, etc.) — only effecting where the desired executions look for non-standard components — and are the generally preferred methods.

Note On Provisioning Automation Tools:

Because automation tools like CloudFormation (and cloud-init) are not interactive processes, they too will not process the contents of the "/etc/profile.d/" directory. One can use one of the above enumerated methods for modifying the invocation of these tools or include explicit sourcing of the relevant "/etc/profile.d/" files within the scripts executed by these frameworks.

Wednesday, September 6, 2017

Order Matters

For one of the projects I'm working on, I needed to automate the deployment of JNLP-based Jenkins build-agents onto cloud-deployed (AWS/EC2) EL7-based hosts.

When we'd first started the project, we were using the Jenkins EC2 plugins to do demand-based deployment of EC2 spot-instances to host the Jenkins agent-software. It worked decently and kept deployment costs low. For better or worse, our client preferred using fixed/reserved instances and JNLP-based agent setup vice master-orchestrated SSH-based setup. The requested method likely makes sense for if/when they decide there's a need for Windows-based build agents — as all will be configured equivalently (via JNLP). It also eliminates the need to worry about SSH key-management.

At any rate, we made the conversion. It was initially trivial ...until we implemented our cost-savings routines. Our customer's developers don't work 24/7/365 (and neither do we). It didn't make sense to have agents just sitting around idle racking up EC2 and EBS time. So, we placed everything under a home-brewed tool to manage scheduled shutdowns and startups.

The next business-day, the problems started. The Jenkins master came online just fine. All of the agents also started back up ...however, the Jenkins master was declaring them dead. As part of debugging process, we would log into each of the agent-hosts but would find that there was a JNLP process in the PID-list. Initially we assumed the declared-dead problem was just a simple timing issue. Our first step was to try rescheduling the agents' start to happen a half hour before the master's. When that failed, we then tried setting the master's start a half hour before the agents'. No soup.

Interestingly, doing a quick `systemctl restart jenkins-agent` on the agent hosts would make them pretty much immediately go online in the Master's node-manager. So, we started to suspect something between the master and agent-nodes causing issues.

Because our agents talk to the master through an ELB, the issues lead us to suspect the ELB was proving problematic. After all, even though the Jenkins master starts up just fine, the ELB doesn't start passing traffic for a minute or so after the master service starts accepting direct connections. We suspected that perhaps there was enough lag in the ELB going fully active that the master was declaring the agents dead — as each agent-host was showing a ping-timeout error in the master console:

Upon logging back into the agents, we noticed that, while the java process was running, it wasn't actually logging. For background, when using default logging, the Jenkins agent create ${JENKINS_WORK_DIR}/logs/remoting.log.N files. The actively logged-to file is the ".0" file - which you can verify with tools other tools ...or just notice that there's a remoting.log.0.lck file. Interestingly, the remoting.log.0 file was blank, whereas all the other, closed files showed connection-actions between the agent and master.

So, started looking at the systemd file we'd originally authored:

[Unit]
Description=Jenkins Agent Daemon

[Service]
ExecStart=/bin/java -jar /var/jenkins/slave.jar -jnlpUrl https://jenkins.lab/computer/Jenkins00-Agent0/slave-agent.jnlp -secret &lt;secret&gt; -workDir &lt;workdir&gt;
User=jenkins
Restart=always
RestartSec=15

[Install]
WantedBy=multi-user.target

Initially, I tried playing with the RestartSec parameter. Made no difference in behavior ...probably because the java process never really exits to be restarted. So, did some further digging about - particularly with an eye towards having failed to track missing systemd dependencies.

When you're making the transition from EL6 to EL7, one of the things that's easy to forget is that systemd is not a sequential init-system. Turned out that, while I'd told the service, "don't start till multi-user.target is going", I hadn't told it "...but make damned sure that the NICs are actually talking to the network." That was the key, and was accomplished by adding:

Wants=network-online.target
Requires=network.target

to the service definition files' [Unit] sections. Some documents indicated to set the Wants= to network.target. Others indicated that there were some rare cases where network.target may be satisfied before the NIC is actually fully ready to go — that the network-online.target was therefore the better dependency to set. In practice, the time between which network.target and network-online.target onlines is typically in the millisecond range - so, "take your pick". I was tired of fighting things, so, opted for network-online.target just to be conservative/"done with things"

After making the above change to one of my three nodes, that node immediately began to reliably online after full service-downings, agent-reboots or even total re-deployments of the agent-host. The change has since been propagated to the other agent nodes and the "run only during work-hours" service-profile has been tested "good".

Wednesday, November 26, 2014

Converting to EL7: Solving The "Your Favorite Service Isn't Systemd-Enabled" Problem

Having finally gotten off my butt to knock out getting my RHCE (for EL6) before the 2014-12-19 drop-dead date, I'm finally ready to start focusing on migrating my personal systems to EL7-based distros.

My personal VPS is currently running CentOS 6.6. I use my VPS to host a couple of personal websites and email for family and a few friends. Yes, I realize that it would probably be easier to offload all of this to providers like Google. However, while Google is very good at SPAM-stomping, and provides me a very generous amount of space for archiving emails, one area that they do lack for is email aliases: whenever I have to register to a new web-site, I use a custom email address to do so. At my last pruning, I still had 300+, per-site, aliases. So, for me, number of available aliases ("unlimited" is best) and ease of creating them trumps all other considerations.

Since I don't have Google handling my mail for me, I have to run my own A/V and anti-spam engines. Being a good Internet Citizen, I also like to make use of Sender Policy Framework (via OpenSPF) and DomainKeys (currently via DKIMproxy).

I'm only just into the process of sorting out what I need to do to make the transition as quick and as painless (more for my family and friends than me) a process as possible. I hate outages. And, with a week off for the Thanskgiving holidays, I've got time to do things in a fairly orderly fashion.

At any rate, one of the things I discovered is that my current DomainKeys solution hasn't been updated to "just work" within the systemd framework used within EL7. This isn't terribly surprising, as it appears that the DKIMproxy SourceForge project may have gone dormant, in 2013 (so, I'll have to see if there's alternatives that have the appearance of still being a going concern - in the mean time...) Fortunately, the DKIMproxy source code does come with a `chkconfig` compatible SysV-init script. Even more fortunately, converting from SysV-init to a systemd-compatible service control is a bit more straight forward than when I was dealing with moving from Solaris 9's legacy init to Solaris 10's SMF.

If you've already got a `chkconfig` style init script, moving to systemd-managed is fairly trivial. Your `chkconfig` script can be copied, pretty much "as is" into "/usr/lib/systemd". My (current) preference is to create a "scripts" subdirectory and put it in there. Haven't read deeply enough into systemd to see if this is the "Best Practices" method, however. Also, where I work has no established conventions ...because they only started migrating to EL6 in fall of 2013 - so, I can't exactly crib anything EL7-related from how we do it at work.

Once you have your SysV-init style script placed where it's going to live (e.g., "/usr/lib/systemd/scripts"), you need to create associated service definition files. In my particular case, I had to create two as the DKIMproxy software actually has an inbound and an outbound funtion. Launched from normal SysV-init, it all gets handled as one piece. However, one of the nice things about systemd is it's not only a launcher framework, it's a service monitor framework, as well. To take full advantage, I wanted one monitor for the inbound service and one for the outbound service. The legacy init script that DKIMproxy ships with makes this easy enough as, in addition to the normal "[start|stop|restart|status]" arguments, it had per-direction subcommand (e.g., "start-in" and "stop-out"). The service-definition for my "dkim-in.service" looks like:

[Unit]
     Description=Manage the inbound DKIM service
     After=postfix.service


     [Service]
     Type=forking
     PIDFile=/usr/local/dkimproxy/var/run/dkimproxy_in.pid
     ExecStart=/usr/lib/systemd/scripts/dkim start-in
     ExecStop=/usr/lib/systemd/scripts/dkim stop-in


     [Install]
     WantedBy=multi-user.target

To break down the above:

The "Unit" stanza tells systemd a bit about your new service:

The "Description" line is just ASCII text that allows you to provide a short, meaningful of what the service does. You can see your service's description field by typing `systemctl -p Description show <SERVICENAME>`
The "After" parameter is a space-separated list of other services that you want to have successfully started before systemd attempts to start your new service. In my case, since DKIMproxy is an extension to Postfix, it doesn't make sense to try to have DKIMproxy running until/unless Postfix is running.

The "Service" stanza is where you really define how your service should be managed. This is where you tell systemd how to start, stop, or reload your service and what PID it should look for so it knows that the service is still notionally running. The following parameters are the minimum ones you'd need to get your service working. Other parameters are available to provide additional functionality:

The "Type" parameter tells systemd what type of service it's managing. Valid types are: simple, forking, oneshot, dbus, notify or idle. The systemd.service man page more-fully defines what each option is best used for. However, for a traditional daemonized service, you're most likely to want "forking".
The "PIDFile" parameter tells systemd where to find a file containing the parent PID for your service. It will then use this to do a basic check to monitor whether your service is still running (note that this only checks for presence, not actual functionality).
The "ExecStart" parameter tells systemd how to start your service. In the case of a SysV-init script, you point it to the fully-qualified path you installed your script to and then any arguments necessary to make that script act as a service-starter. If you don't have a single, chkconfig-style script that handles both stop and start functions, you'd simply give the path to whatever starts your service. Notice that there are no quotations surrounding the parameter's value-section. If you put quotes - in the mistaken belief that the starter-command and it's argument need to be grouped, you'll get a path error when you go to start your service the first time.
The "ExecStop" parameter tells systemd how to stop your service. As with the "ExecStart" parameter, if you're leveraging a fully-featured SysV-init script, you point it to the fully-qualified path you installed your script to and then any arguments necessary to make that script act as a service-stopper. Also, the same rules about white-space and quotation-marks apply to the "ExecStop" parameter as do the "ExecStart" parameter.

The "Install" stanza is where you tell systemd the main part of the service dependency-tree to put your service. You have two main dependency-specifiers to choose: "WantedBy" and "RequiredBy". The former is a soft-dependency while the latter is a hard-dependency. If you use the "RequiredBy" parameter, then the service unit-group (e.g., "mult-user.target") enumerated with the "RequiredBy" parameter will only be considered to have successfully onlined if the defined service has successfully launched and stayed running. If you use the "WantedBy" parameter, then the service unit-group (e.g., "mult-user.target") enumerated with the "WantedBy" parameter will still be considered to have successfully onlined whether the defined service has successfully launched or stayed running. It's most likely you'll want to use "WantedBy" rather than "RequiredBy" as you typically won't want systemd to back off the entire unit-group just because your service failed to start or stay running (e.g., you don't want to stop all of the multi-user mode related processes just because one network service has failed.)