Showing posts with label Linux. Show all posts
Showing posts with label Linux. Show all posts

Friday, January 3, 2025

SSH Problems When Using `sysadm_u` SELinux Confinement

Decided to be proactive on the security-setup for my one project. Opted to confine my default-user to sysadm_u. However, as soon as I did that, I stopped being able to ssh into the resulting EC2 as the default-user. Turns out there's a bit more requirede in order to use that confinement with a user that also needs to be able to SSH into the host.

For those reading and who don't have a Red Hat login, if I want to confine a user to the to sysadm_u, I also need to ensure that my system-configuration automation includes:

setsebool ssh_sysadm_login on
setsebool -P ssh_sysadm_login on
Without the above, doing an ssh -v to the target user will show a spurious:
Authenticated to 0.0.0.0 ([127.0.0.1]:22) using "publickey".
debug1: pkcs11_del_provider: called, provider_id = (null)
debug1: channel 0: new [client-session]
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: pledge: filesystem full
client_loop: send disconnect: Broken pipe

The `pledge: filesystem full` kinda threw me, at first, since I knew that neither my local nor my remote filesystem was full. So, I assumed that it was just a misleading error message (as seems to so often be the case when SELinux is involved). So, I searched for ssh login problems associated with the selected SELinux-confinement, which led me to the previously-linked Red Hat article.

I guess that's why the hardening guidelines show `staff_u` as the recommended confinement for administrator users?

 Ultimately, I opted to use  `staff_u`, instead. Having a cloud-config block like:

user: {
  gecos: "GitLab Provisioning-Account (LOCAL)",
  name: "${PROVISIONING_USER}",
  selinux_user: 'staff_u',
  sudo: ['ALL=(root) TYPE=sysadm_t ROLE=sysadm_r NOPASSWD:ALL']
}
Ensuring to have ROLE and TYPE SELinux transition-mappings defined for my default-user eliminates the confusion that can result when confining a user to staff_u and not supplying a mapping. Without the mapping, if an confined admin-user executes `sudo -i`, they get all sorts of unexpected `permission denied` errors.

Wednesday, July 15, 2020

Sometimes The (Workable) Answer Is Too Simple to See

One of the tasks I was asked to tackle was helping the team I'm working with move their (Python and Ansible-based) automation-development efforts from a Windows environment to a couple of Linux servers.

Were the problem to only require addressing how I tend to work, the solution would have been very straight-forward. Since I work across a bunch of contracts, I tend to select tooling that is reliably available: pretty much just git + vi ...and maybe some linting-containers.

However, on this new contract, much of my team doesn't work that way. While they've used git before, it was mostly within the context of VSCode. So, required me to solve two problems – one relevant to the team I'm enabling and one possibly relevant only to me: making it so the team could use VSCode without Windows; making it so that wouldn't be endlessly prompted for passwords.

The easy "VSCode without Windows" problem was actually the easier problem. It boiled down to:

  1. Install the VSCode server/agent on the would be (Linux-based) dev servers
  2. Update the Linux-based dev servers' /etc/ssh/sshd_config file's AllowTcpForwarding setting (this one probably shouldn't have been necessary: the VSCode documentation indicates that one should be able to use UNIX domain-sockets on the remote host; however, the setting for doing so didn't appear to be available in my VSCode client)
  3. Point my laptop's VSCode to the Linux-based dev servers
Because I'm lazy, I hate having to enter passwords over and over. This means that, to the greatest degree possible, I make use of things like keyrings. In most of my prior environments, things were PuTTY-based. So, my routine, upon logging in to my workstation each morning, included "fire up Pageant and load my keys": whether then using the PuTTY or MobaXterm ssh-client, this meant no need to enter passwords (the rest of the day) beyond having entered them as part of Pageant's key-loading.

According to the documentation, VSCode isn't compatible with PuTTY – and, by extension, not compatible with Pageant. So, I dutifully Googled around for how to solve the problem. Most of the hits I found seem to rely on having a greater level of access to our customer-issued laptops than what we're afforded: I don't have enough permissions to even check if our PowerShell installations have the OpenSSH-related bits.

Ultimately, I turned to my employer's internal Slack channel for our automation team and posed the question. I was initially met with links to the same pages my Google searches had turned up. Since our customer-issued laptops do come with Git-BASH installed, someone suggested setting up its keyring and then firing-up VSCode from within that application. Being so used to accessing Windows apps via clicky-clicky, it totally hadn't occurred to me to try that. It actually worked (surprising both me and the person who suggested it). 

That said, it means I have an all-but-unused Git-BASH session taunting me from the task-bar. Fortunately, I have the taskbar set to auto-hide. But still: "not tidy".

Also: because everybody uses VSCode on this project, nobody really uses Git-BASH. So, any solution I propose that uses it will require further change-accommodation by the incumbent staff.

Fortunately, most of the incumbent staff already uses MobaXterm when they need CLI-based access to remote systems. Since MobaXterm integrates with Pageant, it's a small skip-and-a-jump to have VSCode use MobaXterm's keyring service ...which pulls from Pageant. Biggest change will be telling them "Once you've opened Moba, invoke VSCode from within it rather than going clicky-clicky on the pretty desktop icon".

I'm sure there's other paths to a solution. Mostly comes down to: A) time available to research and validate them; and, B) how much expertise is needed to use them, since I'll have to write any setup-documentation appropriate to the audience it's meant to serve

Friday, July 19, 2019

Why I Default to The Old Ways

I work with a growing team of automation engineers. Most are purely dev types. Those that have lived in the Operations world, at all, skew heavily towards Windows or only had to very lightly deal with UNIX or Linux.

I, on the other hand, have been using UNIX flavors since 1989. My first Linux system was the result of downloading a distribution from the MIT mirrors in 1992. Result, I have a lot of old habits (seriously: some of my habits are older than some of my teammates). And, because I've had to get deep into the weeds with all of those operating systems many, many, many times, over the years, those habits are pretty entrenched ("learned with blood" and all that rot).

A year or so ago, I'd submitted a PR that included some regex-heavy shell scripts. The person that reviewed the PR had asked "why are you using '[<space><TAB>]*' in your regexes rather than just '\s'?". At the time, I think my response was a semi-glib, "A) old habits die hard; and, B) I know that the former method always works".

That said, I am a lazy-typist. Typing "\s" is a lot fewer keystrokes than is "[<space><TAB>]*". Similarly, "\s" takes up a lot less in the way of column-width than does "[<space><TAB>]*" (and I/we generally like to code to fairly standard page-widths). So, for both laziness reasons and column-conservation reasons, I started to move more towards using "\s" and away from using "[<space><TAB>]*".  I think in the last 12 months, I've moved almost exclusively to  "\s".

Today, that move bit me in the ass. Well, yesterday, actually, because that's when I started receiving reports that the tool I'd authored on EL7 wasn't working when installed/used on EL6. Ultimately, I traced the problem to an `awk` invocation. Specifically, I had a chunk of code (filtering DNS output) that looked like:

awk '/\sIN SRV\s/{ printf("%s;%s\n",$7,$8)}'

Which worked a treat on EL7 but on EL6, "not so much." When I altered it to the older-style invocation:

awk '/[  ]*IN[  ]*SRV[  ]*/{ printf("%s;%s\n",$7,$8)}'

It worked fine on both EL7 and EL6. Turns out the ancient version of `awk` (3.1.7) on EL6 didn't know how to properly interpret the "\s" token. Oddly (my recollection from writing other tooling) is that EL6's version of `grep` understands the "\s" token just fine.

When I Slacked the person I'd had the original conversation with a link to the PR with a "see: this is why" note, he replied, "oh: I never really used awk, so never ran into it".

Tuesday, November 6, 2018

Shutdown In a Hurry

Last year, I put together an automated deployment of a CI tool for a customer. The customer was using it to provide a CI service for different developer-groups within their company. Each developer group acts like an individual tenant of my customer's CI service.

Recently, as more of their developer-groups have started using the service, space-consumption has exploded. Initially, they tried setting up  job within their CI system to automate cleanups of some of their tenants more-poorly architected CI jobs — ones that didn't appropriately clean up after themselves. Unfortunately, once the CI systems' filesystems fill up, jobs stop working ...including the cleanup jobs.

When I'd written the initial automation, I'd written automation to create two different end-states for their CI system: one that is basically just a cloud-hosted version of a standard, standalone deployment of the CI service; the other being a cloud-enabled version that's designed to automatically rebuild itself on a regular basis or in the event of service failure. They chose to deploy the former because it's the type of deployment they were most familiar with.

With their tenants frequently causing service-components to go offline and their cleaning jobs not being reliable, they asked me to investigate things and come up with a work around. I pointed them to the original set of auto-rebuilding tools I'd provided them, noting that, with a small change, those tools could detect the filesystem-full state and initiate emergency actions. In this case, the proposed emergency action being a system-suicide that caused the automation to rebuild the service back to a healthy state.

Initially, I was going to patch the automation so that, upon detecting a disk-full state, it would trigger a graceful shutdown. Then it struck me, "I don't care about these systems' integrity, I care about them going away quickly," since the quicker they go away, the quicker the automation will notice and start taking steps to re-establish the service. What I wanted was, instead of an `init 6` style shutdown, to have a "yank the power cord out of the wall" style of shutdown.

In my pre-Linux days, the commercial UNIX systems I dealt with each had methods for forcing a system to immediately stop dead. Do not pass go. Do not collect $200. So, started digging around for analogues.

Prior to many Linux distributions — including the one the CI service was deployed onto — moving to systemd, you could halt the system by doing:
# kill -SEGV 1
Under systemd, however, the above will cause systemd to dump a core file but not halt the system. Instead, systemd instantly respawns:
Broadcast message from systemd-journald@test02.lab (Tue 2018-11-06 18:35:52 UTC):

systemd[1]: Caught <segv>, dumped core as pid 23769.


Broadcast message from systemd-journald@test02.lab (Tue 2018-11-06 18:35:52 UTC):

systemd[1]: Freezing execution.


Message from syslogd@test02 at Nov  6 18:35:52 ...
 systemd:Caught <segv>, dumped core as pid 23769.

Message from syslogd@test02 at Nov  6 18:35:52 ...
 systemd:Freezing execution.

[root@test02 ~]# who -r
         run-level 5  2018-11-06 00:16
So, I did a bit more digging around. In doing so, I found that Linux has a functional analog to hitting the SysRq key on a 80s or 90s vintage system. This is an optional functionality that is enabled in Enterprise Linux. For a list of things that used to be doable with a SysRq key, I'd point you to this "Magic SysRq Key" article.

So, I tested it out by doing:
# echo o >/proc/sysrq-trigger
Which pretty much instantly causes an SSH connection to the remote system to "hang". It's not so much that the connection has hung as doing the above causes the remote system to immediately halt. It can take a few seconds, but, eventually the SSH client will decide that the endpoint has dropped, resulting in the "hung" SSH session exiting similarly to:
# Connection reset by 10.6.142.20
This bit of knowledge verified, I had my suicide-method for my automation. Basically, I set up a cron job to run every five minutes to test the "fullness" of ${APPLICATION_HOME}:
#!/bin/bash
#
##################################################
PROGNAME="$(basename ${0})"
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
APPDIR=$(awk -F= '/APP_WORKDIR/{print $2}' /etc/cfn/Deployment.envs)
RAWBLOCKS="$(( $( stat -f --format='%b*%S' ${APPDIR} ) ))"
USEBLOCKS="$(( $( stat -f --format='%a*%S' ${APPDIR} ) ))"
PRCNTFREE=$( printf '%.0f' $(awk "BEGIN { print ( ${USEBLOCKS} / ${RAWBLOCKS} ) * 100 }") )

if [[ ${PRCNTFREE} -gt 5 ]]
then
   echo "Found enough free space" > /dev/null
else
   # Put a bullet in systemd's brain
   echo o > /proc/sysrq-trigger
fi
Basically, the above finds the number of available blocks in the target filesystem and divides by the number of raw blocks and converts it into a percentage (specifically, it checks the number of blocks available to a non-privileged application rather than the root user). If the percentage drops to or below "5", it sends a "hard stop" signal via the SysRq interface.

The external (re)build automation takes things from there. The preliminary benchmarked recovery time from exceeding the disk-full threshold to being back in business on a replacement node is approximately 15 minutes (though will be faster in future iterations by better optimizing the build-time hardening routines).

Thursday, August 16, 2018

Systemd And Timing Issues

Unlike a lot of Linux people, I'm not a knee-jerk hater of systemd. My "salaried UNIX" background, up through 2008, was primarily with OSes like Solaris and AIX. With Solaris, in particular, I was used to sytemd-type init-systems due to SMF.

That said, making the switch from RHEL and CentOS 6 to RHEL and CentOS 7 hasn't been without its issues. The change from upstart to systemd is a lot more dramatic than from SysV-init to upstart.

Much of the pain with systemd comes with COTS software originally written to work on EL6. Some vendors really only due fairly cursory testing before saying something is EL7 compatible. Many — especially earlier in the EL 7 lifecycle — didn't bother creating systemd services at all. They simply relied on systemd-sysv-generator utility to do the dirty work for them.

While the systemd-sysv-generator utility does a fairly decent job, one of the places it can fall down is if the legacy-init script (files hosted in /etc/rc.d/init.d) is actually a symbolic link to someplace else in the filesystem. Even then, it's not super much a problem if "someplace else" is still within the "/" filesystem. However, if your SOPs include "segregate OS and application onto different filesystems", then "someplace else can" very much be a problem — when "someplace else" is on a different filesystem from "/".

Recently, I was asked to automate the installation of some COTS software with the "it works on EL6 so it ought to work on EL7" type of "compatibility". Not only did the software not come with systemd service files, its legacy-init files linked out to software installed in /opt. Our shop's SOPs are of the "applications on their own filesystems" variety. Thus, the /opt/<APPLICATION> directory is actually its own filesystem hosted on its own storage device. After doing the installation, I'd reboot the system. ...And when the system came back, even though there was a boot script in /etc/rc.d/init.d, the service wasn't starting. Poring over the logs, I eventually found:
systemd-sysv-generator[NNN]: stat() failed on /etc/rc.d/init.d/<script_name>
No such file or directory
This struck me odd given that the link and its destination very much did exist.

Turns out, systemd invokes the systemd-sysv-generator utility very early in the system-initialization proces. It invokes it so early, in fact, that the /opt/<APPLICATION>  filesystem has yet to be mounted when it runs. Thus, when it's looking to do the conversion the file the sym-link points to actually does not yet exist.

My first thought was, "screw it: I'll just write a systemd service file for the stupid application." Unfortunately, the application's starter was kind of a rats nest of suck and fail; complete and utter lossage. Trying to invoke it from directly via a systemd service definition resulted in the application's packaged controller-process not knowing where to find a stupid number of its sub-components. Brittle. So, I searched for other alternatives...

Eventually, my searches led me to both the nugget about when systemd invokes the systemd-sysv-generator utility and how to overcome the "sym-link to a yet-to-be-mounted-filesystem" problem. Under systemd-enabled systems, there's a new-with-systemd mount-option you can place in /etc/fstab — x-initrd.mount. You also need to make sure that your filesystem's fs_passno is set to "0" ...and if your filesystem lives on an LVM2 volume, you need to update your GRUB2 config to ensure that the LVM gets onlined prior to systemd invoking the systemd-sysv-generator utility. Fugly.

At any rate, once I implemented this fix, the systemd-sysv-generator utility became happy with the sym-linked legacy-init script ...And my vendor's crappy application was happy to restart on reboot.

Given that I'm deploying on AWS, I was able to accommodate setting these fstab options by doing:
mounts:
  - [ "/dev/nvme1n1", "/opt/<application> , "auto", "defaults,x-initrd.mount", "0", "0" ]
Within my cloud-init declaration-block. This should work in any context that allows you to use cloud-init.

I wish I could say that this was the worst problem I've run into with this particular application. But, really, this application is an all around steaming pile of technology.

Wednesday, August 15, 2018

Diagnosing Init Script Issues

Recently, I've been wrestling with some COTS software that one of my customers wanted me to automate the deployment of. Automating its deployment has been... Patience-trying.

The first time I ran the installer as a boot-time "run once" script, it bombed. Over several troubleshooting iterations, I found that one of the five sub-components was at fault. When I deployed the five sub-components individually (and to different hosts), all but that one sub-component failed. Worse, if I ran the automated installer from an interactive user's shell, the installation would succeed. Every. Damned. Time.

So, I figured, "hmm... need to see if I can adequately simulate how to run this thing from an interactive user's shell yet fool the installer into thinking it had a similar environment to what the host's init process provides." So, I commenced to Google'ing.

Commence to Googlin'

Eventually, I found reference to `setsid`. Basically, this utility allows you to spawn a subshell that's detached from a TTY ...just like an init-spawned process. So, I started out with:

     setsid bash -c '/path/to/installer -- \
       -c /path/to/answer.json' < /dev/null 2>&1 \
       | tee /root/log.noshell

Unfortunately, while the above creates a TTY-less subshell for the script to run in, it still wasn't quite fully simulating an init-like context. The damned installer was still completing successfully. Not what I wanted since, when run from an init-spawned context, it was failing in a very specific way. So, back to the almighty Googs.

Eventually, it occurred to me, "init-spawned processes have very different environments set than interactive user shells do. And, as a shell forked off from my login shell, that `setsid` process would inherit all my interactive shell's environment settings. Once I came to that realization, the Googs quickly pointed me to `env -i`. Upon integrating this nugget:

     setsid bash -c 'env -i /path/to/installer -- \
       -c /path/to/answer.json' < /dev/null 2>&1 \
       | tee /root/log.noshell

My installer started failing the same when launched from an interactive shell as from an init-spawned context. I had something with which I could tell the vendor, "your unattended installer is broken: here's how to simulate/reproduce the problem. FIX IT." I dunno about you, but, to me, an "unattended installer" that I have to kick off by hand from an interactive shell really isn't all that useful.

Monday, May 7, 2018

Streamed Backups to S3

Introduction/Background


Many would-be users of AWS come to AWS from a legacy hosting background. Often times, when moving to AWS, the question, "how do I back my stuff up when I no longer have access to my enterprise backup tools," is asked. If not, it's a question that would-be AWS users should be asking.

AWS provides a number of storage options. Each option has use-cases that it is optimized for. Each also has a combination of performance, feature and pricing tradeoffs (see my document for a quick summary of these tradeoffs). The lowest-cost - and therefore most attractive for data-retention use-cases typical of backups-related activities - is S3. Further, within S3, there are pricing/capability tiers that are appropriate to different types of backup needs (the following list is organized by price, highest to lowest):
  • If there is a need to perform frequent full or partial recoveries, the S3 Standard tier is probably the best option
  • If recovery-frequency is pretty much "never" — but needs to be quick if there actually is a need to perform recoveries — and the policies governing backups mandates up to a thirty-day recoverability window, the best option is likely the S3 Infrequent Access (IA) tier.
  • If there's generally no need for recovery beyond legal compliance capabilities, or the recovery-time objectives (RTO) for backups will tolerate a multi-hour wait for data to become unavailable, the S3 glacier layer is probable the best option.
Further, if projects backup needs span the usage profiles of the previous list, data lifecycle policies can be created that will move data from a higher-cost layer to a lower-cost layer based on time thresholds. And, to prevent being billed for data that has no further utility, the lifecycle policies can include an expiration-age at which AWS will simply delete and stop charging for the backed up data.

There are a couple of ways to get backup data into S3:
  • Copy: The easiest — and likely most well known — is to simply copy the data from a host into an S3 bucket. Every file on disk that's copied to S3 exists as an individually downloadable file in S3. Copy operations can be iterative or recursive. If the copy operation takes the form of a recursive-copy, basic location relationship between files is preserved (though, things like hard- or soft-links get converted into multiple copies of a given file). While this method is easy, it includes a loss of filesystem metadata — not just the previously-mentioned loss of link-style file-data but ownerships, permissions, MAC-tags, etc.
  • Sync: Similarly easy is the "sync" method. Like the basic copy Every file on disk that's copied to S3 exists as an individually downloadable file in S3. The sync operation is inherently recursive. Further, if an identical copy of a file exists within S3 at a given location, the sync operation will only overwrite the S3-hosted file if the to-be-copied file is different. This provides good support for incremental style backups. As with the basic copy-to-S3 method, this method results in the loss of file-link and other filesystem metadata.

    Note: if using this method, it is probably a good idea to turn on bucket-versioning to ensure that each version of an uploaded file is kept. This allows a recovery operation to recover a given point-in-time's version of the backed-up file.
  • Streaming copy: This method is the least well-known. However, this method can be leveraged to overcome the problem of loss filesystem metadata. If the stream-to-S3 operation includes an inlined data-encapsulation operation (e.g., piping the stream through the tar utility), filesystem metadata will be preserved.

    Note: the cost of preserving metadata via encapsulation is that the encapsulated object is opaque to S3. As such, there's no (direct) means by which to emulate an incremental backup operation.

Technical Implementation

As the title of this article suggests, the technical-implementation focus of this article is on streamed backups to S3.

Most users of S3 are aware of is static file-copy options. That is copying a file from an EC2 instance directly to S3. Most such users, when they want to store files in EC2 and need to retain filesystem metadata either look to things like s3fs or do staged encapsulation.

The former allows you to treat S3 as though it were a local filesystem. However, for various reasons, many organizations are not comfortable using FUSE-based filesystem-implementstions - particularly opensource project ones (usually due to fears about support if something goes awry)

The latter means using an archiving tool to create a pre-packaged copy of the data first staged to disk as a complete file and then copying that file to S3. Common archiving tools include the Linux Tape ARchive utility (`tar`), cpio or even `mkisofs`/`genisoimage`. However, if the archiving tool supports reading from STDIN and/or writing to STDOUT, the tool can be used to create an archive directly within S3 using S3's streaming-copy capabilities.

Best practices for backups is to ensure that the target data-set is in a consistent state. Generally, this means that the data to be archived is non-changing. This can be done by quiescing a filesystem ...or snapshotting a filesystem and backing up the snapshot. Use of LVM snapshots will be used to illustrate how to a consistent backup of a live filesystem (like those used to host the operating system.)

Note: this illustration assumes that the filesystem to be  backed up is built on top of LVM. If the filesystem is built on a bare (EBS-provided) device, the filesystem will need to be stopped before it can be consistently streamed to S3.

The high-level procedure is as follows:
  1. Create a snapshot of the logical volume hosting the filesystem to be backed up (note that LVM issues an `fsfreeze` operation before to creating the snapshot: this flushes all pending I/Os before making the snapshot, ensuring that the resultant snapshot is in a consistent state). Thin or static-sized snapshots may be selected (thin snapshots are especially useful when snapshotting multiple volumes within the same volume-group as one has less need to worry about getting the snapshot volume's size-specification correct).
  2. Mount the snapshot
  3. Use the archiving-tool to stream the filesystem data to standard output
  4. Pipe the stream to S3's `cp` tool, specifying to read from a stream and to write to object-name in S3
  5. Unmount the snapshot
  6. Delete the snapshot
  7. Validate the backup by using S3's `cp` tool, specifying to write to a stream and then read the stream using the original archiving tool's capability to read from standard input. If the archiving tool has a "test" mode, use that; if it does not, it is likely possible to specify /dev/null as its output destination.
For a basic, automated implementation of the above, see the linked-to tool. Note that this tool is "dumb": it assumes that all logical volumes hosting a filesystem should be backed up. The only argument it takes is the name of the S3 bucket to upload to. The script does only very basic "pre-flight" checking:
  • Ensure that the AWS CLI is found within the script's inherited PATH env.
  • Ensure that either an AWS IAM instance-role is attached to the instance or that an IAM user-role is defined in the script's execution environment (${HOME}/.aws/credential files not currently supported). No attempt is made to ensure the instance- or IAM user-role has sufficient permissions to write to the selected S3 bucket
  • Ensure that a bucket-name has been passed, but not checked for validity.
Once the pre-flights pass: the script will attept to snapshot all volumes hosting a filesystem; mount the snapshots under the /mnt hierarchy — recreating the original volumes' mount-locations, but rooted in /mnt; use the `tar` utility to encapsulate and stream the to-be-archived data to the s3 utility; use the S3 cp utility to write tar's streamed, encapsulated output to the named S3 bucket's "/Backups/" folder. Once the S3 cp utility closes the stream without errors, the script will then dismount and delete the snapshots.

Alternatives

As mentioned previously, it's possible to do similar actions to the above for filesystems that do not reside on LVM2 logical volumes. However, doing so will either require different methods for creating a consistent state for the backup-set or backing up postentially inconsistent data (and possibly even wholly missing "in flight" data).

EBS has the native ability to create copy-on-write snapshots. However, the EBS volume's snapshot capability is generally decoupled from the OS'es ability to "pause" a filesystem. One can use a tool — like those in the LxEBSbackups project — to coordinate the pausing of the filesystem so that the EBS snapshot can create a consistent copy of the data (and then unpause the filesystem as soon as the EBS snapshot has been started).

One can leave the data "as is" in the EBS snapshot or one can then mount the snapshot to the EC2 and execute a streamed archive operation to S3. The former has the value of being low effort. The latter has the benefit of storing the data to lower-priced tiers (even S3 standard is cheaper than snapshots of EBS volumes) and allowing the backed up data to be placed under S3 lifecycle policies.

Tuesday, February 27, 2018

Why I Love The UNIX/Linux CLI

There's many reasons, really. Probably too many to list in any one sitting.

That said, some days you pull an old trick out of your bag and you're reminded all over why you love something. Today was such a day.

All day, we've been having sporadic performance issues with AWS's CloudFormation functionality. Some jobs will run in (a few tens of) minutes, others will take an hour or more. Now, this would be understandable if they were markedly different tasks. But that's not the case. It's literally "run this template 'A' and wait" followed by "re-run template 'A' and wait" where the first run goes in about the time you'd expect and the second run of the same automation-template takes 10x as long.

Fun thing with cloudformation is that, when you launch from the command line, you can pass all of your parameters via a file. Some templates, however, don't like it when there's two active copies that contain the same parameter values. The way around this is to generalize your parameters file and then "process" that parameter file for each running. The UNIX/Linux CLI means that you can do all of this inline, like so:

aws --region us-east-1 cloudformation create-stack --stack-name RedMine07 \
  --disable-rollback --capabilities CAPABILITY_NAMED_IAM \
  --template-url https://s3.amazonaws.com/my-automation-bucket/Templates/RedMine.tmplt.json \
  --parameters file://<(sed 's/__NUM__/07/g' RedMine.parms.json)

The nifty part here is in the last line of that command. When you wrap a shell command in "<()", it runs the command within the parentheses and encapsulates it into a file-handle. Thus, if your main command requires an input-file be specified, the output of your encapsulated command gets treated just like it was being read from an on-disk file rather than as an inline-command.

Slick.

In my case, I'd generalized my parameters file to include a substitutable token, "__NUM__". Each time I need to run the command, all I have to do is change that "07" to a new value and bingo, new stack with unique values where needed.

Another fun thing is the shell's editable history function. If I want to run that exact same command, again, changing my values — both those passed as stack-parameters and the name of the resultant stack — I can do:

!!:gs/07/08

Which causes the prior stanza to be re-run as so:

aws --region us-east-1 cloudformation create-stack --stack-name RedMine08 \
  --disable-rollback --capabilities CAPABILITY_NAMED_IAM \
  --template-url https://s3.amazonaws.com/my-automation-bucket/Templates/RedMine.tmplt.json \
  --parameters file://<(sed 's/__NUM__/08/g' RedMine.parms.json)

Similarly, if the command I wanted to change was the 66th in my history rather than the one I just ran, I could do:

!66:gs/07/08

And achieve the same results.

Thursday, November 3, 2016

Automation in a Password-only UNIX Environment

Occasionally, my organization has need to run ad hoc queries against a large number of Linux systems. Usually, this is a great use-case for an enterprise CM tool like HPSA, Satellite, etc. Unfortunately, the organization I consult to is between solutions (their legacy tool burned down and their replacement has yet to reach a working state). The upshot is that, one needs to do things a bit more on-the-fly.

My preferred method for accessing systems is using some kind of token-based authentication framework. When hosts are AD-integrated, you can often use Kerberos for this. Failing that, you can sub in key-based logins (if all of your targets have your key as an authorized key). While my customer's systems are AD-integrated, their security controls preclude the use of both AD/Kerberos's single-signon capabilities and the use of SSH key-based logins (and, to be honest, almost none of the several hundred targets I needed to query had my key configured as an authorized key).

Because (tunneled) password-based logins are forced, I was initially looking at the prospect of having to write an expect script to avoid having type in my password several hundred times. Fortunately, there's an alternative to this in the tool "sshpass".

SSH pass lets you supply your password with a number of methods: command-line argument, a password-containing file, an environment variable value or even a read from STDIN. I'm not a fan of text files containing passwords (they've a bad tendency to be forgotten and left on a system - bad juju). I'm not particularly a fan of command-line arguments, either - especially on a multi-user system where others might see your password if they `ps` at the wrong time (which increases in probability as the length of time your job runs goes up). The STDIN method is marginally less awful than the command arg method (for similar reasons). At least with an environment variable, the value really only sits in memory (especially if you've set your HISTFILE location to someplace non-persistent).

The particular audit I was doing was an attempt to determine the provenance of a few hundred VMs. Over time, the organization has used templates authored by several groups - and different personnel within one of the groups. I needed to scan all of the systems to see which template they might have been using since the template information had been deleted from the hosting environment. Thus, I needed to run an SSH-encapsulated command to find the hallmarks of each template. Ultimately, what I ended up doing was:

  1. Pushed my query-account's password into the environment variable used by sshpass, "SSHPASS"
  2. Generated a file containing the IPs of all the VMs in question.
  3. Set up a for loop to iterate that list
  4. Looped `sshpass -e ssh -o PubkeyAuthentication=no StrictHostKeyChecking=no <USER>@${HOST} <AUDIT_COMMAND_SEQUENCE> 2>&1
  5. Jam STDOUT through a a sed filter to strip off the crap I wasn't interested in and put CSV-appropriate delimeters into each, queried host's string.
  6. Capture the lot to a text file
The "PubkeyAuthentication=no" option was required because I pretty much always have SSH-agent (or agent-forwarding) enabled. This causes my key to be passed. With the targets' security settings, this causes the connection to be aborted unless I explicitly suppress the passing of my agent-key.

The "StrictHostKeyChecking=no" option was required because I'd never logged into these hosts before. Our standard SSH client config is to require confirmation before accepting the remote key (and shoving it into ${HOME}/.ssh/known_hosts). Without this option, I'd be required to confirm acceptance of each key ...which is just about as automation-breaking as having to retype your password hundreds of times is.

Once the above was done, I had a nice CSV that could be read into Excel and a spreadsheet turned over to the person asking "who built/owns these systems". 

This method also meant that for the hosts that refused the audit credentials, it was easy enough to report "...and this lot aren't properly configured to work with AD".

Friday, October 7, 2016

Using DKMS to maintain driver modules

In my prior post, I noted that maintaining custom drivers for the the kernel in RHEL and CentOS hosts can be a bit painful (and prone to leaving you with an unreachable or even unbootable system). One way to take some of the pain out of owning a system with custom drivers is to leverage DKMS. In general, DKMS is the recommended way to ensure that, as kernels are updated, required kernel modules are also (automatically) updated.

Unfortunately, use of the DKMS method will require that developer tools (i.e., the GNU C-compiler) be present on the system - either in perpetuity or just any time kernel updates are applied. It is very likely that your security team will object to - or even prohibit - this. If the objection/prohibition cannot be overridden, use of the DKMS method will not be possible.

Steps

  1. Set an appropriate version string into the shell-environment:
    export VERSION=3.2.2
  2. Make sure that appropriate header files for the running-kernel are installed
    yum install -y kernel-devel-$(uname -r)
  3. Ensure that the dkms utilities are installed:
    yum --enablerepo=epel install dkms
  4. Download the driver sources and unarchive into the /usr/src directory:
    wget https://sourceforge.net/projects/e1000/files/ixgbevf%20stable/${VERSION}/ixgbevf-${VERSION}.tar.gz/download \
        -O /tmp/ixgbevf-${VERSION}.tar.gz && \
       ( cd /usr/src && \
          tar zxf /tmp/ixgbevf-${VERSION}.tar.gz )
  5. Create an appropriate DKMS configuration file for the driver:
    cat > /usr/src/ixgbevf-${VERSION}/dkms.conf << EOF
    PACKAGE_NAME="ixgbevf"
    PACKAGE_VERSION="${VERSION}"
    CLEAN="cd src/; make clean"
    MAKE="cd src/; make BUILD_KERNEL=\${kernelver}"
    BUILT_MODULE_LOCATION[0]="src/"
    BUILT_MODULE_NAME[0]="ixgbevf"
    DEST_MODULE_LOCATION[0]="/updates"
    DEST_MODULE_NAME[0]="ixgbevf"
    AUTOINSTALL="yes"
    EOF
  6. Register the module to the DKMS-managed kernel tree:
    dkms add -m ixgbevf -v ${VERSION}
  7. Build the module against the currently-running kernel:
    dkms build ixgbevf/${VERSION}

Verification

The easiest way to verify the correct functioning of DKMS is to:
  1. Perform a `yum update -y`
  2. Check that the new drivers were created by executing `find /lib/modules -name ixgbevf.ko`. Output should be similar to the following:
    find /lib/modules -name ixgbevf.ko | grep extra
    /lib/modules/2.6.32-642.1.1.el6.x86_64/extra/ixgbevf.ko
    /lib/modules/2.6.32-642.6.1.el6.x86_64/extra/ixgbevf.ko
    There should be at least two output-lines: one for the currently-running kernel and one for the kernel update. If more kernels are installed, there may be more than just two output-lines
     
  3. Reboot the system, then check what version is active:
    modinfo ixgbevf | grep extra
    filename:       /lib/modules/2.6.32-642.1.1.el6.x86_64/extra/ixgbevf.ko
    If the output is null, DKMS didn't build the new module.

Thursday, August 25, 2016

Use the Force, LUKS

Not like there aren't a bunch of LUKS guides out there already ...mostly posting this one for myself.

Today, was working on turning the (attrocious - other than a long-past deadline, DISA, do you even care what you're publishing?) RHEL 7 V0R2 STIGs specifications into configuration management elements for our enterprise CM system. Got to the STIG item for "ensure that data-at-rest is encrypted as appropriate". This particular element is only semi-automatable ...since it's one of those "context" rules that has a "if local policy requires it" back-biting element to it. At any rate, this particular STIG-item prescribes the use of LUKs.

As I set about to write the code for this security-element, it occurred to me, "we typically use array-based storage encryption - or things like KMS in cloud deployments - that I can't remember how to cofigure LUKS ...least of all configure it so it doesn't require human intervention to mount volumes." So, like any good Linux-tech, I petitioned the gods of Google. Lo, there were many results — most falling into either the "here's how you encrypt a device" or the "here's how you take an encrypted device and make the OS automatically remount it at boot" camps. I was looking to do both so that my test-rig could be rebooted and just have the volume there. I was worried about testing whether devices were encrypted, not whether leaving keys on a system was adequately secure.

At any rate, at least for testing purposes (and in case I need to remember these later), here's what I synthesized from my Google searches.

  1. Create a directory for storing encryption key-files. Ensure that directory is readable only by the root user:
    install -d -m 0700 -o root -g root /etc/crypt.d
  2. Create a 4KB key from randomized data (stronger encryption key than typical, password-based unlock mechanisms):
    # dd if=/dev/urandom of=/etc/crypt.d/cryptFS.key bs=1024 count=4
    ...writing the key to the previously-created, protected directory. Up the key-length by increasing the value of the count parameter.
     
  3. Use the key to create an encrypted raw device:
    # cryptsetup --key-file /etc/crypt.d/cryptFS.key \
    --cipher aes-cbc-essiv:sha256 luksFormat /dev/CryptVG/CryptVol
  4. Activate/open the encrypted device for writing:
    # cryptsetup luksOpen --key-file /etc/crypt.d/cryptFS.key \
    /dev/CryptVG/CryptVol CryptVol_crypt
    Pass the location of the encryption-key using the --key-file parameter.
     
  5. Add a mapping to the crypttab file:
    # ( printf "CryptVol_crypt\t/dev/CryptVG/CryptVol\t" ;
       printf "/etc/crypt.d/cryptFS.key\tluks\n" ) >> /etc/crypttab
    The OS will use this mapping-file at boot-time to open the encrypted device and ready it for mounting. The four column-values to the map are:
    1. Device-mapper Node: this is the name of the writable block-device used for creating filesystem structures and for mounting. The value is relative. When the device is activated, it will be assigned the device name /dev/mapper/<key_value>
    2. Hosting-Device: The physical device that hosts the encrypted psuedo-device. This can be a basic hard disk, a partition on a disk or an LVM volume.
    3. Key Location: Where the device's decryption-key is stored.
    4. Encryption Type: What encryption-method was used to encrypt the device (typically "luks")
     
  6. Create a filesystem on the opened encrypted device:
    # mkfs -t ext4 /dev/mapper/CryptVol_crypt
  7. Add the encrypted device's mount-information to the host's /etc/fstab file:
    # ( printf "/dev/mapper/CryptVol_crypt\t/cryptfs\text4" ;
       printf "defaults\t0 0\n" ) >> /etc/fstab
  8. Verify that everything works by hand-mounting the device (`mount -a`)
  9. Reboot the system (`init 6`) to verify that the encrypted device(s) automatically mount at boot-time
Keys and mappings in place, the system will reboot with the LUKSed devices opened and mounted. The above method's also good if you wanted to give each LUKS-protected device its own, device-specific key-file.

Note: You will really want to back up these key files. If you somehow lose the host OS but not the encrypted devices, the only way you'll be able to re-open those devices if you're able to restore the key-files to the new system. Absent those keys, you better have good backups of the unencrypted data - becuase you're starting from scratch.

Saturday, July 16, 2016

Retrospective Automatic Image Replication in NetBackup

In version 7.x of NetBackup, VERITAS added the Automatic Image Replication functionality. This technology is more commonly referred to as "AIR". Its primary use case is to enable a NetBackup administrator to easily configure data replication between two different — typically geographically-disbursed — NetBackup domains.

Like many tools that are designed for a given use-case, AIR can be used for things it wasn't specifically designed for. Primary down-side to these not-designed-for use-cases is the documentation and tool-sets for such usage is generally pretty thin.

A customer I was assisting wanted to upgrade their appliance-based NetBackup system but didn't want to have to give up their old data. Because NetBackup appliances use Media Server Deduplication Pools (MSDP), it meant that I had a couple choices in how to handle their upgrade. I opted to try to use AIR to help me quickly and easily migrate data from their old appliance's MSDP to their new appliance's.

Sadly, as is typical of  not-designed-for use-case, documentation for doing it was kind of thin-on-the ground. Worse, because Symantec had recently spun VERITAS back off as its own entity, many of the forums that survived the transition had reference- and discussion-links that pointed to nowhere. Fortunately, I had access to a set of laboratory systems (AWS/Azure/Google Cloud/etc. is great for this - both from the standpoint of setup speed and "ready to go" OS templates). I was also able to borrow some of my customer's NetBackup 7.7 keys to use for the testing.

I typically prefer to work with UNIX/Linux-based systems to host NetBackup. However, my customer is a Windows-based shop. My customer's planned migration was also going to have the new NetBackup domain hosted on a different VLAN from their legacy NetBackup domain. This guided my lab design: I created a cloud-based "lab" configuration using two subnets and two Windows Server 2012 instance-templates. I set up each of my instances with enough storage to host the NetBackup software on one disk and the MSDPs on another disk ...and provisioned each of my test master servers with four CPUs and 16GiB or RAM. This is considerably smaller then both their old and new appliances, but I also wasn't trying to simulate an enterprise outpost's worth of backpup traffic. I also set up a mix of about twenty Windows and Linux instances to act as testing clients (customer is beginning to add Linux systems as virtualization and Linux-based "appliances" have started to creep into their enterprise-stacks).

I set up two very generic NetBackup domains. Into each, I built an MSDP. I also set up a couple of very generic backup policies on the one NetBackup Master Server to backup all of the testing clients to the MSDP. I configured the policies for daily fulls and hourly incrementals, and set up each of the clients to continuously regenerate random data-sets in their filesystems. I let this run for forty-eight hours so that I could get a nice amount of seed-data into the source NBU domain's MSDP.

Note: If you're not familiar with MSDP setup, the SETTLERSOMAN website has a good, generic walkthrough.

After populating the source site's MSDP, I converted from using the MSDP by way of a direct STorage Unit definition (STU) to using it by way of a two-stage Storage Lifecycle Policy (SLP). I configured the SLP to use the source-site MSDP as the stage-one destination in the lifecycle and added the second NBU domain's MSDP as the stage-two destination in the lifecycle. I then seeded the second NBU domain's MSDP with data by executing a full backup of all clients against the SLP.

Note: For a discussion on setting up an AIR-based replication SLP, again, the SETLLERSOMAN website has a good, generic walkthrough.

All of the above is fairly straight-forward and well documented (both within the NBU documentation and sites like SETTLERSOMAN). However, it only addresses the issue of how you get newly-generated data from one NBU domain's MSDP to another's. Getting older data from an existing MSDP to a new MSDP is a bit more involved ...and not for the command-line phobic (or, in my case, PowerShell-phobic.)

At a high level, what you do is:
  1. Use the `bpimmedia` tool to enumerate all of the backup images stored on the source-site's MSDP
  2. Grab only the media-IDs of the enumerated backup images
  3. Feed that list of media-IDs to the `nbreplicate` tool so that it can copy that old data to the new MSDP
Note: The vendor documentation for the `bpimmedia` and  `nbreplicate` tools can be found at the VERITAS website.

When using the `bpimmedia` tool to automate image-ID enumeration, using the `-l` flag puts the output into a script-parsable format. The desired capture-item is the fourth field in all lines that begin 'IMAGE':
  • In UNIX/Linux shell, use an invocation similar to: `bpimmedia -l | awk '/^IMAGE/{print $4}`
  • In PowerShell, use an invocation similar to:`bpimmedia -l | select-string -pattern "IMAGE *" | ForEach-Object { $data = $_ -split " " ; "{0}" -f $data[3] }`
The above output can then be either captured to a file — so that one the `nbreplicate` job can be launched to handle all of the images — or each individual image-ID can be passed to an individual `nbreplicate` job (typically via a command-pipeline in a foreach script). I ended up doing the latter because, even though the documentation indicates that the tool supports specifying an image-file, when executed under PowerShell, `nbreplicate` did not seem to know what to do with said file.

The `nbreplicate` command has several key flags we're interested in for this exercise:
  • -backupid: The backup-identifier captured via the `bpimmedia` tool
  • -cn: The copy-number to replicate — in most circumstances, this should be "1"
  • -rcn: The copy-number to assign to the replicated backup-image — in most circumstances, this should be "1"
  • -slp: the name of the SLP hosted on the destination NetBackup domain
  • -target_sts: the FQDN of the destination storage-server (use `nbemmcmd -listhosts` to verify names - or the replication jobs will fail with a status 191, sub-status 174)
  • -target_user: the username of a user that has administrative rights to the destination storage-server
  • -target_user: the password of the the -target_user username
 If you don't care about minimizing the number of replication operations, this can all be put together similar to the following:
  • For Unix:
    for ID in $(bpimmedia -l | awk '/^IMAGE/{print $4}')
    do
       nbreplicate -backupid ${ID} -cn 1 -slp_name <REMOTE_SLP_NAME> \
         -target_sts <REMOTE_STORAGE_SERVER> -target_user <REMOTE_USER> \
         -target_pwd <REMOTE_USER_PASSWORD>
    done
    
  • For Windows:
    @(bpimmedia -l | select-string -pattern "IMAGE *" | \
       ForEach-Object { $data = $_ -split " " ; "{0}" -f $data[3] }) | \
       ForEach-Object { nbreplicate -backupid $_ -cn 1 \
         -slp_name <REMOTE_SLP_NAME> -target_sts <REMOTE_STORAGE_SERVER> \
         -target_user <REMOTE_USER> -target_pwd <REMOTE_USER_PASSWORD> }
    


Monday, June 13, 2016

Seriously, CentOS?

One of my (many) duties in our shop is doing cross-platform maintenance of RPMs. Previously, when I was maintaining them for EL5 and EL6, things were fairly straight-forward. You got or made a SPEC file to packages your SOURCES into RPMS and you were pretty much good to go. Imagine my surprise when I went to start porting things to EL7 and all my freaking packages had el7.centos in their damned names. WTF?? These RPMs are for use on all of our EL7 systems, not just CentOS: why the hell are you dropping the implementation into my %{dist}-string?? So, now I have to make sure that my ${HOME}/.rpmmacros file has a line in it that looks like:
%dist .el7
If I don't want my %{dist}-string get crapped-up.

Screw you, CentOS. Stuff was just fine on CentOS 5 and CentOS 6. This change is not an improvement.

Tuesday, April 19, 2016

But I Don't Like That Username

One of the clients I do work for is in the process of adopting commercial cloud solutions. Early in the process, they had a private network that they were doing executing virtualization and private-cloud efforts. The maintainers of those two environments had created standard builds for use within those environments. For better or worse, my client has a number of developer teams they've contracted out to whose primary efforts are conducted either in-house (to the contractor) or within AWS, Azure or Google Compute Engine.

The group I work for has been tasked with standardizing and automating the deployment of enterprise components across all of the various environments and providing stewardship of the other contracted-out development efforts. When we were first given this task, the customer's build engineers would not provide any methods to replicate the production build - not even sufficient documentation that would allow us to accurately mimic it well enough to enable the other developers with a seamless dev -> test -> prod workflow.

Early in our involvement, I ended up creating my own build. Eventually, in the previously-described vacuum, others decided to adopt my build. Now, there's enough different groups using that build that it's pressuring the maintainers of the internal build to abandon theirs and adopt mine.

Fun part of DevOps is that when tends to be consensus-driven. If there's a slow or unresponsive link in the chain, the old "top down" approach frequently becomes a casualty when critical mass is achieved bottom up..

At any rate, we were recently in discussions with the enterprise build maintainers to show them how to adopt the consensus build. One of their pushbacks was "but that build doesn't include the 'break-glass' account that the legacy builds do." They wanted to know if we could modify the build to be compliant with that user-account.

This struck me odd, since, our group's (and other dev-groups') approach to such issues is "modify it in code". This isn't an approach familiar to the enterprise team. They're very "golden image" oriented. So, I provideded them a quick set of instructions on how to take the incoming standard build and make it compatible with their tools' expectations
#cloud-config
system_info:
  default_user:
    name: ent-adm
For enterprise components that will be migrated from private virtualization and cloud solutions to commercial cloud offerings, the above allows them to take not just my build, but any build that's enabled for automated provisioning and inject their account. Instead of launching a system with whatever the default-user is that's baked in, the above allows them to reset any system's initial username to be whatever they want (the above's 'ent-adm' is just an example - I dunno what their preferred account name is - I've simply used the above whenever I'm deploying instance-templates and don't want to remember "instance X uses userid A; instance Y uses userid B; and instance Z uses userid C").

Even more fun if they don't like any of the attributes associated with the default user's account, they can override it with any user-parameter available in cloud-init.

Tuesday, February 16, 2016

Per-Sender SMTP SASL Authentication

One of the customers I do work for runs a multi-tenant environment. One of the issues this customer has been having is "how do we notify tenants that their systems want patching". For their Linux systems, the immediate answer was to use the `yum-cron` facility to handle it.

Unfortunately, the system that would receive these types of notification emails is different than the ones that handle generic SMTP relaying. Instead of being able to set up each Linux system's Postfix service to use a single Smart-relay, we needed to be able to have just the account that's used to send the ready-to-patch notifications relay into through the notification gateway while all other emails get directed through a broader-scope relay.

It took several search iterations to finally uncover the "trick" for allowing this (credit to zmwangx for his GitHub Gist-posting). Overall, it's pretty straight-forward. Only thing that was not immediately obvious was which tokens mapped-through (what was the common key). In summary, to create the solution one needs to do three things:
  1. Modify the Postfix configuration:
    1. Do the standard tasks around defining a "smart" relay
    2. Do the standard tasks around enabling Postfix to act as a SASL-authenticated client
    3. Enable per-sender authentication
    4. Define a per-sender relay-map
  2. Create a file to map a local sender-address to a SASL-credential
  3. Create a file to map a local sender-address to a specific relay-host
One that's done, it's simply a matter of verifying that things are working the way you think they should be.

Modify Postfix: Define Default Smart Relay
As with much of my experimentation the past year or two, I did my testing using Amazon Web Services resources. In this case, I used AWS's Simple Email Service (SES). Configuring Postfix to relay is trivial. Since my testing was designed to ensure that everything except my wanted sender-address to deny relaying, I configured my test system to point to its local SES relay without actually configuring an SES account to enable that relaying. For postfix, this was simply a matter of doing:
postconf -e "relayhost = [email-smtp.us-west-2.amazonaws.com]:587"
Appends creates line to your /etc/postfix/main.cf that looks like:
relayhost = [email-smtp.us-west-2.amazonaws.com]:587
Note: The AWS SES relays are currently only available within a few regions (as of this writing, NoVA/us-east-1, Oregon/us-west-2 and Ireland/eu-west-1). Each relay requires a SASL credential be created to allow relaying. So, no big deal publicizing the relay's name if spammers don't have such credentials.

At this point, any mail sent via Postfix will attempt to relay through the listed relay-host. Further, because SASL client-credentials are not yet set up, those relay attempts will fail.

Modify Postfix: Define Default SASL-Client Credentials
Add a block similar to the following to begin to turn Postfix into an SMTP SASL-client:
smtp_sasl_auth_enable = yes
smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd
smtp_sasl_security_options = noanonymous
smtp_sasl_mechanism_filter = plain
smtp_tls_CAfile = /etc/pki/tls/certs/ca-bundle.crt
smtp_use_tls = yes
smtp_tls_security_level = encrypt
Two critical items above are the "smtp_sasl_password_maps" and "smtp_tls_CAfile" parameters:

  1. The former instructs Postfix where to look for SASL credentials.
  2. The latter tells postfix where to look for root Certificate Authorities so it can verify that the relay host's SSL certificates are valid. The path listed in the above is the default for Red Hat-derived distributions: alter to fit your distribution's locations
The rest of the parameters instruct Postfix to use SASL-authentication over TLS-encrypted channels (required when connecting to an SMTP-relay via port 587) and to use the "plain" mechanism for sending credentials to the relay-host.

Modify Postfix: Enable Per-Sender Authentication
Use the command `postconf -e "smtp_sender_dependent_authentication = yes"` to enable Postfix's per-sender authentication modules. This will add a line to the /etc/postfix/main.cf that looks like:
smtp_sender_dependent_authentication = yes

Modify Postfix: Define SASL-Sender Map
Once per-sender authentication is enabled, Postfix needs to be instructed where to find mappings of senders to credentials. Use the command,  `postconf -e "sender_dependent_relayhost_maps = hash:/etc/postfix/sender_relay"` to enable Postfix's per-sender authentication modules. This will add a line to the /etc/postfix/main.cf that looks like:
sender_dependent_relayhost_maps = hash:/etc/postfix/sender_relay

Create a Sender-to-Credential Map
Edit/create the sender-to-credential mapping file /etc/postfix/sender_passwd. Its contents should be similar to the following:
# Sender Address                                 <userid>:<password>
patch-alert@ses-test.cloudlab.xanthia.com        AKIAICAOGQX5UA0ACDSVJ:pDHM1n4uYLGN4BQOnzGcTSeQXSRDjcKCy6VkmQk+CoBV
Postfix will use the "patch-alert@ses-test.cloudlab.xanthia.com" sender-address as a common key with the value in the relay-map (following).


Create a Sender-to-Relay Map
Edit/create the sender-to-credential mapping file /etc/postfix/sender_relay. Its contents should be similar to the following:
# Sender Address                            [relay-host]:port
patch-alert@ses-test.cloudlab.xanthia.com        [email-smtp.us-west-2.amazonaws.com]:587
Postfix will use "patch-alert@ses-test.cloudlab.xanthia.com" sender-address as a common key with the value in the credential-map (preceding).

Verification
A quick verification test is to send an email from a mapped user-address and a non-mapped user address:
# sendmail -f patch-alert@ses-test.cloudlab.xanthia.com -t <<EOF
To: fubar@cloudlab.xanthia.com
Subject: Per-User SASL Test
Content-type: text/html

If this arrived, things are probably set up correctly
EOF
# sendmail -f unmapped-user@ses-test.cloudlab.xanthia.com -t <<EOF
To: fubar@cloudlab.xanthia.com
Subject: Per-User SASL Test
Content-type: text/html

If this bounced, things are probably set up correctly
EOF
This should result in an SMTP log-snippet that resembles the following:
Feb 16 18:08:09 ses-test maintuser: MARK == MARK == MARK
Feb 16 18:08:22 ses-test postfix/pickup[5484]: 2B7D244AB: uid=0 from=<patch-alert@ses-test.cloudlab.xanthia.com>
Feb 16 18:08:22 ses-test postfix/cleanup[5583]: 2B7D244AB: message-id=<20160216180822.2B7D244AB@ses-test.cloudlab.xanthia.com>
Feb 16 18:08:22 ses-test postfix/qmgr[5485]: 2B7D244AB: from=<patch-alert@ses-test.cloudlab.xanthia.com>, size=403, nrcpt=1 (queue active)
Feb 16 18:08:22 ses-test postfix/smtp[5585]: 2B7D244AB: to=<thjones2@gmail.com>, relay=email-smtp.us-west-2.amazonaws.com[54.187.123.10]:587, delay=0.37, delays=0.02/0.03/0.19/0.13, dsn=2.0.0, status=sent (250 Ok 00000152eb44d396-408494a9-93f0-4f21-8985-460c057537bf-000000)
Feb 16 18:08:22 ses-test postfix/qmgr[5485]: 2B7D244AB: removed
Feb 16 18:08:32 ses-test postfix/pickup[5484]: A339E44AB: uid=0 from=<bad-sender@ses-test.cloudlab.xanthia.com>
Feb 16 18:08:32 ses-test postfix/cleanup[5583]: A339E44AB: message-id=<20160216180832.A339E44AB@ses-test.cloudlab.xanthia.com>
Feb 16 18:08:32 ses-test postfix/qmgr[5485]: A339E44AB: from=<bad-sender@ses-test.cloudlab.xanthia.com>, size=408, nrcpt=1 (queue active)
Feb 16 18:08:32 ses-test postfix/smtp[5585]: A339E44AB: to=<thjones2@gmail.com>, relay=email-smtp.us-west-2.amazonaws.com[54.69.81.169]:587, delay=0.09, delays=0.01/0/0.08/0, dsn=5.0.0, status=bounced (host email-smtp.us-west-2.amazonaws.com[54.69.81.169] said: 530 Authentication required (in reply to MAIL FROM command))
As can be seen in the snippet, the first message (from the mapped sender) was relayed while the second message (from the unmapped sender) was rejected.

Thursday, September 24, 2015

Simple Guacamole

If the enterprise you work for is like mine, access through the corporate firewall is tightly-controlled. You may find that only we-related protocols are left through (mostly) unfettered. When you're trying to work with a serivce like AWS, this can make management of Linux- and/or Windows-based resources problematic.

A decent solution to such a situation is the use of HTML-based remote connection gateway services. If all you're looking to do is SSH, the GateOne SSH-over-HTTP gateway is a quick and easy to setup solution. If you need to manage instances via graphical desktops - most typically Windows but some people like it for Linux as well - a better solution is Guacamole.

Guacamole is an extensible, HTTP-based solution. It runs as a Java servlet under a Unix hosted service like Tomcat. If you're like me, you may also prefer to encapsulate/broker the Tomcat service through a generic HTTP service like Apache or Nginx. My preference has been Apache - but mostly because I've been using Apache since not long after it was formally forked off of the NCSA project. I also tend to favor Apache because it's historically been part of the core repositories of my Linux of choice, Red Hat/CentOS.

Guacamole gives you HTTP-tunneling options for SSH, Telnet, RDP and VNC. this walk through is designed to get you quickly running Guacamole as an web-based SSH front end. Once you've got the SSH component running, adding other management protocols is easy. This procedure is also designed to be doable even if you don't yet actually have the ability to SSH to a AWS-hosted instance.
  1. Start the AWS web console's "launch instance" wizard.
  2. Select an appropriate EL6-based AMI.
  3. Select an appropriate instance type (the free tier instances are suitable for a basic SSH proxy)
  4. On the "Configure Instance" page, expand the "Advanced Details" section.
  5. In the now-available text box, paste in the contents of this script. Note that this script is flexible enough that, if the version of Guacamole hosted via the EPEL project is updated, the script should continue to work. With a slight bit of massaging, the script could also be made to work with EL 7 and associated Tomcat and EPEL-hosted RPMs.
  6. If the AMI you've picked does not provide the option of password-based logins for the default SSH user, add steps (in the "Advanced Details" text box) for creating an interactive SSH user with a password. Ensure that the user also has the ability to use `sudo` to get root privileges.
  7. Finish up the rest of the process for deploying an instance.
Once the instance finishes deploying, you should be able to set your browser to the public hostname shown for the instance in the AWS console. Add "/guacamole/" after the hostname. Assuming all went well, you will be presented with a Guacamole login prompt. Enter the credentials:
Note that these credentials can be changing the:
printf "\t<authorize username=\"admin\" password=\"PASSWORD\">\n"
Line of the pasted-in script. Once you've authenticated to Guacamole, you'll be able to login to the hosting-instance via SSH using the instance's no-privileged user's credentials. Once logged in, you can escalate privileges and then configure additional authentication mechanisms and connection destinations and protocols.

Note: Guacamole doesn't currently support key-based login mechanisms. If key-based logins are a must make use of GateOne, instead.

Thursday, May 7, 2015

EXTn and the Tyranny of AWS

One of the organizations I provide consulting services to opted to start migrating from an in-house, VMware-based virtualization solution to an Amazon-hosted cloud solution. The transition has been somewhat fraught, here and there - possibly moreso than the prior transition from physical (primarily Solaris-based) servers to virtualized (Linux) servers.

One of the huge "problems" is that the organization's various sub-units have habits formed of decade or longer lifecycles. Staff are particularly used to being able to get on console for various things (using GUI-based software-installers, graphical IDEs as well as tasks that actually require console access - like recovering a system stuck in its startup sequence).

For all that AWS offers, console access isn't one of them. Amazon's model for how customers should deploy and manage systems means that they don't figure that console access is strictly necessary.

In a wholly self-service model, this is probably an ok assumption. Unfortunately, the IT model that the organization in question is moving to doesn't offer instance-owners true self-service. They're essentially trying to transport an on-premises, managed multi-tenancy model into AWS. The model they're moving from didn't have self-service, so they're not interested in enabling self-service (at least not during phase one of the move). Their tenants not only don't have console access in the new environment, they don't have the ability to execute the AWS-style recovery-methods you'd use in the absence of console access. The tenants are impatient and the group supporting them is small, so it's a tough situation.

The migration has been ongoing for a sufficiently long period of time that the default `mkfs` behavior for EXT-based filesystems is starting to rear its head. Being well beyond the 180 day mark since the inception of the migration, tenants are finding that, when their Linux instances reboot, they're not coming back as quickly as they did towards the beginning of the migration ...because their builds still leave autofsck enabled.

If you're reading this, you may have run into similar issues.

The solution to this, while still maintaining the spirit of the "fsck every 180 days or so" best practices for EXTn-based filesystems is fairly straight forward:

  1. Disable the autofsck settings on your instances' EXTn filesystems: use `tune2efs -i 0 -c -1 /dev/<DEVNODE>`
  2. Schedule periodic fsck "simulations". This can be done either by running fsck in "dryrun" mode or by doing an fsck of a filesystem metadata image.
The "dryrun" method is fairly straight forward: just run fsck with the "-N" option. I'm not super much a fan of this as it doesn't feel like it gives me the info I'm looking for to feel good about the state of my filesystems.

The "fsck of a filesystem metadata image" is pretty straight forward, automatable and provides a bit more on the "warm fuzzies" side of thing. To do it:
  1. Create a metadata image file using `e2image -fr /dev/<DEVNODE> /IMAGE/FILE/PATH` (e.g. `e2image -fr /dev/RootVG/auditVol /tmp/auditVol.img`)
  2. Use `losetup` to create an fsck'able block-device from the image file (e.g., `losetup /dev/loop0 /tmp/auditVol.img`)
  3. Execute an fsck against the loopback device (e.g., `fsck /dev/loop0`). Output will look similar to the following:
    # fsck /dev/loop0
    fsck from util-linux-ng 2.17.2
    e2fsck 1.41.12 (17-May-2010)
    /dev/loop0: recovering journal
    /dev/loop0: clean, 13/297184 files, 56066/1187840 blocks
  4. If the output indicates anything other than good health, schedule an outage to do a proper repair of your live filesystem(s)
Granted, if you find you need to do the full check of the real filesystem(s), you're still potentially stuck with the "no console" issue. Even that is potentially surmountable:
  1. Create a "/forcefsck" file
  2. Create a "/fsckoptions" file with the contents "-sy"
  3. Schedule your reboot
When the reboot happens, depending how long the system takes to boot, the EC2 launch monitors may time out: just be patient. If you can't be patient, just monitor the boot logs (either in the AWS console or using the AWS CLI's equivalent option).

Wednesday, March 25, 2015

So You Don't Want to BYOL

The Amazon Web Services MarketPlace is pretty awesome. There's  oodles of pre-made machine templates to choose some. Even in the face of all that choice, it's not unusual to find that, of all the choices you have, none quite fit your needs. That's the scenario I found myself in.

Right now, I'm supporting a customer that's a heavy user of Linux for their business support systems. They're in the process of migrating from our legacy hosting environment to hosting things on AWS. During their development phase, use of CentOS was sufficient for their needs. As they move to production, however, they want "real" Red Hat Enterprise Linux.

Go up on the MarketPlace and there's plenty of options to choose from. However, my customer doesn't want to deal with buying a stand-alone entitlement to patch-support for their AWS-hosted systems. This requirement considerably cuts down on the useful choices in the MarketPlace. There's still "license included" Red Hat options to choose from.

Unfortunately, my customer also has fairly specific partitioning requirements that are not met by the "license included" AMIs. When using CentOS, this wan't a problem - CentOS's patch repos are open-access. Creating an AMI with suitable partitioning and access to those public repos is about a 20 minute process. While some of that process is leveragable for creating a Red Hat AMI, making the resultant AMI be "license included" is a bit more challenging.

When I tried to simply re-use my CentOS process, supplemented by the Amazon repo RPMs, I ended up with a system that, when I did a yum-query, got me 401 errors. I was missing something.

Google searches weren't terribly helpful in solving my problem. I found a lot of "how do I do this" posts, but damned few that actually included the answer. Ultimately, what it turns out to be is that if you generate your AMI from an EBS snapshot, instances launched from that AMI don't have an entitlement key to access the Amazon yum repos. You can see this by looking at your launched instance's metadata:
# curl http://169.254.169.254/latest/dynamic/instance-identity/document
{
  "accountId" : "717243568699",
  "architecture" : "x86_64",
  "availabilityZone" : "us-west-2b",
  "billingProducts" : null,
  "devpayProductCodes" : null,
  "imageId" : "ami-9df0ec7a",
  "instanceId" : "i-51825ba7",
  "instanceType" : "t1.micro",
  "kernelId" : "aki-fc8f11cc",
  "pendingTime" : "2015-03-25T19:04:51Z",
  "privateIp" : "172.31.19.148",
  "ramdiskId" : null,
  "region" : "us-east-1",
  "version" : "2010-08-31"
}

Specifically, what you want to look at is the value for "billingProducts". If it's "null", your yum isn't going to be able to access the Amazon RPM repositories. Where I came up close to empty on my Google searches was "how to make this attribute persist across images".

I found a small note in a community forum post indicating that AMIs generated from an EBS snapshot will always have "billingProducts" set to "null". This is due to a limitation in the tool used to register an image from a snapshot.

To get around this limitation, one has to create an AMI from a instance of an entitled AMI. Basically, after you've created the EBS you've readied to make a custom AMI, you do a disk-swap with a properly-entitled instance. You then use the "create image" option from that instance. Once you launch AMI you created via the EBS-swap, your instance's metadata will now look something like:
# curl http://169.254.169.254/latest/dynamic/instance-identity/document
{
  "accountId" : "717243568699",
  "architecture" : "x86_64",
  "availabilityZone" : "us-west-2b",
  "billingProducts" : [ "bp-6fa54006" ],
  "devpayProductCodes" : null,
  "imageId" : "ami-9df0ec7a",
  "instanceId" : "i-51825ba7",
  "instanceType" : "t1.micro",
  "kernelId" : "aki-fc8f11cc",
  "pendingTime" : "2015-03-25T19:04:51Z",
  "privateIp" : "172.31.19.148",
  "ramdiskId" : null,
  "region" : "us-east-1",
  "version" : "2010-08-31"
}

Once that "billingProducts" is set, the cloud-init related first-boot scripts will take that "billingProducts" and use it to register the system with the Amazon yum repos. Voilà: you  now have a fully custom AMI that uses Amazon-billed access to Red Hat updates.

Note on Compatibility: the Red Hat provided PVM AMIs do not yield well to this method. The Red Hat provided PVM AMIs are all designed with their boot/root device set to /dev/sda1. To date, attempts to leverage the above techniques for PVM AMIs that require their boot/root device set to /dev/sda (used when using a single, partitioned EBS to host a bare /boot partition and LVM-managed root partitions) have not met with success.

Thursday, February 5, 2015

Attack of the Clones

One of the clients I do work for has a fairly significant Linux footprint. However, in these times of greater fiscal responsibility/austerity, my client is looking at even cheaper alternatives. This means that, for systems whose applications don't require  Red Hat for warranty-support, CentOS is being subbed into their environment. This is particularly true for their testing environments. There've even been arguments for doing it in production, applications' vendor-support be damned, because "CentOS is the same as Red Hat"

I've previously argued, "they're very, very similar, but they're not truly identical". In particular, Red Hat handles CVEs and errata somewhat differently than CentOS does (Red Hat backports many fixes to prior EL releases, CentOS's stance is generally "upgrade it").

Today, I got bit by one place where CentOS hews far too closely to "the same as Red Hat Enterprise Linux". Specifically, I was using the `oscap` security tool to do a security audit of a test system. I should say, "I was struggling to use the `oscap` security tool...". With later versions of EL6, Red Hat, and as a derivative, CentOS, implement the CPE system for Linux.

This is all fine and good, except where the tools you use rely on the correctness of CPE-related definitions. By the standard of CPE, Red Hat and CentOS are very much not "the same". Because the security-auditing tool I was using (`oscap`) leverages CPEs and because the CentOS maintainers simply repackage the Red Hat furnished security profiles without updating the CPE call-outs, first, the security tool fails horribly. Every test comes back as "notapplicable".

To fix this situation, a bit of `sed`-fu is required:
mv /usr/share/xml/scap/ssg/content/ssg-rhel6-cpe-oval.xml \
   /usr/share/xml/scap/ssg/content/ssg-rhel6-cpe-oval.xml-DIST && \
cp /usr/share/xml/scap/ssg/content/ssg-rhel6-cpe-oval.xml-DIST \
   /usr/share/xml/scap/ssg/content/ssg-rhel6-cpe-oval.xml && \
sed -i '{
   s#Red Hat Enterprise Linux 6#CentOS 6##g
   s#cpe:/o:redhat:enterprise_linux:6#cpe:/o:centos:centos:6##g
}' /usr/share/xml/scap/ssg/content/ssg-rhel6-cpe-oval.xml


mv /usr/share/xml/scap/ssg/content/ssg-rhel6-xccdf.xml \
   /usr/share/xml/scap/ssg/content/ssg-rhel6-xccdf.xml-DIST && \
cp /usr/share/xml/scap/ssg/content/ssg-rhel6-xccdf.xml-DIST \
   /usr/share/xml/scap/ssg/content/ssg-rhel6-xccdf.xml && \
sed -i \
   's#cpe:/o:redhat:enterprise_linux#cpe:/o:centos:centos##g' \
/usr/share/xml/scap/ssg/content/ssg-rhel6-xccdf.xml

Once the above is done, running `oscap` actually produces useful results.

NOTE: Ironically, doing the above edits will cause the various SCAP profiles to flag an error when running the tests that verify that RPMs have been unaltered. I've submitted a bug to the CentOS group so these fixes are included in future versions of the CentOS OpenSCAP RPMs, but, until then, you just need to be aware that the `oscap` tool  will flag the above two files.

...And if you found this page because you're trying to figure out how to run `oscap` to get results, here's a sample invocation that should act as a starting-point:

oscap xccdf eval --profile common --report \
   /var/tmp/oscap-report_`date "+%Y%m%d%H%M"`.html \
   --results /var/tmp/oscap-results_`date "+%Y%m%d%H%M"`.xml\
   --cpe /usr/share/xml/scap/ssg/content/ssg-rhel6-cpe-dictionary.xml \
   /usr/share/xml/scap/ssg/content/ssg-rhel6-xccdf.xml

Wednesday, November 26, 2014

Converting to EL7: Solving The "Your Favorite Service Isn't Systemd-Enabled" Problem

Having finally gotten off my butt to knock out getting my RHCE (for EL6) before the 2014-12-19 drop-dead date, I'm finally ready to start focusing on migrating my personal systems to EL7-based distros.

My personal VPS is currently running CentOS 6.6. I use my VPS to host a couple of personal websites and email for family and a few friends. Yes, I realize that it would probably be easier to offload all of this to providers like Google. However, while Google is very good at SPAM-stomping, and provides me a very generous amount of space for archiving emails, one area that they do lack for is email aliases: whenever I have to register to a new web-site, I use a custom email address to do so. At my last pruning, I still had 300+, per-site, aliases. So, for me, number of available aliases ("unlimited" is best) and ease of creating them trumps all other considerations.

Since I don't have Google handling my mail for me, I have to run my own A/V and anti-spam engines. Being a good Internet Citizen, I also like to make use of Sender Policy Framework (via OpenSPF) and DomainKeys (currently via DKIMproxy).

I'm only just into the process of sorting out what I need to do to make the transition as quick and as painless (more for my family and friends than me) a process as possible. I hate outages. And, with a week off for the Thanskgiving holidays, I've got time to do things in a fairly orderly fashion.

At any rate, one of the things I discovered is that my current DomainKeys solution hasn't been updated to "just work" within the systemd framework used within EL7. This isn't terribly surprising, as it appears that the DKIMproxy SourceForge project may have gone dormant, in 2013 (so, I'll have to see if there's alternatives that have the appearance of still being a going concern - in the mean time...) Fortunately, the DKIMproxy source code does come with a `chkconfig` compatible SysV-init script. Even more fortunately, converting from SysV-init to a systemd-compatible service control is a bit more straight forward than when I was dealing with moving from Solaris 9's legacy init to Solaris 10's SMF.

If you've already got a `chkconfig` style init script, moving to systemd-managed is fairly trivial. Your `chkconfig` script can be copied, pretty much "as is" into "/usr/lib/systemd". My (current) preference is to create a "scripts" subdirectory and put it in there. Haven't read deeply enough into systemd to see if this is the "Best Practices" method, however. Also, where I work has no established conventions ...because they only started migrating to EL6 in fall of 2013 - so, I can't exactly crib anything EL7-related from how we do it at work.

Once you have your SysV-init style script placed where it's going to live (e.g., "/usr/lib/systemd/scripts"), you need to create associated service definition files. In my particular case, I had to create two as the DKIMproxy software actually has an inbound and an outbound funtion. Launched from normal SysV-init, it all gets handled as one piece. However, one of the nice things about systemd is it's not only a launcher framework, it's a service monitor framework, as well. To take full advantage, I wanted one monitor for the inbound service and one for the outbound service. The legacy init script that DKIMproxy ships with makes this easy enough as, in addition to the normal "[start|stop|restart|status]" arguments, it had per-direction subcommand (e.g., "start-in" and "stop-out"). The service-definition for my "dkim-in.service" looks like:
[Unit]
     Description=Manage the inbound DKIM service
     After=postfix.service


     [Service]
     Type=forking
     PIDFile=/usr/local/dkimproxy/var/run/dkimproxy_in.pid
     ExecStart=/usr/lib/systemd/scripts/dkim start-in
     ExecStop=/usr/lib/systemd/scripts/dkim stop-in


     [Install]
     WantedBy=multi-user.target

To break down the above:

  • The "Unit" stanza tells systemd a bit about your new service:
    • The "Description" line is just ASCII text that allows you to provide a short, meaningful of what the service does. You can see your service's description field by typing `systemctl -p Description show <SERVICENAME>`
    • The "After" parameter is a space-separated list of other services that you want to have successfully started before systemd attempts to start your new service. In my case, since DKIMproxy is an extension to Postfix, it doesn't make sense to try to have DKIMproxy running until/unless Postfix is running.
  • The "Service" stanza is where you really define how your service should be managed. This is where you tell systemd how to start, stop, or reload your service and what PID it should look for so it knows that the service is still notionally running. The following parameters are the minimum ones you'd need to get your service working. Other parameters are available to provide additional functionality:
    • The "Type" parameter tells systemd what type of service it's managing. Valid types are: simpleforking, oneshot, dbus, notify or idle. The systemd.service man page more-fully defines what each option is best used for. However, for a traditional daemonized service, you're most likely to want "forking".
    • The "PIDFile" parameter tells systemd where to find a file containing the parent PID for your service. It will then use this to do a basic check to monitor whether your service is still running (note that this only checks for presence, not actual functionality).
    • The "ExecStart" parameter tells systemd how to start your service. In the case of a SysV-init script, you point it to the fully-qualified path you installed your script to and then any arguments necessary to make that script act as a service-starter. If you don't have a single, chkconfig-style script that handles both stop and start functions, you'd simply give the path to whatever starts your service. Notice that there are no quotations surrounding the parameter's value-section. If you put quotes - in the mistaken belief that the starter-command and it's argument need to be grouped, you'll get a path error when you go to start your service the first time.
    • The "ExecStop" parameter tells systemd how to stop your service. As with the "ExecStart" parameter, if you're leveraging a fully-featured SysV-init script, you point it to the fully-qualified path you installed your script to and then any arguments necessary to make that script act as a service-stopper. Also, the same rules about white-space and quotation-marks apply to the "ExecStop" parameter as do the "ExecStart" parameter.
  • The "Install" stanza is where you tell systemd the main part of the service dependency-tree to put your service. You have two main dependency-specifiers to choose: "WantedBy" and "RequiredBy". The former is a soft-dependency while the latter is a hard-dependency. If you use the "RequiredBy" parameter, then the service unit-group (e.g., "mult-user.target") enumerated with the "RequiredBy" parameter will only be considered to have successfully onlined if the defined service has successfully launched and stayed running.  If you use the "WantedBy" parameter, then the service unit-group (e.g., "mult-user.target") enumerated with the "WantedBy" parameter will still be considered to have successfully onlined whether the defined service has successfully launched or stayed running. It's most likely you'll want to use "WantedBy" rather than "RequiredBy" as you typically won't want systemd to back off the entire unit-group just because your service failed to start or stay running (e.g., you don't want to stop all of the multi-user mode related processes just because one network service has failed.)