Titular Discrepancy: CentOS

Showing posts with label CentOS. Show all posts

Wednesday, March 27, 2019

Travis, EL7 and Docker-Based Testing

As noted in a prior post, lot of my customer-oriented activities support deployment within networks that are either partially- or wholly-isolated from the public Internet. Yesterday, as part of supporting one such customer, I was stood up a new project to help automate the creation of yum repository configuration RPMs for private networks. I've had to hand-jam such files twice, now, and there's unwanted deltas between the two jam-sessions (in defense, they were separated from each other by nearly a three-year time-span). So, I figured it was time to standardize and automate things.

Usually, when I stand up a project, I like to include tests of the content that I wish to deliver. Since most of my projects are done in public GitHub repositories, I typically use TravisCI to automate my testing. Prior to this project, however, I wasn't trying to automate the validity-testing of RPM recipes via Travis. Typically, when automating creation of RPMs I wish to retain or deliver, I set up a Jenkins job that takes the resultant RPMs and stores them in Artifactory – both privately-hosted services. Most of my prior Travis jobs were simple, syntax-checkers (using tools like shellcheck, JSON validators, CFn validators, etc.) rather than functionality-checkers.

This time, however, I was trying to deliver a functionality (RPM spec files that would be used to generate source files from templates and package the results). So, I needed to be able to test that a set of spec files and source-templates could be reliably-used to generate RPMs. This meant, I needed my TravisCI job to generate "throwaway" RPMs from the project-files.

The TravisCI system's test-hosts are Ubuntu-based rather than RHEL or CentOS based.While there are some tools that will allow you to generate RPMs on Ubuntu, there've been some historical caveats on their reliability and/or representativeness. So, my preference was to be able to use a RHEL or CentOS-based context for my test-packagings. Fortunately, TravisCI does offer the ability to use Docker on their test-hosts.

In general, setting up a Docker-oriented job is relatively straight forward. Where things get "fun" is that the version of `rpmbuild` that comes with Enterprise Linux 7 gets kind of annoyed if it's not able to resolve the UIDs and GIDs of the files it's trying to build from (never mind that the build-user inside the running Docker-container is "root" ...and has unlimited access within that container). If it can't resolve them, the rpmbuild tasks fail with a multitude of not terribly helpful "error: Bad owner/group: /path/to/repository/file" messages.

After googling about, I ultimately found that I needed to ensure that the UIDs and GIDs of the project-content need to exist within the Docker-container's /etc/passwd and /etc/group files, respectively. Note: most of the "top" search results Google returned to me indicated that the project files needed to be `chown`ed. However, simply being mappable proved to be sufficient.

Rounding the home stretch...

To resolve the problem, I needed to determine what UIDs and GIDs the project-content had inside my Docker-container. That meant pushing a Travis job that included a (temporary) diagnostic-block to stat the relevant files and return me their UIDs and GIDs. Once the UIDs and GIDs were determined, I needed to update my Travis job to add relevant groupadd and useradd statements to my container-preparation steps. What I ended up with was.

    sudo docker exec centos-${OS_VERSION} groupadd -g 2000 rpmbuilder
    sudo docker exec centos-${OS_VERSION} adduser -g 2000 -u 2000 rpmbuilder

It was late in the day, by this point, so I simply assumed that the IDs were stable. I ran about a dozen iterations of my test, and they stayed stable, but that may have just been "luck". If I run into future "owner/group" errors, I'll update my Travis job-definition to scan the repository-contents for their current UIDs and GIDs and then set them based on those. But, for now, my test harness works: I'm able to know that updates to existing specs/templates or additional specs/templates will create working RPMs when they're taken to where they need to be used.

Thursday, August 16, 2018

Systemd And Timing Issues

Unlike a lot of Linux people, I'm not a knee-jerk hater of systemd. My "salaried UNIX" background, up through 2008, was primarily with OSes like Solaris and AIX. With Solaris, in particular, I was used to sytemd-type init-systems due to SMF.

That said, making the switch from RHEL and CentOS 6 to RHEL and CentOS 7 hasn't been without its issues. The change from upstart to systemd is a lot more dramatic than from SysV-init to upstart.

Much of the pain with systemd comes with COTS software originally written to work on EL6. Some vendors really only due fairly cursory testing before saying something is EL7 compatible. Many — especially earlier in the EL 7 lifecycle — didn't bother creating systemd services at all. They simply relied on systemd-sysv-generator utility to do the dirty work for them.

While the systemd-sysv-generator utility does a fairly decent job, one of the places it can fall down is if the legacy-init script (files hosted in /etc/rc.d/init.d) is actually a symbolic link to someplace else in the filesystem. Even then, it's not super much a problem if "someplace else" is still within the "/" filesystem. However, if your SOPs include "segregate OS and application onto different filesystems", then "someplace else can" very much be a problem — when "someplace else" is on a different filesystem from "/".

Recently, I was asked to automate the installation of some COTS software with the "it works on EL6 so it ought to work on EL7" type of "compatibility". Not only did the software not come with systemd service files, its legacy-init files linked out to software installed in /opt. Our shop's SOPs are of the "applications on their own filesystems" variety. Thus, the /opt/<APPLICATION> directory is actually its own filesystem hosted on its own storage device. After doing the installation, I'd reboot the system. ...And when the system came back, even though there was a boot script in /etc/rc.d/init.d, the service wasn't starting. Poring over the logs, I eventually found:

systemd-sysv-generator[NNN]: stat() failed on /etc/rc.d/init.d/<script_name>
No such file or directory

This struck me odd given that the link and its destination very much did exist.

Turns out, systemd invokes the systemd-sysv-generator utility very early in the system-initialization proces. It invokes it so early, in fact, that the /opt/<APPLICATION> filesystem has yet to be mounted when it runs. Thus, when it's looking to do the conversion the file the sym-link points to actually does not yet exist.

My first thought was, "screw it: I'll just write a systemd service file for the stupid application." Unfortunately, the application's starter was kind of a rats nest of suck and fail; complete and utter lossage. Trying to invoke it from directly via a systemd service definition resulted in the application's packaged controller-process not knowing where to find a stupid number of its sub-components. Brittle. So, I searched for other alternatives...

Eventually, my searches led me to both the nugget about when systemd invokes the systemd-sysv-generator utility and how to overcome the "sym-link to a yet-to-be-mounted-filesystem" problem. Under systemd-enabled systems, there's a new-with-systemd mount-option you can place in /etc/fstab — x-initrd.mount. You also need to make sure that your filesystem's fs_passno is set to "0" ...and if your filesystem lives on an LVM2 volume, you need to update your GRUB2 config to ensure that the LVM gets onlined prior to systemd invoking the systemd-sysv-generator utility. Fugly.

At any rate, once I implemented this fix, the systemd-sysv-generator utility became happy with the sym-linked legacy-init script ...And my vendor's crappy application was happy to restart on reboot.

Given that I'm deploying on AWS, I was able to accommodate setting these fstab options by doing:

mounts:
  - [ "/dev/nvme1n1", "/opt/<application> , "auto", "defaults,x-initrd.mount", "0", "0" ]

Within my cloud-init declaration-block. This should work in any context that allows you to use cloud-init.

I wish I could say that this was the worst problem I've run into with this particular application. But, really, this application is an all around steaming pile of technology.

Friday, March 23, 2018

So You Wanna Run Sonarqube on (Hardened) RHEL/CentOS 7

As developers become more concerned with injecting security-awareness into their processes, tools like Sonarqube become more popular. As that popularity increases, the desire to run that software on more than just the Linux distro(s) it was originally built to run on increases.

Most of my customers only allow the use of Red Hat (and, increasingly, CentOS) on their production networks. As these customers hire developers that cut their teeth not on Red Hat, the pressure for deploying their preferred-tools onto Enterprise Linux increases.

Sonarqube has two relevant installation-methods listed on their installation documentation. One is to explode a ZIP archive and install the files wherever desired and set up requisite automated startup as appropriate. The other is to install a "semi-official" RPM-packaging. Initially, I chose the former, but subsequently changed to the latter. While the former is fine if you're installing very infrequently (and installing by hand), it can fall down if you're doing automated deployments by way of service-frameworks like Amazon's CloudFormation (CFn) and/or AutoScaling.

My preferred installation method is to use CFn to create AutoScaling Groups (ASGs). I don't run Sonarqube clustered: the community edition that my developers lack of funding allows me to deploy doesnt support clustering. Using ASGs means that I can increase the service's availability by having the Sonarqube service simply rebuild itself if the service ever becomes unexpectedly offline. Rebuilds only take a couple minutes and generally don't require any administrator intervention.

This method is/was compatible with the ZIP-based installation method for the first month or so of running the service. Then, one night, it stopped working. When I investigated to figure out why it stop, I found that the URL the deployment-automation was pointing to no longer existed. I'm not really sure what happened because the Sonarqube maintaiers seem to keep back versions of the software around indefinitely.

/shrug

At any rate, it caused me to return to the installation documentation. This time, I noticed that there was an RPM option. So, I went down that path. For the most part, it made things dead easy vice using the ZIP packaging. There were only a few gotchas with the RPM:

The current packaging, in addition to being a "noarch" packaging, is neither EL version-specific nor is it cross-version. Effectively, it's EL6-oriented: it includes a legacy init script but not a systemd service definition. The RPM maintainer likely needs to either make an EL6 and an EL7 packaging or needs to update the packaging to include files for both init-types and then use a platform-selector to install and activate the appropriate one

The current packaging creates the service account 'sonar' and installs the packaged files as user:group 'sonar' ...but doesn't include start-logic to ensure that the Sonarqube runs as 'sonar'. This will manifest in Sonarqube's es.log file similarly to:

[2018-02-21T15:45:51,714][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.RuntimeException: can not run elasticsearch as root
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:125) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:112) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.cli.SettingCommand.execute(SettingCommand.java:54) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:96) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.cli.Command.main(Command.java:62) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:89) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:82) ~[elasticsearch-5.1.1.jar:5.1.1]
Caused by: java.lang.RuntimeException: can not run elasticsearch as root
        at org.elasticsearch.bootstrap.Bootstrap.initializeNatives(Bootstrap.java:100) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:176) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:306) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:121) ~[elasticsearch-5.1.1.jar:5.1.1]
        ... 6 more

The embedded ElastiCache service requires that the file-descriptors ulimit be at a specific minimum value: if the deployment is on a hardened system, it is reasonably likely the default ulimits are too low. If too low, this will manifest in Sonarqube's es.log file similarly to:

[2018-02-21T15:47:23,787][ERROR][bootstrap                ] Exception
java.lang.RuntimeException: max file descriptors [65535] for elasticsearch process likely too low, increase to at least [65536]
    at org.elasticsearch.bootstrap.BootstrapCheck.check(BootstrapCheck.java:79)
    at org.elasticsearch.bootstrap.BootstrapCheck.check(BootstrapCheck.java:60)
    at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:188)
    at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:264)
    at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:111)
    at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:106)
    at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:88)
    at org.elasticsearch.cli.Command.main(Command.java:53)
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:74)
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:67)

The embedded ElastiCache service requires that the kernel's vm.max_map_count runtime parameter be tweaked. If not adequately tweaked, this will manifest in Sonarqube's es.log file similarly to:
```
max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
```

The embedded ElastiCache service's JVM won't start if the /tmp directory's had the noexec mount-option applied as part of the host's hardening. This will manifest in Sonarqube's es.log file similarly to:

WARN  es[][o.e.b.Natives] cannot check if running as root because JNA is not available 
WARN  es[][o.e.b.Natives] cannot install system call filter because JNA is not available 
WARN  es[][o.e.b.Natives] cannot register console handler because JNA is not available 
WARN  es[][o.e.b.Natives] cannot getrlimit RLIMIT_NPROC because JNA is not available
WARN  es[][o.e.b.Natives] cannot getrlimit RLIMIT_AS because JNA is not available
WARN  es[][o.e.b.Natives] cannot getrlimit RLIMIT_FSIZE because JNA is not available

External users won't be able to communicate with the service if the host-based firewall hasn't been appropriately configured.

The first three items are all addressable via a properly-written systemd service definition. The following is what I put together for my deployments:

[Unit]
Description=SonarQube
After=network.target network-online.target
Wants=network-online.target

[Service]
ExecStart=/opt/sonar/bin/linux-x86-64/sonar.sh start
ExecStop=/opt/sonar/bin/linux-x86-64/sonar.sh stop
ExecReload=/opt/sonar/bin/linux-x86-64/sonar.sh restart
Group=sonar
LimitNOFILE=65536
PIDFile=/opt/sonar/bin/linux-x86-64/SonarQube.pid
Restart=always
RestartSec=30
User=sonar
Type=forking

[Install]
WantedBy=multi-user.target

I put this into GitHub project I set up as part of automating my deployments for AWS. The 'User=sonar' option causes systemd to start the process as the sonar user. The 'Group=sonar' directive ensures that the process runs under the sonar. The 'LimitNOFILE=65536' ensures that the service's file-descriptor's ulimit is set sufficiently high.

The fourth item is addressed by creating an /etc/sysctl.d/sonarqube file containing:

vm.max_map_count = 262144

I take care of this with my automation but the RPM maintainer could obviate the need to do so by including an /etc/sysctl.d/sonarqube file as part of the RPM.

The fifth item is addressable through the sonar.properties file. Adding a line like:

sonar.search.javaAdditionalOpts=-Djava.io.tmpdir=/var/tmp/elasticsearch

Will cause Sonarqube to point ElasticSearch's temp-directory to /var/tmp/elasticsearch. Note that this is an example directory: pick a location in which the sonar user can create a directory and that is on a filesystem that does not have the noexec mount-option set. Without making this change, ElasticSearch will fail to start because JNI cannot be started.

The final item is adressable by adding a port-exception to the firewalld service. This can be done as simply as executing `firewall-cmd --permanent --service=sonarqube --add-port=9000/tcp` or adding a /etc/firewalld/services/sonarqube.xml file containing:

<?xml version="1.0" encoding="utf-8"?>
<service>
  <short>Sonarqube service ports</short>
  <description>Firewalld options supporting Sonarqube deployments</description>
  <port protocol="tcp" port="9000" />
  <port protocol="tcp" port="9001" />
</service>

Either is suitable for inclusion in an RPM: the former by way of a %post script; the latter as a config-file packaged in the RPM.

Once all of the bulleted-items are accounted for, the RPM-included service should start right up and function correctly on a hardened EL7 system. Surprisingly enough, no other tweaks are typically needed to make Sonarqube run on a hardened system. Sonarqube will typically function just fine with both FIPS-mode and SELinux active.

It's likely that once Sonarqube is deployed, functionality beyond the defaults will be desired. There are quite a number of functionality-extending plugins available. Download the ones you want and install them to <SONAR_INSTALL_ROOT>/extensions/plugins and restart the service. Most plugins behavior is configurable via the Sonarqube administration GUI.

Note: while much of the 'fixes' following the gotcha list are scattered throughout the Sonarqube documentation, on an unhardened EL 7 system, the system defaults are loose enough that accounting for them was not necessary for either the ZIP- or RPM-based installation-methods.

Extra: If you want the above taken care of for you and you're trying to deploy Sonarqube onto AWS, I put together a set of template-driven CloudFormation templates and helper scripts. They can be found on GitHub.

Monday, January 8, 2018

The Joys of SCL (or, "So You Want to Host RedMine on Red Hat/CentOS")

Recently, I needed to go back and try to re-engineer my company's RedMine deployment. Previously, I had rushed through a quick-standup of the service using the generic CentOS AMI furnished by CentOS.Org. All of the supporting automation was very slap-dash and had more things hard-coded than I really would have liked. However, the original time-scale the automation needed to be delivered under meant that "stupid but works ain't stupid" ruled the day.

That ad hoc implementation was over two years ago, at this point. Since the initial implementation, I had never had the chance to start to revisit it. The lack of a contiguous chunk of time to re-implement changed over the holidays (hooray for everyone being gone over the holiday weeks!). At any rate, there were a couple things I wanted to accomplish with the re-build:

Change from using the generic CentOS AMI to using our customized AMI
- Our AMI has a specific partitioning-scheme for our OS disks that we prefer to be used across all of our services. Wanted to make our RedMine deployment no longer be an "outlier".
- Our AMIs are updated monthly, whereas the CentOS.Org ones are typically updated when there's a new 7.X. Using our AMIs means not having to wait nearly as long for launch-time patch-up to complete before things can become "live".
I wanted to make use of our standard, provisioning-time OS-hardening framework — both so that the instance is better configured to be Internet-facing and so it is more in accordance with our other service-hosts
I wanted to be able to use a vendor-packaged Ruby rather than the self-packaged/self-managed Ruby. I had previously gone the custom Ruby route both because the vendor's default Ruby is too old for RedMine to use and because SCL was not quite as mature an offering back in 2015. In updating my automation to make use of SCL, I was figuring that a vendor-packaged RPM is more likely to be kept up to date without potentially introducing breakage that self-packaged RPMs can be prone to.
I wanted to switch from using UserData to using CloudFormation (CFn) for launch automation:
- Relatively speaking, CFn automation includes a lot more ability to catch errors/deployment-breakage with lower-effort. Typically, it's easier to avoid the "I thought it deployed correctly" happenstances.
- Using CFn automation exposes less information via instance-metadata either via the management UIs/CLIs or by reading metadata from the instance.
- Using CFn automation makes it easier to coordinate the provisioning of non-EC2 solution components (e.g., security groups, IAM roles, S3 buckets, EFS shares, RDS databases, load-balancers etc.)
I wanted to clean up the overall automation-scripts so I could push more stuff into (public) GitHub (projects) and have to pull less information from private repositories (like S3 buckets).

Ultimately, the biggest challenge turned out to be dealing with SCL. I can probably account for the problems with SCL being their focus. Specifically, they seem to be designed more for use by developers in interactive environments than they are to underpin services. If you're a developer, all you really need to do to have a "more up-to-date than vendor-standard" language packages is:

Enable the SCL repos
Install the SCL-hosted language package you're looking for
Update either your user's shell-inits or the system-wide shell inits
Log-out and back in
Go to town with your new SCL-furnished capabilities.

If, on the other hand, you want to use an SCL-managed language to back a service that can't run with the standard vendor-packaged languages (etc)., things are a little more involved.

A little bit of background:

The SCL-managed utilities are typically rooted wholly in "/opt/rh". This means binaries, libraries, documentation. Because of this, your shell's PATH, MANPATH and linker-paths will need to be updated. Fortunately, each SCL-managed package ships with an `enable` utility. You can do something as simple as `scl_enabled <SCL_PACKAGE_NAME> bash`, do the equivalent in your user's "${HOME}/.profile" and/or "${HOME}/.bash_profile" files, or you can do something by way of a file in "/etc/profile.d/". Since I assume "everyone on the system wants to use it", I enable by way of a file in "/etc/profile.d/". Typically, I create a file named "enable_<PACKAGE>.sh" and create contents similar to:

#!/bin/bash
source /opt/rh/<RPM_NAME>/enable
export X_SCLS="$(scl enable <RPM_NAME> 'echo $X_SCLS')"

Within it. Each time an interactive-user logs in, they get the SCL-managed package configured for their shell. If multiple SCL-managed packages are installed, they all "stack" nicely (failure to do the above can result in one SCL-managed package interfering with another).

Caveats:

Unfortunately, while this works well for interactive shell users, it's not generally sufficient for the non-interactive environments. This includes actual running services or stuff that's provisioned at launch-time via scripts kicked off by way of `systemd` (or `init`). Such processes do not use the contents of the "/etc/profile.d" directory (or other user-level initialization scripts). This creates a "what to do" scenario.

Linux affords a couple methods for handling this:

You can modify the system-wide PATH (etc.). To make the run-time linker find libraries in non-standard paths, you can create files in "/etc/ld.conf.d" that enumerate the desired linker search paths and then run `ldconfig` (the vendor-packaging of MariaDB and a few others do this). However, this is traditionally been considered not to be "good form" and frowned upon.
Use under systemd:
- You can modify the scripts invoked via systemd/init; however, if those scripts are RPM-managed, such modifications can be lost when the RPM gets updated via a system patch-up
- Under systemd, you can create customized service-definitions that contain the necessary environmentals. Typically, this can be quickly done by copying the service-definition found in the "/usr/lib/systemd/system" directory to the "/etc/systemd/system" directory and modifying as necessary.
Cron jobs: If you need to automate the running of tasks via cron, you can either:
- Write action-wrappers that include the requisite environmental settings
- Make use of the ability to set environmentals in jobs defined in /etc/cron.d
Some applications (such as Apache) allow you to extend library and other search directives via run-time configuration directives. In the case of RedMine, Apache calls the Passenger framework which, in turn, runs the Ruby-based RedMine code. To make Apache happy, you create an "/etc/httpd/conf.d/Passenger.conf" that looks something like:
```
LoadModule passenger_module /opt/rh/rh-ruby24/root/usr/local/share/gems/gems/passenger-5.1.12/buildout/apache2/mod_passenger.so

<IfModule mod_passenger.c>
        PassengerRoot /opt/rh/rh-ruby24/root/usr/local/share/gems/gems/passenger-5.1.12
        PassengerDefaultRuby /opt/rh/rh-ruby24/root/usr/bin/ruby
        SetEnv LD_LIBRARY_PATH /opt/rh/rh-ruby24/root/usr/lib64
</IfModule>
```
The directives:
- "LoadModule" directive binds the module-name to the fully-qualified path to the Passenger Apache-plugin (shared-library). This file is created by the Passenger installer.
- "PassengerRoot" tells Apache where to look for further, related gems and libraries.
- "PassengerDefaultRuby" tells Apache which Ruby to use (useful if there's more than one Ruby version installed or, as in the case of SCL-managed packages, Ruby is not found in the standard system search paths).
- "SetEnv LD_LIBRARY_PATH" tells Apachels run-time linker where else to look for shared library files.
Without these, Apache won't find RedMine or all of the associate (SCL-managed) Ruby dependencies.

The above list should be viewed in "worst-to-first" ascending-order of preference. The second systemd and cron methods and the application-configuration method (notionally) have the narrowest scope (area of effect, blast-radius, etc.) — only effecting where the desired executions look for non-standard components — and are the generally preferred methods.

Note On Provisioning Automation Tools:

Because automation tools like CloudFormation (and cloud-init) are not interactive processes, they too will not process the contents of the "/etc/profile.d/" directory. One can use one of the above enumerated methods for modifying the invocation of these tools or include explicit sourcing of the relevant "/etc/profile.d/" files within the scripts executed by these frameworks.

Wednesday, September 6, 2017

Order Matters

For one of the projects I'm working on, I needed to automate the deployment of JNLP-based Jenkins build-agents onto cloud-deployed (AWS/EC2) EL7-based hosts.

When we'd first started the project, we were using the Jenkins EC2 plugins to do demand-based deployment of EC2 spot-instances to host the Jenkins agent-software. It worked decently and kept deployment costs low. For better or worse, our client preferred using fixed/reserved instances and JNLP-based agent setup vice master-orchestrated SSH-based setup. The requested method likely makes sense for if/when they decide there's a need for Windows-based build agents — as all will be configured equivalently (via JNLP). It also eliminates the need to worry about SSH key-management.

At any rate, we made the conversion. It was initially trivial ...until we implemented our cost-savings routines. Our customer's developers don't work 24/7/365 (and neither do we). It didn't make sense to have agents just sitting around idle racking up EC2 and EBS time. So, we placed everything under a home-brewed tool to manage scheduled shutdowns and startups.

The next business-day, the problems started. The Jenkins master came online just fine. All of the agents also started back up ...however, the Jenkins master was declaring them dead. As part of debugging process, we would log into each of the agent-hosts but would find that there was a JNLP process in the PID-list. Initially we assumed the declared-dead problem was just a simple timing issue. Our first step was to try rescheduling the agents' start to happen a half hour before the master's. When that failed, we then tried setting the master's start a half hour before the agents'. No soup.

Interestingly, doing a quick `systemctl restart jenkins-agent` on the agent hosts would make them pretty much immediately go online in the Master's node-manager. So, we started to suspect something between the master and agent-nodes causing issues.

Because our agents talk to the master through an ELB, the issues lead us to suspect the ELB was proving problematic. After all, even though the Jenkins master starts up just fine, the ELB doesn't start passing traffic for a minute or so after the master service starts accepting direct connections. We suspected that perhaps there was enough lag in the ELB going fully active that the master was declaring the agents dead — as each agent-host was showing a ping-timeout error in the master console:

Upon logging back into the agents, we noticed that, while the java process was running, it wasn't actually logging. For background, when using default logging, the Jenkins agent create ${JENKINS_WORK_DIR}/logs/remoting.log.N files. The actively logged-to file is the ".0" file - which you can verify with tools other tools ...or just notice that there's a remoting.log.0.lck file. Interestingly, the remoting.log.0 file was blank, whereas all the other, closed files showed connection-actions between the agent and master.

So, started looking at the systemd file we'd originally authored:

[Unit]
Description=Jenkins Agent Daemon

[Service]
ExecStart=/bin/java -jar /var/jenkins/slave.jar -jnlpUrl https://jenkins.lab/computer/Jenkins00-Agent0/slave-agent.jnlp -secret &lt;secret&gt; -workDir &lt;workdir&gt;
User=jenkins
Restart=always
RestartSec=15

[Install]
WantedBy=multi-user.target

Initially, I tried playing with the RestartSec parameter. Made no difference in behavior ...probably because the java process never really exits to be restarted. So, did some further digging about - particularly with an eye towards having failed to track missing systemd dependencies.

When you're making the transition from EL6 to EL7, one of the things that's easy to forget is that systemd is not a sequential init-system. Turned out that, while I'd told the service, "don't start till multi-user.target is going", I hadn't told it "...but make damned sure that the NICs are actually talking to the network." That was the key, and was accomplished by adding:

Wants=network-online.target
Requires=network.target

to the service definition files' [Unit] sections. Some documents indicated to set the Wants= to network.target. Others indicated that there were some rare cases where network.target may be satisfied before the NIC is actually fully ready to go — that the network-online.target was therefore the better dependency to set. In practice, the time between which network.target and network-online.target onlines is typically in the millisecond range - so, "take your pick". I was tired of fighting things, so, opted for network-online.target just to be conservative/"done with things"

After making the above change to one of my three nodes, that node immediately began to reliably online after full service-downings, agent-reboots or even total re-deployments of the agent-host. The change has since been propagated to the other agent nodes and the "run only during work-hours" service-profile has been tested "good".

Thursday, November 3, 2016

Automation in a Password-only UNIX Environment

Occasionally, my organization has need to run ad hoc queries against a large number of Linux systems. Usually, this is a great use-case for an enterprise CM tool like HPSA, Satellite, etc. Unfortunately, the organization I consult to is between solutions (their legacy tool burned down and their replacement has yet to reach a working state). The upshot is that, one needs to do things a bit more on-the-fly.

My preferred method for accessing systems is using some kind of token-based authentication framework. When hosts are AD-integrated, you can often use Kerberos for this. Failing that, you can sub in key-based logins (if all of your targets have your key as an authorized key). While my customer's systems are AD-integrated, their security controls preclude the use of both AD/Kerberos's single-signon capabilities and the use of SSH key-based logins (and, to be honest, almost none of the several hundred targets I needed to query had my key configured as an authorized key).

Because (tunneled) password-based logins are forced, I was initially looking at the prospect of having to write an expect script to avoid having type in my password several hundred times. Fortunately, there's an alternative to this in the tool "sshpass".

SSH pass lets you supply your password with a number of methods: command-line argument, a password-containing file, an environment variable value or even a read from STDIN. I'm not a fan of text files containing passwords (they've a bad tendency to be forgotten and left on a system - bad juju). I'm not particularly a fan of command-line arguments, either - especially on a multi-user system where others might see your password if they `ps` at the wrong time (which increases in probability as the length of time your job runs goes up). The STDIN method is marginally less awful than the command arg method (for similar reasons). At least with an environment variable, the value really only sits in memory (especially if you've set your HISTFILE location to someplace non-persistent).

The particular audit I was doing was an attempt to determine the provenance of a few hundred VMs. Over time, the organization has used templates authored by several groups - and different personnel within one of the groups. I needed to scan all of the systems to see which template they might have been using since the template information had been deleted from the hosting environment. Thus, I needed to run an SSH-encapsulated command to find the hallmarks of each template. Ultimately, what I ended up doing was:

Pushed my query-account's password into the environment variable used by sshpass, "SSHPASS"
Generated a file containing the IPs of all the VMs in question.
Set up a for loop to iterate that list
Looped `sshpass -e ssh -o PubkeyAuthentication=no StrictHostKeyChecking=no <USER>@${HOST} <AUDIT_COMMAND_SEQUENCE> 2>&1`
Jam STDOUT through a a sed filter to strip off the crap I wasn't interested in and put CSV-appropriate delimeters into each, queried host's string.
Capture the lot to a text file

The "PubkeyAuthentication=no" option was required because I pretty much always have SSH-agent (or agent-forwarding) enabled. This causes my key to be passed. With the targets' security settings, this causes the connection to be aborted unless I explicitly suppress the passing of my agent-key.

The "StrictHostKeyChecking=no" option was required because I'd never logged into these hosts before. Our standard SSH client config is to require confirmation before accepting the remote key (and shoving it into ${HOME}/.ssh/known_hosts). Without this option, I'd be required to confirm acceptance of each key ...which is just about as automation-breaking as having to retype your password hundreds of times is.

Once the above was done, I had a nice CSV that could be read into Excel and a spreadsheet turned over to the person asking "who built/owns these systems".

This method also meant that for the hosts that refused the audit credentials, it was easy enough to report "...and this lot aren't properly configured to work with AD".

Monday, October 17, 2016

Update to EL6 and the Pain of 10Gbps Networking in AWS

In a previous post, EL6 and the Pain of 10Gbps Networking in AWS, EL6 and the Pain of 10Gbps Networking in AWS, I discussed how to enable using third-party ixgbevf drivers to support 10Gbps networking in AWS-hostd RHEL 6 and CentOS 6 instances. Under further testing, it turns out that this is overkill. The native drivers may be used, instead, with minimal hassle.

It turns out that, in my quest to ensure that my CentOS and RHEL AMIs contained as many of the AWS utilities present in the Amazon Linux AMIs, that I included two RPMS - ec2-net and ec2-net-util - that were preventing use of the native drivers. Skipping these two RPMs (and possibly sacrificing ENI hot-plug capabilities) allows a much more low-effort support of 10Gpbs networking in AWS-hosted EL6 instances.

Absent those RPMS, 10Gbps support becomes a simple matter of:

Add add_drivers+="ixgbe ixgbevf" to the AMI's /etc/dracut.conf file
Use dracut to regenerate the AMI's initramfs.
Ensure that there are no persistent network device mapping entries in /etc/udev/rules.d.
Ensure that there are no ixgbe or ixgbevf config directives in /etc/modprobe.d/* files.
Enable sr-iov support in the instance-to-be-registered
Register the AMI

The resulting AMIs (and instances spawned from them) should support 10Gbps networking and be compatible with M4-generation instance-types.

Friday, October 7, 2016

Using DKMS to maintain driver modules

In my prior post, I noted that maintaining custom drivers for the the kernel in RHEL and CentOS hosts can be a bit painful (and prone to leaving you with an unreachable or even unbootable system). One way to take some of the pain out of owning a system with custom drivers is to leverage DKMS. In general, DKMS is the recommended way to ensure that, as kernels are updated, required kernel modules are also (automatically) updated.

Unfortunately, use of the DKMS method will require that developer tools (i.e., the GNU C-compiler) be present on the system - either in perpetuity or just any time kernel updates are applied. It is very likely that your security team will object to - or even prohibit - this. If the objection/prohibition cannot be overridden, use of the DKMS method will not be possible.

Steps

Set an appropriate version string into the shell-environment:
```
export VERSION=3.2.2
```
Make sure that appropriate header files for the running-kernel are installed
```
yum install -y kernel-devel-$(uname -r)
```
Ensure that the dkms utilities are installed:
```
yum --enablerepo=epel install dkms
```

Download the driver sources and unarchive into the /usr/src directory:

wget https://sourceforge.net/projects/e1000/files/ixgbevf%20stable/${VERSION}/ixgbevf-${VERSION}.tar.gz/download \
    -O /tmp/ixgbevf-${VERSION}.tar.gz && \
   ( cd /usr/src && \
      tar zxf /tmp/ixgbevf-${VERSION}.tar.gz )

Create an appropriate DKMS configuration file for the driver:

cat > /usr/src/ixgbevf-${VERSION}/dkms.conf << EOF
PACKAGE_NAME="ixgbevf"
PACKAGE_VERSION="${VERSION}"
CLEAN="cd src/; make clean"
MAKE="cd src/; make BUILD_KERNEL=\${kernelver}"
BUILT_MODULE_LOCATION[0]="src/"
BUILT_MODULE_NAME[0]="ixgbevf"
DEST_MODULE_LOCATION[0]="/updates"
DEST_MODULE_NAME[0]="ixgbevf"
AUTOINSTALL="yes"
EOF

Register the module to the DKMS-managed kernel tree:
```
dkms add -m ixgbevf -v ${VERSION}
```
Build the module against the currently-running kernel:
```
dkms build ixgbevf/${VERSION}
```

Verification

The easiest way to verify the correct functioning of DKMS is to:

Perform a `yum update -y`
Check that the new drivers were created by executing `find /lib/modules -name ixgbevf.ko`. Output should be similar to the following:
```
find /lib/modules -name ixgbevf.ko | grep extra
/lib/modules/2.6.32-642.1.1.el6.x86_64/extra/ixgbevf.ko
/lib/modules/2.6.32-642.6.1.el6.x86_64/extra/ixgbevf.ko
```
There should be at least two output-lines: one for the currently-running kernel and one for the kernel update. If more kernels are installed, there may be more than just two output-lines
Reboot the system, then check what version is active:
```
modinfo ixgbevf | grep extra
filename:       /lib/modules/2.6.32-642.1.1.el6.x86_64/extra/ixgbevf.ko
```
If the output is null, DKMS didn't build the new module.

Wednesday, October 5, 2016

EL6 and the Pain of 10Gbps Networking in AWS

AWS-hosted instances with optimized networking-support enabled see the 10Gbps interface as an Intel 10Gbps ethernet adapter (`lspci | grep Ethernet` will display a string similar to Intel Corporation 82599 Virtual Function). This interface makes use the ixgbevf network-driver. Enterprise Linux 6 and 7 bundle the version 2.12.x version of the driver into the kernel RPM. Per the Enabling Enhanced Networking with the Intel 82599 VF Interface on Linux Instances in a VPC document, AWS enhanced networking recommends version 2.14.2 or higher of the ixgbevf network-driver. To meet AWS's recommendations for 10Gbps support within an EL6 instance, it will be necessary to update the ixgbevf driver to at least version 2.14.2.

The ixgbevf network-driver source-code can be found on SourceForge. It should be noted that not every AWS-compatible version will successfully compile on EL 6. Version 3.2.2 is known to successfully compile, without intervention, on EL6.

Notes:

>>>CRITICAL ITEM<<<

It is necessary to recompile the ixgbevf driver and inject it into the kernel each time the kernel version changes. This needs to be done between changing the kernel version and rebooting into the new kernel version. Failure to update the driver each time the kernel changes will result in the instance failing to return to the network after a reboot event.

Step #10 from the implementation procedure:

rpm -qa kernel | sed 's/^kernel-//' | xargs -I {} dracut -v -f /boot/initramfs-{}.img {}

Is the easiest way to ensure any available kernels are properly linked against the ixgbevf driver.

>>>CRITICAL ITEM<<<

It is possible that the above process can be avoided by installing DKMS and letting it coordinate the insertion of the ixgbevf driver modules into updated kernels.

Procedure:

The following assumes the instance-owner has privileged access to the instance OS and can make AWS-level configuration changes to the instance-configuration:

Login to the instance
Escalate privileges to root
Install the up-to-date ixgbevf driver. This can be installed either by compiling from source or using pre-compiled binaries.
Delete any `*persistent-net*.rules` files found in the /etc/udev/rules.d directory (one or both of 70-persistent-net.rules and 75-persistent-net-generator.rules may be present)
Ensure that an `/etc/modprobe.d` file with the following minimum contents exists:
```
alias eth0 ixgbevf
```
Recommend creating/placing in /etc/modprobe.d/ifaliases.conf
Unload any ixgbevf drivers that may be in the running kernel:
```
modprobe -rv ixgbevf
```
Load the updated ixgbevf driver into the running kernel:
```
modprobe -v ixgbevf
```
Ensure that the /etc/modprobe.d/ixgbevf.conf file exists. Its contents should resemble:
```
options ixgbevf InterruptThrottleRate=1
```
Update the /etc/dracut.conf. Ensure that the add_drivers+="" directive is uncommented and contains reference to the ixgbevf modules (i.e., `add_drivers+="ixgbevf"`)

Recompile all installed kernels:

rpm -qa kernel | sed 's/^kernel-//'  | xargs -I {} dracut -v -f /boot/initramfs-{}.img {}

Shut down the instance

When the instance has stopped, use the AWS CLI tool to enable optimized networking support:

aws ec2 --region  modify-instance-attribute --instance-id  --sriov-net-support simple

Power the instance back on

Verify that 10Gbps capability is available:

Check that the ixgbevf module is loaded

$ sudo lsmod
Module                  Size  Used by
ipv6                  336282  46
ixgbevf                63414  0
i2c_piix4              11232  0
i2c_core               29132  1 i2c_piix4
ext4                  379559  6
jbd2                   93252  1 ext4
mbcache                 8193  1 ext4
xen_blkfront           21998  3
pata_acpi               3701  0
ata_generic             3837  0
ata_piix               24409  0
dm_mirror              14864  0
dm_region_hash         12085  1 dm_mirror
dm_log                  9930  2 dm_mirror,dm_region_hash
dm_mod                102467  20 dm_mirror,dm_log

Check that `ethtool` is showing that the default interface (typicall "`eth0`") is using the ixgbevf driver:

$ sudo ethtool -i eth0
driver: ixgbevf
version: 3.2.2
firmware-version: N/A
bus-info: 0000:00:03.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

Verify that the interface is listed as supporting a link mode of `10000baseT/Full` and a speed of `10000Mb/s`:

$ sudo ethtool eth0
Settings for eth0:
        Supported ports: [ ]
        Supported link modes:   10000baseT/Full
        Supported pause frame use: No
        Supports auto-negotiation: No
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Speed: 10000Mb/s
        Duplex: Full
        Port: Other
        PHYAD: 0
        Transceiver: Unknown!
        Auto-negotiation: off
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes

Thursday, August 25, 2016

Use the Force, LUKS

Not like there aren't a bunch of LUKS guides out there already ...mostly posting this one for myself.

Today, was working on turning the (attrocious - other than a long-past deadline, DISA, do you even care what you're publishing?) RHEL 7 V0R2 STIGs specifications into configuration management elements for our enterprise CM system. Got to the STIG item for "ensure that data-at-rest is encrypted as appropriate". This particular element is only semi-automatable ...since it's one of those "context" rules that has a "if local policy requires it" back-biting element to it. At any rate, this particular STIG-item prescribes the use of LUKs.

As I set about to write the code for this security-element, it occurred to me, "we typically use array-based storage encryption - or things like KMS in cloud deployments - that I can't remember how to cofigure LUKS ...least of all configure it so it doesn't require human intervention to mount volumes." So, like any good Linux-tech, I petitioned the gods of Google. Lo, there were many results — most falling into either the "here's how you encrypt a device" or the "here's how you take an encrypted device and make the OS automatically remount it at boot" camps. I was looking to do both so that my test-rig could be rebooted and just have the volume there. I was worried about testing whether devices were encrypted, not whether leaving keys on a system was adequately secure.

At any rate, at least for testing purposes (and in case I need to remember these later), here's what I synthesized from my Google searches.

Create a directory for storing encryption key-files. Ensure that directory is readable only by the root user:
```
install -d -m 0700 -o root -g root /etc/crypt.d
```
Create a 4KB key from randomized data (stronger encryption key than typical, password-based unlock mechanisms):
```
# dd if=/dev/urandom of=/etc/crypt.d/cryptFS.key bs=1024 count=4
```
...writing the key to the previously-created, protected directory. Up the key-length by increasing the value of the count parameter.

Use the key to create an encrypted raw device:

# cryptsetup --key-file /etc/crypt.d/cryptFS.key \
--cipher aes-cbc-essiv:sha256 luksFormat /dev/CryptVG/CryptVol

Activate/open the encrypted device for writing:
```
# cryptsetup luksOpen --key-file /etc/crypt.d/cryptFS.key \
/dev/CryptVG/CryptVol CryptVol_crypt
```
Pass the location of the encryption-key using the --key-file parameter.
Add a mapping to the crypttab file:
```
# ( printf "CryptVol_crypt\t/dev/CryptVG/CryptVol\t" ;
   printf "/etc/crypt.d/cryptFS.key\tluks\n" ) >> /etc/crypttab
```
The OS will use this mapping-file at boot-time to open the encrypted device and ready it for mounting. The four column-values to the map are:
1. Device-mapper Node: this is the name of the writable block-device used for creating filesystem structures and for mounting. The value is relative. When the device is activated, it will be assigned the device name /dev/mapper/<key_value>
2. Hosting-Device: The physical device that hosts the encrypted psuedo-device. This can be a basic hard disk, a partition on a disk or an LVM volume.
3. Key Location: Where the device's decryption-key is stored.
4. Encryption Type: What encryption-method was used to encrypt the device (typically "luks")
Create a filesystem on the opened encrypted device:
```
# mkfs -t ext4 /dev/mapper/CryptVol_crypt
```

Add the encrypted device's mount-information to the host's /etc/fstab file:

# ( printf "/dev/mapper/CryptVol_crypt\t/cryptfs\text4" ;
   printf "defaults\t0 0\n" ) >> /etc/fstab

Verify that everything works by hand-mounting the device (`mount -a`)
Reboot the system (`init 6`) to verify that the encrypted device(s) automatically mount at boot-time

Keys and mappings in place, the system will reboot with the LUKSed devices opened and mounted. The above method's also good if you wanted to give each LUKS-protected device its own, device-specific key-file.

Note: You will really want to back up these key files. If you somehow lose the host OS but not the encrypted devices, the only way you'll be able to re-open those devices if you're able to restore the key-files to the new system. Absent those keys, you better have good backups of the unencrypted data - becuase you're starting from scratch.

Monday, June 13, 2016

Seriously, CentOS?

One of my (many) duties in our shop is doing cross-platform maintenance of RPMs. Previously, when I was maintaining them for EL5 and EL6, things were fairly straight-forward. You got or made a SPEC file to packages your SOURCES into RPMS and you were pretty much good to go. Imagine my surprise when I went to start porting things to EL7 and all my freaking packages had el7.centos in their damned names. WTF?? These RPMs are for use on all of our EL7 systems, not just CentOS: why the hell are you dropping the implementation into my %{dist}-string?? So, now I have to make sure that my ${HOME}/.rpmmacros file has a line in it that looks like:

%dist .el7

If I don't want my %{dist}-string get crapped-up.

Screw you, CentOS. Stuff was just fine on CentOS 5 and CentOS 6. This change is not an improvement.

Friday, January 8, 2016

Solving Root VG Collisions in LVM-Enabled Virtualization Templates

The nature of template-based Linux OS deployments means that, if a template uses LVM for its root filesystems, any system built from that template will have non-unique volume group (VG) names. In most situations, non-unique VG names are not a problem. However, if you encounter a situation where you need to a broken instance by correcting a problem within the instance's root filesystems, non-unique VG names can make that task more difficult.

To avoid this eventuality, the template user can easily modify each launched template by executing steps similar to the following:

#!/bin/sh

DEFIF=$(ip route show | awk '/^default/{print $5}')
BASEIP=$(printf '%02X' \
         $(ip addr show ${DEFIF} | \
           awk '/inet /{print $2}' | \
           sed -e 's#/.*$##' -e 's/\./ /g' \
          ))

vgrename -v VolGroup00 VolGroup00_${BASEIP}
sed -i 's/VolGroup00/&_'${BASEIP}'/' /etc/fstab
sed -i 's/VolGroup00/&_'${BASEIP}'/g' /boot/grub/grub.conf

for KRNL in $(awk '/initrd/{print $2}' /boot/grub/grub.conf | \
              sed -e 's/^.*initramfs-//' -e 's/\.img$//')
do
   mkinitrd -f -v /boot/initramfs-${KRNL}.img ${KRNL}
done

init 6

Note that the above script assumes that the current root VG name is "VolGroup00". If your current root VG name is different, change the value in the script above as appropriate.

This script may be executed either at instance launch-time or anywhere in the life-cycle of an an instance. The above script takes the existing root VG name and tacks on a uniqueness component. In this case, the uniqueness is achieved by taking the IP address of the instance's primary interface and converting it to a hexadecimal string. So long as a group of systems does not contain any repeated primary IP addresses, this should provide a sufficient level of uniqueness for a group of deployed systems.

Note: renaming the root VG will _not_ solve the problems caused by PV UUID non-uniqueness. Currently, there is no known-good solution to this issue. The general recommendation is to avoid that problem by using a different template to build your recovery-host than used to build your broken host.

Thursday, September 24, 2015

Simple Guacamole

If the enterprise you work for is like mine, access through the corporate firewall is tightly-controlled. You may find that only we-related protocols are left through (mostly) unfettered. When you're trying to work with a serivce like AWS, this can make management of Linux- and/or Windows-based resources problematic.

A decent solution to such a situation is the use of HTML-based remote connection gateway services. If all you're looking to do is SSH, the GateOne SSH-over-HTTP gateway is a quick and easy to setup solution. If you need to manage instances via graphical desktops - most typically Windows but some people like it for Linux as well - a better solution is Guacamole.

Guacamole is an extensible, HTTP-based solution. It runs as a Java servlet under a Unix hosted service like Tomcat. If you're like me, you may also prefer to encapsulate/broker the Tomcat service through a generic HTTP service like Apache or Nginx. My preference has been Apache - but mostly because I've been using Apache since not long after it was formally forked off of the NCSA project. I also tend to favor Apache because it's historically been part of the core repositories of my Linux of choice, Red Hat/CentOS.

Guacamole gives you HTTP-tunneling options for SSH, Telnet, RDP and VNC. this walk through is designed to get you quickly running Guacamole as an web-based SSH front end. Once you've got the SSH component running, adding other management protocols is easy. This procedure is also designed to be doable even if you don't yet actually have the ability to SSH to a AWS-hosted instance.

Start the AWS web console's "launch instance" wizard.
Select an appropriate EL6-based AMI.
Select an appropriate instance type (the free tier instances are suitable for a basic SSH proxy)
On the "Configure Instance" page, expand the "Advanced Details" section.
In the now-available text box, paste in the contents of this script. Note that this script is flexible enough that, if the version of Guacamole hosted via the EPEL project is updated, the script should continue to work. With a slight bit of massaging, the script could also be made to work with EL 7 and associated Tomcat and EPEL-hosted RPMs.
If the AMI you've picked does not provide the option of password-based logins for the default SSH user, add steps (in the "Advanced Details" text box) for creating an interactive SSH user with a password. Ensure that the user also has the ability to use `sudo` to get root privileges.
Finish up the rest of the process for deploying an instance.

Once the instance finishes deploying, you should be able to set your browser to the public hostname shown for the instance in the AWS console. Add "/guacamole/" after the hostname. Assuming all went well, you will be presented with a Guacamole login prompt. Enter the credentials:
Note that these credentials can be changing the:

printf "\t<authorize username=\"admin\" password=\"PASSWORD\">\n"

Line of the pasted-in script. Once you've authenticated to Guacamole, you'll be able to login to the hosting-instance via SSH using the instance's no-privileged user's credentials. Once logged in, you can escalate privileges and then configure additional authentication mechanisms and connection destinations and protocols.

Note: Guacamole doesn't currently support key-based login mechanisms. If key-based logins are a must make use of GateOne, instead.

Thursday, February 5, 2015

Attack of the Clones

One of the clients I do work for has a fairly significant Linux footprint. However, in these times of greater fiscal responsibility/austerity, my client is looking at even cheaper alternatives. This means that, for systems whose applications don't require Red Hat for warranty-support, CentOS is being subbed into their environment. This is particularly true for their testing environments. There've even been arguments for doing it in production, applications' vendor-support be damned, because "CentOS is the same as Red Hat"

I've previously argued, "they're very, very similar, but they're not truly identical". In particular, Red Hat handles CVEs and errata somewhat differently than CentOS does (Red Hat backports many fixes to prior EL releases, CentOS's stance is generally "upgrade it").

Today, I got bit by one place where CentOS hews far too closely to "the same as Red Hat Enterprise Linux". Specifically, I was using the `oscap` security tool to do a security audit of a test system. I should say, "I was struggling to use the `oscap` security tool...". With later versions of EL6, Red Hat, and as a derivative, CentOS, implement the CPE system for Linux.

This is all fine and good, except where the tools you use rely on the correctness of CPE-related definitions. By the standard of CPE, Red Hat and CentOS are very much not "the same". Because the security-auditing tool I was using (`oscap`) leverages CPEs and because the CentOS maintainers simply repackage the Red Hat furnished security profiles without updating the CPE call-outs, first, the security tool fails horribly. Every test comes back as "notapplicable".

To fix this situation, a bit of `sed`-fu is required:

mv /usr/share/xml/scap/ssg/content/ssg-rhel6-cpe-oval.xml \
   /usr/share/xml/scap/ssg/content/ssg-rhel6-cpe-oval.xml-DIST && \
cp /usr/share/xml/scap/ssg/content/ssg-rhel6-cpe-oval.xml-DIST \
   /usr/share/xml/scap/ssg/content/ssg-rhel6-cpe-oval.xml && \
sed -i '{
   s#Red Hat Enterprise Linux 6#CentOS 6##g
   s#cpe:/o:redhat:enterprise_linux:6#cpe:/o:centos:centos:6##g
}' /usr/share/xml/scap/ssg/content/ssg-rhel6-cpe-oval.xml


mv /usr/share/xml/scap/ssg/content/ssg-rhel6-xccdf.xml \
   /usr/share/xml/scap/ssg/content/ssg-rhel6-xccdf.xml-DIST && \
cp /usr/share/xml/scap/ssg/content/ssg-rhel6-xccdf.xml-DIST \
   /usr/share/xml/scap/ssg/content/ssg-rhel6-xccdf.xml && \
sed -i \
   's#cpe:/o:redhat:enterprise_linux#cpe:/o:centos:centos##g' \
/usr/share/xml/scap/ssg/content/ssg-rhel6-xccdf.xml

Once the above is done, running `oscap` actually produces useful results.

NOTE: Ironically, doing the above edits will cause the various SCAP profiles to flag an error when running the tests that verify that RPMs have been unaltered. I've submitted a bug to the CentOS group so these fixes are included in future versions of the CentOS OpenSCAP RPMs, but, until then, you just need to be aware that the `oscap` tool will flag the above two files.

...And if you found this page because you're trying to figure out how to run `oscap` to get results, here's a sample invocation that should act as a starting-point:

oscap xccdf eval --profile common --report \
   /var/tmp/oscap-report_`date "+%Y%m%d%H%M"`.html \
   --results /var/tmp/oscap-results_`date "+%Y%m%d%H%M"`.xml\
   --cpe /usr/share/xml/scap/ssg/content/ssg-rhel6-cpe-dictionary.xml \
   /usr/share/xml/scap/ssg/content/ssg-rhel6-xccdf.xml

Monday, December 29, 2014

Custom EL6 AMIs With a Root LVM

One of the customers I work for is a security-focused organization. As such, they try to follow the security guidelines laid out within the SCAP guidelines for the operating systems they deploy. This particular customer is also engaged in couple of "cloud" initiatives - a couple privately-hosted and one publicly-hosted option. For the publicly-hosted cloud initiative, they make use of Amazon Web Services EC2 services.

The current SCAP guidelines for Red Hat Enterprise Linux (RHEL) 6 draw the bulk of their content straight from the DISA STIGS for RHEL 6. There are a few differences, here and there, but the commonality between the SCAP and STIG guidance - at least as of the SCAP XCCDF 1.1.4 and STIG Version 1, Release 5, respectively - is probably just shy of 100% when measured on the recommended tests and fixes. In turn, automating the guidance in these specifications allow you to quickly crank out predictably-secure Red Hat, CentOS, Scientific Linux or Amazon Linux systems.

For the privately-hosted cloud initiatives, supporting this guidance was a straight-forward matter. The solutions my customer uses all support the capability to network-boot and provision a virtual machine (VM) from which to create a template. Amazon didn't provide similar functionality to my customer, somewhat limiting some of the things that can be done to create a fully-customized instance or resulting template (Amazon Machine Image - or "AMI" - in EC2 terminology).

For the most part this wasn't a problem to my customer. Perhaps the biggest sticking-point was that it meant that, at least initially, partitioning schemes used on the privately-hosted VMs couldn't be easily replicated on the EC2 instances.

Section 2.1.1 of the SCAP guidance calls for "/tmp", "/var", "/var/log", "/var/log/audit", and "/home" to each be on their own, dedicated partitions, separate from the "/" partition. On the privately-hosted cloud solutions, use of a common, network-based KickStart was used to carve the boot-disk into a /boot partition and an LVM volume-group (VG). The boot VG was then carved up to create the SCAP-mandated partitions.

With the lack of network-booting/provisioning support, it meant we didn't have the capability to extend our KickStart methodologies to the EC2 environment. Further, at least initially, Amazon didn't provide support for use of LVM on boot disks. The combination of the two limitations meant my customer couldn't easily meet the SCAP partioning requiremts. Lack of LVM meant that the boot disk had to be carved up using bare /dev/sdX devices. Lack of console defeated the ability to repartition an already-built system to create the requisite partitons on the boot disk. Initially, this meant that the AMIs we could field were limited to "/boot" and "/" partitions. This meant config-drift between the hosting environments and meant we had to get security-waivers for the Amazon-hosted environment.

Not being one who well-tolerates these kind of arbitrary-feeling deviances, I got to cracking with my Google searches. Most of what I found were older documents that focussed on how to create LVM-enabled, S3-backed AMIs. These weren't at all what I wanted - they were a pain in the ass to create, were stoopidly time-consuming to transfer into EC2 and the resultant AMIs hamstrung me on the instance-types I could spawn from them. So, I kept scouring around. In the comments section to one of the references for S3-backed AMIs, I saw a comment about doing a chroot() build. So, I used that as my next branch of Googling about.

Didn't find a lot for RHEL-based distros - mostly Ubuntu and some others. That said, it gave me the starting point that I needed to find my ultimate solution. Basically, that solution comes down to:

Pick an EL-based AMI from the Amazon Marketplace (I chose a CentOS one - I figured that using an EL-based starting point would ease creating my EL-based AMI since I'd already have all the tools I needed and in package names/formats I was already familiar with)
Launch the smallest instance-size possible from the Marketplace AMI (8GB when I was researching the problem)
Attach an EBS volume to the running instance - I set mine to the minimum size possible (8GB) figuring I could either grow the resultant volumes or, once I got my methods down/automated, use a larger EBS for my custom AMI.
Carve the attached EBS up into two (primary) partitions. I like using `parted` for this, since I can specify the desired, multi-partition layout (and all the offsets, partition types/labels, etc.) in one long command-string.
- I kept "/boot" in the 200-400MB range. Could probably keep it smaller since the plans weren't so much to patch instantiations as much as periodically use automated build tools to launch instances from updated AMIs and re-deploy the applications onto the new/updated instances.
- I gave the rest of the disk to the partition that would host my root VG.
I `vgcreate`d my root volume group, then carved it up into the SCAP-mandated partitions (minus "/tmp" - we do that as a tmpfs filesystem since the A/V tools that SCAP wants you to have tend to kill system performance if "/tmp" is on disk - probably not relevant in EC2, but consistency across environments was a goal of the exercise)
Create ext4 filesystems on each of my LVs and my "/boot" partition.
Mount all of the filesystems under "/mnt" to support a chroot-able install (i.e., "/mnt/root", "/mnt/root/var", etc.)
Create base device-nodes within my chroot-able install-tree (you'll want/need "/dev/console", "/dev/null", "/dev/zero", "/dev/random", "/dev/urandom", "/dev/tty" and "/dev/ptmx" - modes, ownerships and major/minor numbers should match what's in your live OS's)
Setup loopback mounts for "/proc", "/sys", "/dev/pts" and "/dev/shm",
Create "/etc/fstab" and "/etc/mtab" files within my chroot-able install-tree (should resemble the mount-scheme you want in your final AMI - dropping the "/mnt/root" from the paths)
Use `yum` to install the same package-sets to the chroot that our normal KickStart processes would install.
The `yum` install should have created all of your "/boot" files with the exception of your "grub.conf" type files.

Create a "/mnt/boot/grub.conf" file with vmlinuz/initramfs references matching the ones installed by `yum`.
Create links to your "grub.conf" file:
- You should have an "/mnt/root/etc/grub.conf" file that's a sym-link to your "/mnt/root/boot/grub.conf" file (be careful how you create this sym-link so you don't create an invalid link)
- Similarly, you'll want a "/mnt/root/boot/grub/grub.conf" linked up to "/mnt/root/boot/grub.conf" (not always necessary, but it's a belt-and-suspenders solution to some issues related to creating PVM AMIs)

Create a basic eth0 config file at "/mnt/root/etc/sysconfig/network-scripts/ifcfg-eth0". EC2 instances require the use of DHCP for networking to work properly. A minimal network config file should look something like:
```
DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=on
IPV6INIT=no
```
Create a basic network-config file at "/mnt/root/etc/sysconfig/network". A minimal network config file should look something like:
```
NETWORKING=yes
NETWORKING_IPV6=no
HOSTNAME=localhost.localdomain
```
Append "UseDNS no" and "PermitRootLogin without-password" to the end of your "/mnt/root/etc/ssh/sshd_config" file. The former fixes connect-speed problems related to EC2's use of private IPs on their hosted instances. The latter allows you to SSH in as root for the initial login - but only with a valid SSH key (don't want to make newly-launched instances instantly ownable!)
Assuming you want instances started from your AMI to use SELinux:
- Do a `touch /mnt/root/.autorelabel`
- Make sure that the "SELINUX" value in "/mnt/root/etc/selinux/config" is set to either "permissive" or "enforcing"
Create an unprivileged login user within the chroot-able install-tree. Make sure a password is set and the the user is able to use `sudo` to access root (since I recommend setting root's password to a random value).
Create boot init script that will download your AWS public key into the root and/or maintenance user's ${HOME}/.ssh/authorized_keys file. At its most basic, this should be a run-once script that looks like:
```
curl -f http://169.254.169.254/latest/meta-data/public-keys/0/openssh-key > /tmp/pubkey
install --mode 0700 -d ${KEYDESTDIR}
install --mode 0600 /tmp/pubkey ${KEYDESTDIR}/authorized_keys
```
Because "/tmp" is an ephemeral filesystem, the next time the instance is booted, the "/tmp/pubkey" will self-clean. Note that an appropriate destination-directory will need to exist

Clean up the chroot-able install-tree:

yum --installroot=/mnt/root/ -y clean packages
rm -rf /mnt/root/var/cache/yum
rm -rf /mnt/root/var/lib/yum
cat /dev/null > /mnt/root/root/.bash_history

Unmount all of the chroot-able install-tree's filesystems.
Use `vgchange` to deactivate the root VG
Using the AWS console, create a snapshot of the attached EBS.
Once the snapshot completes, you can then use the AWS console to create an AMI from the EBS-snapshot using the "Create Image" option. It is key that you set the "Root Device Name", "Virtualization Type" and "Kernel ID" parameters to appropriate values.
- The "Root Device Name" value will auto-populate as "/dev/sda1" - change this to "/dev/sda"
- The "Virtualization Type" should be set as "Paravirtual".
- The appropriate value for the "Kernel ID" parameter will vary from AWS availability-region to AWS availability-region (for example, the value for "US-East (N. Virginia)" will be different from the value for "EU (Ireland)"). In the drop-down, look for a description field that contains "pv-grub-hd00". There will be several. Look for the highest-numbered option that matches your AMIs architecture (for example, I would select the kernel with the description "pv-grub-hd00_1.04-x86_64.gz" for my x86_64-based EL 6.x custom AMI).
The other Parameters can be tweaked, but I usually leave them as is.
Click the "Create" button, then wait for the AMI-creation to finish.
Once the "Create" finishes, the AMI should be listed in your "AMIs" section of the AWS console.
Test the new AMI by launching an instance. If the instance successfully completes its launch checks and you are able to SSH into it, you've successfully created a custom, PVM AMI (HWM AMIs are fairly easily created, as well, but require some slight deviations that I'll cover in another document).

I've automated much of the above tasks using some simple shell scripts and the Amazon EC2 tools. Use of the EC2 tools is well documented by Amazon. Their use allows me to automate everything within the instance launched from the Marketplace AMI (I keep all my scripts in Git, so, prepping a Marketplace AMI for building custom AMIs takes maybe two minutes on top of launching the generic Marketplace AMI). When automated as I have, you can go from launching your Marketplace AMI to having a launchable custom AMI in as little as twenty minutes.

Properly automated, generating updated AMIs as security fixes or other patch bundles come out is as simple as kicking off a script, hitting the vending machine for a fresh Mountain Dew, then coming back to launch new, custom AMIs.

Wednesday, November 26, 2014

Converting to EL7: Solving The "Your Favorite Service Isn't Systemd-Enabled" Problem

Having finally gotten off my butt to knock out getting my RHCE (for EL6) before the 2014-12-19 drop-dead date, I'm finally ready to start focusing on migrating my personal systems to EL7-based distros.

My personal VPS is currently running CentOS 6.6. I use my VPS to host a couple of personal websites and email for family and a few friends. Yes, I realize that it would probably be easier to offload all of this to providers like Google. However, while Google is very good at SPAM-stomping, and provides me a very generous amount of space for archiving emails, one area that they do lack for is email aliases: whenever I have to register to a new web-site, I use a custom email address to do so. At my last pruning, I still had 300+, per-site, aliases. So, for me, number of available aliases ("unlimited" is best) and ease of creating them trumps all other considerations.

Since I don't have Google handling my mail for me, I have to run my own A/V and anti-spam engines. Being a good Internet Citizen, I also like to make use of Sender Policy Framework (via OpenSPF) and DomainKeys (currently via DKIMproxy).

I'm only just into the process of sorting out what I need to do to make the transition as quick and as painless (more for my family and friends than me) a process as possible. I hate outages. And, with a week off for the Thanskgiving holidays, I've got time to do things in a fairly orderly fashion.

At any rate, one of the things I discovered is that my current DomainKeys solution hasn't been updated to "just work" within the systemd framework used within EL7. This isn't terribly surprising, as it appears that the DKIMproxy SourceForge project may have gone dormant, in 2013 (so, I'll have to see if there's alternatives that have the appearance of still being a going concern - in the mean time...) Fortunately, the DKIMproxy source code does come with a `chkconfig` compatible SysV-init script. Even more fortunately, converting from SysV-init to a systemd-compatible service control is a bit more straight forward than when I was dealing with moving from Solaris 9's legacy init to Solaris 10's SMF.

If you've already got a `chkconfig` style init script, moving to systemd-managed is fairly trivial. Your `chkconfig` script can be copied, pretty much "as is" into "/usr/lib/systemd". My (current) preference is to create a "scripts" subdirectory and put it in there. Haven't read deeply enough into systemd to see if this is the "Best Practices" method, however. Also, where I work has no established conventions ...because they only started migrating to EL6 in fall of 2013 - so, I can't exactly crib anything EL7-related from how we do it at work.

Once you have your SysV-init style script placed where it's going to live (e.g., "/usr/lib/systemd/scripts"), you need to create associated service definition files. In my particular case, I had to create two as the DKIMproxy software actually has an inbound and an outbound funtion. Launched from normal SysV-init, it all gets handled as one piece. However, one of the nice things about systemd is it's not only a launcher framework, it's a service monitor framework, as well. To take full advantage, I wanted one monitor for the inbound service and one for the outbound service. The legacy init script that DKIMproxy ships with makes this easy enough as, in addition to the normal "[start|stop|restart|status]" arguments, it had per-direction subcommand (e.g., "start-in" and "stop-out"). The service-definition for my "dkim-in.service" looks like:

[Unit]
     Description=Manage the inbound DKIM service
     After=postfix.service


     [Service]
     Type=forking
     PIDFile=/usr/local/dkimproxy/var/run/dkimproxy_in.pid
     ExecStart=/usr/lib/systemd/scripts/dkim start-in
     ExecStop=/usr/lib/systemd/scripts/dkim stop-in


     [Install]
     WantedBy=multi-user.target

To break down the above:

The "Unit" stanza tells systemd a bit about your new service:

The "Description" line is just ASCII text that allows you to provide a short, meaningful of what the service does. You can see your service's description field by typing `systemctl -p Description show <SERVICENAME>`
The "After" parameter is a space-separated list of other services that you want to have successfully started before systemd attempts to start your new service. In my case, since DKIMproxy is an extension to Postfix, it doesn't make sense to try to have DKIMproxy running until/unless Postfix is running.

The "Service" stanza is where you really define how your service should be managed. This is where you tell systemd how to start, stop, or reload your service and what PID it should look for so it knows that the service is still notionally running. The following parameters are the minimum ones you'd need to get your service working. Other parameters are available to provide additional functionality:

The "Type" parameter tells systemd what type of service it's managing. Valid types are: simple, forking, oneshot, dbus, notify or idle. The systemd.service man page more-fully defines what each option is best used for. However, for a traditional daemonized service, you're most likely to want "forking".
The "PIDFile" parameter tells systemd where to find a file containing the parent PID for your service. It will then use this to do a basic check to monitor whether your service is still running (note that this only checks for presence, not actual functionality).
The "ExecStart" parameter tells systemd how to start your service. In the case of a SysV-init script, you point it to the fully-qualified path you installed your script to and then any arguments necessary to make that script act as a service-starter. If you don't have a single, chkconfig-style script that handles both stop and start functions, you'd simply give the path to whatever starts your service. Notice that there are no quotations surrounding the parameter's value-section. If you put quotes - in the mistaken belief that the starter-command and it's argument need to be grouped, you'll get a path error when you go to start your service the first time.
The "ExecStop" parameter tells systemd how to stop your service. As with the "ExecStart" parameter, if you're leveraging a fully-featured SysV-init script, you point it to the fully-qualified path you installed your script to and then any arguments necessary to make that script act as a service-stopper. Also, the same rules about white-space and quotation-marks apply to the "ExecStop" parameter as do the "ExecStart" parameter.

The "Install" stanza is where you tell systemd the main part of the service dependency-tree to put your service. You have two main dependency-specifiers to choose: "WantedBy" and "RequiredBy". The former is a soft-dependency while the latter is a hard-dependency. If you use the "RequiredBy" parameter, then the service unit-group (e.g., "mult-user.target") enumerated with the "RequiredBy" parameter will only be considered to have successfully onlined if the defined service has successfully launched and stayed running. If you use the "WantedBy" parameter, then the service unit-group (e.g., "mult-user.target") enumerated with the "WantedBy" parameter will still be considered to have successfully onlined whether the defined service has successfully launched or stayed running. It's most likely you'll want to use "WantedBy" rather than "RequiredBy" as you typically won't want systemd to back off the entire unit-group just because your service failed to start or stay running (e.g., you don't want to stop all of the multi-user mode related processes just because one network service has failed.)