Monday, January 8, 2018

The Joys of SCL (or, "So You Want to Host RedMine on Red Hat/CentOS")

Recently, had to go back and try to re-engineer my compay's RedMine deployment. Previously, I'd rushed through a quick-standup of the service using the generic CentOS AMI furnished by CentOS.Org. All of the supporting automation was very slap-dash and had more things hard-coded in than I really would have liked. However, the original time-scale it had to be delivered under meant that "stupid but works ain't stupid" ruled the day.

That ad hoc implementation was over two years ago, at this point. Never did get a chance to start to revisit it, till recently (hooray for everyone being gone over the holiday weeks!). At any rate, there were a couple things I wanted to accomplish with the re-build:

  1. Change from using the generic CentOS AMI to using our customized AMI
    • Our AMI has a specific partitioning-scheme for our OS disks that we prefer to be used across all of our services. Wanted to make our RedMine deployment no longer be an "outlier".
    • Our AMIs are updated monthly, whereas the CentOS.Org ones are typically updated when there's a new 7.X. Using our AMIs means not having to wait nearly as long for launch-time patch-up to complete before things can become "live".
  2. I wanted to make use of our standard, provisioning-time OS-hardening framework - both so that the instance is better configured to be Internet-facing and so it is more in accordance with our other service-hosts
  3. I wanted to be able to use a vendor-packaged Ruby rather than the self-packaged/self-managed Ruby. The vendor's default Ruby is too old for RedMine to use and SCL wasn't really that well established a "thing" back in 2015. Figure that a vendor-packaged RPM is more likely to be kept up to date without potentially introducing breakage that self-packaged might.
  4. Wanted to switch from using UserData for launch automation to using CloudFormation (CFn). 
    • Relatively speaking, CFn automation includes a lot more ability to catch errors/deployment-breakage with lower-effort: less "I thought it deployed correctly" happenstances.
    • Using CFn automation exposes less information via instance-metadata either via the management UIs/CLIs or by reading metadata from the instance
    • Using CFn automation makes it slightly easier to coordinate the provisioning of non-EC2 solution components (e.g., security groups, IAM roles, S3 buckets, EFS shares, RDS databases, load-balancers etc.)
  5. I wanted to clean up the overall automation-scripts so I could push more stuff into (public) GitHub (projects) and have to pull less crap from S3 buckets.
Ultimately, the biggest challenge turned out to be dealing with SCL. I can probably account for the problems with SCL being their focus. Specifically, they seem to be designed more for use by developers in interactive environments than they are to underpin services. If you're a developer, all you really need to do to have a "more up-to-date than vendor-standard" language packages is:
  • Enable the SCL repos
  • Install the SCL-hosted language package you're looking for
  • Update either your user's shell-inits or the system-wide shell inits
  • Log-out and back in
  • Go to twon with your new SCL-furnished capabilities.
If, on the other hand, you want to use an SCL-managed language to back a service that can't run with the standard vendor-packaged languages (etc)., things are a little more involved.

A little bit of background:

The SCL-managed utilities are typically rooted wholly in "/opt/rh". This means binaries, libraries, documentation. Because of this, your shell's PATH, MANPATH and linker-paths will need to be updated. Fortunately, each SCL-managed package ships with an `enable` utility. You can do something as simple as `scl_enabled <SCL_PACKAGE_NAME> bash`, do the equivalent in your user's "${HOME}/.profile" and/or "${HOME}/.bash_profile" files, or you can do something by way of a file in "/etc/profile.d/". Since I assume "everyone on the system wants to use it", I enable by way of a file in "/etc/profile.d/". Typically, I create a file named "enable_<PACKAGE>.sh" and create contents similar to:
source /opt/rh/<RPM_NAME>/enable
export X_SCLS="$(scl enable <RPM_NAME> 'echo $X_SCLS')"
Within it. Each time an interactive-user logs in, they get the SCL-managed package configured for their shell. If multiple SCL-managed packages are installed, they all "stack" nicely (failure to do the above can result in one SCL-managed package interfering with another).


Unfortunately, while this works great for interactive shell users, it's not at all sufficient for non-interactive shells. This includes actual running services or stuff that's provisioned at launch-time via scripts kicked off by way of `systemd` (or `init`). Such processes do not use the contents of the "/etc/profile.d" directory (or other user-level initialization scripts). So: "what to do".

Linux affords a couple methods for handling this:

  • You can modify the system-wide PATH (etc.). To make the run-time linker find libraries in non-standard paths, you can create files in "/etc/ld.conf.d" that enumerate the desired linker search paths and then run `ldconfig` (the vendor-packaging of MariaDB and a few others do this). However, this is generally considered not to be "good form" and is frowned upon.
  • You can modify the scripts invoked via systemd/init; however, if those scripts are RPM-managed, such modifications can be lost when the RPM gets updated via a system patch-up
  • Under systemd, you can create customized service-definitions that contain the necessary environmentals. Typically, this can be quickly done by copying the service-definition found in the "/usr/lib/systemd/system" directory to the "/etc/systemd/system" directory and modifying as necessary.
  • Some applications (such as Apache) allow you to extend library and other search directives via run-time configuration directives. In the case of RedMine, Apache calls the Passenger framework which, in turn, runs the Ruby-based RedMine code. To make Apache happy, you create an "/etc/httpd/conf.d/Passenger.conf" that looks something like:
    LoadModule passenger_module /opt/rh/rh-ruby24/root/usr/local/share/gems/gems/passenger-5.1.12/buildout/apache2/
    <IfModule mod_passenger.c>
            PassengerRoot /opt/rh/rh-ruby24/root/usr/local/share/gems/gems/passenger-5.1.12
            PassengerDefaultRuby /opt/rh/rh-ruby24/root/usr/bin/ruby
            SetEnv LD_LIBRARY_PATH /opt/rh/rh-ruby24/root/usr/lib64
    The directives:
    • "LoadModule" directive binds the module-name to the fully-qualified path to the Passenger Apache-plugin (shared-library). This file is created by the Passenger installer.
    • "PassengerRoot" tells Apache where to look for further, related gems and libraries.
    • "PassengerDefaultRuby" tells Apache which Ruby to use (useful if there's more than one Ruby version installed or, as in the case of SCL-managed packages, Ruby is not found in the standard system search paths).
    • "SetEnv LD_LIBRARY_PATH" tells Apachels run-time linker where else to look for shared library files.
    Without these, Apache won't find RedMine or all of the associate (SCL-managed) Ruby dependencies.
Because this last method (notionally) has the narrowest scope (area of effect, blast-radius, etc.) —only effecting where the desired application looks for non-standard components — this is the generally preferred method. The above list should be viewed in "worst-to-first" ascending-order of preference.
Note On Provisioning Automation Tools:

Because automation tools like CloudFormation (and cloud-init) are not interactive processes, they too will not process the contents of the "/etc/profile.d/" directory. One can use one of the above enumerated methods for modifying the invocation of these tools or include explicit sourcing of the relevant "/etc/profile.d/" files within the scripts executed by these frameworks.

Tuesday, December 5, 2017

I've Used How Much Space??

A customer of mine needed me to help them implement a full CI/CD tool-chain in AWS. As part of that implementation, they wanted to have daily backups. Small problem: the enterprise backup software that their organization normally uses isn't available in their AWS-hosted development account/environment. That environment is mostly "support it yourself".

Fortunately, AWS has a number of tools that can help with things like backup tasks. The customer didn't have strong specifications on how they wanted things backed up, retention-periods, etc. Just "we need daily backups". So I threw them together some basic "pump it into S3" type of jobs with the caveat "you'll want to keep an eye on this because, right now, there's no data lifecycle elements in place".

For the first several months things ran fine. Then, as they often do, problems began popping up. Their backup jobs started experiencing periodic errors. Wasn't able to find underlying causes. However, in my searching around, it occurred to me "wonder if these guys have been aging stuff off like I warned them they'd probably want to do."

AWS provides a nifty GUI option in the S3 console that will show you storage utilization. A quick look in their S3 backup buckets told me, "doesn't look like they have".

Not being over much of a GUI-jockey, I wanted something I could run from the CLI that could be fed to an out-of-band notifier. The AWS CLI offers the `s3api` tool-set that comes in handy for such actions. My first dig through (and some Googling), I sorted out "how do I get a total-utilization view for this bucket". It looks something like:

aws s3api list-objects --bucket toolbox-s3res-12wjd9bihhuuu-backups-q5l4kntxp35k \
    --output json --query "[sum(Contents[].Size), length(Contents[])]" | \
    awk 'NR!=2 {print $0;next} NR==2 {print $0/1024/1024/1024" GB"}'
1671.5 GB

The above agreed with the GUI and was more space than I'd assumed they'd be using at this point. So, I wanted to see "can I clean up".

aws s3api list-objects --bucket toolbox-s3res-12wjd9bihhuuu-backups-q5l4kntxp35k \
    --prefix Backups/YYYYMMDD/ --output json \
    --query "[sum(Contents[].Size), length(Contents[])]" | \
    awk 'NR!=2 {print $0;next} NR==2 {print $0/1024/1024/1024" GB"}'
198.397 GB
That "one day's worth of backups" was also more than expected. Last time I'd censused their backups (earlier in the summer), they had maybe 40GiB worth of data. They wanted a week's worth of backups. However, at 200GiB/day worth of backups, I could see that I really wasn't going to be able to trim the utilization. Also meant that maybe they were keeping on top of aging things off.
Note: yes, S3 has lifecycle policies that allow you to automate moving things to lower-cost tiers. Unfortunately, the auto-tiering (at least from regular S3 to S3-IA) has a minimum age of 30 days. Not helpful, here.
Saving grace: at least I snuffled up a way to verify get metrics without the web GUI. As a side effect, also meant I had a way to see that the amount that reaches S3 matches the amount being exported from their CI/CD applications.

Wednesday, September 6, 2017

Order Matters

For one of the projects I'm working on, I needed to automate the deployment of JNLP-based Jenkins build-agents onto cloud-deployed (AWS/EC2) EL7-based hosts.

When we'd first started the project, we were using the Jenkins EC2 plugins to do demand-based deployment of EC2 spot-instances to host the Jenkins agent-software. It worked decently and kept deployment costs low. For better or worse, our client preferred using fixed/reserved instances and JNLP-based agent setup vice master-orchestrated SSH-based setup. The requested method likely makes sense for if/when they decide there's a need for Windows-based build agents — as all will be configured equivalently (via JNLP). It also eliminates the need to worry about SSH key-management.

At any rate, we made the conversion. It was initially trivial ...until we implemented our cost-savings routines. Our customer's developers don't work 24/7/365 (and neither do we). It didn't make sense to have agents just sitting around idle racking up EC2 and EBS time. So, we placed everything under a home-brewed tool to manage scheduled shutdowns and startups.

The next business-day, the problems started. The Jenkins master came online just fine. All of the agents also started back up ...however, the Jenkins master was declaring them dead. As part of debugging process, we would log into each of the agent-hosts but would find that there was a JNLP process in the PID-list. Initially we assumed the declared-dead problem was just a simple timing issue. Our first step was to try rescheduling the agents' start to happen a half hour before the master's. When that failed, we then tried setting the master's start a half hour before the agents'. No soup.

Interestingly, doing a quick `systemctl restart jenkins-agent` on the agent hosts would make them pretty much immediately go online in the Master's node-manager. So, we started to suspect something between the master and agent-nodes causing issues.

Because our agents talk to the master through an ELB, the issues lead us to suspect the ELB was proving problematic. After all, even though the Jenkins master starts up just fine, the ELB doesn't start passing traffic for a minute or so after the master service starts accepting direct connections. We suspected that perhaps there was enough lag in the ELB going fully active that the master was declaring the agents dead — as each agent-host was showing a ping-timeout error in the master console:

Upon logging back into the agents, we noticed that, while the java process was running, it wasn't actually logging. For background, when using default logging, the Jenkins agent create ${JENKINS_WORK_DIR}/logs/remoting.log.N files. The actively logged-to file is the ".0" file - which you can verify with tools other tools ...or just notice that there's a remoting.log.0.lck file. Interestingly, the remoting.log.0 file was blank, whereas all the other, closed files showed connection-actions between the agent and master.

So, started looking at the systemd file we'd originally authored:

Description=Jenkins Agent Daemon

ExecStart=/bin/java -jar /var/jenkins/slave.jar -jnlpUrl https://jenkins.lab/computer/Jenkins00-Agent0/slave-agent.jnlp -secret &lt;secret&gt; -workDir &lt;workdir&gt;


Initially, I tried playing with the RestartSec parameter. Made no difference in behavior ...probably because the java process never really exits to be restarted. So, did some further digging about - particularly with an eye towards having failed to track missing systemd dependencies.

When you're making the transition from EL6 to EL7, one of the things that's easy to forget is that systemd is not a sequential init-system. Turned out that, while I'd told the service, "don't start till is going", I hadn't told it "...but make damned sure that the NICs are actually talking to the network." That was the key, and was accomplished by adding:

to the service definition files' [Unit] sections. Some documents indicated to set the Wants= to Others indicated that there were some rare cases where may be satisfied before the NIC is actually fully ready to go — that the was therefore the better dependency to set. In practice, the time between which and onlines is typically in the millisecond range - so, "take your pick". I was tired of fighting things, so, opted for just to be conservative/"done with things"

After making the above change to one of my three nodes, that node immediately began to reliably online after full service-downings, agent-reboots or even total re-deployments of the agent-host. The change has since been propagated to the other agent nodes and the "run only during work-hours" service-profile has been tested "good".

Wednesday, May 24, 2017

Barely There

So, this morning I get an IM from one of my teammates asking, "you don't happen to have your own copy of <GitHub_Hosted_Project>, do you? I had kind of an 'oops' moment a few minutes ago." Unfortunately, because my A/V had had its own "oops" moment two days prior, all of my local project copies had gone *poof!*, as well.

Fortunately, I'd been making a habit of configuring our RedMine to do bare-syncs of all of our projects. So, I replied back, "hold on, I'll see what I can recover from RedMine." I update the security rules for our RedMine VM and then SSH in. I escalate my privileges and navigate to where RedMine's configured to create repository copies. I look at the repository copy and remember, "shit, this only does a bare copy. None of the actual files I need is here."

So, I consult the Googles to see if there's any articles on "how to recreate a full git repository from a bare copy" (and permutations thereof). Pretty much all the results are around "how to create a bare repository" ...not really helpful.

I go to my RedMine project's repository page and notice the "download" link when I click on one of the constituent files. I click on it to see whether I actually get the file's contents or if all the download link is is a (now broken) referral to the file on GitHub. Low and behold, the file downloads. It's a small project, so, absolute worst case, I can download all of the individual files an manually recreate the GitHub project and only lose my project-history.

That said, the fact that I'm able to download the files tells me, "RedMine has a copy of those files somewhere." So I think to myself, "mebbe I need to try another search: clearly RedMine is allowing me to check files out of this 'bare' repository, so perhaps there's something similar I can do more directly." I return to Google and punch in "git bare repository checkout". VoilĂ . A couple useful-looking hits. I scan through a couple of them and find that I can create a full clone from a bare repo. All I have to do is go into my (RedMine's) filesystem, copy my bare repository to a safe location (just in case) and then clone from it:

# find <BARE_REPOSITORY> -print | cpio -vpmd /tmp
     # cd /tmp

That final find shows me that I now have a full repository (there's now a fully populated .git subdirectory in it). I chown the directory to my SSH user-account, then exit my sudo session (I'd ssh'ed in with key-forwarding enabled).

I go to GitHub and (re)create the nuked project, then, configure the on-disk copy of my files and git metadata to be able to push everything back to GitHub. I execute my git push and all looks good from the ssh session. I hop back to GitHub and there is all my code and all of my commit-history and other metadata. Huzzah!

I finish out by going back and setting up branch-protection and my push-rules and CI-test integrations. Life is good again.

Wednesday, May 17, 2017

The Savings Are In The POST

Been playing around with Artifactory for a client looking to implement a full CI/CD toolchain. My customer has an interesting budgeting method: they're better able to afford sunk costs than recurring costs. So, they splurged for the Enterprise Edition pricing but asked me to try to deploy it in a "cost-aware" fashion.

Two nice things that the Enterprise Edition of artifactory gives you: the ability to store artifacts directly to lower-cost "cloud" storage tiers and the ability to cluster. Initially, I wasn't going to bother with the latter: while the customer is concerned about reliability, the "cost-aware" method for running things means that design-resiliency is more critical than absolute availability/uptime. Having two or more nodes running didn't initially make sense, so I set the clustering component aside, and explored other avenues for resiliency.

The first phase of resiliency was externalizing stuff that was easily persisted.

Artifactory keeps much of its configuration and tracking information in a database. We're deploying the toolchain into AWS, so, offloading the management overhead of an external database to RDS was pretty much a no-brainer.

When you have the Enterprise Edition entitlements, Artifactory lets you externalize the storage of artifacts to cloud-storage. For a cost-aware deployment, storing gigabytes of data in S3 is much more economical than storing in an EBS volume. Storing it in S3 also means that the data has a high-degree of availability and protection right out of the box. Artifactory also makes it fairly trivial to set up storage tiering. This meant I was able to configure the application to stage recently-written or fetched data in either an SSD-backed EBS volume or leverage instance storage (fast, ephemeral storage). I could then let the tiering move data to S3 either as the local filesystem became full or the recently-written or fetched data aged.

With some of the other stuff I automated, once you had configuration and object data externalized, resiliency was fairly easy to accommodate. You could kill the running instance (or let a faulty one die), spin up a new instance, automate the install of the binaries and the sucking-down of any stub-data needed to get the replacement host knitted back to the external data-services. I assumed that Artifactory was going to be the same case.

For better or worse, Artifactory's relationship with it's database is a bit more complicated than some of the other CI/CD components. Specifically, just giving Artifactory the host:port and credentials for the database doesn't get your new instance talking to the database. No. It wants all that and it wants some authentication tokens and certificates to more-fully secure the Artifactory/database communications.

While backing these tokens and certificates up and accounting for them in the provisioing-time stub-data pull-down is a potential approach, when you have Enterprise Edition, you're essentially reinventing the wheel. You can, instead, take advantage of EE's clustering capabilities ...and run the Artifactory service as a single-node cluster. Doing so requires generating a configuration bundle-file via an API call and then putting that in your stub-data set. The first time you fire up a new instantiation of Artifactory, if that bundle-file is found in the application's configuration-root, it will "magic" the rest (automatically extract the bundle contents and update the factory config files to enable communication with the externalized services).

As of the writing of this post, the documentation for the HA setup is a bit wanting. They tell you "use this RESTful API call to generate the bundle file". Unfortunately, they don't go into great depth on how to convice Artifactory to generate the bundle. Yeah, there's other links in the documents for how to use the API calls to store data in Artifactory, but no "for this use-case, execute this". Ultimately, what it ended up being was:
# curl -u <ADMIN_USER> -X POST http://localhost:8081/artifactory/api/system/bootstrap_bundle
If you got an error or no response, you need to look in the Artifactory access logs for clues.

If your call was successful, you'd get a (JSON) return similar to:
Enter host password for user '<ADMIN_USER>':
    "file" : "/var/opt/jfrog/artifactory/etc/bootstrap.bundle.tar.gz"
The reason I gave the article the title I did was, because, previous to this exercise, I'd not had to muck with explicitly setting the method for passing the API calls to the command-endpoint. It's that "-X POST" bit that's critical. I wouldn't have known to use the documentation's search-function for the call-method had I not looked at the access logs and seen that Artifactory didn't like me using curl's default, GET-based method.

At any rate, once you have that bootstrap-bundle generated and stored with your other stub-data, all that's left is automating the instantiation-time creation of the file. After a generic install, when those files are present, the newly-launched Artifactory instance joins itself back to the single-node cluster and all your externalized-data becomes available. All that taken care of, running "cost-aware" means that:

  • When you reach the close of service hours, you terminate your Artifactory host.
  • Just before your service hours start, you instantiate a new Artifactory host.
If you offer your service Mon-Fri from 06:00 to 20:00, you've saved yourself 148 hours of EC2 charges per calendar-week. If you're really aggressive about running cost-aware, you can apply similar "business hours" logic to your RDS (though, given the size of the RDS, the costs to create the automation may be more than you'll save in a year's running of the RDS instance).

Wednesday, November 23, 2016

Manually Mirroring Git Repos

If you're like me, you've had occasions where you need to replicate Git-managed projects from one Git service to another. If you're not like me, you use paid services that make such mundane tasks a matter of clicking a few buttons in a GUI. If, however, you need to copy projects from one repository-service to another and no one has paid to make a GUI buttton/config-page available to you, then you need to find other methods to get things done.

The following assumes that you have a git project hosted in one repository service (e.g, GitHub) that you wish to mirror to another repository service (e.g., BitBucket, AWS CodeCommit, etc). The basic workflow looks like the following:

Procedure Outline:

  1. Login to a git-enabled host
  2. Create a copy of your "source-of-truth" repository, depositing its contents to a staging-directory:
    git clone --mirror \
  3. Navigate into the staging-directory:
    cd stage
  4. Set the push-destination to the copy-repository:
    git remote set-url --push origin \
  5. Ensure the staging-directory's data is still up to date:
    git fetch -p origin
  6. Push the copied source-repository's data to the copy-repository:
    git push --mirror

Procedure Outline:

Using an example configuration (the AMIgen6 project):
$ git clone --mirror stage && \
  cd stage && \
  git remote set-url --push origin && \
  git fetch -p origin && \
  git push --mirror
Cloning into bare repository 'stage'...
remote: Counting objects: 789, done.
remote: Total 789 (delta 0), reused 0 (delta 0), pack-reused 789
Receiving objects: 100% (789/789), 83.72 MiB | 979.00 KiB/s, done.
Resolving deltas: 100% (409/409), done.
Checking connectivity... done.
Counting objects: 789, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (369/369), done.
Writing objects: 100% (789/789), 83.72 MiB | 693.00 KiB/s, done.
Total 789 (delta 409), reused 789 (delta 409)
 * [new branch]      ExtraRPMs -> ExtraRPMs
 * [new branch]      SELuser-fix -> SELuser-fix
 * [new branch]      master -> master
 * [new branch]      refs/pull/38/head -> refs/pull/38/head
 * [new branch]      refs/pull/39/head -> refs/pull/39/head
 * [new branch]      refs/pull/40/head -> refs/pull/40/head
 * [new branch]      refs/pull/41/head -> refs/pull/41/head
 * [new branch]      refs/pull/42/head -> refs/pull/42/head
 * [new branch]      refs/pull/43/head -> refs/pull/43/head
 * [new branch]      refs/pull/44/head -> refs/pull/44/head
 * [new branch]      refs/pull/52/head -> refs/pull/52/head
 * [new branch]      refs/pull/53/head -> refs/pull/53/head
 * [new branch]      refs/pull/54/head -> refs/pull/54/head
 * [new branch]      refs/pull/55/head -> refs/pull/55/head
 * [new branch]      refs/pull/56/head -> refs/pull/56/head
 * [new branch]      refs/pull/57/head -> refs/pull/57/head
 * [new branch]      refs/pull/62/head -> refs/pull/62/head
 * [new branch]      refs/pull/64/head -> refs/pull/64/head
 * [new branch]      refs/pull/65/head -> refs/pull/65/head
 * [new branch]      refs/pull/66/head -> refs/pull/66/head
 * [new branch]      refs/pull/68/head -> refs/pull/68/head
 * [new branch]      refs/pull/71/head -> refs/pull/71/head
 * [new branch]      refs/pull/73/head -> refs/pull/73/head
 * [new branch]      refs/pull/76/head -> refs/pull/76/head
 * [new branch]      refs/pull/77/head -> refs/pull/77/head

Updating (and Automating)

To keep your copy-repository's project in sync with your source-repository's project, periodically do:
cd stage && \
  git fetch -p origin && \
  git push --mirror
This can be accomplished by logging into a host and executing the steps manually or placing them into a cron job.

Thursday, November 3, 2016

Automation in a Password-only UNIX Environment

Occasionally, my organization has need to run ad hoc queries against a large number of Linux systems. Usually, this is a great use-case for an enterprise CM tool like HPSA, Satellite, etc. Unfortunately, the organization I consult to is between solutions (their legacy tool burned down and their replacement has yet to reach a working state). The upshot is that, one needs to do things a bit more on-the-fly.

My preferred method for accessing systems is using some kind of token-based authentication framework. When hosts are AD-integrated, you can often use Kerberos for this. Failing that, you can sub in key-based logins (if all of your targets have your key as an authorized key). While my customer's systems are AD-integrated, their security controls preclude the use of both AD/Kerberos's single-signon capabilities and the use of SSH key-based logins (and, to be honest, almost none of the several hundred targets I needed to query had my key configured as an authorized key).

Because (tunneled) password-based logins are forced, I was initially looking at the prospect of having to write an expect script to avoid having type in my password several hundred times. Fortunately, there's an alternative to this in the tool "sshpass".

SSH pass lets you supply your password with a number of methods: command-line argument, a password-containing file, an environment variable value or even a read from STDIN. I'm not a fan of text files containing passwords (they've a bad tendency to be forgotten and left on a system - bad juju). I'm not particularly a fan of command-line arguments, either - especially on a multi-user system where others might see your password if they `ps` at the wrong time (which increases in probability as the length of time your job runs goes up). The STDIN method is marginally less awful than the command arg method (for similar reasons). At least with an environment variable, the value really only sits in memory (especially if you've set your HISTFILE location to someplace non-persistent).

The particular audit I was doing was an attempt to determine the provenance of a few hundred VMs. Over time, the organization has used templates authored by several groups - and different personnel within one of the groups. I needed to scan all of the systems to see which template they might have been using since the template information had been deleted from the hosting environment. Thus, I needed to run an SSH-encapsulated command to find the hallmarks of each template. Ultimately, what I ended up doing was:

  1. Pushed my query-account's password into the environment variable used by sshpass, "SSHPASS"
  2. Generated a file containing the IPs of all the VMs in question.
  3. Set up a for loop to iterate that list
  4. Looped `sshpass -e ssh -o PubkeyAuthentication=no StrictHostKeyChecking=no <USER>@${HOST} <AUDIT_COMMAND_SEQUENCE> 2>&1
  5. Jam STDOUT through a a sed filter to strip off the crap I wasn't interested in and put CSV-appropriate delimeters into each, queried host's string.
  6. Capture the lot to a text file
The "PubkeyAuthentication=no" option was required because I pretty much always have SSH-agent (or agent-forwarding) enabled. This causes my key to be passed. With the targets' security settings, this causes the connection to be aborted unless I explicitly suppress the passing of my agent-key.

The "StrictHostKeyChecking=no" option was required because I'd never logged into these hosts before. Our standard SSH client config is to require confirmation before accepting the remote key (and shoving it into ${HOME}/.ssh/known_hosts). Without this option, I'd be required to confirm acceptance of each key ...which is just about as automation-breaking as having to retype your password hundreds of times is.

Once the above was done, I had a nice CSV that could be read into Excel and a spreadsheet turned over to the person asking "who built/owns these systems". 

This method also meant that for the hosts that refused the audit credentials, it was easy enough to report "...and this lot aren't properly configured to work with AD".

Monday, October 17, 2016

Update to EL6 and the Pain of 10Gbps Networking in AWS

In a previous post, EL6 and the Pain of 10Gbps Networking in AWS, EL6 and the Pain of 10Gbps Networking in AWS, I discussed how to enable using third-party ixgbevf drivers to support 10Gbps networking in AWS-hostd RHEL 6 and CentOS 6 instances. Under further testing, it turns out that this is overkill. The native drivers may be used, instead, with minimal hassle.

It turns out that, in my quest to ensure that my CentOS and RHEL AMIs contained as many of the AWS utilities present in the Amazon Linux AMIs, that I included two RPMS - ec2-net and ec2-net-util - that were preventing use of the native drivers. Skipping these two RPMs (and possibly sacrificing ENI hot-plug capabilities) allows a much more low-effort support of 10Gpbs networking in AWS-hosted EL6 instances.

Absent those RPMS, 10Gbps support becomes a simple matter of:
  1. Add add_drivers+="ixgbe ixgbevf" to the AMI's /etc/dracut.conf file
  2. Use dracut to regenerate the AMI's initramfs.
  3. Ensure that there are no persistent network device mapping entries in /etc/udev/rules.d.
  4. Ensure that there are no ixgbe or ixgbevf config directives in /etc/modprobe.d/* files.
  5. Enable sr-iov support in the instance-to-be-registered
  6. Register the AMI 
The resulting AMIs (and instances spawned from them) should support 10Gbps networking and be compatible with M4-generation instance-types.

Friday, October 7, 2016

Using DKMS to maintain driver modules

In my prior post, I noted that maintaining custom drivers for the the kernel in RHEL and CentOS hosts can be a bit painful (and prone to leaving you with an unreachable or even unbootable system). One way to take some of the pain out of owning a system with custom drivers is to leverage DKMS. In general, DKMS is the recommended way to ensure that, as kernels are updated, required kernel modules are also (automatically) updated.

Unfortunately, use of the DKMS method will require that developer tools (i.e., the GNU C-compiler) be present on the system - either in perpetuity or just any time kernel updates are applied. It is very likely that your security team will object to - or even prohibit - this. If the objection/prohibition cannot be overridden, use of the DKMS method will not be possible.


  1. Set an appropriate version string into the shell-environment:
    export VERSION=3.2.2
  2. Make sure that appropriate header files for the running-kernel are installed
    yum install -y kernel-devel-$(uname -r)
  3. Ensure that the dkms utilities are installed:
    yum --enablerepo=epel install dkms
  4. Download the driver sources and unarchive into the /usr/src directory:
    wget${VERSION}/ixgbevf-${VERSION}.tar.gz/download \
        -O /tmp/ixgbevf-${VERSION}.tar.gz && \
       ( cd /usr/src && \
          tar zxf /tmp/ixgbevf-${VERSION}.tar.gz )
  5. Create an appropriate DKMS configuration file for the driver:
    cat > /usr/src/ixgbevf-${VERSION}/dkms.conf << EOF
    CLEAN="cd src/; make clean"
    MAKE="cd src/; make BUILD_KERNEL=\${kernelver}"
  6. Register the module to the DKMS-managed kernel tree:
    dkms add -m ixgbevf -v ${VERSION}
  7. Build the module against the currently-running kernel:
    dkms build ixgbevf/${VERSION}


The easiest way to verify the correct functioning of DKMS is to:
  1. Perform a `yum update -y`
  2. Check that the new drivers were created by executing `find /lib/modules -name ixgbevf.ko`. Output should be similar to the following:
    find /lib/modules -name ixgbevf.ko | grep extra
    There should be at least two output-lines: one for the currently-running kernel and one for the kernel update. If more kernels are installed, there may be more than just two output-lines
  3. Reboot the system, then check what version is active:
    modinfo ixgbevf | grep extra
    filename:       /lib/modules/2.6.32-642.1.1.el6.x86_64/extra/ixgbevf.ko
    If the output is null, DKMS didn't build the new module.

Wednesday, October 5, 2016

EL6 and the Pain of 10Gbps Networking in AWS

AWS-hosted instances with optimized networking-support enabled see the 10Gbps interface as an Intel 10Gbps ethernet adapter (`lspci | grep Ethernet` will display a string similar to Intel Corporation 82599 Virtual Function). This interface makes use the ixgbevf network-driver. Enterprise Linux 6 and 7 bundle the version 2.12.x version of the driver into the kernel RPM. Per the Enabling Enhanced Networking with the Intel 82599 VF Interface on Linux Instances in a VPC document, AWS enhanced networking recommends version 2.14.2 or higher of the ixgbevf network-driver. To meet AWS's recommendations for 10Gbps support within an EL6 instance, it will be necessary to update the ixgbevf driver to at least version 2.14.2.

The ixgbevf network-driver source-code can be found on SourceForge. It should be noted that not every AWS-compatible version will successfully compile on EL 6. Version 3.2.2 is known to successfully compile, without intervention, on EL6.



It is necessary to recompile the ixgbevf driver and inject it into the kernel each time the kernel version changes. This needs to be done between changing the kernel version and rebooting into the new kernel version. Failure to update the driver each time the kernel changes will result in the instance failing to return to the network after a reboot event.

Step #10 from the implementation procedure:
rpm -qa kernel | sed 's/^kernel-//' | xargs -I {} dracut -v -f /boot/initramfs-{}.img {}
Is the easiest way to ensure any available kernels are properly linked against the ixgbevf driver.


It is possible that the above process can be avoided by installing DKMS and letting it coordinate the insertion of the ixgbevf driver modules into updated kernels.


The following assumes the instance-owner has privileged access to the instance OS and can make AWS-level configuration changes to the instance-configuration:
  1. Login to the instance
  2. Escalate privileges to root
  3. Install the up-to-date ixgbevf driver. This can be installed either by compiling from source or using pre-compiled binaries.
  4. Delete any `*persistent-net*.rules` files found in the /etc/udev/rules.d directory (one or both of 70-persistent-net.rules and 75-persistent-net-generator.rules may be present)
  5. Ensure that an `/etc/modprobe.d` file with the following minimum contents exists:
    alias eth0 ixgbevf
    Recommend creating/placing in /etc/modprobe.d/ifaliases.conf
  6. Unload any ixgbevf drivers that may be in the running kernel:
    modprobe -rv ixgbevf
  7. Load the updated ixgbevf driver into the running kernel:
    modprobe -v ixgbevf
  8. Ensure that the /etc/modprobe.d/ixgbevf.conf file exists. Its contents should resemble:
    options ixgbevf InterruptThrottleRate=1
  9. Update the /etc/dracut.conf. Ensure that the add_drivers+="" directive is uncommented and contains reference to the ixgbevf modules (i.e., `add_drivers+="ixgbevf"`)
  10. Recompile all installed kernels:
    rpm -qa kernel | sed 's/^kernel-//'  | xargs -I {} dracut -v -f /boot/initramfs-{}.img {}
  11. Shut down the instance
  12. When the instance has stopped, use the AWS CLI tool to enable optimized networking support:
    aws ec2 --region  modify-instance-attribute --instance-id  --sriov-net-support simple
  13. Power the instance back on
  14. Verify that 10Gbps capability is available:
    1. Check that the ixgbevf module is loaded
      $ sudo lsmod
      Module                  Size  Used by
      ipv6                  336282  46
      ixgbevf                63414  0
      i2c_piix4              11232  0
      i2c_core               29132  1 i2c_piix4
      ext4                  379559  6
      jbd2                   93252  1 ext4
      mbcache                 8193  1 ext4
      xen_blkfront           21998  3
      pata_acpi               3701  0
      ata_generic             3837  0
      ata_piix               24409  0
      dm_mirror              14864  0
      dm_region_hash         12085  1 dm_mirror
      dm_log                  9930  2 dm_mirror,dm_region_hash
      dm_mod                102467  20 dm_mirror,dm_log
    2. Check that `ethtool` is showing that the default interface (typicall "`eth0`") is using the ixgbevf driver:
      $ sudo ethtool -i eth0
      driver: ixgbevf
      version: 3.2.2
      firmware-version: N/A
      bus-info: 0000:00:03.0
      supports-statistics: yes
      supports-test: yes
      supports-eeprom-access: no
      supports-register-dump: yes
      supports-priv-flags: no
    3. Verify that the interface is listed as supporting a link mode of `10000baseT/Full` and a speed of `10000Mb/s`:
      $ sudo ethtool eth0
      Settings for eth0:
              Supported ports: [ ]
              Supported link modes:   10000baseT/Full
              Supported pause frame use: No
              Supports auto-negotiation: No
              Advertised link modes:  Not reported
              Advertised pause frame use: No
              Advertised auto-negotiation: No
              Speed: 10000Mb/s
              Duplex: Full
              Port: Other
              PHYAD: 0
              Transceiver: Unknown!
              Auto-negotiation: off
              Current message level: 0x00000007 (7)
                                     drv probe link
              Link detected: yes