Titular Discrepancy: CI

Wednesday, March 27, 2019

Travis, EL7 and Docker-Based Testing

As noted in a prior post, lot of my customer-oriented activities support deployment within networks that are either partially- or wholly-isolated from the public Internet. Yesterday, as part of supporting one such customer, I was stood up a new project to help automate the creation of yum repository configuration RPMs for private networks. I've had to hand-jam such files twice, now, and there's unwanted deltas between the two jam-sessions (in defense, they were separated from each other by nearly a three-year time-span). So, I figured it was time to standardize and automate things.

Usually, when I stand up a project, I like to include tests of the content that I wish to deliver. Since most of my projects are done in public GitHub repositories, I typically use TravisCI to automate my testing. Prior to this project, however, I wasn't trying to automate the validity-testing of RPM recipes via Travis. Typically, when automating creation of RPMs I wish to retain or deliver, I set up a Jenkins job that takes the resultant RPMs and stores them in Artifactory – both privately-hosted services. Most of my prior Travis jobs were simple, syntax-checkers (using tools like shellcheck, JSON validators, CFn validators, etc.) rather than functionality-checkers.

This time, however, I was trying to deliver a functionality (RPM spec files that would be used to generate source files from templates and package the results). So, I needed to be able to test that a set of spec files and source-templates could be reliably-used to generate RPMs. This meant, I needed my TravisCI job to generate "throwaway" RPMs from the project-files.

The TravisCI system's test-hosts are Ubuntu-based rather than RHEL or CentOS based.While there are some tools that will allow you to generate RPMs on Ubuntu, there've been some historical caveats on their reliability and/or representativeness. So, my preference was to be able to use a RHEL or CentOS-based context for my test-packagings. Fortunately, TravisCI does offer the ability to use Docker on their test-hosts.

In general, setting up a Docker-oriented job is relatively straight forward. Where things get "fun" is that the version of `rpmbuild` that comes with Enterprise Linux 7 gets kind of annoyed if it's not able to resolve the UIDs and GIDs of the files it's trying to build from (never mind that the build-user inside the running Docker-container is "root" ...and has unlimited access within that container). If it can't resolve them, the rpmbuild tasks fail with a multitude of not terribly helpful "error: Bad owner/group: /path/to/repository/file" messages.

After googling about, I ultimately found that I needed to ensure that the UIDs and GIDs of the project-content need to exist within the Docker-container's /etc/passwd and /etc/group files, respectively. Note: most of the "top" search results Google returned to me indicated that the project files needed to be `chown`ed. However, simply being mappable proved to be sufficient.

Rounding the home stretch...

To resolve the problem, I needed to determine what UIDs and GIDs the project-content had inside my Docker-container. That meant pushing a Travis job that included a (temporary) diagnostic-block to stat the relevant files and return me their UIDs and GIDs. Once the UIDs and GIDs were determined, I needed to update my Travis job to add relevant groupadd and useradd statements to my container-preparation steps. What I ended up with was.

    sudo docker exec centos-${OS_VERSION} groupadd -g 2000 rpmbuilder
    sudo docker exec centos-${OS_VERSION} adduser -g 2000 -u 2000 rpmbuilder

It was late in the day, by this point, so I simply assumed that the IDs were stable. I ran about a dozen iterations of my test, and they stayed stable, but that may have just been "luck". If I run into future "owner/group" errors, I'll update my Travis job-definition to scan the repository-contents for their current UIDs and GIDs and then set them based on those. But, for now, my test harness works: I'm able to know that updates to existing specs/templates or additional specs/templates will create working RPMs when they're taken to where they need to be used.

Wednesday, May 17, 2017

The Savings Are In The POST

Been playing around with Artifactory for a client looking to implement a full CI/CD toolchain. My customer has an interesting budgeting method: they're better able to afford sunk costs than recurring costs. So, they splurged for the Enterprise Edition pricing but asked me to try to deploy it in a "cost-aware" fashion.

Two nice things that the Enterprise Edition of artifactory gives you: the ability to store artifacts directly to lower-cost "cloud" storage tiers and the ability to cluster. Initially, I wasn't going to bother with the latter: while the customer is concerned about reliability, the "cost-aware" method for running things means that design-resiliency is more critical than absolute availability/uptime. Having two or more nodes running didn't initially make sense, so I set the clustering component aside, and explored other avenues for resiliency.

The first phase of resiliency was externalizing stuff that was easily persisted.

Artifactory keeps much of its configuration and tracking information in a database. We're deploying the toolchain into AWS, so, offloading the management overhead of an external database to RDS was pretty much a no-brainer.

When you have the Enterprise Edition entitlements, Artifactory lets you externalize the storage of artifacts to cloud-storage. For a cost-aware deployment, storing gigabytes of data in S3 is much more economical than storing in an EBS volume. Storing it in S3 also means that the data has a high-degree of availability and protection right out of the box. Artifactory also makes it fairly trivial to set up storage tiering. This meant I was able to configure the application to stage recently-written or fetched data in either an SSD-backed EBS volume or leverage instance storage (fast, ephemeral storage). I could then let the tiering move data to S3 either as the local filesystem became full or the recently-written or fetched data aged.

With some of the other stuff I automated, once you had configuration and object data externalized, resiliency was fairly easy to accommodate. You could kill the running instance (or let a faulty one die), spin up a new instance, automate the install of the binaries and the sucking-down of any stub-data needed to get the replacement host knitted back to the external data-services. I assumed that Artifactory was going to be the same case.

For better or worse, Artifactory's relationship with it's database is a bit more complicated than some of the other CI/CD components. Specifically, just giving Artifactory the host:port and credentials for the database doesn't get your new instance talking to the database. No. It wants all that and it wants some authentication tokens and certificates to more-fully secure the Artifactory/database communications.

While backing these tokens and certificates up and accounting for them in the provisioing-time stub-data pull-down is a potential approach, when you have Enterprise Edition, you're essentially reinventing the wheel. You can, instead, take advantage of EE's clustering capabilities ...and run the Artifactory service as a single-node cluster. Doing so requires generating a configuration bundle-file via an API call and then putting that in your stub-data set. The first time you fire up a new instantiation of Artifactory, if that bundle-file is found in the application's configuration-root, it will "magic" the rest (automatically extract the bundle contents and update the factory config files to enable communication with the externalized services).

As of the writing of this post, the documentation for the HA setup is a bit wanting. They tell you "use this RESTful API call to generate the bundle file". Unfortunately, they don't go into great depth on how to convice Artifactory to generate the bundle. Yeah, there's other links in the documents for how to use the API calls to store data in Artifactory, but no "for this use-case, execute this". Ultimately, what it ended up being was:

# curl -u <ADMIN_USER> -X POST http://localhost:8081/artifactory/api/system/bootstrap_bundle

If you got an error or no response, you need to look in the Artifactory access logs for clues.

If your call was successful, you'd get a (JSON) return similar to:

Enter host password for user '<ADMIN_USER>':
{
    "file" : "/var/opt/jfrog/artifactory/etc/bootstrap.bundle.tar.gz"
}

The reason I gave the article the title I did was, because, previous to this exercise, I'd not had to muck with explicitly setting the method for passing the API calls to the command-endpoint. It's that "-X POST" bit that's critical. I wouldn't have known to use the documentation's search-function for the call-method had I not looked at the access logs and seen that Artifactory didn't like me using curl's default, GET-based method.

At any rate, once you have that bootstrap-bundle generated and stored with your other stub-data, all that's left is automating the instantiation-time creation of the ha-node.properties file. After a generic install, when those files are present, the newly-launched Artifactory instance joins itself back to the single-node cluster and all your externalized-data becomes available. All that taken care of, running "cost-aware" means that:

When you reach the close of service hours, you terminate your Artifactory host.
Just before your service hours start, you instantiate a new Artifactory host.

If you offer your service Mon-Fri from 06:00 to 20:00, you've saved yourself 148 hours of EC2 charges per calendar-week. If you're really aggressive about running cost-aware, you can apply similar "business hours" logic to your RDS (though, given the size of the RDS, the costs to create the automation may be more than you'll save in a year's running of the RDS instance).