Tuesday, December 5, 2017

I've Used How Much Space??

A customer of mine needed me to help them implement a full CI/CD tool-chain in AWS. As part of that implementation, they wanted to have daily backups. Small problem: the enterprise backup software that their organization normally uses isn't available in their AWS-hosted development account/environment. That environment is mostly "support it yourself".

Fortunately, AWS has a number of tools that can help with things like backup tasks. The customer didn't have strong specifications on how they wanted things backed up, retention-periods, etc. Just "we need daily backups". So I threw them together some basic "pump it into S3" type of jobs with the caveat "you'll want to keep an eye on this because, right now, there's no data lifecycle elements in place".

For the first several months things ran fine. Then, as they often do, problems began popping up. Their backup jobs started experiencing periodic errors. Wasn't able to find underlying causes. However, in my searching around, it occurred to me "wonder if these guys have been aging stuff off like I warned them they'd probably want to do."

AWS provides a nifty GUI option in the S3 console that will show you storage utilization. A quick look in their S3 backup buckets told me, "doesn't look like they have".

Not being over much of a GUI-jockey, I wanted something I could run from the CLI that could be fed to an out-of-band notifier. The AWS CLI offers the `s3api` tool-set that comes in handy for such actions. My first dig through (and some Googling), I sorted out "how do I get a total-utilization view for this bucket". It looks something like:

aws s3api list-objects --bucket toolbox-s3res-12wjd9bihhuuu-backups-q5l4kntxp35k \
    --output json --query "[sum(Contents[].Size), length(Contents[])]" | \
    awk 'NR!=2 {print $0;next} NR==2 {print $0/1024/1024/1024" GB"}'
[
1671.5 GB
    423759
]

The above agreed with the GUI and was more space than I'd assumed they'd be using at this point. So, I wanted to see "can I clean up".

aws s3api list-objects --bucket toolbox-s3res-12wjd9bihhuuu-backups-q5l4kntxp35k \
    --prefix Backups/YYYYMMDD/ --output json \
    --query "[sum(Contents[].Size), length(Contents[])]" | \
    awk 'NR!=2 {print $0;next} NR==2 {print $0/1024/1024/1024" GB"}'
[
198.397 GB
    50048
]
That "one day's worth of backups" was also more than expected. Last time I'd censused their backups (earlier in the summer), they had maybe 40GiB worth of data. They wanted a week's worth of backups. However, at 200GiB/day worth of backups, I could see that I really wasn't going to be able to trim the utilization. Also meant that maybe they were keeping on top of aging things off.
Note: yes, S3 has lifecycle policies that allow you to automate moving things to lower-cost tiers. Unfortunately, the auto-tiering (at least from regular S3 to S3-IA) has a minimum age of 30 days. Not helpful, here.
Saving grace: at least I snuffled up a way to verify get metrics without the web GUI. As a side effect, also meant I had a way to see that the amount that reaches S3 matches the amount being exported from their CI/CD applications.

Wednesday, September 6, 2017

Order Matters

For one of the projects I'm working on, I needed to automate the deployment of JNLP-based Jenkins build-agents onto cloud-deployed (AWS/EC2) EL7-based hosts.

When we'd first started the project, we were using the Jenkins EC2 plugins to do demand-based deployment of EC2 spot-instances to host the Jenkins agent-software. It worked decently and kept deployment costs low. For better or worse, our client preferred using fixed/reserved instances and JNLP-based agent setup vice master-orchestrated SSH-based setup. The requested method likely makes sense for if/when they decide there's a need for Windows-based build agents — as all will be configured equivalently (via JNLP). It also eliminates the need to worry about SSH key-management.

At any rate, we made the conversion. It was initially trivial ...until we implemented our cost-savings routines. Our customer's developers don't work 24/7/365 (and neither do we). It didn't make sense to have agents just sitting around idle racking up EC2 and EBS time. So, we placed everything under a home-brewed tool to manage scheduled shutdowns and startups.

The next business-day, the problems started. The Jenkins master came online just fine. All of the agents also started back up ...however, the Jenkins master was declaring them dead. As part of debugging process, we would log into each of the agent-hosts but would find that there was a JNLP process in the PID-list. Initially we assumed the declared-dead problem was just a simple timing issue. Our first step was to try rescheduling the agents' start to happen a half hour before the master's. When that failed, we then tried setting the master's start a half hour before the agents'. No soup.

Interestingly, doing a quick `systemctl restart jenkins-agent` on the agent hosts would make them pretty much immediately go online in the Master's node-manager. So, we started to suspect something between the master and agent-nodes causing issues.

Because our agents talk to the master through an ELB, the issues lead us to suspect the ELB was proving problematic. After all, even though the Jenkins master starts up just fine, the ELB doesn't start passing traffic for a minute or so after the master service starts accepting direct connections. We suspected that perhaps there was enough lag in the ELB going fully active that the master was declaring the agents dead — as each agent-host was showing a ping-timeout error in the master console:

Upon logging back into the agents, we noticed that, while the java process was running, it wasn't actually logging. For background, when using default logging, the Jenkins agent create ${JENKINS_WORK_DIR}/logs/remoting.log.N files. The actively logged-to file is the ".0" file - which you can verify with tools other tools ...or just notice that there's a remoting.log.0.lck file. Interestingly, the remoting.log.0 file was blank, whereas all the other, closed files showed connection-actions between the agent and master.

So, started looking at the systemd file we'd originally authored:

[Unit]
Description=Jenkins Agent Daemon

[Service]
ExecStart=/bin/java -jar /var/jenkins/slave.jar -jnlpUrl https://jenkins.lab/computer/Jenkins00-Agent0/slave-agent.jnlp -secret <secret> -workDir <workdir>
User=jenkins
Restart=always
RestartSec=15

[Install]
WantedBy=multi-user.target

Initially, I tried playing with the RestartSec parameter. Made no difference in behavior ...probably because the java process never really exits to be restarted. So, did some further digging about - particularly with an eye towards having failed to track missing systemd dependencies.

When you're making the transition from EL6 to EL7, one of the things that's easy to forget is that systemd is not a sequential init-system. Turned out that, while I'd told the service, "don't start till multi-user.target is going", I hadn't told it "...but make damned sure that the NICs are actually talking to the network." That was the key, and was accomplished by adding:

Wants=network-online.target
Requires=network.target

to the service definition files' [Unit] sections. Some documents indicated to set the Wants= to network.target. Others indicated that there were some rare cases where network.target may be satisfied before the NIC is actually fully ready to go — that the network-online.target was therefore the better dependency to set. In practice, the time between which network.target and network-online.target onlines is typically in the millisecond range - so, "take your pick". I was tired of fighting things, so, opted for network-online.target just to be conservative/"done with things"

After making the above change to one of my three nodes, that node immediately began to reliably online after full service-downings, agent-reboots or even total re-deployments of the agent-host. The change has since been propagated to the other agent nodes and the "run only during work-hours" service-profile has been tested "good".

Wednesday, May 24, 2017

Barely There

So, this morning I get an IM from one of my teammates asking, "you don't happen to have your own copy of <GitHub_Hosted_Project>, do you? I had kind of an 'oops' moment a few minutes ago." Unfortunately, because my A/V had had its own "oops" moment two days prior, all of my local project copies had gone *poof!*, as well.

Fortunately, I'd been making a habit of configuring our RedMine to do bare-syncs of all of our projects. So, I replied back, "hold on, I'll see what I can recover from RedMine." I update the security rules for our RedMine VM and then SSH in. I escalate my privileges and navigate to where RedMine's configured to create repository copies. I look at the repository copy and remember, "shit, this only does a bare copy. None of the actual files I need is here."

So, I consult the Googles to see if there's any articles on "how to recreate a full git repository from a bare copy" (and permutations thereof). Pretty much all the results are around "how to create a bare repository" ...not really helpful.

I go to my RedMine project's repository page and notice the "download" link when I click on one of the constituent files. I click on it to see whether I actually get the file's contents or if all the download link is is a (now broken) referral to the file on GitHub. Low and behold, the file downloads. It's a small project, so, absolute worst case, I can download all of the individual files an manually recreate the GitHub project and only lose my project-history.

That said, the fact that I'm able to download the files tells me, "RedMine has a copy of those files somewhere." So I think to myself, "mebbe I need to try another search: clearly RedMine is allowing me to check files out of this 'bare' repository, so perhaps there's something similar I can do more directly." I return to Google and punch in "git bare repository checkout". VoilĂ . A couple useful-looking hits. I scan through a couple of them and find that I can create a full clone from a bare repo. All I have to do is go into my (RedMine's) filesystem, copy my bare repository to a safe location (just in case) and then clone from it:

# find <BARE_REPOSITORY> -print | cpio -vpmd /tmp
     # cd /tmp
     # git clone <BARE_REPOSITORY_COPY> <FULL_REPOSITORY_TARGET>
     # find <FULL_REPOSITORY_TARGET>

That final find shows me that I now have a full repository (there's now a fully populated .git subdirectory in it). I chown the directory to my SSH user-account, then exit my sudo session (I'd ssh'ed in with key-forwarding enabled).

I go to GitHub and (re)create the nuked project, then, configure the on-disk copy of my files and git metadata to be able to push everything back to GitHub. I execute my git push and all looks good from the ssh session. I hop back to GitHub and there is all my code and all of my commit-history and other metadata. Huzzah!

I finish out by going back and setting up branch-protection and my push-rules and CI-test integrations. Life is good again.

Wednesday, May 17, 2017

The Savings Are In The POST

Been playing around with Artifactory for a client looking to implement a full CI/CD toolchain. My customer has an interesting budgeting method: they're better able to afford sunk costs than recurring costs. So, they splurged for the Enterprise Edition pricing but asked me to try to deploy it in a "cost-aware" fashion.

Two nice things that the Enterprise Edition of artifactory gives you: the ability to store artifacts directly to lower-cost "cloud" storage tiers and the ability to cluster. Initially, I wasn't going to bother with the latter: while the customer is concerned about reliability, the "cost-aware" method for running things means that design-resiliency is more critical than absolute availability/uptime. Having two or more nodes running didn't initially make sense, so I set the clustering component aside, and explored other avenues for resiliency.

The first phase of resiliency was externalizing stuff that was easily persisted.

Artifactory keeps much of its configuration and tracking information in a database. We're deploying the toolchain into AWS, so, offloading the management overhead of an external database to RDS was pretty much a no-brainer.

When you have the Enterprise Edition entitlements, Artifactory lets you externalize the storage of artifacts to cloud-storage. For a cost-aware deployment, storing gigabytes of data in S3 is much more economical than storing in an EBS volume. Storing it in S3 also means that the data has a high-degree of availability and protection right out of the box. Artifactory also makes it fairly trivial to set up storage tiering. This meant I was able to configure the application to stage recently-written or fetched data in either an SSD-backed EBS volume or leverage instance storage (fast, ephemeral storage). I could then let the tiering move data to S3 either as the local filesystem became full or the recently-written or fetched data aged.

With some of the other stuff I automated, once you had configuration and object data externalized, resiliency was fairly easy to accommodate. You could kill the running instance (or let a faulty one die), spin up a new instance, automate the install of the binaries and the sucking-down of any stub-data needed to get the replacement host knitted back to the external data-services. I assumed that Artifactory was going to be the same case.

For better or worse, Artifactory's relationship with it's database is a bit more complicated than some of the other CI/CD components. Specifically, just giving Artifactory the host:port and credentials for the database doesn't get your new instance talking to the database. No. It wants all that and it wants some authentication tokens and certificates to more-fully secure the Artifactory/database communications.

While backing these tokens and certificates up and accounting for them in the provisioing-time stub-data pull-down is a potential approach, when you have Enterprise Edition, you're essentially reinventing the wheel. You can, instead, take advantage of EE's clustering capabilities ...and run the Artifactory service as a single-node cluster. Doing so requires generating a configuration bundle-file via an API call and then putting that in your stub-data set. The first time you fire up a new instantiation of Artifactory, if that bundle-file is found in the application's configuration-root, it will "magic" the rest (automatically extract the bundle contents and update the factory config files to enable communication with the externalized services).

As of the writing of this post, the documentation for the HA setup is a bit wanting. They tell you "use this RESTful API call to generate the bundle file". Unfortunately, they don't go into great depth on how to convice Artifactory to generate the bundle. Yeah, there's other links in the documents for how to use the API calls to store data in Artifactory, but no "for this use-case, execute this". Ultimately, what it ended up being was:
# curl -u <ADMIN_USER> -X POST http://localhost:8081/artifactory/api/system/bootstrap_bundle
If you got an error or no response, you need to look in the Artifactory access logs for clues.

If your call was successful, you'd get a (JSON) return similar to:
Enter host password for user '<ADMIN_USER>':
{
    "file" : "/var/opt/jfrog/artifactory/etc/bootstrap.bundle.tar.gz"
}
The reason I gave the article the title I did was, because, previous to this exercise, I'd not had to muck with explicitly setting the method for passing the API calls to the command-endpoint. It's that "-X POST" bit that's critical. I wouldn't have known to use the documentation's search-function for the call-method had I not looked at the access logs and seen that Artifactory didn't like me using curl's default, GET-based method.

At any rate, once you have that bootstrap-bundle generated and stored with your other stub-data, all that's left is automating the instantiation-time creation of the ha-node.properties file. After a generic install, when those files are present, the newly-launched Artifactory instance joins itself back to the single-node cluster and all your externalized-data becomes available. All that taken care of, running "cost-aware" means that:

  • When you reach the close of service hours, you terminate your Artifactory host.
  • Just before your service hours start, you instantiate a new Artifactory host.
If you offer your service Mon-Fri from 06:00 to 20:00, you've saved yourself 148 hours of EC2 charges per calendar-week. If you're really aggressive about running cost-aware, you can apply similar "business hours" logic to your RDS (though, given the size of the RDS, the costs to create the automation may be more than you'll save in a year's running of the RDS instance).