Wednesday, September 6, 2017

Order Matters

For one of the projects I'm working on, I needed to automate the deployment of JNLP-based Jenkins build-agents onto cloud-deployed (AWS/EC2) EL7-based hosts.

When we'd first started the project, we were using the Jenkins EC2 plugins to do demand-based deployment of EC2 spot-instances to host the Jenkins agent-software. It worked decently and kept deployment costs low. For better or worse, our client preferred using fixed/reserved instances and JNLP-based agent setup vice master-orchestrated SSH-based setup. The requested method likely makes sense for if/when they decide there's a need for Windows-based build agents — as all will be configured equivalently (via JNLP). It also eliminates the need to worry about SSH key-management.

At any rate, we made the conversion. It was initially trivial ...until we implemented our cost-savings routines. Our customer's developers don't work 24/7/365 (and neither do we). It didn't make sense to have agents just sitting around idle racking up EC2 and EBS time. So, we placed everything under a home-brewed tool to manage scheduled shutdowns and startups.

The next business-day, the problems started. The Jenkins master came online just fine. All of the agents also started back up ...however, the Jenkins master was declaring them dead. As part of debugging process, we would log into each of the agent-hosts but would find that there was a JNLP process in the PID-list. Initially we assumed the declared-dead problem was just a simple timing issue. Our first step was to try rescheduling the agents' start to happen a half hour before the master's. When that failed, we then tried setting the master's start a half hour before the agents'. No soup.

Interestingly, doing a quick `systemctl restart jenkins-agent` on the agent hosts would make them pretty much immediately go online in the Master's node-manager. So, we started to suspect something between the master and agent-nodes causing issues.

Because our agents talk to the master through an ELB, the issues lead us to suspect the ELB was proving problematic. After all, even though the Jenkins master starts up just fine, the ELB doesn't start passing traffic for a minute or so after the master service starts accepting direct connections. We suspected that perhaps there was enough lag in the ELB going fully active that the master was declaring the agents dead — as each agent-host was showing a ping-timeout error in the master console:

Upon logging back into the agents, we noticed that, while the java process was running, it wasn't actually logging. For background, when using default logging, the Jenkins agent create ${JENKINS_WORK_DIR}/logs/remoting.log.N files. The actively logged-to file is the ".0" file - which you can verify with tools other tools ...or just notice that there's a remoting.log.0.lck file. Interestingly, the remoting.log.0 file was blank, whereas all the other, closed files showed connection-actions between the agent and master.

So, started looking at the systemd file we'd originally authored:

[Unit]
Description=Jenkins Agent Daemon

[Service]
ExecStart=/bin/java -jar /var/jenkins/slave.jar -jnlpUrl https://jenkins.lab/computer/Jenkins00-Agent0/slave-agent.jnlp -secret <secret> -workDir <workdir>
User=jenkins
Restart=always
RestartSec=15

[Install]
WantedBy=multi-user.target

Initially, I tried playing with the RestartSec parameter. Made no difference in behavior ...probably because the java process never really exits to be restarted. So, did some further digging about - particularly with an eye towards having failed to track missing systemd dependencies.

When you're making the transition from EL6 to EL7, one of the things that's easy to forget is that systemd is not a sequential init-system. Turned out that, while I'd told the service, "don't start till multi-user.target is going", I hadn't told it "...but make damned sure that the NICs are actually talking to the network." That was the key, and was accomplished by adding:

Wants=network-online.target
Requires=network.target

to the service definition files' [Unit] sections. Some documents indicated to set the Wants= to network.target. Others indicated that there were some rare cases where network.target may be satisfied before the NIC is actually fully ready to go — that the network-online.target was therefore the better dependency to set. In practice, the time between which network.target and network-online.target onlines is typically in the millisecond range - so, "take your pick". I was tired of fighting things, so, opted for network-online.target just to be conservative/"done with things"

After making the above change to one of my three nodes, that node immediately began to reliably online after full service-downings, agent-reboots or even total re-deployments of the agent-host. The change has since been propagated to the other agent nodes and the "run only during work-hours" service-profile has been tested "good".