Titular Discrepancy: September 2018

Wednesday, September 26, 2018

So You Created a Regression?

Sometimes, when you're fixing up files in git-managed files, you'll create a regression by nuking a line (or whole blocks) of code. If you can remember something that was in that nuked chunk of code, you can write a quick script to find all prior commits that referenced that chunk of code.

In this particular case, one of my peers was working on writing some Jenkins pipeline-definitions. The pipeline needed to automagically create S3 pre-signed URLs. At some point, the routine for doing so got nuked. Because coming up with the requisite interpolation-protections had been kind of a pain in the ass, we really didn't want to have to go through the pain of reinventing that particular wheel.

So, how to make git help us find the missing snippet. `git log`, horse to a quick loop-iteration, can do the trick:

for FILE in $( find <PROJECT_ROOT_DIR> -name "*.groovy" )
do
   echo $FILE
   git log --pretty="   %H" -Spresign $FILE
done | grep ^commit

In the above:

We're executing within the directory created by the original `git clone` invocation. To limit the search-scope, you can also run it from a subdirectory of the project.
Since, in this example case, we know that all the Jenkins pipeline definitions end with the `.groovy` extension, we limit our search to just those file-types.
The `-Spresign` is used to tell `git log` to look for the string `presign`.
The `--pretty=" %H"` suppresses all the other output from the `git log` command's output - ensuring that only the commit-ID is print. The leading spaces in the quoted string provide a bit of indenting to make the output-groupings a bit easier to intuit.

This quick loop provides us a nice list of commit-IDs like so:

./Deployment/Jenkins/agent/agent-instance.groovy
   64e2039d593f653f75fd1776ca94bdf556166710
   619a44054a6732bacfacc51305b353f8a7e5ebf6
   1e8d5e40c7db2963671457afaf2d16e80e42951f
   bb7af6fd6ed54aeca54b084627e7e98f54025c85
./Deployment/Jenkins/master/Jenkins_master_infra.groovy
./Deployment/Jenkins/master/Jenkins_S3-MigrationHelper.groovy
./Deployment/Jenkins/master/Master-Ec2-Instance.groovy
./Deployment/Jenkins/master/Master-Elbv1.groovy
./Deployment/Jenkins/master/parent-instance.groovy

In the above, only the file `Deployment/Jenkins/agent/agent-instance.groovy` contains our searched-for string. The first-listed commit-ID (indented under the filename) contains the subtraction-diff for the targeted string. Similary, the second commit-ID contains the code snippet we're actually after. The remaining commit-IDs contain the original "invention of the wheel",

In this particular case, we couldn't simply revert to the specific commit as there were a lot of other changes that were desired. However, it did let the developer use `git show` so that he could copy out the full snippet he wanted back in the current version(s) of his pipelines.

Wednesday, September 19, 2018

Exposed By Time and Use

Last year, a project with both high complexity and high urgency was dropped in my lap. Worse, the project was, to be charitable, not well specified. It was basically, "we need you to automate the deployment of these six services and we need to have something demo-able within thirty days".

Naturally, the six services were ones that I'd never worked with from the standpoint of installing or administering. Thus, there as a bit of a learning curve around how best to automate things that wasn't aided by the paucity of "here's the desired end-state" or other related details. All they really knew was:

Automate the deployment into AWS
Make sure it works on our customized/hardened build
Make sure that it backs itself up in case things blow up
GO!

I took my shot at the problem. I met the deadline. Obviously, the results were not exactly "optimal" — especially from the "turn it over to others to maintain standpoint. Naturally, after I turned over the initial capability to the requester, they were radio silent for a couple weeks.

When they finally replied, it was to let me know that the deadline had been extended by several months. So, I opted to use that time to make the automation a bit friendlier for the uninitiated to use. That's mostly irrelevant here — just more "background".

At any rate, we're now nearly a year-and-a-half removed from that initial rush-job. And, while I've improved the ease of use for the automation (it's been turned over to others for daily care-and-feeding), much of the underlying logic hasn't been revisited.

Over that kind of span, time tends to expose a given solution's shortcomings. Recently, they were attempting to do a parallel upgrade of one of the services and found that the data-move portion of the solution was resulting in build-timeouts. Turns out the size of the dataset being backed up (and recovered from as part of the automated migration process) had exploded. I'd set up the backups to operate incrementally, so, the increase in raw transfer times had been hidden.

The incremental backups were only taking a few minutes; however, restores of the dataset were taking upwards of 35 minutes. The build-automation was set to time out at 15 minutes (early in the service-deployment, a similar opration took 3-7 minutes) So, step one was to adjust the automation's timeouts to make allowances for the new restore-time realities. Second step was to investigate why the restores were so slow.

The quick-n-dirty backup method I'd opted for was a simple `s3 sync --delete /<APPLICATION_HOME_DIR>/ s3://<BUCKET>/<FOLDER>`. It was a dead-simple way to "sweep" the contents of the directory /<APPLICATION_HOME_DIR>/ to S3. And, because S3's `sync` method defaults to incremental, the cron-managed backups were taking the same couple minutes each day that they were a year-plus ago.

Fun fact about S3 and its transfer performance: if the objects you're uploading have keys with high degrees of commonality, transfer performance will become abysmal.

You may be asking why I mention "keys" since I've not mentioned encryption. S3, being an object-based filesystem, doesn't have the hierarchical layout of legacy, host-oriented storage. If I take a file from a host-oriented storage and use the S3 CLI utility to copy that file via its fully-qualified pathname to S3, the object created in S3 will look like:

<FULLY>/<QUALIFIED>/<PATH>/<FILE>

Of the above, "<FILE>" is the object name stored in S3 while "<FULLY>/<QUALIFIED>/<PATH>" is the key for that file. If you have a few thousand objects with the same or sufficiently-similar "<FULLY>/<QUALIFIED>/<PATH>" values, you'll run into that "transfer performance will become abysmal" issue mentioned earlier.

We very definitely did run into that problem. HARD. The dataset in question is (currently) a skosh less than 11GiB in size. The instance being backed up has an expected througput of about 0.45Gbps of sustained network throughput. So, we were expecting that dataset to take only a couple minutes to transfer. However, as noted above, it was taking 35+ minutes to do so.

So, how to fix? One of the easier methods is to stream your backups to a single file. I quick series of benchmarking-runs showed that doing so cut that transfer from over 35 minutes to under five minutes. Similarly, were one to iterate over all the files in the dataset, and individually copying the files into S3 using either randomize filenames (and setting the "real" fully-qualified-path as an attribute/tag of the file) or simply reversing the path-name (doing something like `S3NAME=$( echo "${REAL_PATHNAME} | perl -lne 'print join "/", reverse split/\//;')`) and storing that then your performance goes up dramatically.

I'll likely end up doing one of those three methods ...once I have enough contiguous time to allocate to re-engineering the backup and restore/rebuild methods.