Friday, August 16, 2019

Unsupported Things Never Stay Unsupported

A few years ago, the project-lead on the project I was working, at the time, asked me, "can you write some backup scripts that our cloud pathfinder programs can use? Moving to AWS, their systems won't be covered by the enterprise's NetBackup system and they need something to help with disaster-mitigation."

At the time, the team I was one was very small. While there were nearly a dozen people – a couple of whom were also technically-oriented – on the team, it was pretty much just me and one other person that were automation-oriented (and responsible for the entire tooling workload). That other person was more Windows-focussed and our pathfinders were pretty much wholly Linux-based. I am, to make an understatement, UNIX/Linux-oriented. Thus it fell on me.

I ended up whacking together a very quick-and-dirty set of tools. Since our pathfinders were notionally technically-savvy (otherwise, they ought not have been pathfinders), I built the tools with a few assumptions: 1) that they'd read the (minimal) documentation I included with the project; 2) that, having read that, if they found it lacking, they'd examine the code (since it was hosted in a public GitHub project); 3) that they'd contact me or the team if 1 and/or 2 to fail - preferably by filing a GitHub issue against the project; and, 4) that they might be the type of users who tried to use more-complex storage configurations (especially given that, at the time, you couldn't expand an EBS volume once created). 

Since I wanted something fairly quickly-done, I wrote the tools to leverage AWS's native storage-snapshotting capability. Using storage-snapshots was also attractive because it was more cost-efficient than mirroring and online EBS to an offline EBS (especially the multiple such offline EBSes that would be necessary to create multiple days worth of recovery-points). Lastly, because of the way the snapshots work, once you initiate the snapshot, you don't have to worry about filesystem changes while the snapshot-creation finishes.

Resultant of assumption #4, I wrote the scripts to accommodate the possibility that they'd be trying to back up filesystems that spanned – either as concats or stripes – two or more EBS volumes. This accommodation meant that I added an option to freeze the filesystem before requesting snapshot and (because AWS had yet to add the option to snapshot a set of EBSes all at once) implemented multi-threaded snapshot-request logic. The combination of the two meant that, in the few milliseconds that it took to initiate the (effectively) parallel snapshot-requests, I/Os would be halted to the filesystem and prevent there from being any consistency-gaps between members of the EBS-set.

Unfortunately, as my primary customers' adoption went from the (technically-savvy) pathfinder-projects to progressively less-technical follow-on projects, the assumptions under which the tools were authored became less valid. Because the follow-on projects tended to be peopled by staff that are very "cut-n-paste" oriented, they tended to do some silly things:
  • Trying to freeze the "/" filesystem (you can do it, but you're probably not going to be able to undo it …without rebooting)
  • Trying to freeze filesystems that didn't exist ("but it was in the example")
  • Trying to freeze filesystems that weren't built on top of spans (not inherently "wrong", per se, but also not needed from a "keep all the sub-volumes in sync" perspective)
Worse, as our team added (junior) people to help act as stewards for these projects, they didn't really understand the vagaries of storage – why certain considerations are warranted or not (and the tool-features associated thereto). So, they weren't steering the new tool-users away from bad usages. Bearing in mind that the tools were a quick-and-dirty effort meant for use by technically-savvy users, the tools had pretty much no "safeties" built in.

Side-note: To me, you build safeties into tools that you're planning to support (to minimize the amount of "help me, help me: stuff blowed up" emails).

Being short on time and with a huge stack of other projects to work on when I first wrote them, smoothing off the rough-edges wasn't a priority. And, when the request for the tools was originally made, in response to my saying "I, don't have the spare cycles to fully engineer a supportable tool-set," I was told, "these will made available as references, not supported tools."

Harkening back to the additions to the team, one of the other things that wasn't being adequately communicated to them in their onboarding was that the tooling wasn't "supported". We ended up on a cycle where, about every 6-11 months, people would be screaming about "the scripts are broken and tenants are screaming". At which point would have to remind the people made the "they're just references" tool-request that, "1) these tools were never meant as more than a reference; 2) they're working perfectly-adequately and just like they did back in 2015 – it's just that both the program-users and our team's stewards are just botching their use."

The most recent iteration of the preceding, I actually had some time to retrofit some safeties. Having done so, the people that were whining about the tools being "broken" were chuffed and asked, "cool, when do we make the announcement of the upates." I responded, "well, you don't really need to." At which point I was asked, "why do you want to hide these updates." My response was, "there's a difference between 'not announcing updates' and 'hiding availability of updates'. Besides: I made no functional changes to the tools. The only thing that's changed is it's harder for people to use them in ways that will cause problems." In disbelief, I was asked, "if there's no functionality change, why did we waste time updating the tools."


Muttering under my breath as I composed my reply, "because you guys were (erroneously) stating that the tools were broken and that they needed to be made more-friendly. The 'make more friendly' is now done. That said, announcing to people that are already successfully using them, 'the tools have been updated,' provides no value to those people: whether they continue using the pre-safeties versions or the updated versions, their jobs will continue to function without any need for changes. All that announcing will inevitably do is cause those people to harangue you about 'we updated: why aren't we seeing any functional enhancements'?"

No comments:

Post a Comment