One of the joys in the life of being an automation engineer is when you have to "spiff up" the automation put together by other. In general, I consider myself a crap programmer. This assessment is in spite of seeing the garbage written by people that actually consider themselves to be good programmers. It's a matter of what you compare yourself to, I guess: some people compare themselves to where they started; I compare myself to some idea of where I perceive "expert" to be.
At any rate, my most recent project has me "improving" on the automation that another group slapped together and never had time to improve. I look at the code they left me to "improve" and can't help but suspect that, given all the time in the universe, they wouln't meaningfully improve. Oh well: it's giving me an excuse to learn and use a new tool-set.
My first assigned task was improving the deployment automation for their Sonarqube software. I'm still a very long way from where I'd like the improvements to be, but I finally figured out why the hell their RDS deployments from snapshots were taking so mind-bendingly long. You see, when I write RDS deployment-automation, I specify everything. I'm a control freak, so that's just how I am. Sadly, I sort of assume a similar approach will be taken by others. Bad assumption to make.
In trying to debug why it was taking the best part of ninety minutes to deploy a small (100GiB) database from an RDS snapshot, I discovered that what was going on was that RDS was taking forever and a day to convert an instance launched as a "standard" ("magnetic") disk-type instance to one that would run as a GP2. Apparently, when the original template was written, they hadn't specified the RDS instance type. This failure to specify means that RDS uses its default-type ...which is "standard".
Subsequent to their having deployed from that template, they'd (apparently) manually updated the RDS instance type to GP2. I'm assuming this to be the case because the automation around the RDS had never been modified - no "stack update" had been done at any point in time. Indeed, there was nothing in their stack-definition that even allowed the subsequent re-selection of the instance's storage-type.
Interestingly enough, when one launches a new RDS from a snapshot taken of another RDS, AWS attempts to ensure that the new RDS is of the same storage-type. I the snapshot was taken of a GP2-enabled instance, it wants to make the new RDS use GP2. This is all well and good: it prevents you from having unexpectedly different performance-characteristics between the snap-source and the new RDS instantiation.
Unfortunately, in a case where the templates don't override defaults, you get into a HORRIBLE situation when you deploy from that snapshot. Specifically, CloudFormation will create an RDS instance-type with the default, "magnetic" storage. Then, once the instance is created and the data recovered, RDS then attempts to convert the new instance to use storage-type that matched the snap-source. You sit there, staring at the RDS output wondering "what the hell are you 'modifying' and why is it taking so freaking long???". Staring unseeingly at the screen long enough, you might notice that the pending modification task is "storage". If that registered to your wait-numbed brain, you'll have an "oh shit" moment and hand-crank a new RDS from the same snapshot, being careful to match the storage-types of the new instance and the snapshot. Then, you'll watch with some degree of incredulity while the hand-cranked RDS reaches a ready state long before the automated deployment reaches its'.
Oh well, at least I now know why it was so balls-slow an have "improved" the templates accordingly. Looks like I'll be waiting maybe ten minutes for a snap-sourced RDS rather than the 90+ that it was taking.
No comments:
Post a Comment