Wednesday, March 8, 2023

Why Prompt Uncomfortable Questions

I do automation work for a number of enterprise customers. Each of these customers deploys Red Hat and Red Hat derivatives in their hosting environments (be those environments physical, on-premises virtualized or "in the cloud"). Basically, they use "for real" Red Hat for their production systems, then use one of the free clones for the development efforts …until the CentOS 8 Core → CentOS 8 Stream debacle, they used Centos for the development efforts. My focus with such customers is almost always constrained to their cloud-hosted Enterprise Linux hosts.

However, due to that CentOS 8 Stream debacle, one of my customers fell for Oracle's "use OL8: it's free" come-on. Now, if you've been in IT for even five minutes, you realize that anything "free" offered by Oracle is only so offered because Oracle sees it as a foot in the door. That said, with OL8, specifically, they try to add features to it to make it "a more compelling offering" (or something). Unfortunately, to date, the deltas between OL8 and RHEL8 are significantly greater than between RHEL8 and Rocky 8 or Alma 8 ...and not for the better. Automation that my team has written that "just works" for creating RHEL 8, Rocky 8 and Alma 8 AMIs and VM-templates doesn't "just work" for creating OL8 AMIs.

For example, while Rocky 8, Alma 8 and CentOS 8 Stream (and CentOS 8 Core, before it) all know how to do weak dependency resolution, OL8 doesn't (or, at least, didn't until maybe recently: a couple months after reporting the issue to Oracle via their bugzilla, they finally sent back a notification saying the problem's fixed, I just haven't had the opportunity to verify).

At any rate, each time I've run into various brokenness within OL8, the customer's Oracle sales team keeps trying to sell support contracts.

Similarly, when trying to tout the value of OL8, they tried to flog OL8's ksplice feature. When I asked "why would I use that when RHEL 8 and all of its non-Oracle derivatives have kpatch", the Oracle rep responded back with a list of things that ksplice is able to do but kpatch isn't. That said, the wording of his response elicited a "but you also seem to be saying that feature isn't in the free edition" response from me. Eventually, the representative replied back saying that my assessment was accurate – the flogged feature wasn't in the free edition.

He also tried to salvage the "Oracle support" thread by pointing out that Oracle would also provide support for my customer's Red Hat systems under an uber-contract. Now, my customer uses pay as you go (PAYG) EC2s in AWS but wanted free EL8 alternatives – thus the consideration of OL8 – for their non-production workloads. As such, if they're doing PAYG RHEL instances and wanting free OL8 for their non-production workloads, why would suggesting my customer buy Oracle's support for all of them make any sense? I mean, if my customer were not doing PAYG RHEL instances, they've presumably bought instance-licenses (and support along with it) from Red Hat, so, again, "why am would they want to buy Oracle's support for them"

…similarly, if they're already doing static licensing for RHEL instances, then they're probably also managing their instance-licenses through Satellite (etc.). As such, they'd then be able to take advantage of the "free for developers" licenses for their non-production EC2s …at which point the question would be "why would they even bother with OL8" let alone have to ask the "why would we buy Oracle's support for them" question?

Yeah, I get that Oracle sees OL8 more as a foot-in-the-door than an actual product, but still: send me information that doesn't beg questions that are going to be uncomfortable for you to answer.

Rate-limit Testing

Recently, was working on a project where I needed to enable customers to join their Linux EC2s to their on-premises Windows AD domain. I noticed that I was occasionally getting errors like:

adcli: couldn't connect to dev.lab domain: Couldn't authenticate as: svc_dev_joiner@DEV.LAB: Client's credentials have been revoked

Initially, I'd thought I was triggering the lockup by trying to rejoin the same host to the domain in too quick of succession. But then I suspected that I might actually be running into a broader-scope rate-limiting problem with the joiner-account. So, I set up a userData file that contained a block like:

hostnamectl set-hostname "ip-$(
  cat /dev/urandom | \
  tr -dc '[:alpha:]' | \
  tr '[:upper:]' '[:lower:]' | \
  fold -w ${1:-11} | \
  head -n 1
).dev.lab"

Then updated my `aws ec2 run-instances …` command to include a `--count 12` option. The above code-snippet ensures that I get an randomized FQDN where the node-name consists of the string "ip-" followed by 11 (relatively) random characters. This creates a 15-character node-name …necessitated by the domain-controller's refusal to allow domain-joins by clients that want node-names longer than the NETBIOS character-limit (because the DC is running in 2003 compatibility-mode). I had previously tried using:

hostnamectl set-hostname "$( 
    printf '%02X' $(
      hostname -I | sed 's/\./ /g'
    )
  ).dev.lab"

However, with my testing subnet being small, I realized that I might be generating hostnames that were already in the AD domain, which might cause its own problems. Thus, the desire for greater uniqueness in my node-names.

In either case, the domain-owners are going to be pissed that I'm dicking up their domain database with a crapton of "nonconformant" hostnames.

Monday, March 6, 2023

Stop Tramping On My Toes

 A week or so ago, I was asked by my customer to write some (AWS) account-import scripts to help their Ops team bring account-resources under Terragrunt contol. No big deal.

I inform the customer, "I need an account that I'm free to spin up and tear down so I can test the scripts as I write them." Customer responds back, supplying me with an account in which to do my work.

Now, there's a lot of stuff that needs to be accounted for in the import automation. Things were going decently well, until late last week. Then things got squirrely. Upon investigating, I see the telltales signs that I haven't been the only person using the target account and that the other person's activities have undermined all of the assumptions that were in my scripts.

I ask for help razing the account and point out that it looks like someone else had been using the account while it was supposed to be for my sole use. Another "engineer" on the team – one who has been regularly tramping on everyone else's toes by just accessing accounts and working in them – chimes in on the thread saying, "yeah: I ran <CODE> in that account, last week?"

Color me utterly non-shocked. Also, color wanting to choke the shit out of the asshole for breaking shit, again by conducting activities with no coordination.

So, now I have to wait until someone with more permissions than I have on the account can go in and flatten the account, so I can build it back to the state I was expecting to be working from.

I mean, I supposed I should be used to it, by now. Further, I suppose that I shouldn't be surprised that an "engineer" who barely knows how to use git would operate like a freaking wrecking ball. But it's no less frustrating each time he blows me or someone else up with his carelessness.