Friday, August 16, 2019

Unsupported Things Never Stay Unsupported

A few years ago, the project-lead on the project I was working, at the time, asked me, "can you write some backup scripts that our cloud pathfinder programs can use? Moving to AWS, their systems won't be covered by the enterprise's NetBackup system and they need something to help with disaster-mitigation."

At the time, the team I was one was very small. While there were nearly a dozen people – a couple of whom were also technically-oriented – on the team, it was pretty much just me and one other person that were automation-oriented (and responsible for the entire tooling workload). That other person was more Windows-focussed and our pathfinders were pretty much wholly Linux-based. I am, to make an understatement, UNIX/Linux-oriented. Thus it fell on me.

I ended up whacking together a very quick-and-dirty set of tools. Since our pathfinders were notionally technically-savvy (otherwise, they ought not have been pathfinders), I built the tools with a few assumptions: 1) that they'd read the (minimal) documentation I included with the project; 2) that, having read that, if they found it lacking, they'd examine the code (since it was hosted in a public GitHub project); 3) that they'd contact me or the team if 1 and/or 2 to fail - preferably by filing a GitHub issue against the project; and, 4) that they might be the type of users who tried to use more-complex storage configurations (especially given that, at the time, you couldn't expand an EBS volume once created). 

Since I wanted something fairly quickly-done, I wrote the tools to leverage AWS's native storage-snapshotting capability. Using storage-snapshots was also attractive because it was more cost-efficient than mirroring and online EBS to an offline EBS (especially the multiple such offline EBSes that would be necessary to create multiple days worth of recovery-points). Lastly, because of the way the snapshots work, once you initiate the snapshot, you don't have to worry about filesystem changes while the snapshot-creation finishes.

Resultant of assumption #4, I wrote the scripts to accommodate the possibility that they'd be trying to back up filesystems that spanned – either as concats or stripes – two or more EBS volumes. This accommodation meant that I added an option to freeze the filesystem before requesting snapshot and (because AWS had yet to add the option to snapshot a set of EBSes all at once) implemented multi-threaded snapshot-request logic. The combination of the two meant that, in the few milliseconds that it took to initiate the (effectively) parallel snapshot-requests, I/Os would be halted to the filesystem and prevent there from being any consistency-gaps between members of the EBS-set.

Unfortunately, as my primary customers' adoption went from the (technically-savvy) pathfinder-projects to progressively less-technical follow-on projects, the assumptions under which the tools were authored became less valid. Because the follow-on projects tended to be peopled by staff that are very "cut-n-paste" oriented, they tended to do some silly things:
  • Trying to freeze the "/" filesystem (you can do it, but you're probably not going to be able to undo it …without rebooting)
  • Trying to freeze filesystems that didn't exist ("but it was in the example")
  • Trying to freeze filesystems that weren't built on top of spans (not inherently "wrong", per se, but also not needed from a "keep all the sub-volumes in sync" perspective)
Worse, as our team added (junior) people to help act as stewards for these projects, they didn't really understand the vagaries of storage – why certain considerations are warranted or not (and the tool-features associated thereto). So, they weren't steering the new tool-users away from bad usages. Bearing in mind that the tools were a quick-and-dirty effort meant for use by technically-savvy users, the tools had pretty much no "safeties" built in.

Side-note: To me, you build safeties into tools that you're planning to support (to minimize the amount of "help me, help me: stuff blowed up" emails).

Being short on time and with a huge stack of other projects to work on when I first wrote them, smoothing off the rough-edges wasn't a priority. And, when the request for the tools was originally made, in response to my saying "I, don't have the spare cycles to fully engineer a supportable tool-set," I was told, "these will made available as references, not supported tools."

Harkening back to the additions to the team, one of the other things that wasn't being adequately communicated to them in their onboarding was that the tooling wasn't "supported". We ended up on a cycle where, about every 6-11 months, people would be screaming about "the scripts are broken and tenants are screaming". At which point would have to remind the people made the "they're just references" tool-request that, "1) these tools were never meant as more than a reference; 2) they're working perfectly-adequately and just like they did back in 2015 – it's just that both the program-users and our team's stewards are just botching their use."

The most recent iteration of the preceding, I actually had some time to retrofit some safeties. Having done so, the people that were whining about the tools being "broken" were chuffed and asked, "cool, when do we make the announcement of the upates." I responded, "well, you don't really need to." At which point I was asked, "why do you want to hide these updates." My response was, "there's a difference between 'not announcing updates' and 'hiding availability of updates'. Besides: I made no functional changes to the tools. The only thing that's changed is it's harder for people to use them in ways that will cause problems." In disbelief, I was asked, "if there's no functionality change, why did we waste time updating the tools."


Muttering under my breath as I composed my reply, "because you guys were (erroneously) stating that the tools were broken and that they needed to be made more-friendly. The 'make more friendly' is now done. That said, announcing to people that are already successfully using them, 'the tools have been updated,' provides no value to those people: whether they continue using the pre-safeties versions or the updated versions, their jobs will continue to function without any need for changes. All that announcing will inevitably do is cause those people to harangue you about 'we updated: why aren't we seeing any functional enhancements'?"

Friday, August 9, 2019

So You Want to Access a FIPSed EL7 Host Via RDP

One of the joys of the modern, corporate security-landscape is that enterprises frequently end up lockin down their internal networks to fairly extraordinary degrees. That as software and operating system vendors offer new bolts to tighten, organizations will tend to do so - sometimes without considering the impact of what that tightening will do.

Several of my customers protect their networks not only with inbound firewalls, but firewalls that severely restrict outbound connectivity. Pretty much, their users' desktop systems can only access an external service if its offered via HTTP/S. Similarly, their users' desktop systems are configured with application whitelisting enabled - preventing not only power users from installing software that requires privileged-access to install, but prevents users from installing things that are wholly constrained to their home directories. This kind of security-posture is suitable for the vast majority of users, but is considerably less so for developers.

The group I work for provides cloud-enablement services. This means that we are both developers and provide services to our customers' developers. Both for our own needs (when on-site) and for those of customers' developers, this has meant needing to have remote (cloud-hosted), "developer" desktops. The cloud service providers (CSPs) we and our customers use provide remote desktop solutions (e.g., AWS's "Workspaces"). However, these services are typically not usable at our customer sites due to the previously-mentioned network and desktop lockdowns: even if the local desktop has tools like RDP and SSH clients installed, those tools are only usable within the enterprises' internal networks; if the remote desktop offering is reachable via HTTP/S, it's typically through a widget that the would-be remote desktop user would install to their local workstation if application-whitelisting didn't prevent it.

To solve this problem or both our own needs (when on-site) and our customers' developers' needs, we stood up a set of remote (cloud-hosted), Windows-based desktops. To make them usable from locked-down networks, we employed Apache's Guacamole service. Guacamole makes remote Windows and Linux desktops available within a user's web browser.

Guacamole-fronted Windows desktops proved to be a decent solution for several years. Unfortunately, as the cloud wars heat up and CSPs try to find ways to bring - or force - customers into their datacenters, what was once a decent solution can become not decent - often due to pricing factors. Sadly, it appears that Microsoft may be trying to pump up Azure-adoption by increasing the price of cloud-hosted Windows services when those services are run in other CSPs' datacenters.

While we wait to see if and how this plays out, financially, we opted to see "can we find lower-cost alternatives to Windows-based (remote) developer desktops." Most of our and our customers' developers are Linux-oriented - or at least Linux-comfortable: it was a no-brainer to see what we could do using Linux. Our Guacamole service already uses Linux-based containers to provide the HTTP/S-encapsulation for RDP and Guacamole natively supports the fronting of Linux-based graphical desktops via VNC. That said, given that the infrastructure is built around an RDP, it might prove to ease some of the rearchitecting-process by keeping communications RDP-based even without Windows in the solution-stack.

Because our security guidance has previously required us to use "hardened" Red Hat and CentOS-based servers to host Linux applications, that was our starting-point for this process. This hardening almost always introduces "wrinkles" into deployment of solutions - usually because the software isn't SELinux-enabled or relies on kernel-bits that are disabled under FIPS mode. This time, the problem was FIPS mode.

While installing and using RDP on Linux has become a lot easier than it used to be (tools like XRDP now actually ship with SELinux policy-modules!), not all of the kinks are gone, yet. What I discovered, when starting on the investigation path, is that the XRDP installer for Enterprise Linux 7 isn't designed to work in FIPS mode. Specifically, when the installer goes to set up its encryption-keys, it attempts to do so using MD5-based methods. When FIPS mode is enabled on a Linux kernel, MD5 is disabled.

Fortunately, this only effects legacy RDP connections. The currently-preferred solution for RDP leverages TLS. Both TLS and its preferred ciphers and algorithms are all FIPS compatible. Further, even though tin installer fails to set up the encryption keys, these keys are effectively optional: a file at the expected location for keys merely needs to exist, not actually be a valid key. This meant that the problem in the installer was trivially worked around by adding a `touch /etc/xrdp/rsakeys.ini` to the install process. Getting a cloud-hosted, Linux-based, graphical desktop ultimately becomes a matter of:

  1. Stand up a cloud-hosted Red Hat or CentOS 7 system
  2. Ensure that the "GNOME Desktop" and "Graphical Administration Tools" package-groups are installed (since, if your EL7 starting-point is like ours, no GUIs will be in the base system-image)
  3. Once those are installed, ensure that the system's default run-state has been set to "graphical.target". The installers for the "GNOME Desktop" package-group should have taken care of this for you. Check the run-level with `systemctl get-default`. If the installers for the "GNOME Desktop" package-group didn't properly set things, correct it by executing `systemctl set-default graphical.target`
  4. Ensure that the xrdp and tigervnc-server RPMs are installed
  5. Make sure that firewalld allows connections to the XRDP service by executing `firewall-cmd --add-port=3389/tcp --permanent`
  6. Similarly, ensure that whatever CSP-layer networking controls are present allow TCP port 3389 inbound to your XRDP-enabled Linux host.
  7. ...And if you want users of your Linux-based RDP host to be able remotely-access actual Windows-based servers, install Vinagre.
  8. Reboot to ensure everything is in place and running.
Once the above is done, you can test things out by RDPing into your new Linux host from a Windows host …and, if you've installed Vinagre, RDP from your new, XRDP-enabled Linux host to Windows host (for a nice case of RDP-inception).



References: