Wednesday, August 28, 2024

Getting the Most Out of EC2 Transfers With Private S3 Endpoints

Recently, I was given a project to help a customer migrate an on-premises GitLab installation into AWS. The current GitLab was pretty large: a full export of the configuration was nearly 500GiB in size.

It turned out a good chunk of that 500GiB was due to disk-hosted artifacts and LFS objects. Since I was putting it all into AWS, I opted to make use of GitLab's ability to store BLOBs in S3. Ultimately, that turned out to be nearly 8,000 LFS objects and nearly 150,000 artifacts (plus several hundred "uploads").

The first challenge was getting the on-premises data into my EC2. Customer didn't want to give me access to their on-premises network, so I needed to have them generate the export TAR-file and upload it to S3. Once in S3, I needed to get it into an EC2.

Wanting to make sure that the S3→EC2 task was as quick as possible, I selected an instance-type rated to 12.5Gbps of network bandwidth and 10Gbps of EBS bandwidth. However, my first attempt at downloading the TAR-file from S3 took nearly an hour to run: it was barely creeping along at 120MiB/s. Abysmal.

I broke out `iostat` and found that my target EBS was reporting 100% utilization and a bit less than 125MiB/s of average throughput. That seemed "off" to me, so I looked at the EBS. Was then that I noticed that the default volume-throughput was only 125MiB/s. So, I upped the setting to its maximum: 1000MiB/s. I re-ran the transfer only to find that, while the transfer-speed had improved, it had only improved to a shade under 150MiB/s. Still abysmal.

So, I started rifling through the AWS documentation to see what CLI settings I could change to improve things. First mods were:

max_concurrent_requests = 40
multipart_chunksize = 10MB
multipart_threshold = 10MB

This didn't really make much difference. `iostat` was showing really variable utilization-numbers, but mostly that my target-disk was all but idle. Similarly, `netstat` was showing only a handful of simultaneous-streams between my EC2 and S3.

Contacted AWS support. They let me know that S3 multi-part upload and download was limited to 10,0000 chunks. So, I did the math (<FILE_SIZE> / <MAX_CHUNKS>) and changed the above to:

max_concurrent_requests = 40
multipart_chunksize = 55MB
multipart_threshold = 64MB

This time, the transfers were running about 220-250MiB/s. While that was a 46% throughput increase, it was still abysmal. While `netstat` was finally showing the expected number of simultaneous connections, my `iostat` was still saying that my EBS was mostly idle.

Reached back out to AWS support. They had the further suggestion of adding:

preferred_transfer_client = crt
target_bandwidth = 10GB/s

To my S3 configuration. Re-ran my test and was getting ≈990MiB/s of continuous throughput for the transfer! This knocked the transfer speed down from fifty-five minutes to a shade over eight minutes. In other words, I was going to be able to knock nearly an hour off the upcoming migration-task.

In digging back through the documentation, it seems that, when one doesn't specify a preferred_transfer_client value, the CLI will select the `classic` (`python`) client. And, depending on your Python version, the performance ranges from merely-horrible to ungodly-bad: using RHEL 9 for my EC2, it was pretty freaking bad, but had been less-bad when using AWS for my EC2's OS. Presumably a difference in the two distro's respective Python versions?

Specifying a preferred_transfer_client value of `crt` (C run-time client) unleashed the full might and fury of my EC2's and GP3's capabilities.

Interestingly, this "use 'classic'" behavior isn't a universal auto-selection. If you've selected an EC2 with any of the instance-types:

  • p4d.24xlarge
  • p4de.24xlarge
  • p5.48xlarge
  • trn1n.32xlarge
  • trn1.32xlarge

The auto-selection gets you `crt`. Not sure why `crt` isn't the auto-selected value for Nitro-based instance-types. But, "it's what it's". 

Side note: just selecting `crt` probably wouldn't have completely roided-out the transfer. I assume the further setting of `target_bandwidth` to `10GB/s` probably fully-unleashed things. There definitely wasn't much bandwidth leftover for me to actually monitor the transfer. I assume that the `target_bandwidth` parameter has a default value that's less than "all the bandwidth". However, I didn't actually bother to verify that.

Update: 

After asking support "why isn't `crt` the default for more instance-types", I got back the reply:

Thank you for your response. I see that these particular P5, P4d and Trn1 instances are purpose built for high-performance ML training1. Hence I assume the throughput needed for this ML applications needs to high and CRT is auto enabled for these instance types.

Currently, the CRT transfer client does not support all of the functionality available in the classic transfer client.

These are few limitations for CRT configurations2:

  • Region redirects - Transfers fail for requests sent to a region that does not match the region of the targeted S3 bucket.
  • max_concurrent_requests, max_queue_size, multipart_threshold, and max_bandwidth configuration values - Ignores these configuration values.
  • S3 to S3 copies - Falls back to using the classic transfer client


All of which is to say that, once I set `preferred_transfer_client = crt` all of my other, prior settings got ignored.

Wednesday, July 3, 2024

Implementing (Psuedo) Profiles in Git (Part 2!)

 As noted in my first Implementing (Psuedo) Profiles in Git post:


I'm an automation consultant for an IT contracting company. Using git is a daily part of my work-life. … Then things started shifting, a bit. Some customers wanted me to use my corporate email address as my ID. Annoying, but not an especially big deal, by itself. Then, some wanted me to use their privately-hosted repositories and wanted me to use identities issued by them.

This led me down a path of setting up multiple git "profiles" that I captured into my first article on this topic. To better support such per-project identities, it's also a good habit to use per-project authentication methods. I generally prefer to do git-over-SSH – rather than git-over-http(s) – when interfacing with remote Git repositories. Because I don't like having to keep re-entering my password, I use an SSH-agent to manage my keys. When one only has one or two projects they regularly interface with, this means a key-agent that is only needing to store a couple of authentication-keys.

Unfortunately, if you have more than one key in your SSH agent, when you attempt to connect to a remote SSH service, the agent will iteratively-present keys until the remote accepts one of them. If you've got three or more keys in your agent, the agent could present 3+ keys to the remote SSH server. By itself, this isn't a problem: the remote logs the earlier-presented keys as an authentication failure, but otherwise let you go about your business. However, if the remote SSH server is hardened, it very likely will be configured to lock your account after the third authentication-failure. As such, if you've got four or more keys in your agent and the remote requires a key that your agent doesn't present in the first three autentication-attempts, you'll find your account for that remote SSH service getting locked out.

What to do? Launch multiple ssh-agent instantiations.

Unfortunately, without modifying the default behavior, when you invoke the ssh-agent service, it will create a (semi) randomly-named UNIX domain-socket to listen for requests on. If you've only got a single ssh-agent instance running, this is a non-problem. If you've got multiple, particularly if you're using a tool like direnv, setting up your SSH_AUTH_SOCKET in your .envrc files is problematic if you don't have predictably-named socket-paths.

How to solve this conundrum? Well, I finally got tired of, every time I rebooted my dev-console, having to run `eval $( ssh-agent )` in per-project Xterms. So, I started googling and ultimately just dug through the man page for ssh-agent. In doing the latter, I found:

DESCRIPTION
     ssh-agent is a program to hold private keys used for public key authentication.  Through use of environment variables the
     agent can be located and automatically used for authentication when logging in to other machines using ssh(1).

     The options are as follows:

     -a bind_address
             Bind the agent to the UNIX-domain socket bind_address.  The default is $TMPDIR/ssh-XXXXXXXXXX/agent..

So, now I can add appropriate command-aliases to my bash profile(s) (which I've already moved to ~/.bash_profile.d/<PROJECT>)  that can be referenced based on where in my dev-console's filesystem hierachy I am and can set up my .envrcs, too. Result: if I'm in <CUSTOMER_1>/<PROJECT>/<TREE>, I get attached to an ssh-agent set up for that customer's project(s); if I'm in <CUSTOMER_2>/<PROJECT>/<TREE>, I get attached to an ssh-agent set up for that customer's project(s); etc.. For example:

$ cd ~/GIT/work/Customer_1/
direnv: loading ~/GIT/work/Customer_1/.envrc
direnv: export +AWS_CONFIG_FILE +AWS_DEFAULT_PROFILE +AWS_DEFAULT_REGION ~SSH_AUTH_SOCK
ferric@fountain:~/GIT/work/Customer_1 $ ssh-add -l
3072 SHA256:oMI+47EiStyAGIPnMfTRNliWrftKBIxMfzwYuxspy2E SH512-signed RSAv2 key for Customer 1's Project (RSA)

Thursday, June 20, 2024

Keeping It Clean: EKS and `kubectl` Configuration

Previously, I was worried about, "how do I make it so that kubectl can talk to my EKS clusters".  However, after several days of standing up and tearing down EKS clusters across a several accounts, I discovered that my ~/.kube/config file had absolutely exploded in size and its manageability reduced to all but zero. And, while aws eks update-kubeconfig --name <CLUSTER_NAME> is great, its lack of a `--delete` suboption is kind of horrible when you want or need to clean out long-since-deleted clusters from your environment. So, onto "next best thing", I guess…

Ultimately, that "next best thing" was setting a KUBECONFIG environment-variable as part of my configuration/setup tasks (e.g., something like `export KUBECONFIG=${HOME}/.kube/config.d/MyAccount.conf`). While not as good as I'd like to think a `aws eks update-kubeconfig --name <CLUSTER_NAME> --delete would be, it at least means that:

  1. Each AWS account's EKS's configuration-stanzas are kept wholly separate from each other
  2. Reduces cleanup to simply overwriting – or straight up nuking – per-account ${HOME}/.kube/config.d/MyAccount.conf files

…I tend to like to keep my stuff "tidy". This kind of configuration-separation facilitates scratching that (OCDish) itch. 

The above is derived, in part, from the Organizing Cluster Access Using kubeconfig Files document

Monday, June 17, 2024

Crib Notes: Accessing EKS Cluster with `kubectl`

While AWS does provide a CLI tool – eksctl –for talking to EKS resources, it's not suitable for all Kubernetes actions one might wish to engage in. Instead, one must use the more-generic access provided through the more-broadly used tool, kubectl. Both tools will generally be needed, however.

If, like me, your AWS resources are only reachable through IAM roles – rather than IAM user credentials – it will be necessary to use the AWS CLI tool's eks update-kubeconfig subcommand. The general setup workflow will look like:

  1. Set up your profile definition(s)
  2. Use the AWS CLI's sso login to authenticate your CLI into AWS (e.g., `aws sso login --no-browser`)
  3. Verify that you've successfully logged in to your target IAM role (e.g., `aws sts get-caller-identity` …or any AWS command, really)
  4. Use the AWS CLI to update your ~/.kube/config file with the `eks update-kubeconfig` subcommand (e.g., `aws eks update-kubeconfig --name thjones2-test-01`)
  5. Validate that you're able to execute kubectl commands and get back the kind of data that you expect to get (e.g., `kubectl get pods --all-namespaces` to get a list of all running pods in all namespaces within the target EKS cluster)

Thursday, May 16, 2024

So You Work in Private VPCs and Want CLI Access to Your Linux EC2s?

 Most of the AWS projects I work on, both currently and historically, have deployed most, if not all, of their EC2s into private VPC subnets. This means that, if one wants to be able directly login to their Linux EC2s' interactive shells, they're out of luck. Historically, to get something akin to direct access one had to set up bastion-hosts in a public VPC subnet, and then jump through to the EC2s one actually wanted to login to. How well one secured those bastion-hosts could make-or-break how well-isolated their private VPC subnets – and associated resources – were.

If you were the sole administrator or part of a small team, or were part of an arbitrary-sized administration-group that all worked from a common network (i.e., from behind a corporate firewall or through a corporate VPN), keeping a bastion-host secure was fairly easy. All you had to do was set up a security-group that allowed only SSH connections and only allowed them from one or a few source IP addresses (e.g. your corporate firewall's outbound NAT IP address). For a bit of extra security, one could eve prohibit password-based logins on the Linux bastions (instead, using SSH key-based login, SmartCards, etc. for authenticating logins). However, if you were a member of a team of non-trivial size and your team members were geographically-distributed, maintaining whitelists to protect bastion-hosts could become painful. That painfulness would be magnified if that distributed team's members were either frequently changing-up their work locations or were coming from locations where their outbound IP address would change with any degree of frequency (e.g., work-from-home staff whose ISPs would frequently change their routers' outbound IPs).

A few years ago, AWS introduced SSM and the ability to tunnel SSH connections through SSM (see the re:Post article for more). With appropriate account-level security-controls, the need for dedicated bastion-hosts and maintenance of whitelists effectively vanished. Instead, all one had to do was:

  • Register an SSH key to the target EC2s' account
  • Set up their local SSH client to allow SSH-over-SSM
  • Then SSH "directly" to their target EC2s

SSM would, effectively, "take care of the rest" …including logging of connections. If one were feeling really enterprising, one could enable key-logging for those SSM-tunneled SSH connections (a good search-engine query should turn up configuration guides; one such guide is toptal's). This would, undoubtedly make your organization's IA team really happy (and may even be required depending on security-requirements your organization is legally-required to adhere to) – especially if they don't yet have an enterprise session-logging tool purchased.

But what if your EC2s are hosting applications that require GUI-based access to set up and/or administer? Generally, you have two choices:

  • X11 display-redirection
  • SSH port-forwarding

Unfortunately, SSM is a fairly low-throughput solution. So, while doing X11 display-redirection from an EC2 in a public VPC subnet may be more than adequately performant, the same cannot be said when done through an SSH-over-SSM tunnel. Doing X11 display-redirection of a remote browser session – or, worse, an entire graphical desktop session (e.g., KDE or Gnome desktops) – is paaaaaainfully slow. For my own tastes, it's uselessly slow. 

Alternately, one can use SSH port-forwarding as part of that SSH-over-SSM session. Then, instead of trying to send rendered graphics over the tunnel, one only sends the pre-rendered data. It's a much lighter traffic load with the result being a much quicker/livelier response. It's also pretty easy to set up. Something like:

ssh -L localhost:8080:$( aws ec2 describe-instances \ --query 'Reservations[].Instances[].PrivateIpAddress' \ --output text \ --instance-ids <EC2_INSTANCE_ID> ):80 <USERID>@<EC2_INSTANCE_ID>

Is all you need. In the above, the argument to the -L flag is saying, "set up a tcp/8080 listener on my local machine and have it forward connections to the remote machine's tcp/80". The local and remote ports can be varied for your specific needs. You can even set up dynamic-forwarding by creating a SOCKS proxy (but this document is meant to be a starting point, not dive into the weeds).

Note that, while the above is using a subshell (via the $( … ) shell-syntax) to snarf the remote EC2's private IP address, one should be able to simply substitute "localhost". I simply prefer to try to speak to the remote's ethernet, rather than loopback, interface, since doing so can help identify firewall-type issues that might interfere with others' use of the target service.

Thursday, March 28, 2024

Mixed Data-Types and Keeping Things Clean

This year, one of the projects I've been assigned to has me assisting a customer in implementing a cloud-monitoring solution for their multi-cloud deployment. The tool uses the various CSPs APIs to monitor the creation/modification/deletion of resources and how those resources are configured.

The tool, itself, is primarily oriented for use and configuration via web UI. However, one can configure it via Terraform. This makes it easier to functionally-clone the monitoring tool's configuration as well a reconstitute it if someone blows it up.

That said, the tool uses Nunjucks and GraphQL to implement some of its rule-elements. Further, most of the data it handles comes in the form of JSON streams. The Nunjucks content, in particular, can be used to parse those JSON streams and static JSON content can be stored within the  monitoring-application. Because Terraform is used for CLI-based configuration, the Terraform resources can consist of pure Terraform code as well as a mix of encapsulated Nunjucks, GraphQL and JSON.

Most of the vendor's demonstration configuration-code has the Nunjucks, GraphQL and JSON contents wholly encapsulated in Terraform resource-definitions. If one wants to lint their configuration-code prior to pushing it into the application, the vendor-offered method for formatting the code can work counter to that. That said, with careful coding, one can separate the content-types from each other and use reference-directives to allow Terraform to do the work of merging it all together. While this may seem more complex, separating the content-types means that each chunk of content is more-easily validated and checked for errors. Rather than blindly hitting "terraform apply" and hoping for the best, you can lint your JSON, Nunjucks and GraphQL separately. This means that, once you've authored all of your initial code and wish to turn it over to someone else to do lifecycle tasks, you can horse it all to a CI workflow that ensures that humans that have edited any given file hasn't introduced content-type violations that can lead to ugly surprises.

Honestly, I have more confidence that the people I turn things over to will know how to massage single content-type files than mixed content-type files. This means I feel like I'm less likely to get confused help requests after I'm moved to another assignment.

Wednesday, October 18, 2023

Crib Notes: Has My EC2's AMI-Publisher Updated Their AMIs

Discovered, late yesterday afternoon, that the automation I'd used to deploy my development-EC2, had had some updates and that these updates (rotating some baked-in query-credentials) were breaking my  development-EC2's ability to use some resources (since they could no longer access those resources). So, today, I'll need to re-deploy my EC2. I figured that, since it'd been since September 14th since I'd launched my EC2, perhaps a new AMI was available (our team publishes new AMIs each month). So, what's a quick way to tell if a new AMI is available? Some nested AWS CLI commands:

aws \
  --output text ec2 describe-images \
  --owner $(
  aws ec2 describe-images \
    --output text \
    --image-ids $(
    aws ec2 describe-instances \
      --output text \
      --query 'Reservations[].Instances[].ImageId' \
      --instance-ids <CURRENT_EC2_ID>  ) \
  --query 'Images[].OwnerId'
) \
  --query 'Images[].{ImageId:ImageId,CreationDate:CreationDate,Name:Name}' | \
sort -nk 1

Sadly, the answer, today, is, "no". I apparently, this month's release-date is going to be later today or tomorrow. So, I'll just be re-deploying from September's AMI.

Friday, October 6, 2023

ACTUALLY Deleting Emails in gSuite/gMail

Each month, I archive all the contents of my main gSuite account to a third-party repository. I do this via an IMAP-based transfer.

Unfortunately, when you use an IMAP-based transfer to move files, Google doesn't actually delete the emails from your gMail/gSuite account. No, it simply removes all labels from them. This means that instead of getting space back – space that Google charges for in gSuite – the messages simply become not-easily-visible within gMail. None of your space is freed up and, thus, space-charges for those unlabeled emails continue to accrue.

Discovered this annoyance a couple years ago when my mail-client was telling me I was getting near the end of my quota. When I first got the quota-warning, I was like, "how??? I've offloaded all my old emails. There's only a month's worth of email in my Inbox, Sent folder and my per-project folders!" That prompted me to dig around to discover the de-labled/not-deleted fuckery. So, I dug around further to find a method for viewing those de-labeled/not-deleted files. Turns out, putting:

-has:userlabels -in:sent -in:chat -in:draft -in:inbox

In your webmail search-bar will show you them …and allow you to delete them.

My gSuite account was a couple years old when I discovered all this. So, when I selected all the unlabeled emails for deletion, it took a while for Google to actually delete them. However, once the deletion completed, I recovered nearly 2GiB worth of space in my gSuite account.

Tuesday, September 12, 2023

Tailoring `oscap` Profiles for Dummies

Several of the projects I am or have been matrixed to leverage the oscap utility to perform hardening based on common security-benchmarks. However, some of the profile-defaults are either too strict or too lax for a given application-deployment. While one can wholly ignore the common security-benchmarks selected hardenings and create one's own custom hardening-profile(s), that's a bit too much like reinventing the wheel.

Checking Which Security-Profiles Are Available

The oscap utility can be used to quickly show what profile-names are available for use. This is done by executing:

$ oscap info /PATH/TO/OS/<XCCDF>.xml

On Red Hat systems (and derivatives) with the scap-security-guide RPM installed, the XCCDF files will be installed in the /usr/share/xml/scap/ssg/content directory. To see which profiles are available for Red Hat 8 distros, one would execute:

$ oscap info /usr/share/xml/scap/ssg/content/ssg-rhel8-xccdf.xml

Which would give an output like:

Document type: XCCDF Checklist
Checklist version: 1.2
Imported: 2023-02-13T11:49:00
Status: draft
Generated: 2023-02-13
Resolved: true
Profiles:
        Title: ANSSI-BP-028 (enhanced)
                Id: xccdf_org.ssgproject.content_profile_anssi_bp28_enhanced
        Title: ANSSI-BP-028 (high)
                Id: xccdf_org.ssgproject.content_profile_anssi_bp28_high
        Title: ANSSI-BP-028 (intermediary)
                Id: xccdf_org.ssgproject.content_profile_anssi_bp28_intermediary
        Title: ANSSI-BP-028 (minimal)
                Id: xccdf_org.ssgproject.content_profile_anssi_bp28_minimal
        Title: CIS Red Hat Enterprise Linux 8 Benchmark for Level 2 - Server
                Id: xccdf_org.ssgproject.content_profile_cis
        Title: CIS Red Hat Enterprise Linux 8 Benchmark for Level 1 - Server
                Id: xccdf_org.ssgproject.content_profile_cis_server_l1
        Title: CIS Red Hat Enterprise Linux 8 Benchmark for Level 1 - Workstation
                Id: xccdf_org.ssgproject.content_profile_cis_workstation_l1
        Title: CIS Red Hat Enterprise Linux 8 Benchmark for Level 2 - Workstation
                Id: xccdf_org.ssgproject.content_profile_cis_works
        Title: Unclassified Information in Non-federal Information Systems and Organizations (NIST 800-171)
                Id: xccdf_org.ssgproject.content_profile_cui
        Title: Australian Cyber Security Centre (ACSC) Essential Eight
                Id: xccdf_org.ssgproject.content_profile_e8
        Title: Health Insurance Portability and Accountability Act (HIPAA)
                Id: xccdf_org.ssgproject.content_profile_hipaa
        Title: Australian Cyber Security Centre (ACSC) ISM Official
                Id: xccdf_org.ssgproject.content_profile_ism_o
        Title: Protection Profile for General Purpose Operating Systems
                Id: xccdf_org.ssgproject.content_profile_ospp
        Title: PCI-DSS v3.2.1 Control Baseline for Red Hat Enterprise Linux 8
                Id: xccdf_org.ssgproject.content_profile_pci-dss
        Title: DISA STIG for Red Hat Enterprise Linux 8
                Id: xccdf_org.ssgproject.content_profile_stig
        Title: DISA STIG with GUI for Red Hat Enterprise Linux 8
                Id: xccdf_org.ssgproject.content_profile_stig_gui
Referenced check files:
        ssg-rhel8-oval.xml
                system: http://oval.mitre.org/XMLSchema/oval-definitions-5
        ssg-rhel8-ocil.xml
                system: http://scap.nist.gov/schema/ocil/2
        https://access.redhat.com/security/data/oval/com.redhat.rhsa-RHEL8.xml.bz2
                system: http://oval.mitre.org/XMLSchema/oval-definitions-5

The critical items, here, are the lines that begin with "Title" and the lines that begin with "Id".

  • The lines that begine with "Title" are what will appear in graphical tools like the SCAP WorkBench GUI.
  • The lines that begin with "Id" are used with the `oscap` utility. These identifiers one given as arguments to the utility's --profile flag (when using the `oscap` utility to scan and/or remediate a system).
    Note: When using the values from the "Id" lines, either the fully-qualified ID-string may be given or just the parts after the "…profile_" substring. As such, one could specify either "xccdf_org.ssgproject.content_profile_stig" or just "stig".

Creating Tailored Security-Profiles:

The easiest method for tailoring security-profiles is to use the SCAP Workbench to generate the appropriately-formatted XML. However, if one already has an appropriately-formatted tailoring XML file, a plain text-editor (such as vim) is a very quick way to add or remove content.

It's worth noting that the SCAP Workbench is a GUI. As such, it will be necessary to either have access to the graphical console of Linux or OSX host or the ability to display a remote Linux host's GUI-programs locally. Remote display to local system can be the entire remote desktop (via things like Xrdp, VNC, XNest or other) or just the SCAP Workbench client, itself (personally, I leverage X11-over-SSH tunnels).

On a Red Hat (or derivative) system, you'll want to install the scap-workbench and the scap-security-guide RPMs. The former provides the SCAP Workbench GUI while the latter provides the content you'll typically want to use. Alternate to the scap-security-guide RPM, you can install SCAP content from the Compliance As Code project (the upstream source for the scap-security-guide RPM's contents).

To generate a "null" tailoring-profile – one that doesn't change the behavior of a given profile's execution – use the following generic procedure:

  1. Establish GUI access to the system that will run the scap-workbench binary
  2. Execute `scap-workbench`. This will bring up a banner that looks like:


    The above is shown with the "content to load" list expanded. This demonstrates the content that's loaded to a Red Hat 8 system by way of the scap-security-guide RPM.
  3. Select the appropriate content from the dropdown: if using vendor-content, one of the RHELn menu items; if opening content from another source (e.g. the Compliance as Code project), select the "Other SCAP Content" option
  4. Click the "Load Content" button. if the "Other SCAP Content" option was selected, this will open up a dialog for navigating to the desired content. Otherwise, the vendor-content for the selected RHEL version will be opened.
  5. Once the selected content has been read, the GUI will display a page with the Title of the opened-content, a "Customization" dropdown-menu and a "Profile" drop-down menu.
  6. Select the appropriate hardening-profile from the "Profile" drop-down menu (e.g., "DISA STIG for Red Hat Enterprise Linux 8")
  7. Click on the "Customize" button next to the selected "Profile":
  8. This will bring up a window like:
    Accept the default value for the "New Profile ID" field and click on the "Ok" button
  9. This will open a new window:
    • To save a "null" tailoring-file, immediately hit the "Ok" button
    • Alternately, pick through the list of available settings, selecting or unselecting boxes as seems appropriate for the target use-case, then hit the "Ok" button 
  10. This will close the customization-window and change the main window's "Customization" drop-down to include the string "(unsaved changes)"
  11. Click on the  "File" dropdown-menu at the top of the window and select the  "Save Customization Only" menu-item. Select a file-name that makes sense (I typically choose something like "tailoring-<OS_VERSION>-<PROFILE_NAME>.xml" (e.g., "tailoring-el8-stig.xml",  "tailoring-el8-cis-server-l1.xml", etc.)
  12. Exit the SCAP workbench.

The resultant file will contain a line similar to:

<xccdf:Profile id="xccdf_org.ssgproject.content_profile_stig_customized" extends="xccdf_org.ssgproject.content_profile_stig">

The actual contents of the line will vary, but the critical components are the "id" and "extends" tokens:

  • id: the name of the profile to use when invoking the oscap utility
  • extends: the name of the profile that will get modified by the tailoring-file's contents

The contents of the tailoring-file are generally pretty basic – something like:

<xccdf:tailoring id="xccdf_scap-workbench_tailoring_default" xmlns:xccdf="http://checklists.nist.gov/xccdf/1.2">
  <xccdf:benchmark href="/tmp/scap-workbench-WOghwr/ssg-rhel8-xccdf.xml">
  <xccdf:version time="2023-09-11T17:10:35">1</xccdf:version>
  <xccdf:profile extends="xccdf_org.ssgproject.content_profile_stig" id="xccdf_org.ssgproject.content_profile_stig_customized">
    <xccdf:title override="true" xml:lang="en-US" xmlns:xhtml="http://www.w3.org/1999/xhtml">DISA STIG for Red Hat Enterprise Linux 8 [CUSTOMIZED]</xccdf:title>
    <xccdf:description override="true" xml:lang="en-US" xmlns:xhtml="http://www.w3.org/1999/xhtml">This profile contains configuration checks that align to the
DISA STIG for Red Hat Enterprise Linux 8 V1R9.</xccdf:description>
    <xccdf:select idref="xccdf_org.ssgproject.content_rule_rpm_verify_hashes" selected="false">
  </xccdf:select></xccdf:profile>
</xccdf:benchmark></xccdf:"tailoring>

Rules that have been added to the execution list will look something like (note the "true" condition/key):

<xccdf:select idref="xccdf_org.ssgproject.content_rule_rpm_verify_hashes" selected="true"/>

While rules that have been deslected for execution will look something like (note the "false" condition/key):

<xccdf:select idref="xccdf_org.ssgproject.content_rule_rpm_verify_hashes" selected="false"/>

Whether adding extra rules or deselecting rules from the execution-profile, the rules will be placed after the "</xccdf:description>" token and before the "</xccdf:Profile>" token.

Note that the action/rule is effectively null if the condition/key for a rule in the tailoring-file has the same value as the action/rule value in the profile referenced by the "extends" token.

Using Tailored Security-Profiles:

Once generated, the tailoring-file is used by calling the oscap utility in the normal way but for:

  • Adding a "--tailoring" flag (with the path of the tailoring-file as its argument)
  • Ensuring the value of the "--profile" matches the profile "id" token's value in the tailoring-file (and that the "extends" token's value in the tailoring-file matches the "id" token's value in the referenced XCCDF file)

For example, if executing a remediation using a tailored-execution of the STIG profile, one would execute something like:

oscap xccdf eval \
  --profile stig_custom \
  --tailoring-file /root/tailoring-el8-stig.xml \
  /usr/share/scap-content/openscap/ssg-rhel8-xccdf.xml

The above tells the oscap utility to use the tailoring-file located at "/root/tailoring-watchmaker-el8.xml" to modify the behavior of the "stig" profile as defined in the "/usr/share/scap-content/openscap/ssg-rhel8-xccdf.xml" file.

Tuesday, April 11, 2023

TIL: I Am Probably Going To Come To Hate `fapolicyd`

One of the things I do in my role is write security automation. Part of that requires testing systems' hardening-compliance each time one of the security-benchmarks my customers use is updated.

In a recent update, the benchmarks for Red Hat Enterprise Linux 8 (and derivatives) added the requirement to enable and run the application-whitelisting service, fapolicyd. I didn't immediately notice this change …until I went to offload the security-scans from my test EC2 to an S3 bucket. The AWS CLI was suddently broken.

Worse, it was broken in an absolutely inscrutable way: if one executed and  AWS CLI command, even something as simple and basic as `aws help`, it would immediately return having neither performed the requested action nor emitting an error. As an initial debug attempt, I did:

echo $( aws help )$?

Which got me the oh-so-useful `255` for my troubles. But, it did allow me to start about to Googling. My first hit was this guy. It had the oh-so-helpful guidance:

  • 255 -- Command failed. There were errors from either the CLI or the service the request was made to.

Like I said: "oh-so-helpful guidance".

So, I opened a support request with Amazon. Initially, they were at least as in the dark as I was. 

Fortunately, another member of the team I work on noticed the support case when it came into his Inbox via our development account's auto-Cc configuration. Unlike me, he, apparently, hasn't deadmailed everything sent to that distro (which, given how much is sent to that distro, deadmailing anything that arrived through the distro was the only way to preserve my sanity). He'd indicated that he had previously had similar issues and that he got around them by disabling the fapolicyd service. I quickly tested stopping the service and the AWS CLI happily resumed functioning as it had before the hardening tasks had executed.

I knew that wholly disabling the service was not going to be acceptable to our cyber-defense team. But, knowing where the problem originated meant I had a useful investigatory path.

The first (seeming) solution I found was to execute something like:

fapolicyd-cli --file add /usr/local/bin/aws
fapolicyd -u

This allowed both the AWS CLI to function and for the fapolicyd service to continue to be run.

For better or worse, though, I'm a curious type. I wanted to see what the rule looked like that the fapolicyd-cli utility had created. So, I dug around the documentation to find where I might be able to eyeball the exception.

# AUTOGENERATED FILE VERSION 2
# This file contains a list of trusted files
#
#  FULL PATH        SIZE                             SHA256
# /home/user/my-ls 157984 61a9960bf7d255a85811f4afcac51067b8f2e4c75e21cf4f2af95319d4ed1b87
/usr/local/aws-cli/v2/2.11.6/dist/aws 6658376 c48f667b861182c2785b5988c5041086e323cf2e29225da22bcd0f18e411e922

Which immediately rung alarm bells in my skull (strong "danger Will Robinson"  vibes). By making the exception not only conditional on the binary's (real) path, but also to its size and, particularly, its SHA256 signature, I knew that if anyone ever updated the installed binary, their exception-description was no longer going to match. This in turn would mean that the utility would stop working. Not wanting to deal with tickets that I could easily prevent, I continued my investigation.

Knowing that what I actually wanted was to give a blanket exemption to everything under the "/usr/local/aws-cli/v2" directory, I started investigating how to do that. Ultimately, I came up with a exeption-set that looked like:

allow perm=any all : dir=/usr/local/aws-cli/v2/ type=application/x-sharedlib trust 1
allow perm=any all : dir=/usr/local/aws-cli/v2/ type=application/x-executable trust 1

I saved the contents as `/etc/fapolicyd/rules.d/30-aws.rules`and reloaded the `fapolicyd` configuration. However, I was sadly disappointed to discover that the AWS CLI was still broken. On the plus side, it was broken differently. Now, instead of immediately and silently exiting (with a 255 error-code), it was giving me:

[2030] Error loading Python lib '/usr/local/aws-cli/v2/2.11.6/dist/libpython3.11.so.1.0':
dlopen: /usr/local/aws-cli/v2/2.11.6/dist/libpython3.11.so.1.0: cannot open shared object
file: Operation not permitted

Much more meat to chew on. Further searching later, I had two additional commands to help me in my digging:

ausearch --start today -m fanotify --raw | aureport --file --summary

Which gave me:

File Summary Report
===========================
total  file
===========================
16  /usr/local/aws-cli/v2/2.11.6/dist/libz.so.1
16  /usr/local/aws-cli/v2/2.11.6/dist/libpython3.11.so.1.0
2  /usr/local/aws-cli/v2/2.11.6/dist/aws

And:

fapolicyd-cli --list

Which gave me:

-> %languages=application/x-bytecode.ocaml,application/x-bytecode.python,
application/java-archive,text/x-java,application/x-java-applet,application/javascript,
text/javascript,text/x-awk,text/x-gawk,text/x-lisp,application/x-elc,text/x-lua,
text/x-m4,text/x-nftables,text/x-perl,text/x-php,text/x-python,text/x-R,text/x-ruby,
text/x-script.guile,text/x-tcl,text/x-luatex,text/x-systemtap
 1. allow perm=any uid=0 : dir=/var/tmp/
 2. allow perm=any uid=0 trust=1 : all
 3. allow perm=open exe=/usr/bin/rpm : all
 4. allow perm=open exe=/usr/libexec/platform-python3.6 comm=dnf : all
 5. deny_audit perm=any pattern=ld_so : all
 6. deny_audit perm=any all : ftype=application/x-bad-elf
 7. allow perm=open all : ftype=application/x-sharedlib trust=1
 8. deny_audit perm=open all : ftype=application/x-sharedlib
 9. allow perm=execute all : trust=1
10. allow perm=open all : ftype=%languages trust=1
11. deny_audit perm=any all : ftype=%languages
12. allow perm=any all : ftype=text/x-shellscript
13. allow perm=any all : dir=/usr/local/aws-cli/v2/ type=application/x-sharedlib trust 1
14. allow perm=any all : dir=/usr/local/aws-cli/v2/ type=application/x-executable trust 1
15. deny_audit perm=execute all : all
16. allow perm=open all : all

Which told me why the binary was at least trying to work, but was unable to load its shared-library. Since I'd named the file `/etc/fapolicyd/rules.d/80-aws.rules`, it was loading later than a rule that was preventing access to shared libraries not in standard trust-paths. In the above, it was the file that created rule #8.

I grepped through the `/etc/fapolicyd/rules.d/` directory looking for the file that created rule #8. Having found it, I moved my rule-file upwards with a quick `mv /etc/fapolicyd/rules.d/{8,3}0-aws.rules` and reloaded my rules. This time, my rules-list came up like:

-> %languages=application/x-bytecode.ocaml,application/x-bytecode.python,
application/java-archive,text/x-java,application/x-java-applet,application/javascript,
text/javascript,text/x-awk,text/x-gawk,text/x-lisp,application/x-elc,text/x-lua,
text/x-m4,text/x-nftables,text/x-perl,text/x-php,text/x-python,text/x-R,text/x-ruby,
text/x-script.guile,text/x-tcl,text/x-luatex,text/x-systemtap
 1. allow perm=any uid=0 : dir=/var/tmp/
 2. allow perm=any uid=0 trust=1 : all
 3. allow perm=open exe=/usr/bin/rpm : all
 4. allow perm=open exe=/usr/libexec/platform-python3.6 comm=dnf : all
 5. allow perm=any all : dir=/usr/local/aws-cli/v2/ type=application/x-sharedlib trust 1
 6. allow perm=any all : dir=/usr/local/aws-cli/v2/ type=application/x-executable trust 1
 7. deny_audit perm=any pattern=ld_so : all
 8. deny_audit perm=any all : ftype=application/x-bad-elf
 9. allow perm=open all : ftype=application/x-sharedlib trust=1
10. deny_audit perm=open all : ftype=application/x-sharedlib
11. allow perm=execute all : trust=1
12. allow perm=open all : ftype=%languages trust=1
13. deny_audit perm=any all : ftype=%languages
14. allow perm=any all : ftype=text/x-shellscript
15. deny_audit perm=execute all : all
16. allow perm=open all : all

With my sharedlib allow-rule now ahead of the default-deny sharedlib rule, I tested out the AWS CLI command again. Success!

Unfortuantely, while I solved the problem I set out to solve, my `ausearch`  output was telling me that a few other standard tools were also likely having similar whitelisting issues. Ironically, those "other standard tools" are all security-related.

Fun fact: a number of security-vendors write their products for Windows, first and foremost. Their Linux tooling is almost an afterthought. As such, it's often not well-delivered: if they deliver their software in RPMs at all, the RPMs are often poorly-constructed. I almost never see signed RPMS from security vendors. When I do actually get signed RPMs, they're very rarely signed in a way that's compatible with a Red Hat system that's configured to run in FIPS-mode. So, I guess I shouldn't be super surprised that these same security-tools aren't aware of the need or how to work with fapolicyd. Oh well, that's someone else's problem (realistically, probably "future me's").