Titular Discrepancy: performance

Showing posts with label performance. Show all posts

Wednesday, August 28, 2024

Getting the Most Out of EC2 Transfers With Private S3 Endpoints

Recently, I was given a project to help a customer migrate an on-premises GitLab installation into AWS. The current GitLab was pretty large: a full export of the configuration was nearly 500GiB in size.

It turned out a good chunk of that 500GiB was due to disk-hosted artifacts and LFS objects. Since I was putting it all into AWS, I opted to make use of GitLab's ability to store BLOBs in S3. Ultimately, that turned out to be nearly 8,000 LFS objects and nearly 150,000 artifacts (plus several hundred "uploads").

The first challenge was getting the on-premises data into my EC2. Customer didn't want to give me access to their on-premises network, so I needed to have them generate the export TAR-file and upload it to S3. Once in S3, I needed to get it into an EC2.

Wanting to make sure that the S3→EC2 task was as quick as possible, I selected an instance-type rated to 12.5Gbps of network bandwidth and 10Gbps of EBS bandwidth. However, my first attempt at downloading the TAR-file from S3 took nearly an hour to run: it was barely creeping along at 120MiB/s. Abysmal.

I broke out `iostat` and found that my target EBS was reporting 100% utilization and a bit less than 125MiB/s of average throughput. That seemed "off" to me, so I looked at the EBS. Was then that I noticed that the default volume-throughput was only 125MiB/s. So, I upped the setting to its maximum: 1000MiB/s. I re-ran the transfer only to find that, while the transfer-speed had improved, it had only improved to a shade under 150MiB/s. Still abysmal.

So, I started rifling through the AWS documentation to see what CLI settings I could change to improve things. First mods were:

max_concurrent_requests = 40
multipart_chunksize = 10MB
multipart_threshold = 10MB

This didn't really make much difference. `iostat` was showing really variable utilization-numbers, but mostly that my target-disk was all but idle. Similarly, `netstat` was showing only a handful of simultaneous-streams between my EC2 and S3.

Contacted AWS support. They let me know that S3 multi-part upload and download was limited to 10,0000 chunks. So, I did the math (<FILE_SIZE> / <MAX_CHUNKS>) and changed the above to:

max_concurrent_requests = 40
multipart_chunksize = 55MB
multipart_threshold = 64MB

This time, the transfers were running about 220-250MiB/s. While that was a 46% throughput increase, it was still abysmal. While `netstat` was finally showing the expected number of simultaneous connections, my `iostat` was still saying that my EBS was mostly idle.

Reached back out to AWS support. They had the further suggestion of adding:

preferred_transfer_client = crt
target_bandwidth = 10GB/s

To my S3 configuration. Re-ran my test and was getting ≈990MiB/s of continuous throughput for the transfer! This knocked the transfer speed down from fifty-five minutes to a shade over eight minutes. In other words, I was going to be able to knock nearly an hour off the upcoming migration-task.

In digging back through the documentation, it seems that, when one doesn't specify a preferred_transfer_client value, the CLI will select the `classic` (`python`) client. And, depending on your Python version, the performance ranges from merely-horrible to ungodly-bad: using RHEL 9 for my EC2, it was pretty freaking bad, but had been less-bad when using AWS for my EC2's OS. Presumably a difference in the two distro's respective Python versions?

Specifying a preferred_transfer_client value of `crt` (C run-time client) unleashed the full might and fury of my EC2's and GP3's capabilities.

Interestingly, this "use 'classic'" behavior isn't a universal auto-selection. If you've selected an EC2 with any of the instance-types:

p4d.24xlarge
p4de.24xlarge
p5.48xlarge
trn1n.32xlarge
trn1.32xlarge

The auto-selection gets you `crt`. Not sure why `crt` isn't the auto-selected value for Nitro-based instance-types. But, "it's what it's".

Side note: just selecting `crt` probably wouldn't have completely roided-out the transfer. I assume the further setting of `target_bandwidth` to `10GB/s` probably fully-unleashed things. There definitely wasn't much bandwidth leftover for me to actually monitor the transfer. I assume that the `target_bandwidth` parameter has a default value that's less than "all the bandwidth". However, I didn't actually bother to verify that.

Update:

After asking support "why isn't `crt` the default for more instance-types", I got back the reply:

Thank you for your response. I see that these particular P5, P4d and Trn1 instances are purpose built for high-performance ML training¹. Hence I assume the throughput needed for this ML applications needs to high and CRT is auto enabled for these instance types.

Currently, the CRT transfer client does not support all of the functionality available in the classic transfer client.
These are few limitations for CRT configurations²:

Region redirects - Transfers fail for requests sent to a region that does not match the region of the targeted S3 bucket.
max_concurrent_requests, max_queue_size, multipart_threshold, and max_bandwidth configuration values - Ignores these configuration values.

S3 to S3 copies - Falls back to using the classic transfer client 

All of which is to say that, once I set `preferred_transfer_client = crt` all of my other, prior settings got ignored.

Thursday, May 7, 2020

Punishing Network Performance

Recently, I took delivery of a new laptop. My old laptop was rolling up on five years old, was still running Windows 7 and had less than 1GiB of free disk space. So, "it was time".

At any rate, the new laptop runs Windows 10 Pro. One of the nice things about this is that, because MicroSoft makes Hyper-V available for free for my OS, I no longer have to deal with VirtualBox (or any add-on virtualization solution, for that matter). It means I'm free of the endless parade of "you must reboot to install new VirtualBox drivers" that pretty much caused me to stop with local virtualization on my prior laptop.

One of the nice things with virtualization is that it lets me run Linux, locally. While WSL/WSL2 show promise, they aren't quite there yet. Similarly, I'm not super chuffed that Docker Desktop for Windows requires me to run Docker as root. So, my Linux VM is an EL8 host that I am able to run PodMan on (and able to run my containers as an unprivileged user within that VM).

That said, most of the "how to run Linux on Hyper-V on Windows 10" guides I found resulted in my networking getting completely and totally jacked up. When not jacked up, my Internet connection looks like:

However, after initially booting my CentOS VM, I noticed that my yum downloads were running stoopid-slow. I assumed the problem was constrained to the VM. However, being detail-oriented, I checked my OS's network speed. It was, uh, "not good". I was getting less than 4Mbps down and about 0.2Mbps up. I also found that, if I rebooted my system, my download speeds would come back, but my uploads still sucked. Lastly, I found that if I wholly de-configured Hyper-V, my networking returned to normal.

Sleuthing time.

I hit the Googs and everyone is saying "this is a known issue: you can fix it by changing your NIC settings." Unfortunately, the mentioned settings weren't available in my NIC. Dunno if it's because I'm using WiFi or what. All I know is that it was looking like my choices were going to be "have good Internet speeds or have Linux on Hyper-V".

I'm not real keen on such choices. So, I kept digging – eventually getting to the point where I went beyond the first page of results. Ultimately, I found an answer on one of the StackExchange forums. Interestingly, the answer I found wasn't even marked as the "best answer". Had I not read through all the responses on one thread, I wouldn't even have found it.

At any rate, the fix was to not use my NIC in bridged mode at all. Instead, I needed to ignore all the top-match guides' instructions to use an "External" interface (which puts your NIC into bridged mode) and, instead, use an "Internal" interface ...and then set up connection-sharing, allowing my private vSwitch to share (rather than bridge) through my WiFi adapter. As soon as I made that change, my speeds came back.

Wednesday, September 19, 2018

Exposed By Time and Use

Last year, a project with both high complexity and high urgency was dropped in my lap. Worse, the project was, to be charitable, not well specified. It was basically, "we need you to automate the deployment of these six services and we need to have something demo-able within thirty days".

Naturally, the six services were ones that I'd never worked with from the standpoint of installing or administering. Thus, there as a bit of a learning curve around how best to automate things that wasn't aided by the paucity of "here's the desired end-state" or other related details. All they really knew was:

Automate the deployment into AWS
Make sure it works on our customized/hardened build
Make sure that it backs itself up in case things blow up
GO!

I took my shot at the problem. I met the deadline. Obviously, the results were not exactly "optimal" — especially from the "turn it over to others to maintain standpoint. Naturally, after I turned over the initial capability to the requester, they were radio silent for a couple weeks.

When they finally replied, it was to let me know that the deadline had been extended by several months. So, I opted to use that time to make the automation a bit friendlier for the uninitiated to use. That's mostly irrelevant here — just more "background".

At any rate, we're now nearly a year-and-a-half removed from that initial rush-job. And, while I've improved the ease of use for the automation (it's been turned over to others for daily care-and-feeding), much of the underlying logic hasn't been revisited.

Over that kind of span, time tends to expose a given solution's shortcomings. Recently, they were attempting to do a parallel upgrade of one of the services and found that the data-move portion of the solution was resulting in build-timeouts. Turns out the size of the dataset being backed up (and recovered from as part of the automated migration process) had exploded. I'd set up the backups to operate incrementally, so, the increase in raw transfer times had been hidden.

The incremental backups were only taking a few minutes; however, restores of the dataset were taking upwards of 35 minutes. The build-automation was set to time out at 15 minutes (early in the service-deployment, a similar opration took 3-7 minutes) So, step one was to adjust the automation's timeouts to make allowances for the new restore-time realities. Second step was to investigate why the restores were so slow.

The quick-n-dirty backup method I'd opted for was a simple `s3 sync --delete /<APPLICATION_HOME_DIR>/ s3://<BUCKET>/<FOLDER>`. It was a dead-simple way to "sweep" the contents of the directory /<APPLICATION_HOME_DIR>/ to S3. And, because S3's `sync` method defaults to incremental, the cron-managed backups were taking the same couple minutes each day that they were a year-plus ago.

Fun fact about S3 and its transfer performance: if the objects you're uploading have keys with high degrees of commonality, transfer performance will become abysmal.

You may be asking why I mention "keys" since I've not mentioned encryption. S3, being an object-based filesystem, doesn't have the hierarchical layout of legacy, host-oriented storage. If I take a file from a host-oriented storage and use the S3 CLI utility to copy that file via its fully-qualified pathname to S3, the object created in S3 will look like:

<FULLY>/<QUALIFIED>/<PATH>/<FILE>

Of the above, "<FILE>" is the object name stored in S3 while "<FULLY>/<QUALIFIED>/<PATH>" is the key for that file. If you have a few thousand objects with the same or sufficiently-similar "<FULLY>/<QUALIFIED>/<PATH>" values, you'll run into that "transfer performance will become abysmal" issue mentioned earlier.

We very definitely did run into that problem. HARD. The dataset in question is (currently) a skosh less than 11GiB in size. The instance being backed up has an expected througput of about 0.45Gbps of sustained network throughput. So, we were expecting that dataset to take only a couple minutes to transfer. However, as noted above, it was taking 35+ minutes to do so.

So, how to fix? One of the easier methods is to stream your backups to a single file. I quick series of benchmarking-runs showed that doing so cut that transfer from over 35 minutes to under five minutes. Similarly, were one to iterate over all the files in the dataset, and individually copying the files into S3 using either randomize filenames (and setting the "real" fully-qualified-path as an attribute/tag of the file) or simply reversing the path-name (doing something like `S3NAME=$( echo "${REAL_PATHNAME} | perl -lne 'print join "/", reverse split/\//;')`) and storing that then your performance goes up dramatically.

I'll likely end up doing one of those three methods ...once I have enough contiguous time to allocate to re-engineering the backup and restore/rebuild methods.