Titular Discrepancy: replication

Tuesday, May 7, 2019

Crib-Notes: Offline Delta-Syncs of S3 Buckets

In the normal world, synchronizing two buckets is as simple as doing `aws s3 sync <SYNC_OPTIONS> <SOURCE_BUCKET> <DESTINATION_BUCKET>`. However, due to the information security needs of some of my customers, it's occasionally necessary to perform data-synchronizations between two S3 buckets, but using methods that amount to "offline" transfers.

To illustrate what is meant by "offline":

Create a transfer-archive from a data source
Copy the transfer-archive across a security boundary
Unpack the transfer-archive to its final destination

Note that things are a bit more involved than the summary of the process – but this gives you the gist of the major effort-points.

The first time you do an offline bucket sync, transferring the entirety of a bucket is typically the goal. However, for a refresh-sync – particularly for a bucket of greater than a trivial content-size, this can be sub-ideal. For example, it might be necessary to do monthly syncs of a bucket that grows by a few Gigabytes per month. After a year, a full sync can mean having to move tens to hundreds of gigabytes. A better way is to only sync the deltas – copying only what's changed between the current and immediately-prior sync-tasks (a few GiB rather than tens to hundreds).

The AWS CLI tools don't really have a "sync only the files that have been added/modified since <DATE>". That said, it's not super difficult to work around that gap. A simple shell script like the following works a trick:

for FILE in $( aws s3 ls --recursive s3://<SOURCE_BUCKET>/  | \
   awk '$1 > "2019-03-01 00:00:00" {print $4}' )
do
   echo "Downloading ${FILE}"
   install -bDm 000644 <( aws s3 cp "s3://<SOURCE_BUCKET>/${FILE}" - ) \
     "<STAGING_DIR>/${FILE}"
done

To explain the above:

Create a list of files to iterate:
1. Invoke a subprocess using the $() notation. Within that subprocess...
2. Invoke the AWS CLI's S3 module to recursively list the source-bucket's contents (`aws s3 ls --recursive`)
3. Pipe the output to `awk` – looking for any date-string that's newer than the value in s3 ls's first output-column (the file-modification date column) and print out only the fourth column (the S3 object-path)
The output from the subprocess is captured that output as an iterable list-structure
Use a for loop-method to iterate the previously-assembled list, assigning each S3 object-path to the ${FILE} variable
Since I hate sending programs off to do things in silence (I don't trust them to not hang), my first looped-command is to say what's happening via the echo "Downloading ${FILE}" directive.
The install line makes use of some niftiness within both BASH and the AWS CLI's S3 command:
1. By specifying "-" as the "destination" for the file-copy operation, you tell the S3 command to write the fetched object-contents to STDOUT.
2. BASH allows you take a stream of output and assign a file-handle to it by surrounding the output-producing command with <( ).
3. Invoking the install command with the -D flag tells the command to "create all necessary path-elements to place the source 'file' in the desired location within the filesystem, even if none of the intervening directory structure exists, yet."
Putting it all together, the install operation takes the streamed s3 cp output, and installs it as a file (with mode 000644) at the location derived from the STAGING_DIR plus the S3 object-path ...thus preserving the SOURCE_BUCKET's content-structure within the STAGING_DIR

Obviously, this method really only works for additive/substitutive deltas. If you need to account for deletions and/or moves, this approach will be insufficient.

Wednesday, November 23, 2016

Manually Mirroring Git Repos

If you're like me, you've had occasions where you need to replicate Git-managed projects from one Git service to another. If you're not like me, you use paid services that make such mundane tasks a matter of clicking a few buttons in a GUI. If, however, you need to copy projects from one repository-service to another and no one has paid to make a GUI buttton/config-page available to you, then you need to find other methods to get things done.

The following assumes that you have a git project hosted in one repository service (e.g, GitHub) that you wish to mirror to another repository service (e.g., BitBucket, AWS CodeCommit, etc). The basic workflow looks like the following:

Procedure Outline:

Create a copy of your "source-of-truth" repository, depositing its contents to a staging-directory:

git clone --mirror \
   <REPOSITORY_USER>@<REPOSITORY1.DNS.NAME>:<PROJECT_USER_OR_GROUP>/<PROJECT_NAME>.git \
   stage

Navigate into the staging-directory:
```
cd stage
```

Set the push-destination to the copy-repository:

git remote set-url --push origin \
   <REPOSITORY_USER>@<REPOSITORY2.DNS.NAME>:<PROJECT_USER_OR_GROUP>/<PROJECT_NAME>.git

Ensure the staging-directory's data is still up to date:
```
git fetch -p origin
```
Push the copied source-repository's data to the copy-repository:
```
git push --mirror
```

Procedure Outline:

Using an example configuration (the AMIgen6 project):

$ git clone --mirror git@github.com:ferricoxide/AMIgen6.git stage && \
  cd stage && \
  git remote set-url --push origin git@bitbucket.org:ferricoxide/amigen6-copy.git && \
  git fetch -p origin && \
  git push --mirror
Cloning into bare repository 'stage'...
remote: Counting objects: 789, done.
remote: Total 789 (delta 0), reused 0 (delta 0), pack-reused 789
Receiving objects: 100% (789/789), 83.72 MiB | 979.00 KiB/s, done.
Resolving deltas: 100% (409/409), done.
Checking connectivity... done.
Counting objects: 789, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (369/369), done.
Writing objects: 100% (789/789), 83.72 MiB | 693.00 KiB/s, done.
Total 789 (delta 409), reused 789 (delta 409)
To git@bitbucket.org:ferricoxide/amigen6-copy.git
 * [new branch]      ExtraRPMs -> ExtraRPMs
 * [new branch]      SELuser-fix -> SELuser-fix
 * [new branch]      master -> master
 * [new branch]      refs/pull/38/head -> refs/pull/38/head
 * [new branch]      refs/pull/39/head -> refs/pull/39/head
 * [new branch]      refs/pull/40/head -> refs/pull/40/head
 * [new branch]      refs/pull/41/head -> refs/pull/41/head
 * [new branch]      refs/pull/42/head -> refs/pull/42/head
 * [new branch]      refs/pull/43/head -> refs/pull/43/head
 * [new branch]      refs/pull/44/head -> refs/pull/44/head
 * [new branch]      refs/pull/52/head -> refs/pull/52/head
 * [new branch]      refs/pull/53/head -> refs/pull/53/head
 * [new branch]      refs/pull/54/head -> refs/pull/54/head
 * [new branch]      refs/pull/55/head -> refs/pull/55/head
 * [new branch]      refs/pull/56/head -> refs/pull/56/head
 * [new branch]      refs/pull/57/head -> refs/pull/57/head
 * [new branch]      refs/pull/62/head -> refs/pull/62/head
 * [new branch]      refs/pull/64/head -> refs/pull/64/head
 * [new branch]      refs/pull/65/head -> refs/pull/65/head
 * [new branch]      refs/pull/66/head -> refs/pull/66/head
 * [new branch]      refs/pull/68/head -> refs/pull/68/head
 * [new branch]      refs/pull/71/head -> refs/pull/71/head
 * [new branch]      refs/pull/73/head -> refs/pull/73/head
 * [new branch]      refs/pull/76/head -> refs/pull/76/head
 * [new branch]      refs/pull/77/head -> refs/pull/77/head

Updating (and Automating)

To keep your copy-repository's project in sync with your source-repository's project, periodically do:

cd stage && \
  git fetch -p origin && \
  git push --mirror

This can be accomplished by logging into a host and executing the steps manually or placing them into a cron job.