Monday, October 21, 2013

Visualizing Deduplication

The enterprise I work for makes heavy use of deduplicating storage arrays for storing backup data. If you're frequently backing up the same data over and over again (e.g., multiple backup clients all running the same OS, weekly full backups of low change-rate application data-sets, etc.), deduplication can save you a ton of space.

In general, most people in the organization don't really bother to think how things work. It's just magic on the backend that allows their data to be backed up. They don't really grok that the hundreds of TiB of blocks they've pushed at the array actually only took up a few tens of TiB. Unfortunately, like any storage device, deduplicating storage arrays can run out of space. When they do, people's lack of understanding of what goes on under the covers really comes to the fore. You'll hear plaintive cries of "but I deleted 200TiB - what do you mean there's only 800GiB free, now??"

Given our data storage tendencies, if we realize a 10% return of the amount deleted at the backup server, it's a good day. When we get more than that, it's gravy. Trying to verbally-explain this to folks who don't understand what's going on at the arrays, is mostly a vain effort.

To try to make things a bit more understandable, I came up with three simplified illustrations of array utilization patterns with respect to backup images and the nature of the application's data change patterns. Each of the illustrations should be interpreted to represent data sets that are backed up. Each box represents a same-sized chunk of data (e.g., 200GiB). Each letter represents a unique segment of data within a larger data-set. Each row represents a backup job - with time being represented oldest to most recent ordered top to bottom. Each column represents a copy of a unique chunk of data.

All of what follows is massive over-simplification and ignores the effects of file-compression and other factors that can contribute to space-used on or reclaimed from an array.

100% Data Change Rate:
100% Change-rate (full backup to full backup)
The above diagram attempts to illustrate a data set that changes 100% - with no growth in the data set-size - between full backups. A given full backup is represented by five lettered-boxes on a single line. Incremental backups are represented by single lettered-boxes on a single line.

The application's full data-set is represented by the five boxes labeled "A" through "E". Changes to a given segment are signified by using the prime symbol. That is, "A'" indicates that data-segment "A" has experienced a change to every block of its contents - presumably sufficient change to prevent the storage array from being able to deduplicate that data either against the original data-segment "A" or any of the other lettered data-segments.

This diagram illustrates that, between the time the first and the last full backup has occurred, all of the data-segments have changed. Further, the changes happened such that one (and only one) full data-segment changed before a given incremental backup job was run.

Given these assumptions about the characteristics of the backup data:

  • if one were to erase only the original full backup from the the backup server, up to 100% of the blocks consumed at the array might be returned to the pool of available blocks.
  • If one were to erase the original full backup and any (up to all) of the incrementals, the only blocks that would be returned to the pool of available blocks would be the ones associated with the initial full backup. None of the blocks associated with the incremental backups would be returned. Thus, the greater the number of incrementals deleted, the lower the overall yield of blocks returned to the available blocks pool would be. Thus, if all of the backup images - except for the most recent full backup - were deleted, the maximum return-yield might be as high as 50% of the blocks erased at the backup host.

This would be a scenario where you might hear the "but I deleted xGiB of data and you're telling me only half that got freed up??"

100% Growth Rate:
The above diagram attempts to illustrate a data set that grows by 100% between full backups - growing by 20% of the original set-size between each backup window. A given full backup is represented by five or more lettered-boxes on a single line. Incremental backups are represented by single lettered-boxes on a single line.

The above is kind of ugly. Assuming no deduplicability between data segments, each backup job will consume additional space on the storage array. Assuming 200GiB segments, the first full backup would take up 1TiB of space on the array. Each incremental would consume a further 200GiB. By the time the last incremental has run, 2TiB of space will be consumed in the array.

That ugliness aside, when the second full backup is run, almost no additional disk space will be consumed: the full backups segments would be duplicates of all prior jobs.

However, because the second full backup duplicates all prior data, erasing the first full backup and/or any (or all) of the incrementals would result in a 0% yield of the deleted blocks being returned to the available blocks pool. In this scenario, the person deleting data on their backup server will be incredulous when you tell them "that 2TiB of image data you deleted freed up no additional space on the array". Unfortunately, they won't really care that the 4TiB of blocks that they'd sent to the array only actually consumed 2TiB of space on the array.

20% Data Change Rate:
The above diagram attempts to illustrate a data set that changes continuously changes 20% of the data set - with no growth in the data set-size - between full backups. Further, only one chunk of data is changing - 80% of the original data-set remains completely unchanged between backups. A given full backup is represented by five lettered-boxes on a single line. Incremental backups are represented by single lettered-boxes on a single line.

The application's full data-set is represented by the five boxes labeled "A" through "E". Changes to a given segment are signified by using the prime symbol. That is, "A'" indicates that data-segment "A" has experienced a change to every block of its contents - presumably sufficient change to prevent the storage array from being able to deduplicate that data either against the original data-segment "A" or any of the other lettered data-segments.

Assuming 200GiB data-segments, the first full backup plus the incrementals would have resulted in 2TiB having been sent to the array. The total space consumed on the array would be similar.

If the first full backup is erased, of the 1TiB erased, up to 200GiB would be returned to the free blocks pool. This is because the first full backup and the second full backup overlap by 800GiB worth of deduplicated (non-unique) data.

In the above image, each of the first four incrementals has unique data not found in the second full backup. The fifth incremental's data is incorporated into the last full backup. Thus, deleting each of the first four incrementals may return up to 200GiB of blocks to the free blocks pool. Deletion of the last incremental will result in no blocks being returned to the free blocks pool. By extension, if the first full backup plus all of the incrementals are deleted, though 2TiB of data would be deleted at the backup server, only 1TiB would be returned to the free block pool.


Each of the above illustrations aside, unless you're undertaking the very expensive practice of allocating one deduplication pool per application/data-set, the deletion-yields tend to become even more apalling. The compression and deduplication algorithms in your array may have found storage optimizations across data sets. End result will be that your deduplication ratios will go down, but your actual blocks consumed on the array won't go down nearly as quickly as the folks deleting data might hope.

Similarly poor deletion-yields will occur when your backup system is keeping more than just two full backups. This is because the amount of aggregate overlaps across backup sets will be greater as the number of retained full backups increases.

Wednesday, September 25, 2013

The Case of the Broken ARP (In Progress)

Recently, I was tasked with a project that, as part of the preparation-phase, required patching up a whole bunch of servers to the same patch-level. In total, I patched about 20 systems that were dual-homed and equipped with asymmetrical, 10/1 active/passive bonds. All but the last system went aces. The last system ...was weird.

After patching the final system, my secondary network-pair was no longer able to talk on the network. While diagnosing, attempts to ping out to any hosts on the local LAN segment resulted in "host unreachable" errors. Any hosts that I did try to ping, ended up in the afflicted host's ARP table with a missing ("<incomplete>") MAC address entry.

Our builds are normally fairly locked down. That means, many troubleshooting tools (such as tcpdump) are not loaded. At my last straw, I opted to temporarily load tcpdump to see what, if anything, the afflicted bond (and sub-interfaces) were seeing on the network. Interestingly, as soon as I started snooping either the parent bond or the active interface, networking activity became "normal" and the previously "<incomplete>" ARP table entries populated the other hosts' MAC addresses. As soon as I stopped my tcpdump runs, networking reverted to its broken state and the other hosts' MAC address entries in the ARP table returned to "<incomplete>".

Still don't have a fix - this is a "work in progress" article, at the moment.What I ended up doing as a workaround - since this server is a critical infrastructure component - is did a `ifconfig bond1 promisc` and updated its /etc/sysconfig/network-scripts file to preserve the state should the system reboot before I find a more suitable fix. So, for right now, in order to get this one bond (of two on the system) to work, I need to leave it in promiscuous mode.

Obviously, I have our networking guys looking at the switches to see if there's a difference between how the ports for bond0 and bond1 are configured. I figure, it has to be the network, since: A) one bond works but the other doesn't; and, B) no promiscuous-mode changes were required for any of the other hosts that were patched.

At any rate, if you happen to stumble on this article before I get it beyond a "work in progress" state, please feel free to comment if you know a likely fix.

Tuesday, September 3, 2013

Password Encryption Methods

As a systems administrator, there are times where you have to find programatic ways to update passwords for accounts. On some operating systems, your user account modification tools don't allow you to easily set passwords in a programmatic fashion. Solaris used to be kind of a pain, in this regard, when it came time to do an operations-wide password reset. Fortunately, Linux is a bit nicer about this.

The Linux `usermod` utility allows you to (relatively easily) specify password in a programmatic fashion. The one "gotcha" of the utility is the requirement to use hashed password-strings rather than cleartext. The question becomes, "how best to generate those hashes".

The answer will likely depend on your security requirements. If MD5 hashes are acceptable, then you can use OpenSSL or the `grub-md5-crypt` utilities to generate them. If, however, your security requirements require SHA256- or even SHA512-based hashes neither of those utilities will work for you.

Newer Linux distributions (e.g. RHEL 6+) essentially replace the `grub-md5-crypt` utility with the `grub-crypt` utility. This utility supports not only the older MD5 that its predecessor supported, but also SHA256 and SHA512.

However, what do you do when `grub-crypt` is missing (e.g., you're running RedHat 5.x) or you just want one method that will work across different Linux versions (e.g., your operations environment consists of a mix of RHEL 5 and RHEL 6 systems)? While you can use a tool like `openssl` to do the dirty work, if your security requirements dictate an SHA-based hashing algorithm, it's not currently up to the task. If you want the SHAs in a cross-distribution manner, you have to leverage more generic tools like Perl or Python.

The following examples will show you how to create a hashed password-string from the cleartext password "Sm4<kT*TheFace". Some tools (e.g., OpenSSL's "passwd" command) allowed you to choose to use a fixed-salt or a random-salt. From the standpoint of being able to tell "did I generate this script", using a fixed-salt can be useful; however, using a random-salt may be marginally more secure. The Perl and Python methods pretty much demand the specification of a salt. In the examples below, the salt I'm using is "Ay4p":
  • Perl (SHA512) Method: `perl -e 'print crypt("Sm4<kT*TheFace", "\$6\$Ay4p\$");'`
  • Python (SHA512) Method: `python -c 'import crypt; print crypt.crypt(""Sm4<kT*TheFace", "$6$Ay4p$")'
Note that you specify the encryption-type used by specifying an numerical representation of the standard encryption-types. The standard encryption types for Linux operating systems (from the crypt() manpage):
  • 1  = MD5
  • 2a = BlowFish (not present in all Linux distributions)
  • 5  = SHA256 (Linux with GlibC 2.7+)
  • 6  = SHA512 (Linux with GlibC 2.7+) 

    Tuesday, May 28, 2013

    Fixing CIFS ACLs On a DataDomain

    On my current project, our customer makes use of DataDomain storage to provide nearline backups of system data. Backups to these devices is primarily done through tools like Veeam (for ESX-hosted virtual machines and associated application data) and NetBackup.

    However, a small percentage of the tenants my customer hosts are "self backup" tenants. In this case, "self backup" means that, rather than leveraging the enterprise backup frameworks, the tenants are given an NFS or CIFS share directly off of the DataDomain that is: A) closest to their system(s); and, B) has the most free space to accommodate their data.

    "Self backup" tenants that use NFS shares are minimally problematic. Most of the backup problems come from the fact that DataDomains weren't really designed for multi-tenancy. Things like quota controls are fairly lacking. So, it's possible for a tennants of a shared DataDomain to screw each other over by either soaking up all of the device's bandwidth or soaking up all the space.

    Still, those problems aside, providing NFS service to tenants is fairly straight-forward. You create a directory, you go into the NFS share-export interface, create the share and access controls and call it a day. CIFS shares, on the other hand...

    While we'd assumed that providing CIFS service would be on a par to providing NFS service, it's proven to be otherwise. While the DataDomains provide an NTFS-style ACL capability in their filesystem, it hasn't proven to work quite as one might expect.

    The interface for creating shares allows you to set share-level access controls based on both calling-host as well as assigned users and/or groups. One would reasonably assume that this would mean that the correct way to set up a share is to export it with appropriate client-allow lists and user/group-allow lists and that the shares would be set with appropriate filesystem permissions automagically. This isn't exactly how it's turned out to work.

    What we've discovered is that you pretty much have to set the shares up as being universally accessible from all CIFS clients and that you grant global "full control" access to the top-level share-folder. Normally, this would be a nightmare, but, once created, you can lock the shares down. You just have to manage the NTFS attributes from a Windows-based host. Basically, you create the share, present it to a Windows-based administrative host, then use the Windows folder security tools to modify the permissions on the share (e.g., remove all the "Everyone" rights, then manually assign appropriate appropriate ownerships and posix gropus to the folder and set up the correct DACLs.

    From an engineering perspective, it means that you have to document the hell out of things and try your best to train the ops folks on how to do things The Right Way™. Then, with frequent turnovers in Operations and other "shit happens" kind of things, you have to go back and periodically audit configurations for correctness and repair the brokenness that has crept in.

    Unfortunately, one of the biggest sources of brokenness that creeps in is broken permissions structures. When doing the initial folder-setup, it's absolutely critical that the person setting up the folder remembers to click the "Replace all child object permissions with inheritable permissions from this object" checkbox (accessed by clicking on the "Change Permissions" button within the "Advanced Security Settings" section for the folder). Failure to do so makes it so that each folder, subfolder and file created (by tenants) in the share have their own, tenant-created permissions structures. What this results in is a share whose permissions are not easily maintainable by the array-operators. Ultimately, it results in trouble tickets opened by tenants whose applications and/or operational folks eventually break access for themselves

    Once those tickets come in, there's not much that can be easily done if the person who "owns" the share has left the organization. If you find yourself needing to fix such a situation, you need to either involve DataDomain's support staff to fix it (assuming your environment is reachable via an WebEx-type of support session) or get someone to slip you instructions on how to access the array's "Engineering Mode"

    There's actually two engineering modes: there's the regular SE shell and the BASH shell. The SE shell is basically a super-set of the regular system administration CLI. The BASH shell is basically a Linux BASH shell with DataDomain-specific management commands enabled. For the most part, the two modes are interchangable. However, if you need the ability to do mass modifications or script on your array, you'll need to access the DataDomain's BASH shell mode to do it. See my prior article on accessing the DataDomain's BASH shell mode.

    Once you've gotten the engineering BASH shell, you have pretty much unfettered access to the guts of the DataDomain. The BASH shell is pretty much the same as you'd encounter on a stock Linux system. Most of the GNU utilities you're used to using will be there and will work the same way they do on Linux. You won't have man pages, so, if you forget flags to a given shell command, look them up on a Linux host that has the man pages installed. In addition to the standard Linux commands will be some DataDomain-specific commands. For the purposes of fixing your NTFS ACL mess, you'll be wanting to use the "dd_xcacls" command:

    • Use "dd_xcacls -O '[DomainObject]' [Object]" to set the Ownership of an object. For example, to set the ownership attribute to your AD domain account, issue the command "dd_xcacls -O 'MDOMAIN\MYUSER' /data/col1/backup/ShareName".
    • Use "dd_xcacls -G '[DomainObject]' [Object]" to set the POSIX group of an object.  For example, to set the POSIX group attribute to your AD domain group, issue the command "dd_xcacls -O 'MDOMAIN\MYUSER' /data/col1/backup/ShareName".
    • Use "dd_xcacls -D '[ActiveDirectorySID]:[Setting]/[ScopeMask]/[RightsList]' [OBJECT]" to set the POSIX group of an object. For example, to give "Full Control" rights to your domain account, issue the command "dd_xcacls -D 'MDOMAIN\MYUSER:ALLOW/4/FullControl' /data/col1/backup/ShareName".

    A couple of notes apply to the immediately preceding:

    1. While the "dd_xcacls" command can notionally set rights-inheritance, I've discovered that this isn't 100% reliable in the DDOS 5.1.x family. It will likely be necessary that once you've placed the desired DACLs on the filesystem objects, you'll need to use a Windows system to set/force inheritance onto objects lower in the filesystem hierarchy.
    2. When you set a DACL with "dd_xcacls -D", it replaces whatever DACLS are in place. Any permissions previously on the filesystem object will be removed. If you want more than one user/group DACL applied to the filesystem-object, you need to apply them all at once. Use the ";" token to separate DACLs within the quoted text-argument to the "-D" flag

    Because you'll need to fix all of your permissions, one at a time, from this mode, you'll want to use the Linux `find` command to power your use of the  "dd_xcacls" command. On a normal Linux system, when dealing with filesystems that have spaces in directory or file object-names, you'd do something like `find [DIRECTORY] -print0 | xargs -0 [ACTION]` to more efficiently handle this. However, that doesn't seem to work exactly like on a generic Linux system, at least not on the DDOS 5.x systems I've used. Instead, you'll need to use a `find [Directory] -exec [dd_xcacls command-string] {} \;`. This is very slow and resource intensive. On a directory structure with thousands of files, this can take hours to run. Further, because of how resource-intensive using this method is, you won't be able to run more than one such job at a time. Attempting to do so will result in SEGFAULTs - and the more you attempt to run concurrently, the more frequent the SEGFAULTs will be. These SEGFAULTs will cause individual "dd_xcacls" iterations to fail, potentially leaving random filesystem objects permissions unmodified.

    Friday, May 24, 2013

    DataDomain Bash Shell (or: So You Want to Wreck Your DataDomain)

    When dealing with the care and feeding of DataDomain arrays, there are occasions where it helps to know how to access the array's "Engineering Mode". In actuality, there are two levels of engineering mode for DataDomains:

    • SE Shell: the SE (system engineer) shell mode is a superset of the normal system administration shell. It includes all of the management commands of the normal administration shell plus some powerful utilities for doing lower-level maintenance tasks on your DataDomain. These include things like fixing ACLs on your CIFS shares, changing networking settings (e.g., timeouts related to OST sessions) and other knobs that are nice to be able to twizzle
    • BASH Shells: While the SE shell mode gives you more utilities for managing the array, they're still wrapped in the overall DDOS command-shell construct. The BASH shell mode is pretty much just like a normal root shell on a Linux system: you're able to script tasks in it, use tools like `find`, etc. Take all the damage you can do in the SE mode and add on the capability of doing those tasks on a massive, automated scale.
    While enabling SE mode can likened to enabling you to shoot your foot off with a .22, the BASH mode could be likened to enabling you to shoot your foot off with a howitzer. Where SE mode is merely dangerous, I can't really begin to characterize the level of risk you expose yourself to when you start taking full advantage of the DataDomain's BASH shell.

    Since accessing either of these modes isn't well-documented (though there's a decent number of Google searches that will turn up the basic "SE" mode) and I use this site as a personal-reminder on how to do things. I'm going to put the procedures here.

    Please note: use of engineering mode allows you to do major amounts of damage to your data with a frightening degree of ease and rapidity. Don't try to access engineering mode unless you're fully prepared to have to re-install your DataDomain - inclusive of destroying what's left of the data on it.

    Accessing SE Mode:
    1. SSH to the DataDomain.
    2. Login with an account that has system administrator privileges (this may be one of the default accounts your array was installed with, a local account you've set up for the purpose or an Active Directory managed account that has been placed into a Active Directory security-group that has been granted the system administrator role on the DataDomain
    3. Get the array's serial number. The easiest way to do this is type `system show serialno` at the default command prompt
    4. Access SE mode by typing `priv set se`. You will be prompted for a password - the password is the serial number from the prior step.
    At this point, your command prompt will change to "SE@<ARRAYNAME>" where "<ARRAYNAME>" will be the nodename of your DataDomain. While in this mode, an additional command-set will be enabled. These commands are accessed by typing "se". You can get a further listing of the "se" sub-commands in much the same way you can get help at the normal system administration shell (in this particular case: by typing "se ?").


    Accessing the SE BASH Shell:
    Once you're in SE mode, the following command-sequence will allow you to access the engineering mode's BASH shell:

    1. Type "fi st"
    2. Type "df"
    3. Type <CTRL>-C three times
    4. Type "shell-escape"
    At this point, a warning banner will come up to remind you of the jeopardy you've put your configuration in. The prompt will also change to include a warning. This is DataDomain's way of reminding you, at every step, the danger of the access-level you've entered.

    Once you've gotten the engineering BASH shell, you have pretty much unfettered access to the guts of the DataDomain. The BASH shell is pretty much the same as you'd encounter on a stock Linux system. Most of the GNU utilities you're used to using will be there and will work the same way they do on Linux. You won't have man pages, so, if you forget flags to a given shell command, look them up on a Linux host that has the man pages installed.

    In addition to the standard Linux commands will be some DataDomain-specific commands. These are the commands that are accessible from the "se" command and its subcommands. The primary use-case for exercising these commands in BASH mode is that the BASH mode is pretty much as fully-scriptable as a root prompt on a normal Linux host. In other words, take all the danger and power of SE mode and wrap it in the sweaty-dynamite of an automated script (you can do a lot of modifications/damage by horsing the se sub-commands to a BASH `find` command or script).

    Friday, February 22, 2013

    Posterous Alternatives?

    So, Posterous is shutting down at the end of the month. It was a great meta-poster. Anyone know any suitable alternatives. Really liked having it for replicating my technical ramblings across multiple sites (really helpful when you work at multiple job-sites, each with their own web-filter rule-sets).

    Looking for any recommendations - maybe even paid-for services.

    Friday, February 8, 2013

    The "Which LUN Is That" Game

    One of the fun things about SAN storage has to do with its tendency to be multipathed. Different operating systems handle this differently - but most of the ones I've worked with tend to do what I refer to as "ghosting".

    "Ghosting" is, by no means, a technical term. It's mostly meant to convey that, when an OS sees a device down multiple paths, it sees the device multiple times. While you may only have one actual chunk of storage presented from your SAN to your host, you might see that chunk 2, 3, 4 or more times in your storage listings. I call these "extra" presences in the output "ghosts".

    At any rate, one of the joys associated with ghosting is determining, "am I seeing all of the LUNs I expect to see and am I seeing each of them an appropriate number of times. If you're only presenting one chunk of storage down four paths and you see four copies of the LUN show up, it's easy enough to say, "yep: things look right". It's even easy(ish) when you have multiple chunks of different-sized chunks of storage to determine if things are right: 1) you have PATHsxLUNs number of storage chunks visible; 2) each set of chunks is an identifiable size. However, if you're presenting multiple chunks that are the same size or multiple chunks with differing levels of pathing-redundancy, things get a bit trickier.

    One of the things I'd sorta liked about Solaris's STMS drivers were, once you'd activated the software for a given LUN, the ghosts disappeared into a single meta-device. For better or worse, not everyone used STMS - particularly not on older Solaris releases or configurations that used things like EMC's PowerPath or VERITAS's DMP software. PowerPath actually kinda made the problem worse, as, in addition to the normal ghosts, you added a PowerPath meta-device for each group of LUN-paths. This made the output from the `format` command even longer.

    All of that aside, how do you easily identify which disks are ghosts of each other and which ones are completely separate LUNS? The most reliable method I've found is looking for the LUNs' serial numbers. If you have eight storge chunks visible, four of which have one serial number and four of which have a different serial number, you know that you've got two LUNs presented to your host and that each is visible down four paths. But how do you get the serial numbers?

    Disks' and LUNs' serial numbers are generally found in their SCSI inquiry responses in what's referred to as "code page 83". How you get to that information is highly OS dependent.

    • On Solaris - at least prior to Solaris 10 - you didn't generally have good utilities for pulling serial numbers from LUNs. If you wanted to pull that info, you'd have to fire up the `format` utility in "expert" mode, issue the command-sequence "scsi → inquire". By default, this dumps out code page 83 as part of the response. It dumps this info in two parts: a big, multi-line block of hex-codes and a smaller multi-line block of ASCII text. Your disk/LUN serial number is found by ignorg the big, multiline block of hex values and looking at the third line from the smaller ASCII block.
    • On Linux, they provide you a nice little tool that allows you to directly dump out the target SCSI inquiry code-page. Its default behavior is pretty much to dump just the serial number (actually, the serial number is embedded in a longer string, but if you send that string over to your SAN guys, they'll generally recognize the relevant substring and match it up to what they've presented to your host). The way you dump out that string is to use the command `scsi_id -ugs /block/sdX` (where "sdX" is something like "sda", "sdh", etc.).

    At any rate, multi-pathing software and associated utilities aside, once you've determined which serial numbers correspond to which disk device-nodes, it because a trivial exercise to determine "am I seeing all of the LUNs I expect to see" and "are my LUNs presented down the expected number of SAN-fabric paths".

    Note: if you're running Solaris 10 or an earlier Solaris release with appropriate storage device management packages installed, you may have access to tools like `prtpicl`, `luxadm` and `fcinfo` with which to pull similarly-useful pathing information.

    Friday, January 11, 2013

    LVM Online Relayout

    Prior to coming to the Linux world, most of my complex, software-based storage taskings were performed under the Veritas Storage Foundation framework. In recent years, working primarily in virtualized environments, most storage tasks are done "behind the scenes" - either at the storage array level or within the context of VMware. Up until today, I had no cause to worry about converting filesystems from using one underlying RAID-type to another.

    Today, someone wanted to know, "how do I convert from a three-disk RAID-0 set to a six-disk RAID-10 set". Under Storage Foundation, this is just an online relayout operation - converting from a simple volume to a layered volume. Until I dug into it, I wasn't aware that LVM was capable of layered volumes, letalone online-conversion from one volume-type to another.

    At first, I thought I was going to have to tell the person (since Storage Foundation wasn't an option for them), "create your RAID-0 sets with `mdadm` and then layer RAID-1 on top of those MD-sets with LVM". Turns out, you can do it in LVM (I spun up a VM in our lab and worked through it).

    Basically the procedure assumes that you'd previously:
    1. Attached your first set of disks/LUNs to your host
    2. Used the usual LVM tools to create your volumegroup and LVM objects (in my testing scenario, I set up a three-disk RAID-0 with a 64KB stripe-width)
    3. Created and mounted your filesystem.
    4. Gone about your business.
    Having done the above, your underlying LVM configuration will look something like:
    # vgdisplay AppVG
      --- Volume group ---
      VG Name               AppVG
      System ID             
      Format                lvm2
      Metadata Areas        3
      Metadata Sequence No  2
      VG Access             read/write
      VG Status             resizable
      MAX LV                0
      Cur LV                1
      Open LV               0
      Max PV                0
      Cur PV                3
      Act PV                3
      VG Size               444.00 MB
      PE Size               4.00 MB
      Total PE              111
      Alloc PE / Size       102 / 408.00 MB
      Free  PE / Size       9 / 36.00 MB
      VG UUID               raOK8i-b0r5-zlcG-TEqE-uCcl-VM3L-RelQgX
    # lvdisplay /dev/AppVG/AppVol 
      --- Logical volume ---
      LV Name                /dev/AppVG/AppVol
      VG Name                AppVG
      LV UUID                6QuQSv-rklG-pPv6-Tq6I-TuI0-N50T-UdQ4lu
      LV Write Access        read/write
      LV Status              available
      # open                 1
      LV Size                408.00 MB
      Current LE             102
      Segments               1
      Allocation             inherit
      Read ahead sectors     auto
      - currently set to     768
      Block device           253:7
    Take special note that there are free PEs available in the volumegroup. In order for the eventual relayout to work, you have to leave space in the volume group for LVM to do its reorganizing magic. I've found that a 10% set-aside has been safe in testing scenarios - possibly even overly generous. In a large, production configuration, that set-aside may not be enough.

    When you're ready to do the conversion from RAID, add second set of identically-sized disks to the system. Format the new devices and use `vgextend` to add the new disks to the volumegroup.

    Note: Realistically, so long as you increase the number of available blocks in the volumegroup by at least 100%, it likely doesn't matter whether you add the same number/composition of disks to the volumegroup. Differences in mirror compositions will mostly be a performance rather than an allowed-configuration issue.

    Once the volumegroup has been sufficiently-grown, use the command `lvconvert -m 1 /dev/<VolGroupName>/<VolName>` to change the RAID-0 set to a RAID-10 set. The `lvconvert` works with the filesystem mounted and in operation - technically, there's no requirement to take an outage window to do the operation. As the `lvconvert` runs, it will generate progress information similar to the following:
    AppVG/AppVol: Converted: 0.0%
    AppVG/AppVol: Converted: 55.9%
    AppVG/AppVol: Converted: 100.0%
    Larger volumes will take a longer period of time to convert (activity on the volume will also increase the time required for conversion). Output is generated at regular intervals. The longer the operation takes, the more lines of status output that will be generated.

    Once the conversion has completed, you can verify that your RAID-0 set is now a RAID-10 set with the `lvdisplay` tool:
    lvdisplay /dev/AppVG/AppVol 
      --- Logical volume ---
      LV Name                /dev/AppVG/AppVol
      VG Name                AppVG
      LV UUID                6QuQSv-rklG-pPv6-Tq6I-TuI0-N50T-UdQ4lu
      LV Write Access        read/write
      LV Status              available
      # open                 1
      LV Size                408.00 MB
      Current LE             102
      Mirrored volumes       2
      Segments               1
      Allocation             inherit
      Read ahead sectors     auto
      - currently set to     768
      Block device           253:7
    The addition of the "Mirrored Volumes" line indicates that the logical volume is now a mirrored RAID-set.