Tuesday, November 30, 2010

Linux (Networking): Oh How I Hate You

Working with most commercial UNIX systems (Solaris, AIX, etc.) and even Windows, you take certain things for granted. Easy networking setup is one of those things. It seems to be particularly so for systems that are designed to work in modern, redundant networks. Setting up things like multi-homed hosts is relatively trivial. I dunno. It may just be that I'm so used to how commercial OSes do things that, when I have to deal with The Linux Way™ it seems hopelessly archaic and picayune.

If I take a Solaris host that's got more than one network address on it, routing's pretty simple. I can declare one default route or I can declare a default route per interface/network. At the end of the day, Solaris's internal routing mechanisms just get it right. The only time there's really any mucking about is if I want/need to set up multiple static routes (or use dynamic routing protocols).

Linux... Well, perhaps it's just the configuration I had to make work. Someone wanted me to get a system with multiple bonded interfaces set up with VLAN tagging to route properly. Having the commercial UNIX mindset, I figured "just declare a GATEWAY in each of the bonded interface's /etc/sysconfig/network-scripts file" and that would be the end of the day.

Nope. It seems like Linux has a "last default route declared is the default route" design. Ok. I can deal with that. I mean, I used to have to deal with that with commercial UNIXes. So, I figured, "alright, only declare a default route in one interface's scriptfile". And, that sorta worked. I always got that one default route as my interface. Unfortunately, Linux's network routing philosophy didn't allow that to fully work as experience with other OSes might lead one to expect.

On the system I was asked to configure, one of the interfaces happened to be on the same network as the host I was administering it from. It should be noted that this interface is a secondary interface. The host's canonical name points to an IP on a different LAN segment. Prior to configuring the secondary interface on this host, I was able to log into that primary interface with no problems. Unfortunately, adding that secondary interface that was on the same LAN segment as my administration host cause problems. The Linux routing saw to it that I could only connect to the secondary interface. I was knocked out of trying to get into the primary interface.

This seemed odd. So, I started to dig around on the Linux host to figure out what the heck was going on. First up, a gander at the routing tables:

# netstat -rnv
Kernel IP routing table
Destination     Gateway          Genmask         Flags   MSS Window  irtt Iface
192.168.2.0     0.0.0.0          255.255.255.0   U         0 0          0 bond1.1002
192.168.33.0    0.0.0.0          255.255.255.0   U         0 0          0 bond0.1033
169.254.0.0     0.0.0.0          255.255.0.0     U         0 0          0 bond1.1002
0.0.0.0         192.168.33.254   0.0.0.0         UG        0 0          0 bond0.1033

Hmm... Not quite what I'm used to seeing. On a Solaris system, I'd expect something more along the lines of:

IRE Table: IPv4
  Destination             Mask           Gateway          Device   Mxfrg Rtt   Ref Flg  Out  In/Fwd
-------------------- ---------------  -------------------- ------  ----- ----- --- --- ----- ------
default              0.0.0.0          1
92.168.8.254                1500*     0   1 UG    1836      0
192.168.8.0          255.255.255.0    1
92.168.8.77         ce1     1500*     0   1 U      620      0
192.168.11.0         255.255.255.0    192.168.11.222       ce0     1500*     0   1 U        0      0
224.0.0.0            240.0.0.0        1
92.168.8.77         ce1     1500*     0   1 U        0      0
127.0.0.1            255.255.255.255  127.0.0.1            lo0     8232*     0   1 UH   13292      0

Yeah yeah, not identically formatted output, but similar enough that things on the Linux host don't look right if what you're used to seeing is the Solaris system's way of setting up routing. On a Solaris host, network destinations (i.e., "192.168.2.0", "192.168.33.0", "192.168.8.0" and "192.168.11.0" in the above examples) get routed through an IP address on a NIC. On Linux, however, it seems like all of the network routes were configured to go through whatever the default route was.

Now, what `netstat -rnv` is showing for interface network routes may not be strictly representative of what Linux is actually doing, but, both what Linux is doing and how its presented is wrong - particularly if there's firewalls between you and the multi-homed Linux hosts. The above output is kind of a sloppy representation of Linux's symmetrical routing philosphy. Unfortunately, because of the way Linux routes, If I have a configuration where the multi-homed Linux host has an two IP addresses - 192.168.2.123 and 192.168.33.123 - and I'm connecting from a host with an address of 192.168.2.76 but am trying to connect to the Linux host's 192.168.33.123 address, my connection attempt times out. While Linux may, in fact, receive my connection request at the 192.168.33.123 address, its default routing behavior seems to be to send it back out through its 192.168.2.123 address - ostensibly because the host I'm connecting from is on the same segment as the Linux host's 192.168.2.123 address. 

Given my background, my first thought is "make the Linux routing table look like the routing tables you're more used to.". Linux is nice enough to let me do what seem to be the right `route add` statements. However, it doesn't allow me to nuke the seemingly bogus network routes pointing at 0.0.0.0.

Ok, apparently I'm in for some kind of fight. I've gotta sort out the routing philsophy differences between my experience and what the writers of the Linux networking stack are. Fortunately, I have a good friend (named "Google") who's generally pretty good at getting my back. It's with Google's help that I discover that this kind of routing problem is handled through Linux's "advanced routing" functionality. I don't really quibble about what's so "advanced" about sending a response packet back out the same interface that the request packet came in on. I just kinda shrug and chalk it up to differences in implementation philosophy. It does, however, leave me with the question of, "how do I solve this difference of philosphy?"

Apparently, I have to muck about with files that I don't have to on either single-homed Linux systems or multi-homed commercial UNIX systems. I have to configure additional routing tables so that I can set up policy routing. Ok, so, I'm starting to scratch my head here. By itself, this isn't inherently horrible. However, it's not one of those topics that seems to come up a lot. It's neither well-documented in Linux nor do many relevant hits get returned by Google. Thus, I'm left to take what hits I do find and start experimenting. Ultimately, I found that I had to set up five files (in addition to the normal /etc/sysconfig/network-scripts/ifcfg-* files) to get thinks working as I think they ought:

/etc/iproute2/rt_tables
/etc/sysconfig/network-scripts/route-bond0.1033
/etc/sysconfig/network-scripts/route-bond1.1002
/etc/sysconfig/network-scripts/rule-bond0.1033

/etc/sysconfig/network-scripts/rule-bond1.1002

Using the "/etc/iproute2/rt_tables" file is kind of optional. Mostly, it lets me assign logical/friendly names to the extra routing tables I need to set up. I like friendly names. They're easier to remember and can be more self-documenting than plain-Jane numeric IDs. So, I edit the "/etc/iproute2/rt_tables" and add two lines:

2         net1002
33        net1033

I should probably note that what I actually wanted to add was:

1002      net1002
1033      net1033

I wanted these aliases as they would be reflective of what my 802.1q VLAN IDs were. Unfortunately, Linux seems to limit the range of numbers you can allocate table IDs out of. Worse, there are reserved and semi-reserved IDs in that limited range, further limiting your pool of table ID candidates. So, creating an enterprise-deployable standard config file might not be practical on a network with lots of VLANs, subnets, etc. Friendly names set up, I then had to set up the per "NIC" routing and rules files.

I set up the "/etc/sysconfig/network-scripts/route-${IF}.{$VLANID}"  with two routing statements:

table ${TABLENAME} to ${NETBASE}/${CIDR} dev ${DEV}.${VLAN}
table ${TABLENAME} to default via ${GATEWAY} dev ${DEV}.${VLAN}

This might actually be overkill. I might only need one of those lines. However, it was late in the day and I was tired of experimenting. They worked, so, that was good enough. Sometimes, a sledgehammer's a fine substitute for a scalpel.

Lastly, I set up the "/etc/sysconfig/network-scripts/rule-${IF}.{$VLANID}" files with a single rule, each:

from ${NETBASE}/${CIDR} table ${TABLENAME} priority ${NUM}

Again, the priority value's I picked may be suboptimal (I set the net1002 priority to "2" and the net1033 priority to "33"). But, since they worked, I left them at those values.

I did a `service network restart` and was able to access my multi-homed Linux host by either IP address from my workstation. Just to be super-safe, I bounced the host (to shake out anything that might have been left in place by my experiments). When the box came back from the reboot, I was still able to access it through either IP address from my workstation.

Friday, November 26, 2010

Linux Storage Multipathing in a Mixed-Vendor (NetApp/EMC) Environment

So, recently, I've been tasked with coming up with documentation for the junior staff that has to work with RedHat Enterprise Linux 5.x servers in a multi-vendor storage environment. I our particular environment, the two most likely candidates to be seen on a Linux server are NetApp and/or CLARiiON storage arrays.

Previously, I've covered how to set up an RHEL 5.x system to use the Linux multipath service with NetApp Filer-based fibrechannel storage. Below, I'll expand on that, a bit, by explaining how to deal with a multi-vendor storage environment where separate storage subsystems will use separate storage multipathing solutions. In this particular case, the NetApp Filer-based fibrechannel storage will continue to be managed with the native Linux storage multipathing solution (multipathd) and the EMC CLARiiON-based storage will use the EMC-provided storage multipathing solution, PowerPath. I'm not saying such a configuration will be normal in your environment or mine, it's just an "edge-case scenario" I explored in my testing environment just in case someone asked for it. It's almost a given that if you haven't tested or documented the edge-cases, someone will invariably want to know how to do it (and, conversely, if you test and document it, no one ever bothers you about how to do it in real life).

Prior to presentation of EMC CLARiiON-based storage to your mixed-storage system, you will want to ensure that:
  • CLARiiON-based LUNs are excluded from your multipathd setup
  • PowerPath software has been installed
To explicitly exclude CLARiiON-based storage from multipathd's management, it will be necessary to modify your system's /etc/multipath.conf file. You will need to modify this file's blacklist stanza to resemble the following:

blacklist {
        wwid DevId
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^hd[a-z]"
        devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"

          #############################################################################
        # Comment out the next four lines if management of CLARiiON LUNs is *WANTED*
        #############################################################################
        device {
                vendor "DGC"
                product "*"
        }

  }

The lines we're most interested in are the four that begin with the device { directive. These lines are what the blacklist interpreter uses to tell itself "ignore any devices whose SCSI inquiry returns a Vendor ID of "DGC".

I should note that I'd driven myself slightly nuts working out the above. I'd tried simply placing the 'device' entry directly in the 'blacklist' block. However, I found that, if I didn't contextualize it into a 'device' sub-block, the multipathd service would pretty much just flip me the bird and ignore the directive (without spitting out errors to tell me that it was doing so or why). Thus, it would continue to grab my CLARiiON LUNs until I nested my directives properly. The 'product' definition is, also, probably overkill, but, it works.

Once these are in place, restart the multipath daemon to get it to reread its configuration files. Afterwards, request storage and do the usual PowerPath tasks to bring the CLARiiON devices under PowerPath's control. Properly set up, this will result in a configuration similar to the following:
# multipath -l
360a98000486e2f34576f2f51715a714d dm-7 NETAPP,LUN
[size=25G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
 \_ 0:0:0:1 sda 8:0   [active][undef]
 \_ 0:0:1:1 sdb 8:16  [active][undef]
 \_ 1:0:0:1 sde 8:64  [active][undef]
 \_ 1:0:1:1 sdf 8:80  [active][undef]

# powermt display dev=all
Pseudo name=emcpowera
CLARiiON ID=APM00034000388 [stclnt0001u]
Logical device ID=600601F8440E0000DBE5E2FEF1E1DF11 [LUN 13]
state=alive; policy=BasicFailover; priority=0; queued-IOs=0;
Owner: default=SP A, current=SP B       Array failover mode: 1
==============================================================================
--------------- Host ---------------   - Stor -   -- I/O Path --  -- Stats ---
###  HW Path               I/O Paths    Interf.   Mode    State   Q-IOs Errors
==============================================================================
   0 qla2xxx                  sdc       SP A1     unlic   alive       0      0
   0 qla2xxx                  sdd       SP B1     unlic   alive       0      0
   1 qla2xxx                  sdg       SP A0     active  alive       0      0
   1 qla2xxx                  sdh       SP B0     active  alive       0      0

As can be seen from the above, the NetApp LUN(s) are showing up under the multipathd's control and the CLARiiON LUNs are showing up under PowerPath's control. Neither multi-pathing solution is seeing the others' devices.

You'll also note that the Array failover mode is set to "1". In my test environment, the only CLARiiON I have access to is in dire need of a firmware upgrade. Its firmware doesn't support mode "4" (ALUA). Since I'm using this test configuration to test both PowerPath and native multipathing, I had to set the LUN to a mode that both the array and multipathd supported to get my logs to stop getting polluted with "invalid mode" settings. Oh well, hopefully a hardware refresh is coming to my lab.

Lastly, you'll also likely note that I'm running PowerPath in unlicensed mode. Again, this is a lab scenario where I'm tearing stuff down and rebuilding, frequently. Were it a production system, the licensing would be in place to enable all of the PowerPath functionality.

Wednesday, November 17, 2010

Linux Multipath Path-Failure Simulation

Previously, I've discussed how to set up the RedHat Linux storage multipathing software to manage fibrechannel-based storage devices. I didn't, however, cover how one tests that such a configuration is working as intended.
When it comes to testing resilient configurations, one typically has a number of testing options. In a fibrechannel fabric situation, one can do any of:
  • Offline a LUN within the array
  • Down an array's storage processors and/or HBAs
  • Pull the fibrechannel connection between the array and the fibrechannel switching infrastructure
  • Shut off a switch (or switches) in a fabric
  • Pull connections between switches in a fabric
  • Pull the fibrechannel connection between the fibrechannel switching infrastructure and the storage-consuming host system
  • Disable paths/ports within a fibrechannel switch
  • Disable HBAs on the storage-consuming host systems
  • Disable particular storage targets within the storage-consuming host systems
Generally, I favor approaches that limit the impact of the tested scenario as much as possible. I favor approaches that limit the likelihood of introducing actual/lasting breakage into the tested configuration.
I also tend to favor approaches where I have as much control of the testing scenario as possible. I'm an impatient person and having to coordinate outages and administrative events with other infrastructure groups and various "stakeholders" can be a tiresome, protracted chore. Some would say that indicates I'm not a team-player: I like to think that I just prefer to get things done efficiently and as quickly as possible. Tomayto/tomahto.
Through most of my IT career, I've worked primarily the server side of the house (Solaris, AIX, Linux ...and even - *ech* - Windows) - whether as a systems administrator or as an integrator. So, my testing approaches tend to be oriented from the storage-consumer's view of the world. If I don't want to have to coordinate outside of a server's ownership/management team, I'm pretty much limited to the last three items on the above list: yanking cables from the server's HBAs, disabling the server's HBAs and disabling storage targets within the server.
Going back to the avoidance of "introducing actual/lasting breakage", I tend to like to avoid yanking cables. At the end of the day, you never know if the person doing the monkey-work of pulling the cable is going to do it like a surgeon or like a gorilla. I've, unfortunately, been burned by run-ins with more than a few gorillas. So, if I don't have to have cables physically disconnected, I avoid it.
Being able to logically disable an HBA is a nice test scenario. It effects the kind of path-failure scenario that you're hoping to test. Unfortunately, not all HBA manufacturers seem to include the ability to logically disable the HBA from within their management utilities. Within commercial UNIX variants - like Solaris or AIX - this hasn't often proven to be a problem. Under Linux, however, the abilty to logically disable HBAs from within their management utilities seems to be a bit "spotty".
Luckily, where the HBA manufacturers sometimes leave me in the lurch, RedHat Linux leaves me some alternatives. In the spirit of the Linux DIYism, those alternatives aren't really all that fun to deal with ...until you write tools, for yourself, that removes some of the pain. I wrote two tools to help myself in this area: one is a tool which offlines designated storage paths and one is a tool which attempts to restore those downed storage paths.
Linux makes it possible to change the system-perceived state of a given device path by writing the term "offline" to the file location, /sys/block/${DEV}/device/state. Thus, were one to want to make the OS think that the path to /dev/sdg was down, one would execute the command, `echo "offline" > /sys/block/sdg/device/state`. All that my path-downing script is makes it so you can down a given /dev/sdX device by executing `pathdown.sh <DEVICE>` (e.g., `pathdown sdg`). There's minimal logic built in to verify that the named /dev/sdX device is a real, downable device and it provides a post-action status of that device, but, other than that, it's a pretty simple script.
To decide which path one wants to down, it's expected that the tester will look at the multipather's view of its managed devices using `multipath -l <DEVICE>` (e.g., `multipath -l mpath0`). This command will produce output similar to the following:
mpath0 (360a9800043346d364a4a2f41592d5849) dm-7 NETAPP,LUN
[size=20G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
 \_ 0:0:0:1 sda 8:0   [active][undef]
 \_ 0:0:1:1 sdb 8:16  [active][undef]
 \_ 1:0:0:1 sdc 8:32  [active][undef]
 \_ 1:0:1:1 sdd 8:48  [active][undef]
Thus, if one wanted to deactivate one of the channels in the mpath0 multipathing group, one might issue the command  `pathdown sdb`. This would result in the path associated with /dev/sdb being taken offline. After taking this action, the output of `multipath -l mpath0` would change to:
mpath0 (360a9800043346d364a4a2f41592d5849) dm-7 NETAPP,LUN
[size=20G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
 \_ 0:0:0:1 sda 8:0   [active][undef]
 \_ 0:0:1:1 sdb 8:16  [failed][faulty]
 \_ 1:0:0:1 sdc 8:32  [active][undef]
 \_ 1:0:1:1 sdd 8:48  [active][undef]
Typically, when doing such testing activities, one would be performing a large file operation to the disk device (preferably a write operation). My test sequence is typically to
  1. Start an `iostat` job, grepping for the devices I'm interested in, and capturing the output to a file
  2. start up a file transfer (or even just a `dd` operation) into the device.
  3. Start downing paths as the transfer occurs
  4. Wait for the transfer to complete, then kill the `iostat` job
  5. Review the captured output from the `iostat` job to ensure that the I/O behaviors I expected to see actually occurred
In the testing environment I had available when I wrote this page, I was using a NetApp filer presenting blockmode storage via fibrechannel. The NetApp multipathing plugin supports concurrent, multi-channel operations to the target LUN. Thus, the output from my `iostat` job will show uniform I/Os across all paths to the LUN, and then show outputs drop to zero on each path that I offline. Were I using an array that only supported Active/Passive I/O operations, I would expect to see the traffic move from the downed path to one of the failover paths, instead.
So, great: you've tested that your multipathing system behaves as expected. However, once you've completed that testing, all of the paths that you've offlined have stayed offline. What to do about it?
The simplest method is to reboot the system. However, I abhor knocking my systems' `uptime` if I don't absolutely have to. Fortunately, much as Linux provides the means to offline paths, it provides the means for reviving them (well, to be more accurate, to tell it "hey, go check these paths and see if they're online again"). As with offlining paths, the methods for doing so aren't currently built into any OS-provided utilities. What you have to do is:
  1. Tell the OS to delete the device paths
  2. Tell the OS to rescan the HBAs for devices it doesn't currently know about
  3. Tell the multipath service to look for changes to paths associated with managed devices
The way you tell the OS to (gracefully) delete device paths is to write a value to a file. Specifically, one writes the value "`1" to the file /sys/block/${DEV}/device/delete. Thus, if one is trying to get the system to clean up for the downed device path, /dev/sdb, one would issue the command `echo "1" > /sys/block/sdb/device/delete`.
The way you tell the OS to rescan the HBAs is to issue the command `echo "- - -" >  /sys/class/scsi_host/${HBA}/scan`. In Linux, the HBAs are numbered in the order found and named "hostN" (i.e., "host0", "host1", etc.). Thus, to rescan HBA 0, one would issue the command `echo "- - -" >  /sys/class/scsi_host/host0/scan` (for good measure, rescan all the HBAs).
The way to tell the multipath service to look for changes to paths associated with managed devices is to issue the command `multipath` (I acutally use `multipath -v2` because I like the more verbose output that tells me what did or didn't happen as a result of the command). Granted, the multipath service periodically rescans the devices it manages to find state-change information, but I don't like to wait for systems to "get around to it".
All that my path fixing script does is rolls up the above, three steps into one, easy to remember and use command.
Depending on how you have logging configured on your system, the results of all the path offlining and restoration will be logged. Both the SCSI subsystem and the multipathing daemon should log events. Thus, you can verify the results of your activities by looking in your system logs.
That said, if the system you're testing is hooked up to an enterprise monitoring system, you will want to let your monitoring groups know that they need to ignore the red flags you'll be generating on their monitoring dashboards.

    Tuesday, November 16, 2010

    RedHat NIC-Bonding/Trunking

    I've got, potentially, a number of RedHat Enterprise Linux 5.x systems that I need to set up NIC-trunking on. More fun, most of these systems will require that the trunks be composed of tagged-VLAN (802.1q) based NICs. Originally, our group's RedHat guy put together a script to help make this an easier task. Basically, if finds all the physical NICs on the system and then creates the requisite /etc/sysconfig/network-scripts files to configure them into a bonded interface. The script itself was serviceable, but it was strictly an interactive script. It wasn't designed to be invoked with arguments. So, you couldn't just say "bond and go" (or put it into an automated-provisioning framework.
    So, wanting a tool that could be fired off from an automated systems management console, I rewrote the script. Note that I'm not claiming the below is an ideal, optimal or even a good script. I'm just putting it out there as a reference for myself (in case my laptop dies, gets stolen, whatever) and in case someone else can use it as a basis for writing their own tool (to be honest, it's also in hopes that someone finds it and posts up a "here: do this to make it better/more elegant").
    Note: I am a KSH user. That means that, when I write scripts, I write them using Korn Shell sysntax and structures. Most of these should be portable to any modern, POSIX-compliant shell:

    #
    #!/bin/ksh
    #
    # Script to create "null" bonded NICs
    #
    # Usage:
    #   ${0} <BondName> <BondMode> <NICStoBond> <IPAddress> <NetMask> <Gateway> <VLAN> <restart>
    
    BONDNAME=${1:-bond0}
    BONDMODE=${2:-4}
    BONDNICS=${3:-all}
    IPADDR=${4:-UNDEF}
    NETMASK=${5:-UNDEF}
    GATEWAY=${6:-UNDEF}
    VLANID=${7:-0}
    BOUNCE=${8:-UNDEF}
    
    # Set work/save dirs
    WORKDIR=/tmp/BOND
    NEWSCRIPTS=${WORKDIR}/scripts
    SAVEDIR=${WORKDIR}/save
    CONFDIR=/etc/sysconfig/network-scripts
    DATE=`date "+%Y%m%d%H%M%S"`
    VERIFY="yes"
    
    # Ensure that our work/save directories exist and are empty
    WorkDirs() {
       if [ ! -d ${WORKDIR} ]
       then
          mkdir -p ${WORKDIR}
          mkdir -p ${NEWSCRIPTS}
          mkdir -p ${SAVEDIR}
       else
          mv ${WORKDIR} ${WORKDIR}.SAV-${DATE} && mkdir ${WORKDIR}
          mkdir -p ${NEWSCRIPTS}
          mkdir -p ${SAVEDIR}
       fi
    }
    
    # Make sure kernel modules defined/loaded
    AddModAlias() {
       NEEDALIAS=`grep ${BONDNAME} /etc/modprobe.conf`
       if [ "${NEEDALIAS}" = "" ]
       then
          echo "alias ${BONDNAME} bonding" >> /etc/modprobe.conf
          echo "options ${BONDNAME} mode=${BONDMODE} miimon=100" >> /etc/modprobe.conf
          modprobe ${BONDNAME}
       fi
    }
    
    # Save previous config files
    SaveOrig() {
       if [ -f ${CONFDIR}/ifcfg-bond? ]
       then
          cp ${CONFDIR}/ifcfg-bond?  ${SAVEDIR}
       fi
    
       if [ -f ${CONFDIR}/ifcfg-bond?.* ]
       then
          cp ${CONFDIR}/ifcfg-bond?.* ${SAVEDIR}
       fi
    
       if [ -f ${CONFDIR}/ifcfg-eth? ]
       then
          cp ${CONFDIR}/ifcfg-eth?  ${SAVEDIR}
       fi
    
       if [ -f ${CONFDIR}/ifcfg-eth?.* ]
       then
          cp ${CONFDIR}/ifcfg-eth?.* ${SAVEDIR}
       fi
    }
    
    
    ###################################
    # Create an iteratable list of NICs
    ###################################
    IFlist() {
       if [ "${BONDNICS}" = "all" ] || [ "${BONDNICS}" = "ALL" ]
       then
          IFSET=`GetAllIFs`
       else
          IFSET=`ParseIFlist ${BONDNICS}`
       fi
       # Value returned by function
       echo ${IFSET}
    }
    
      # Iterate all NICs found in "/sys/devices"
      GetAllIFs() {
         find /sys/devices -name "*eth*" | sed 's#^.*net:##' | sort
      }
    
      # Parse passed-list of NICs
      ParseIFlist() {
         echo ${1} | sed 's/,/ /g'
      }
    #
    ###################################
    
    # Generate physical NIC file contents
    MkBaseIFfiles() {
       echo "# Interface config for ${1}"
       echo "DEVICE=${1}"
       echo "BOOTPROTO=none"
       echo "ONBOOT=yes"
       echo "USRCTL=no"
       echo "TYPE=Ethernet"
       echo "MASTER=${BONDNAME}"
       echo "SLAVE=yes"
    }
    
    # Generate bonded NIC file contents
    MkBondIFfile() {
       echo "# Interface config for ${1}"
       if [ "${2}" = "0" ]
       then
          echo "DEVICE=${1}"
          echo 'BONDING_MODULE_OPTS="mode='${BONDMODE}' miimon=100"'
       else
          echo "DEVICE=${1}.${2}"
          echo "VLAN=yes"
       fi
    
       echo "BOOTPROTO=none"
       echo "ONBOOT=yes"
       echo "USRCTL=no"
    
       # Check to see if creating a "base" bond for a 802.1q environment
       if [ "${IPADDR}" = "NULL" ] || [ "${IPADDR}" = "null" ]
       then
          echo "IPADDR="
          echo "NETMASK="
       else
          echo "IPADDR=${IPADDR}"
          echo "NETMASK=${NETMASK}"
       fi
    
       # If has no default-route
       if [ "${GATEWAY}" = "NULL" ] || [ "${GATEWAY}" = "null" ]
       then
          echo "GATEWAY="
       else
          echo "GATEWAY=${GATEWAY}"
       fi
       echo "TYPE=BOND"
    }
    
    
    if [ "${IPADDR}" = "UNDEF" ]
    then
       printf "Enter IP address: "
       read IPADDR
       echo "Entered: ${IPADDR}"
    fi
    
    if [ "${NETMASK}" = "UNDEF" ]
    then
       printf "Enter netmask address: "
       read NETMASK
       echo "Entered: ${NETMASK}"
    fi
    
    if [ "${GATEWAY}" = "UNDEF" ]
    then
       printf "Enter gateway address: "
       read GATEWAY
       echo "Entered: ${GATEWAY}"
    fi
    
    # Create workdirs
    WorkDirs
    
    # Update loaded modules as necessary
    AddModAlias
    
    # Save off prior configuration's files
    SaveOrig
    
    # Create base IF files
    NICLIST=`IFlist`
    for NIC in ${NICLIST}
    do
       MkBaseIFfiles ${NIC} > ${NEWSCRIPTS}/ifcfg-${NIC}
    done
    
    # Create bonded IF files
    if [ "${VLANID}" = "0" ]
    then
       MkBondIFfile ${BONDNAME} ${VLANID} > ${NEWSCRIPTS}/ifcfg-${BONDNAME}
    else
       MkBondIFfile ${BONDNAME} ${VLANID} > ${NEWSCRIPTS}/ifcfg-${BONDNAME}.${VLANID}
    fi
    
    # Copy new config files to system dir
    cp ${NEWSCRIPTS}/* ${CONFDIR}
    
    # Restart or just save?
    if [ "${BOUNCE}" = "UNDEF" ]
    then
       echo "Assuming multi-step config. Network not restarted."
    else
       echo "Restart requested. Attempting Network restart."
       service network restart
    fi

    Please feel free to comment with optimizations or suggestions for enhancements

    Friday, November 12, 2010

    NetApp as Block Storage (fibrechannel) for RedHat Linux

    I work for an enterprise that's starting to go down the Linux path. This has mostly come about due to the fact that they are also heavily pursuing virtualization technologies. It's part of an overall effort to make their IT more efficient by increasing the utilization of server resources and reducing the sheer size of the physical infrastructure.

    The enterprise I work for is heavily Windows-oriented. Most of their "UNIX" people are converts from Windows - frequently converts of necessity or managerial-dictate rather than choice. As a UNIX-user since 1989, I'm one of the few "true" UNIX geeks they have. So, I frequently get tasked with making technologies work with their UNIX platforms.

    Most recently, there's been an effort that's required the deployment of a few RedHat-based physical servers. The applications being deployed on these server requires block-based storage. iSCSI isn't really an option in this environment, as they are just now starting to get into 10Gbps Ethernet deployments (and have no interest in deploying dedicated iSCSI networking infrastructure). The primary block mode storage platforms in use by our enterprise are solutions from EMC and, to a much lesser extent, NetApp.

    The first Linux project to be rolled out on fibrechannel storage will be using NetApp. So, I had to go through the joys of putting together a "How To" guide. While my background is heavy in UNIX, most of my career has been spent on Solaris, AIX and IRIX. Linux has only started appearing in my professional life within the last 24 months or so. It's changed a lot since I first used it back in the early- to mid-90s. Given my relative thinness in Linux, Google (and the storage systems' vendor docs) have been my boon companions.

    I'm writing this page mostly so I have a crib-sheet for later use. Delivered documentation seems to have a habit of becoming lost by our documentation custodians.

    To start off with, the storage engineers wanted to make sure that there were several items in place for this solution:

    • Up to date QLogic drivers
    • Availability of the QLogic SANsurfer utilities
    • Ability to use the Linux native multipathing solution

    Qlogic HBA Drivers

    Ultimately, I had three choices for the QLogic drivers: use the ones that come with RedHat Enterprise Linux 5; use ones furnished by our hardware vendor in their driver and utilities "support pack"; or, use the ones from QLogic.

    Some parties wanted to use the ones supplied by our hardware vendor. This made a certain sense since they needed other components installed from the "support pack". The idea was that, by going with the "support pack", 100%, for firmware, drivers and utilities, we'd have a "known quantity" of compatibility tested software. That decision lasted less than two business days. The QLogic (and some other) drivers supplied as part of the "support pack" were distributed in source-code form. Our security requirements make it so that deploying such software would require maintaining externally compiled versions of those drivers. While not inherently problematic, it was unwieldy. Further exacerbating things was the fact that using these binaries would require additional coordination with components already installed as part of the core OS.

    As noted previously, Linux, in our enterprise, has primarily been slated for virtualized efforts. As such, the primary build versions for Linux were optimized for virtual hardware. A lot of the supporting components for physical deployments were stripped out at the behest of security. It became quickly apparent that maintaining this software would be an odious task and none of us wanted to be on the hook for it. So, we opted to not use the source-code-derived drivers from the "support pack".

    We also discovered that the driver packages offered by QLogic were in similar source-form. So, they, too, were disqualified from consideration.

    Fortunately, RedHat's Enterprise Linux 5.x comes with the drivers needed for the QLogic HBAs we use. Thus, we chose the path of least resistance - to use the RedHat drivers. All I had to do was document which driver revisions were included with each RHEL release and patch-level to be in use in our farms. An annoying task, but fairly easy to knock out, given the earliness of deployment of Linux in our enterprise.

    SANsurfer

    We also had a choice of suppliers for the SANsurfer utilities. We could use the ones straight from QLogic or the ones from our hardware vendor. We chose the latter, primarily based on the fact that our support contracts are with the hardware vendor, not QLogics. It was assumed that, should we run into issues, support would be more easily gotten from a company we've got direct support contracts with rather than one that we didn't. Fortunately, both our hardware vendor and QLogics both provide their utilities in nice, standalone install bundles. We didn't have to worry about hassling with compiling tools or dependency tracking.

    Linux MultiPath and NetApp FC Storage

    RedHat Linux include a fairly serviceable multipathing solution. It's extensible through plugins that provide multipathing policy and management modules.

    In order to allow the native multipathing drivers to work with NetApp LUNs, I had to grab the NetApp FCP Host Utilities Kit from NetApp's NOW site. Interestingly, NetApp seems to be one of the few vendors that hasn't "unified" their version numbering across supported platforms. Windows, Solaris, ESX and Linux all seem to be at different 5.x levels (with the latest Linux version being 5.3).

    I'll give NetApp credit: their documentation for the utilities was pretty straignforward and the software bundle was trivial to install. All I had to do was grab the RPM from the NOW site and install it on the test box. If I had any complaint, it's that they didn't include an sample /etc/multipath.conf file. Fortunately, they did include one in the installation PDFs. So, I did a quick copy-and-paste into a PuTTY vi session, cleaned up the file and saved it off for inclusion in an automated provisioning policy. Basically, the file looks like

    defaults {
            user_friendly_names     yes
            max_fds                 max
            queue_without_daemon    no
            flush_on_last_del       yes
            }
    
    blacklist {
            wwid DevId
            devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
            devnode "^hd[a-z]"
            devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
            }
    
    devices {
            device {
                    vendor                  "NETAPP"
                    product                 "LUN"
                    getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                    prio_callout            "/sbin/mpath_prio_ontap /dev/%n"
                    features                "1 queue_if_no_path"
                    hardware_handler        "0"
                    path_grouping_policy    group_by_prio
                    failback                immediate
                    rr_weight               uniform
                    rr_min_io               128
                    path_checker            directio
                    }
            }
    

    So, if you found this page via an internet search, the above is all you should need to make the Linux multipathing service work with the NetApp FCP Host Utilities to make multipathing work with NetApp arrays.

    Once you have the above /etc/multipath.conf file in place, just start up the multipath daemon and set it up to restart at system boot. For RHEL 5.x, this is just a matter of executing:

         # service multipathd start
         # chkconfig multipathd on

    Once this service is started, any LUNs you present to your RedHat box over your fabric will be path-managed

    LUNs "On the Fly"

    One of the things you take for granted using OSes like Solaris, AIX and IRIX is the inclusion of tools that facilitate the addition, modification and deletion of hardware "on the fly". Now, there are utilities you can install onto a Linux system, but there doesn't, yet, seem to be a core component that's the equivalent of Solaris's devfsadm. Instead, the only really universally available method for achieving similar results is to do:

         # echo "- - -" > /sys/class/scsi_host/hostN/scan

    Doing the above, after your friendly neighborhood SAN administrator has notified you "I've presented your LUNs to your system", you'll see new storage devices showing up in your `fdisk` output (without having to reboot!). It's worth noting that, even with the NetApp FCP host utilities installed, if your LUNs are visible on multiple fibrechannel paths, you'll see a /dev/sdX entry for each and every path to that device. So, if you have a LUN visible to your system through two HBAs and two SAN fabrics, you'll see four /dev/sdX devices for every "real" LUN presented.

    Multipath Magic!

    If you want to verify that the multipathing service is seeing things correctly, execute `multipath -ll`. This will yield output similar to the following:

    mpath0 (360a9800043346d364a4a2f41592d5849) dm-7 NETAPP,LUN
    [size=20G][features=1 queue_if_no_path][hwhandler=0][rw]
    \_ round-robin 0 [prio=0][active]
     \_ 0:0:0:1 sda 8:0   [active][undef]
     \_ 0:0:1:1 sdb 8:16  [active][undef]
     \_ 1:0:0:1 sdc 8:32  [active][undef]
     \_ 1:0:1:1 sdd 8:48  [active][undef]
    

     

    If you want to verify it by using the utilities included in the NetApp FCP Host Utilities, execute `sanlun lun show -p all`. This will yield output similar to the following:

    filer:/vol/fcclient/lun1 (LUN 1)          Lun state: GOOD
    Lun Size:     20g (21474836480)  Controller_CF_State: Cluster Disabled
    Protocol: FCP           Controller Partner:
    DM-MP DevName: mpath0   (360a9800043346d364a4a2f41592d5849)     dm-7
    Multipath-provider: NATIVE
    --------- ---------- ------- ------------ --------------------------------------------- ---------------
       sanlun Controller                                                            Primary         Partner
         path       Path   /dev/         Host                                    Controller      Controller
        state       type    node          HBA                                          port            port
    --------- ---------- ------- ------------ --------------------------------------------- ---------------
         GOOD  primary       sda        host0                                            0c              --
         GOOD  primary       sdb        host0                                            0d              --
         GOOD  primary       sdc        host1                                            0b              --
         GOOD  primary       sdd        host1                                            0a              --
    

     

    In both the multipath and sanlun output, there will be two fields worth noting: the device name and the device ID. In the above output, the device name is "mpath0" and the device ID is "360a9800043346d364a4a2f41592d5849". In the Linux device tree, "mpath0" can be found at "/dev/mapper/mpath0"; the device ID can be found at "/dev/disk/by-id/360a9800043346d364a4a2f41592d5849". Use the "/dev/mapper" entries in your "/etc/fstab" and when you run `mkfs`.

    If you're feeling really enterprising, you can verify the multipathing works by downing FC paths in your SAN and verifying that the Linux hosts detects and reacts appropriately.

    Wednesday, November 10, 2010

    Commercial UNIX Background vs Linux DIYism

    I've been managing UNIX systems for over 18 years - 14 of them on a professional basis. In that time I've had the pleasure to support systems running:

    • Solaris
    • IRIX
    • AIX
    • HP/UX
    • RedHat Linux
    • SuSE Linux
    • SlackWare Linux
    • Coherent
    • Xenix
    • DG/UX
    • TRU-64
    • OSF
    • UNICOS
    • NeXTSTEP
    • Several different flavors of BSD
    Things have changed a lot in that 14-18 years of UNIX. Especially in recent years, where most of my work was on Solaris and AIX, I got kind of used to things being (relatively) easy. The enterprise-grade commercial UNIX solutions included lots of tools to facilitate working with complex storage and networking environments. Over the last year-and-change, I've had to switch gears and go to the Linux world (specifically, RedHat).

    Now, the I've made use of Linux, off-and-on, since 1992. While Linux has improved greatly in that time, it hasn't improved uniformly or in the same directions that enterprise-grade, commercial UNIX systems have. Specifically, a lot of the tools that are just sort of a given in commercial UNIX operating systems aren't immediately at hand in Linux.

    Where, on a commercial UNIX operating system, you might have a tool or a suite of tools to accomplish a given task, in Linux you frequently only have the capability to do those tasks but not bundled tools. Frequently, especially when working with complex storage or networking tasks, you have to write your own tools leveraging all of the lower-level hooks that the Linux developers have made available. You end up writing a lot of scripts to interact with structures in the /proc and /sys hierarchies.

    Unfortunately, it seems like so many of the online Linux resources are dedicated towards the Linux hobbyist. When you're trying to support Linux in a production environment, you frequently end up trying to re-engineer methods and tools are "standard equipment" on commercial UNIX releases. Because so many of the forums and other resources are geared towards hobbyists and neophytes, the questions you have aren't frequently broached on the forums (or, if they are, they aren't answered). So, there's a lot of mucking about to do.

    Fortunately, while a lot of the tools you might be used to on commercial UNIXes aren't in Linux, the Linux maintainers have at least provided the building blocks for recreating those tools. I just wish there were more and better resources documenting those building-blocks. It would make my job a lot easier. Re-engineering a tool is hard enough - having to do research to find the appropriate building-blocks is just a real time-suck.

    And, yes, having experience with all of those systems over all of those years, I realize that life as a UNIX SA/engineer/etc. has always been part DIY. It's just that, you get used to only having to know a given OSes toolset rather than having to write things on your own. You take for granted the advances that the mature, commercial UNIX OSes have habituated you to and are simply absent in Linux.

    Oh well, live and learn, I suppose. (cue The Timewarp music).