Showing posts with label RHEL 5. Show all posts
Showing posts with label RHEL 5. Show all posts

Tuesday, September 3, 2013

Password Encryption Methods

As a systems administrator, there are times where you have to find programatic ways to update passwords for accounts. On some operating systems, your user account modification tools don't allow you to easily set passwords in a programmatic fashion. Solaris used to be kind of a pain, in this regard, when it came time to do an operations-wide password reset. Fortunately, Linux is a bit nicer about this.

The Linux `usermod` utility allows you to (relatively easily) specify password in a programmatic fashion. The one "gotcha" of the utility is the requirement to use hashed password-strings rather than cleartext. The question becomes, "how best to generate those hashes".

The answer will likely depend on your security requirements. If MD5 hashes are acceptable, then you can use OpenSSL or the `grub-md5-crypt` utilities to generate them. If, however, your security requirements require SHA256- or even SHA512-based hashes neither of those utilities will work for you.

Newer Linux distributions (e.g. RHEL 6+) essentially replace the `grub-md5-crypt` utility with the `grub-crypt` utility. This utility supports not only the older MD5 that its predecessor supported, but also SHA256 and SHA512.

However, what do you do when `grub-crypt` is missing (e.g., you're running RedHat 5.x) or you just want one method that will work across different Linux versions (e.g., your operations environment consists of a mix of RHEL 5 and RHEL 6 systems)? While you can use a tool like `openssl` to do the dirty work, if your security requirements dictate an SHA-based hashing algorithm, it's not currently up to the task. If you want the SHAs in a cross-distribution manner, you have to leverage more generic tools like Perl or Python.

The following examples will show you how to create a hashed password-string from the cleartext password "Sm4<kT*TheFace". Some tools (e.g., OpenSSL's "passwd" command) allowed you to choose to use a fixed-salt or a random-salt. From the standpoint of being able to tell "did I generate this script", using a fixed-salt can be useful; however, using a random-salt may be marginally more secure. The Perl and Python methods pretty much demand the specification of a salt. In the examples below, the salt I'm using is "Ay4p":
  • Perl (SHA512) Method: `perl -e 'print crypt("Sm4<kT*TheFace", "\$6\$Ay4p\$");'`
  • Python (SHA512) Method: `python -c 'import crypt; print crypt.crypt(""Sm4<kT*TheFace", "$6$Ay4p$")'
Note that you specify the encryption-type used by specifying an numerical representation of the standard encryption-types. The standard encryption types for Linux operating systems (from the crypt() manpage):
  • 1  = MD5
  • 2a = BlowFish (not present in all Linux distributions)
  • 5  = SHA256 (Linux with GlibC 2.7+)
  • 6  = SHA512 (Linux with GlibC 2.7+) 

    Friday, January 11, 2013

    LVM Online Relayout

    Prior to coming to the Linux world, most of my complex, software-based storage taskings were performed under the Veritas Storage Foundation framework. In recent years, working primarily in virtualized environments, most storage tasks are done "behind the scenes" - either at the storage array level or within the context of VMware. Up until today, I had no cause to worry about converting filesystems from using one underlying RAID-type to another.

    Today, someone wanted to know, "how do I convert from a three-disk RAID-0 set to a six-disk RAID-10 set". Under Storage Foundation, this is just an online relayout operation - converting from a simple volume to a layered volume. Until I dug into it, I wasn't aware that LVM was capable of layered volumes, letalone online-conversion from one volume-type to another.

    At first, I thought I was going to have to tell the person (since Storage Foundation wasn't an option for them), "create your RAID-0 sets with `mdadm` and then layer RAID-1 on top of those MD-sets with LVM". Turns out, you can do it in LVM (I spun up a VM in our lab and worked through it).

    Basically the procedure assumes that you'd previously:
    1. Attached your first set of disks/LUNs to your host
    2. Used the usual LVM tools to create your volumegroup and LVM objects (in my testing scenario, I set up a three-disk RAID-0 with a 64KB stripe-width)
    3. Created and mounted your filesystem.
    4. Gone about your business.
    Having done the above, your underlying LVM configuration will look something like:
    # vgdisplay AppVG
      --- Volume group ---
      VG Name               AppVG
      System ID             
      Format                lvm2
      Metadata Areas        3
      Metadata Sequence No  2
      VG Access             read/write
      VG Status             resizable
      MAX LV                0
      Cur LV                1
      Open LV               0
      Max PV                0
      Cur PV                3
      Act PV                3
      VG Size               444.00 MB
      PE Size               4.00 MB
      Total PE              111
      Alloc PE / Size       102 / 408.00 MB
      Free  PE / Size       9 / 36.00 MB
      VG UUID               raOK8i-b0r5-zlcG-TEqE-uCcl-VM3L-RelQgX
    # lvdisplay /dev/AppVG/AppVol 
      --- Logical volume ---
      LV Name                /dev/AppVG/AppVol
      VG Name                AppVG
      LV UUID                6QuQSv-rklG-pPv6-Tq6I-TuI0-N50T-UdQ4lu
      LV Write Access        read/write
      LV Status              available
      # open                 1
      LV Size                408.00 MB
      Current LE             102
      Segments               1
      Allocation             inherit
      Read ahead sectors     auto
      - currently set to     768
      Block device           253:7
    Take special note that there are free PEs available in the volumegroup. In order for the eventual relayout to work, you have to leave space in the volume group for LVM to do its reorganizing magic. I've found that a 10% set-aside has been safe in testing scenarios - possibly even overly generous. In a large, production configuration, that set-aside may not be enough.

    When you're ready to do the conversion from RAID, add second set of identically-sized disks to the system. Format the new devices and use `vgextend` to add the new disks to the volumegroup.

    Note: Realistically, so long as you increase the number of available blocks in the volumegroup by at least 100%, it likely doesn't matter whether you add the same number/composition of disks to the volumegroup. Differences in mirror compositions will mostly be a performance rather than an allowed-configuration issue.

    Once the volumegroup has been sufficiently-grown, use the command `lvconvert -m 1 /dev/<VolGroupName>/<VolName>` to change the RAID-0 set to a RAID-10 set. The `lvconvert` works with the filesystem mounted and in operation - technically, there's no requirement to take an outage window to do the operation. As the `lvconvert` runs, it will generate progress information similar to the following:
    AppVG/AppVol: Converted: 0.0%
    AppVG/AppVol: Converted: 55.9%
    AppVG/AppVol: Converted: 100.0%
    Larger volumes will take a longer period of time to convert (activity on the volume will also increase the time required for conversion). Output is generated at regular intervals. The longer the operation takes, the more lines of status output that will be generated.

    Once the conversion has completed, you can verify that your RAID-0 set is now a RAID-10 set with the `lvdisplay` tool:
    lvdisplay /dev/AppVG/AppVol 
      --- Logical volume ---
      LV Name                /dev/AppVG/AppVol
      VG Name                AppVG
      LV UUID                6QuQSv-rklG-pPv6-Tq6I-TuI0-N50T-UdQ4lu
      LV Write Access        read/write
      LV Status              available
      # open                 1
      LV Size                408.00 MB
      Current LE             102
      Mirrored volumes       2
      Segments               1
      Allocation             inherit
      Read ahead sectors     auto
      - currently set to     768
      Block device           253:7
    The addition of the "Mirrored Volumes" line indicates that the logical volume is now a mirrored RAID-set.

    Wednesday, October 31, 2012

    No-Post Reboots

    One of the things that's always sorta bugged me about Linux was reboots generally required a full reset of the system. That is, if you did an `init 6`, the default bahavior cause your system to drop back down to the boot PROM and go through its POST routines. While on a virtualized system, this is mostly an inconvenience as virtual hardware POST is fairly speedy. However, when you're running on physical hardware, it can be a genuine hardship (the HP-based servers I typically work with can take upwards of ten to fifteen minutes to run its POST routines).
    At any rate, a few months back I was just dinking around online and found a nifty method for doing a quick, no-BIOS reboot:
    # BOOTOPTS=`cat /proc/cmdline` ; KERNEL=`uname -r` ; \
    kexec -l /boot/vmlinuz-$KERNEL --initrd=/boot/initrd-"{$KERNEL}".img \
    --append=${BOOTOPTS} ; reboot
    Basically, the above:
    1. Reads your /proc/cmdline file to get the boot arguments of your most recent bootup and stuffs the value into the BOOTOPTS environmental
    2. Grabs your system's currently running kernel release (in case you've got multiple kernels installed and want to boot back into the current one) and stuffs the value into the KERNEL environmental
    3. Calls the `kexec` command (a nifty utility for directly booting into a new kernel), leveraging the previously set environmentals to tell `kexec` what to do
    4. Finishes with a reboot to the `kexec`-defined kernel (and options)

    In testing it on a physical server that normally takes about 15 minutes to reboot (10+ minutes of POST routines) and speeded it up by over 66%. On a VM, it only saves maybe a minute (though that will depend on your VM's configuration settings).

    Thursday, August 9, 2012

    Why So Big

    Recently, while working on getting a software suite ready for deployment, I had to find space in our certification testing environment (where our security guys scan hosts/apps and decide what needs to be fixed for them to be safe to deploy). Our CTA environment is unfortunately-tight on resources. The particular app I have to get certified wants 16GB or RAM to run in but will accept as little as 12GB (less than that and the installer utility aborts).

    When I went to submit my server (VM, actually) requirements to our CTA team so they could prep me an appropriate install host, they freaked. "Why does it take so much memory" was the cry. So, I dug through the application stack.

    The application includes an embedded Oracle instance that wants to reserve about 9GB for its SGA and other set-asides. It's going on a 64bit RedHat server and RedHat typically wants 1GB of memory to function acceptably (can go down to half that, but you won't normally be terribly happy). That accounted for 10GB of the 12GB minimum the vendor was recommending.

    Unfortunately, the non-Oracle components of the application stack didn't seem to have a single file that described memory set asides. It looked like it was spinning up two Java processes with an aggregate heap size of about 1GB.

    Added to the prior totals, the aggregated heap sizes put me at about 11GB of the vendor-specified 12GB. That still left an unaccounted for 1GB. Now, it could have been the vendor was requesting 12GB because it was a "nice round number" or they could have been adding some slop to their equations to give the app a bit more wiggle-room. 

    I could have left it there, but decided, "well, the stack is running, lets see how much it really uses". So, I fired up top. Noticed that the Oracle DB ran under one userid and that the rest of the app-stack ran under a different one. I set top to look only at the userid used by the rest of the app-stack. The output was too long to fit on one screen and I was too lazy to want to add up the RSS numbers, myself. Figured since top wasn't a good avenue, I might be able to use ps (since the command supports the Berkeley-style output options).

    Time to hit the man pages...

    After digging through the man pages and a bit of cheating (Google is your friend) I found the invocation of ps that I wanted:

    `ps -u <appuser> -U <appuser> -orss=`.

    Horse that to a nice `awk '{ sum += 1 } END { print sum}' and I had a quick method of divining how much resident memory the application was actually eating up. What I found was that the app-stack had 52 processes (!) that had about 1.7GB of resident memory tied up. Mystery solved.

    Tuesday, July 31, 2012

    Finding Patterns

    I know that most of my posts are of the nature "if you're trying to accomplish 'X' here's a way that you can do it". Unfortunately, this time, my post is more of a, "I've got yet to be solved problems going on". So, there's no magic bullet fix currently available for those with similar problems who found this article. That said, if you are suffering similar problmes, know that you're not alone. Hopefully that's small consolation and what follows may even help you in investigating your own problem.

    Oh: if you have suggestions, I'm all ears. Given the nature of my configuration, there's not much in the way of useful information I've yet found via Google. Any help would be much appreciated...



    The shop I work for uses Symantec's Veritas NetBackup product to perform backups of physical servers. As part of our effort to make more of the infrastructure tools we use more enterprise-friendly, I opted to leverage NetBackup 7.1's NetBackup Access Control (NBAC) subsystem. On its own, it provides fine-grained rights-delegation and role-based access control. Horse it to Active Directory and you're able to roll out a global backup system with centralized authentication and rights-management. That is, you have all that when things work.

    For the past couple months, we've been having issues with one of the dozen NetBackup domains we've deployed into our global enterprise. When I first began trougleshooting the NBAC issues, the authentication and/or authorization failures had always been associated with a corruption of LikeWise's sqlite cachedb files. At the time the issues first cropped up, these corruptions always seemed to coincide with DRS moving the NBU master server from one ESX host to another. It seemed like, when under sufficiently heavy load - the kind of load that would trigger a DRS event - LikeWise didn't respond well to having the system paused and moved. Probably something to do with the sudden apparent time-jump that happens when a VM is paused for the last parts of the DRS action. My "solution" to the problem was to disable automated-relocation for my VM.

    This seemed to stabilize things. LikeWise was no longer getting corrupted and seems like I'd been able to stabilize NBAC's authentication and authorization issues. Well, they stabilized for a few weeks.

    Unfortunately, the issues have begun to manifest themselves, again, in recent weeks. We've now had enough errors that some patterns are starting to emerge. Basically, it looks like something is horribly bogging the system down around the time that the nbazd crashes are happening. I'd located all the instances of nbazd crashing from its log files ("ACI" events are loged to the /usr/openv/netbackup/logs/nbazd logfiles), and then began to try correlating them with system load shown by the host's sadc collections. I found two things: 1) I probably need to increase my sample frequency - it's currently at the default 10-minute interval - if I want to more-thoroughly pin-down and/or profile the events; 2) when the crashes have happened within a minute or two of an sadc poll, I've found that the corresponding poll was either delayed by a few seconds to a couple minutes or was completely missing. So, something is causing the server to grind to a standstill and nbazd is a casualty of it.

    For the sake of thoroughness (and what's likely to have matched on a Google-search and brought you here), what I've found in our logs are messages similar to the following:
    /usr/openv/netbackup/logs/nbazd/vxazd.log
    07/28/2012 05:11:48 AM VxSS-vxazd ERROR V-18-3004 Error encountered during ACI repository operation.
    07/28/2012 05:11:48 AM VxSS-vxazd ERROR V-18-3078 Fatal error encountered. (txn.c:964)
    07/28/2012 05:11:48 AM VxSS-vxazd LOG V-18-4204 Server is stopped.
    07/30/2012 01:13:31 PM VxSS-vxazd LOG V-18-4201 Server is starting.
    /usr/openv/netbackup/logs/nbazd/debug/vxazd_debug.log
    07/28/2012 05:11:48 AM Unable to set transaction mode. error = (-1)
    07/28/2012 05:11:48 AM SQL error S1000 -- [Sybase][ODBC Driver][SQL Anywhere] Connection was terminated
    07/28/2012 05:11:48 AM Database fatal error in transaction, error (-1)
    07/30/2012 01:13:31 PM _authd_config.c(205) Conf file path: /usr/openv/netbackup/sec/az/bin/VRTSaz.conf
    07/30/2012 01:22:40 PM _authd_config.c(205) Conf file path: /usr/openv/netbackup/sec/az/bin/VRTSaz.conf
    Our NBU master servers are hosted on virtual machines. It's a supported configuration and adds a lot of flexibility and resiliency to the overall enterprise-design. It also means that I have some additional metrics available to me to check. Unfortunately, when I checked those metrics, while I saw utilization spikes on the VM, those spikes corresponded to healthy operations of the VM. There weren't any major spikes (or troughs) during the grind-downs. So, to ESX, the VM appeared to be healthy.

    At any rate, I've requested our ESX folks see if there might be anything going on on the physical systems hosting my VM that aren't showing up in my VM's individual statistics. I'd previously had to disable automated DRS actions to keep LikeWise from eating itself - those DRS actions wouldn't have been happening had the hosting ESX system not been experiencing loading issues - perhaps whatever was causing those DRS actions is still afflicting this VM.

    I've also tagged one of our senior NBU operators to start picking through NBU's logs. I've asked him to look to see if there are any jobs (or combinations of jobs) that are always running during the bog-downs. If it's a scheduling issue (i.e., we're to blame for our problems), we can always reschedule jobs to exert less loading or we can scale up the VM's memory and/or CPU reservations to accommodate such problem jobs.

    For now, it's a waiting-game. At least there's an investigation path, now. It's all in finding the patterns.

    Wednesday, April 25, 2012

    When VMs Go *POOF!*

    Like many organizations, the one I work for is following the pathway to the cloud. Right now, this consists of moving towards an Infrastructure As A Service model. This model heavily leverages virtualization. As we've been pushing towards this model, we've been doing the whole physical to virtual dance, wherever possible.

    In many cases, this has worked well. However, like many large IT organizations, we have found some assets on our datacenter floors that have been running for years on both outdated hardware and outdated operating systems (and, in some cases, "years" means "in excess of a decade"). Worse, the people that set up these systems are often no longer employed by our organization. Worse still, the vendors of the software in use long-ago discontinued either the support of the software version in use or no longer maintain any version of the software.

    One such customer was migrated last year. They weren't happy about it at the time, but, they made it work. Because we weren't just going to image their existing out of date OS into the pristine, new environment, we set them up with a virtualized operating environment that was as close to what they'd had as was possible. Sadly, what they'd had was a 32bit RedHat Linux server that had been built 12 years previously. Our current offerings only go back as far as RedHat 5, so that's what we built them. While our default build is 64bit, we offer 32bit builds - unfortunately, the customer never requested a 32bit environment The customer did their level best to massage the new environment into running their old software. Much of this was done by tar'ing up directories on the old system and blowing them onto their new VM. They'd been running on this massaged system for nearly six months.

    If you noticed that 32-to-64bit OS-change, you probably know where this is heading...

    Unfortunately, as is inevitable, we had to take a service outage across the entire virtualized environment. We don't yet have the capability in place to live migrate VMs from one data center to another for such windows. Even if we had, the nature of this particular outage (installation of new drivers into each VM) was such that we had to reboot the customer's VM, any way.

    We figured we were good to go as we had 30+ days of nightly backups of each impacted customer VM. Sadly, this particular customer, after doing the previously-described massaging of their systems, had never bothered to reboot their system. It wasn't (yet) in our procedures to do a precautionary pre-modification reboot of the customers' VMs. The maintenance work was done and the customer's VM rebooted. Unfortunately, the system didn't come back from the reboot. Even more unfortunately, the thirty-ish backup images we had for the system were similarly unbootable and, worse, unrescuable.

    Eventually, we tracked down the customer to inform them of the situation and to find out if the system was actually critical (the VM had been offline for nearly a full week by the time we located the system owner, but no angry "where the hell is my system" calls had been received or tickets opened). We were a bit surprised to find that this was, somehow, a critical system. We'd been able to access the broken VM to a sufficient degree to determine that it hadn't been rebooted in nearly six months, that it's owner hadn't logged in nearly that same amount of time but that the on-disk application data appeared to be intact (the filesystems they were on were mountable without errors). So, we were able to offer the customer the option of building them a new VM and helping them migrate their application data off the old VM to the new VM.

    We'd figured we'd just restore data from the old VM's backups to a directory tree on the new VM. The customer, however, wanted the original disks back as mounted disks. So, we had to change plans.

    The VMs we build make use of the Linux Volume Manager software to manage storage allocation. Each customer system is built off of standard templates. Thus, each VM's default/root volume groups all share the same group name. Trying to import another host's root volume group onto a system that already has a volume group of the same name tends to be problematic in Linux. That said, it's possible to massage (there's that word again) things to allow you to do it.

    The customer's old VM wasn't bootable to a point where we could simply FTP/SCP its /etc/lvm/backup/RootVG file off. The security settings on our virtualization environment also meant we couldn't cut and paste from the virtual console into a text file on one of our management hosts. Fortunately, our backup system does support file-level/granular restores. So, we pulled the file from the most recent successful backup.

    Once you have an /etc/lvm/backup/<VGNAME> file available, you can effect a name change on a volume group relatively easily. Basically, you:

    1. Create your new VM
    2. Copy the old VM's  /etc/lvm/backup/<VGNAME> into the new VM's /etc/lvm/backup directory (with a new name)
    3. Edit that file, changing the old volume group's object names to ones that don't collide with the currently-booted root volume groups (really, you only need to change the name of the root volume group object - the volumes can be left as is
    4. Connect the old VM's virtual disk to the new VM
    5. Perform a rescan of the VM's scsi bus so it sees the old VM's virtual disk
    6. Do a `vgcfgrestore -f  /etc/lvm/backup/<VGNAME> <NewVGname>` 
    7. Do a `pvscan`
    8. Do a `vgscan`
    9. Do a `vgchange -ay <NewVGname>`
    10. Mount up the renamed volume group's volumes

    As a side note (to answer "why didn't you just do a `boot linux rescue` and change things that way"): our security rules prevent us from keeping bootable ISOs (etc.) available to our production environment. Therefore, we can't use any of the more "normal" methods for diagnosing or recovering a lost system. Security rules trump diagnostics and/or recoverability needs. Dems da breaks.

    Asymmetrical NIC-bonding

    Currently, the organization I am working for is making the transition from 1Gbps networking infrastructure to 10Gbps infrastructure. The initial goal had been to first migrate all of the high network-IO servers that were using trunked 1Gbps interfaces to using 10Gbps Active/Passive configurations.
    Given the current high per-port cost of 10Gbps networking, it was requested that a way be found to not waste 10Gbps ports. Having a 10Gbps port sitting idle "in case" the active port became unavailable was seen as financially wasteful. A a result, we opted to pursue the use of asymmetrical A/P bonds that used our new 10Gbps links for the active/primary path and reused our 1Gbps infrastructure for the passive/failover path.

    Setting up bonding on Linux can be fairly trivial. However, when you start to do asymmetrical bonding, you want to ensure that your fastest paths are also your active paths. This requires some additional configuration of the bonded pairs beyond just the basic declaration of the bond memberships.
    In a basic bonding setup, you'll have three primary files in the /etc/sysconfig/network-scripts directory: ifcfg-ethX, ifcfg-ethY and ifcfg-bondZ. The ifcfg-ethX and ifcfg-ethY files are basically identical but for their DEVICE and HWADDR parameters. At their most basic, they'll each look (roughly) like:

    DEVICE=ethN
    HWADDR=AA:BB:CC:DD:EE:FF
    ONBOOT=yes
    BOOTPROTO=none
    MASTER=bondZ
    SLAVE=yes

    And the (basic) ifcfg-bondZ file will look like:

    DEVICE=bondZ
    ONBOOT=yes
    BOOTPROTO=static
    NETMASK=XXX.XXX.XXX.XXX
    IPADDR=WWW.XXX.YYY.ZZZ
    MASTER=yes
    BONDING_OPTS="mode=1 miimon=100"
    
    
    This type of configuration may produce the results you're looking for, but it's not guaranteed to. If you want to absolutely ensure that your faster NIC will be selected as the primary NIC (and that it will fail back to that NIC in the event that the faster NIC goes offline and then back online), you need to be a bit more explicit with your ifcfg-bondZ file. To do this, you'll mostly want to modify your BONDING_OPTS directive. I also tend to add some BONDING_SLAVEn directives, but that might be overkill. Your new ifcfg-bondZ file that forces the fastest path will look like:

    DEVICE=bondZ
    ONBOOT=yes
    BOOTPROTO=static
    NETMASK=XXX.XXX.XXX.XXX
    IPADDR=WWW.XXX.YYY.ZZZ
    MASTER=yes
    BONDING_OPTS="mode=1 miimon=100 primary=ethX primary_reselect=1"
    
    
    
    

    The primary= tells the bonding driver to set the ethX device as primary when the bonding-group first onlines. The primary_reselect= tells it to use a interface selection policy of "best".

    Note: The default policy is "0". This policy simply says "return to interface declared as primary". I choose to override with policy "1" as a hedge against the primary interface coming back in some kind of degraded state (while most of our 10Gbps media is 10Gbps-only, some of the newer ones are 100/1000/1000). I only want to fail back to the 10Gbps interface if it's still running at 10Gbps and hasn't for some reason, negotiated down to some slower speed.

    When using the more explicit bonding configuration, the resultant configuration will resemble something like:

    Ethernet Channel Bonding Driver: v3.4.0-1 (Octobber 7, 2008)
    
    
    
    
    
    
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: ethX (primary_reselect better)
    Currently Active Slave: ethX
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    
    
    
    
    
    
    Slave Interface: ethX
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: AA:BB:CC:DD:EE:FF
    
    
    
    
    
    
    Slave Interface: ethY
    MII Status: up
    Speed: 1000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: FF:EE:DD:CC:BB:AA

    Tuesday, February 21, 2012

    Cross-Platform Sharing Pains

    All in all, the developers of Linux seem to have done a pretty good job of ensuring that Linux is able to integrate with other systmes. Chief among this integration has been in the realm of sharing files between Linux and Windows hosts. Overall, the CIFS support has been pretty damned good and has allowed Linux to lead in this area compared to other, proprietary *N*X OSes.

    That said, Microsoft (and others) seem to like to create a moving target for the Open Source community to aim at. If you have Windows 2008-based fileservers in your operations and are trying to get you Linux hosts to mount shares coming off these systems, you hay have run into issues with doing so. This is especially so if your Windows share-servers are set up with high security settings and you're trying to use service names to reference those share servers (i.e., the Windows-based fileserver may have a name like "dcawfs0035n" but you might have an alias like "repository5").

    Normally, when mounting a CIFS fileserver by its real name, you'll do something like:

    # mount -t cifs "//dcawfs0035n/LXsoftware" /mnt/NAS -o domain=mydomain,user=myuser

    And, assuming the credentials you supply are correct, the URI is valid and the place you're attemtping to mount to exists and isn't busy, you'll end up with that CIFS share mounted on your Linux host. However, if you try to mount it via an alias (e.g., a CNAME in DNS):

    # mount -t cifs "//repository5/LXsoftware" /mnt/NAS -o domain=mydomain,user=myuser

    You'll get prompted for your password - as per normal - but, instead of being rewarded with the joy of your CIFS share mounted to you Linux host, you'll get an error similar to the following:

    mount error 5 = Input/output error
    Refer to the mount.cifs(8) manual page (e.g.man mount.cifs)

    Had you fat-fingered your password, you'd have gotten a "mount error 12" (permission denied), instead. The above results in strict name checking being performed on the share mount attempt. Because you've attempted to an alias to connect with, the name-checking will fail and you'll get the above denial. You can verify that this is the underlying cause by re-attempting the mount with the fileserver's real name. If that succeeds where the alias, failed, you'll know where to go, next.

    The Microsoft-published solution is found in KB281308. To summarize, you'll need to:

    • Have admin and login rights on the share-server
    • Login to the share-server
    • Fire up regedit
    • Navigate to "HKLM\System\CurrentControlSet\Services\LanmanServer\Parameters"
    • Create a new DWORD paramter named "DisableStrictNameChecking"
    • Set its value to "1"
    • Reboot the fileserver
    • Retry your CIFS mount attempt.

    At this point, your CIFS mount should succeed.

    Interestingly, if you've ever tried to connect to this share from another Windows host not in the share-server's domain (e.g., from a host on different Windows domain that doesn't have cross-realm trusts set up or a standlone Windows client), you will probably have experienced connection errors, as well. Typical error messages being something on the order of "account is not allowed to logon from this resource" or just generally refusing to accept what you know to be a good set of credentials.

    Monday, May 9, 2011

    `iptables` for the Obsessive Compulsive

    While I probably don't meet the clinical definition for being obsessive compulsive, I do tend to like to keep things highly organized. This is reflected heavily in the way I like to manage computer systems.
    My employer, as part of the default security posture for production Linux systems, requires the use of iptables. If you've ever looked at an iptables file, they tend to be a spaghetti of arcana. Most tables start out fairly basic and might look something like:
    -A INPUT -i lo -j ACCEPT
       -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
       -A INPUT -p udp -m udp --dport 53 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 53 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 80 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 443 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 25 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 587 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 993 -j ACCEPT
       -A INPUT -j REJECT --reject-with icmp-host-prohibited
    
    This would probably be typical of a LAMP server that's also providing DNS and mail services.As it stands, it's fairly manageble and easy to follow if you've got even a slight familiarity with iptables or even firewalls in general.
    Where iptables starts to become unfun is when you start to get fancy with it. I started going down this "unfun" path when I put in place a defense against SSHD brute-forcers. I had to add a group of rules just to handle what, above, was done with a single line. Initially, this started to "spaghettify" my iptables configuration. It ended up making the above look like:
    -A INPUT -i lo -j ACCEPT
       -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
       -A INPUT -p udp -m udp --dport 53 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 53 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -m recent --set --name ssh_safe --rsource
       -A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -m recent --update --seconds 300 --hitcount 3 --name ssh_safe --rsource -j LOG --log-prefix "SSH CONN. REJECT: "
       -A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -m recent --update --seconds 300 --hitcount 3 --name ssh_safe --rsource -j DROP
       -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 80 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 443 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 25 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 587 -j ACCEPT
       -A INPUT -p tcp -m tcp --dport 993 -j ACCEPT
       -A INPUT -j REJECT --reject-with icmp-host-prohibited
    
    Not quite as straight-forward any more. Not "tidy", as I tend to refer to these kinds of things. It gave me that "tick" I get whenever I see something messy. So, how to fix it? Well, my first step was to use iptables' comments module. That allowed me to make the configuration a bit more self-documenting (if you ever look at my shell scripts or my "real" programming, they're littered with comments - makes it easier to go back and remember what the hell you did and why). However, it still "wasn't quite right". So, I decided, "I'll dump all of those SSH-related rules into a single rule group" and then reference that group from the main iptables policy:
    -A INPUT -i lo -j ACCEPT
    -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
    -A INPUT -p udp -m udp --dport 53 -j ACCEPT
    -A INPUT -p tcp -m comment --comment "Forward to SSH attack-handler" -m tcp --dport 22 -j ssh-defense
    -A INPUT -p tcp -m tcp --dport 53 -j ACCEPT
    -A INPUT -p tcp -m tcp --dport 80 -j ACCEPT
    -A INPUT -p tcp -m tcp --dport 443 -j ACCEPT
    -A INPUT -p tcp -m tcp --dport 25 -j ACCEPT
    -A INPUT -p tcp -m tcp --dport 587 -j ACCEPT
    -A INPUT -p tcp -m tcp --dport 993 -j ACCEPT
    -A ssh-defense -p tcp -m comment --comment "SSH: track" -m tcp --dport 22 -m state --state NEW -m recent --set --name ssh_safe --rsource
    -A ssh-defense -p tcp -m comment --comment "SSH: attack-log" -m tcp --dport 22 -m state --state NEW -m recent --update --seconds 300 --hitcount 3 --name ssh_safe --rsource -j LOG --log-prefix "SSH CONN. REJECT: "
    -A ssh-defense -p tcp -m comment --comment "SSH: attack-block" -m tcp --dport 22 -m state --state NEW -m recent --update --seconds 300 --hitcount 3 --name ssh_safe --rsource -j DROP
    -A ssh-defense -p tcp -m comment --comment "SSH: accept" -m tcp --dport 22 -j ACCEPT
    
    Ok, so the above doesn't really look any less spaghetti-like. That's ok. This isn't exactly where we-despaghettify things. The above is mostly meant to be machine read. If you want to see the difference in things, use the `iptables -L` command. Or, to really see the difference, issue `iptables -L INPUT ; iptables -L ssh-defense`:
    Chain INPUT (policy ACCEPT)
       target     prot opt source               destination
       ACCEPT     all  --  anywhere             anywhere
       DROP       all  --  anywhere             loopback/8
       ACCEPT     all  --  anywhere             anywhere            state RELATED,ESTABLISHED
       ACCEPT     udp  --  anywhere             anywhere            udp dpt:domain
       ssh-defense  tcp  --  anywhere             anywhere            /* Forward to SSH attack-handler */ tcp dpt:ssh
       ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:domain
       ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:http
       ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:https
       ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:smtp
       ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:submission
       ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:imaps
    
       Chain ssh-defense (1 references)
       target     prot opt source               destination
                  tcp  --  anywhere             anywhere            /* SSH: track */ tcp dpt:ssh state NEW recent: SET name: ssh_safe side: source
       LOG        tcp  --  anywhere             anywhere            /* SSH: attack-log */ tcp dpt:ssh state NEW recent: UPDATE seconds: 300 hit_count: 3 name: ssh_safe side: source LOG level warning prefix `SSH CONN. REJECT: '
       DROP       tcp  --  anywhere             anywhere            /* SSH: attack-block */ tcp dpt:ssh state NEW recent: UPDATE seconds: 300 hit_count: 3 name: ssh_safe side: source
       ACCEPT     tcp  --  anywhere             anywhere            /* SSH: accept */ tcp dpt:ssh
    
    Even if you don't find the above any more self-documenting or easier to handle (in a wide xterm, it looks much better), it does have one other value: it makes it harder for people to muck up whatever flow or readability that your iptables configuration has. Because you've externalized a group of directives, someone's going to have to go out of their way to intersperse random rules into your iptables configuration. If it's just your own server, this probably has little value (unless you've got MPD). However, if you have shared administration duties, it can be a sanity-saver.

    Thursday, May 5, 2011

    Linux Active Directory Integration and PAM

    Previously, I've written about using LikeWise to provide Active Directory integration to Linux and Solaris hosts. One of the down sides of LikeWise (and several other similar integration tools) is that it tends to make it such that, if a user has an account in Active Directory, they can log into the UNIX or Linux boxes you've bound to your domain. In fact, while walking someone through setting up LikeWise with the automated configuration scripts I'd written, that person asked, "you mean anyone with an AD account can log in?"

    Now, this had occurred to me when I was testing the package for the engineer who was productizing LikeWise for our enterprise build. But, it hadn't really been a priority, at the time. Unfortunately, when someone who isn't necessarily a "security first" kind of person hits you with that question/observation, you know that the folks for whom security is more of a "Job #1" are eventually going to come for you (even if you weren't the one who was responsible for engineering the solution). Besides, I had other priorities to take care of.

    This week was a semi-slack week at work. There was some kind of organizational get-together going on that had most of the IT folks out of town discussing global information technology strategies. Fortunately, I'd not had to take part in that event. So, I've spent the week revisiting some stuff I'd rolled out (or been part of the rollout of) but wasn't completely happy with. The "AD integration giving everyone access" thing was one of them. So, I began by consulting the almighty Google. When I'd found stuff I that seemed promising, I fired up a test VM and started trying it out.

    Now, SSH (and several other services) weren't really a problem. Many applications allow you to internally regulate who can use the service. For example, with OpenSSH, you can modify the sshd_config file to explicitly define which users and groups can and cannot access your box through that service (for those of you who hit this page looking for tips, do a `man sshd_config` and grep for AllowUsers and AllowGroups for more in-depth information). Unfortunately, it's predictable enough to figure that people that are gonna whine about AD integration giving away the farm are gonna bitch if you tell them they have to modify the configuration of each and ever service they want to protect. No, most people want to be able to go to one place and take care of things with on action or one set of consistent actions. I can't blame them: I feel the same way. Everyone wants things done easily. Part of "easily" generally implies "consistently" and/or "in one place".

    Fortunately, any good UNIX or Linux implementation leverages the Pluggable Authentication Management system (aka. PAM). There's about a bazillion different PAM modules out there that allow you to configure any given service's authentication to do or test a similar variety of attributes. My assumption for solving this particular issue was that, while there might be dozens or hundreds of groups (and thousands of users) in an Active Directory forrest, one would only want to grant a very few groups access to an AD-bound UNIX/Linux host. So, I wasn't looking for something that made it easy to grants lots of groups access in one swell-foop. In fact, I was kind of looking for things that made that not an easy thing to do (after all, why lock stuff down if you're just going to blast it back open, again?). I was also looking for something that I could fairly reliably find on generic PAM implementations. The pam_succeed_if is just about tailor made for those requirements.

    LikeWise (and the other AD integration methods) add entries into your PAM system to allow users allowed by those authentication subsystems to login, pretty much, unconditionally. Unfortunately, those PAM modules don't often include methods for controlling which users are able to login once their AD authentication has succeeded. Since the PAM system uses a stackable authentication module, you can insert access controls earlier into the stack to cause a user access to fail out earlier than the AD module would otherwise grant the access. If you wanted to be able to allow users in AD_Group1 and AD_Group2 to log in, but not other groups, you'd modify your pam stack to insert the control ahead of the AD allow module.

         account    [default=ignore success=1] pam_succeed_if.so user ingroup AD_Group1 quiet_success
         account    [default=ignore success=1] pam_succeed_if.so user ingroup AD_Group2 quiet_success
         account    [default=bad success=ignore] pam_succeed_if.so user ingroup wheel quiet_success
         account    sufficient    /lib/security/pam_lsass.so
    

    The above is processed such that if a user is a member of the AD-managed group "AD_Group1" or "AD_Group2", it sets the test's success flag to true. If the user isn't a member of those two groups, testing falls through to the next group check - is the user a member of the group wheel (if yes, fall through to the next test; if no, then there's a failure and the user's access is denied). Downside of using this particular PAM module is that it's only availble to you on *N*X systems with a plethora of PAM modules. This is true for many Linux releases - and I know it to be part of RedHat-related releases - but probably won't be available on less PAM-rich *N*X systems (yet one more reason to cast Solaris on the dung-heap, frankly). If your particular *N*X system doesn't have it, you can probably find the sourcecode for it and create yourself the requisite model for your OS.

    Monday, May 2, 2011

    Vanity Linux Servers and SSH Brute-Forcers

    Let me start by saying, that, for years (think basically since OpenSSH became available) I have run my personal, public-facing SSH services relatively locked-down. No matter what the default security posture for the application was - whether compiled from source or using the host operating systems defaults - the first things I did was to ensure that PermitRootLogin was set to "no". I used to allow tunneled clear-text passwords (way back in the day), but even that I've habitually disabled for (probably) a decade, now. In other words, if you want to SSH into one of my systems, you had to do so as a regular user and you had to do it using key-based logins. Even if you did manage to break in as one of those non-privileged users, I used access controls to limit which users could elevate privileges to root.

    Now, I never went as far as changing the ports my SSH servers listened on. This always seemed kind of pointless. I'm sure there's plenty of script kiddies whose cracking-scripts don't look for services running on alternate ports, but I've never found much value relying on "security by obscurity".

    At any rate, I figured this was enough to keep me basically safe. And, to date, it seems to have. That said, I do periodically get annoyed at seeing my system logs filled with the "Too many authentication failures for root" and "POSSIBLE BREAK-IN ATTEMPT" messages. However, most of the solutions to such problems seemed to be log-scrapers that then blacklisted the attack sources. As I've indicated in prior posts, I'm lazy. Going through the effort of setting up log-scrapers and tying them to blacklisting scripts was more effort than I felt necessary to address something that seemed, primarily to be only a nuisance. So, I never bothered.

    I've also been a longtime user of tools like PortSentry (and its equivalents). So, I usually picked up attacks before they got terribly far. Unfortunately, as Linux has become more popular, there seems to be a lot more service-specific attacks and less broad-spectrum attacks (attacks preceded by probing of all possible entry points). Net result: I'm seeing more of the nuisance alerts in my logs.

    Still, there's that laziness thing. Fortunately, I'd recently sat through a RedHat training class. And, while I was absolutely floored when the instructor told me that even RHEL 6 still ships with PermitRootLogin set to "yes", he let me know that recent RHEL patch levels included iptables modules that made things like fail2ban somewhat redundant. Unfortunately, he didn't go into any further detail. So, I had to go and dig around for how to do it.

    Note: previously, I'd never really bothered with using iptables. I mean, for services that don't require Internet-at-large access, I'd always used things like TCPWrappers or configuring to only listen on loopback or domain sockets to prevent exposing the services. Thus, with my systems, the only Internet-reachable ports were the ones that had to be. There never really seemed to be a point in enabling a local firewall when the system wasn't acting as a gateway to other systems. However, the possibility of leveraging iptables in a useful way kind of changed all that.

    Point of honesty, here: the other reason I'd never bothered with iptables was that its syntax was a tad arcane. While I'd once bothered to learn the syntax for ipfilter - a firewall solution with similarly arcane syntax - so that I could use a Solaris-based system as a firewall for my house, converting my ipfilter knowledge to iptables didn't seem worth the effort.

    So, I decided to dig into it. I read through manual pages. I looked at websites. I dug through my Linux boxes netfilter directories to see if I could find the relevant iptables modules and see if they were internally documented. Initially, I thought the iptables module my instructor had been referring to was the ipt_limit module. Reading up on it, the ipt_limit module looked kind of nifty. So, I started playing around with it. As I played with it (and dug around online), I found there was an even better iptables module, ipt_recent. I now assume the better module was the one he was referring to. At any rate, dinking with both, I eventually set about getting things to a state I liked.

    First thing I did, when setting up iptables was decided to be a nazi about my default security stance. That was accommodated with one simple rule: `iptables -P INPUT DROP`. If you start up iptables with no rules, you get the equivalent default INPUT filter rule of `iptables -P INPUT ACCEPT`. I'd seen some documentation where people like to set there's to `iptables -P INPUT REJECT`. I like "DROP" better than "REJECT" - probably because it suits the more dickish side of me. I mean, if someone's going to chew up my systems resources by probing me or attempting to break in, why should I do them the favor of telling their TCP stack to end the connection immediately? Screw that: let their TCP stack send out SYNs and be ignored. Depending on whether they've cranked down their TCP stack, those unanswered SYNs will mean that they will end up with a bunch of connection attempts stuck in a wait sequence. Polite TCP/IP behavior says that, when you send out a SYN, you wait for an ACK for some predetermined period before you consider the attempt to be failed and execute your TCP/IP abort and cleanup sequence. That can be several tens of seconds to a few hours. During that interval, the attack source has resources tied up. If I sent a REJECT, they could go into immediate cleanup, meaning they can more quickly move onto their next attack with all their system resources freed up.

    The down side of setting your default policy to either REJECT or DROP is that it applies to all your interfaces. So, not only will your public-facing network connectivity cease, so will your loopback traffic. Depending on how tightly you want to secure your system, you could bother to iterate all of the loopback exceptions. Most people will probably find it sufficient to simply set up the rule `iptables -A INPUT -i lo0 -j ACCEPT`. Just bear in mind that more wiley attackers can spoof things to make it appear to come through loopback and take advantage of that blanket exception to your DROP or REJECT rules (though, this can be mitigated by setting up rules to block loopback traffic that appears on your "real" intefaces - something like `-A INPUT -i eth0 -d 127.0.0.0/8 -j DROP` will do it).

    The next thing you'll want to bear in mind with the defualt REJECT or DROP is that, without further fine-tuning, it will apply to each and every packet hitting that filterset. Some TCP/IP connections start on one port, but then get moved off to or involve other ports. If that happens, your connection's not gonna quite work right. One way to work around that, is to use a state table to manage established connections or related connections. Use a rule like `iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT` to accommodate that.

    At this point you're ready to start punching the service-specific holes in your default-deny firewall. On a hobbyist or vanity type system, you might be running things like DNS, HTTP(S), SMTP, and IMAP. That will look like:

    -A INPUT -p udp -m udp --dport 53 -j ACCEPT       # DNS via UDP (typically used for individual DNS lookups)
    -A INPUT -p tcp -m tcp --dport 53 -j ACCEPT       # DNS via TCP (typically used for large zone transfers)
    -A INPUT -p tcp -m tcp --dport 80 -j ACCEPT       # HTTP
    -A INPUT -p tcp -m tcp --dport 443 -j ACCEPT       # HTTP over SSL
    -A INPUT -p tcp -m tcp --dport 25 -j ACCEPT       # SMTP
    -A INPUT -p tcp -m tcp --dport 587 -j ACCEPT       # SMTP submission via STARTTLS
    -A INPUT -p tcp -m tcp --dport 993 -j ACCEPT       # IMAPv4 + SSL

    What the ipt_limits module gets you is the ability to rate-limit connection attempts to a service. This can be a simple as ensuring that only "so many connections" per second are allowed access to the service, limiting the number of connections per time interval per source or outright blacklisting a source that too frequently connects.

    Doing the first can be done within the SSH and/or TCP Wrappers (or, for services run through xinetd, through your xinetd config). Downside of this is, since it's not distinguishing sources, if you're being attacked, you won't be able to get in since the overall number of connections will have been exceeded. Generally, potentially allowing others to lock you out of your own system is considered to be "not a Good Thing™ to do". But, if you want to risk it, add a rule that looks something like  `-A INPUT -m limit --limit 3/minute -m tcp -p tcp --dport 22 -j ACCEPT` to your iptables configuration and be on about your way (using the ipt_limit module).

    If you want to be a bit more targeted in your approach, the ipt_recent module can be leveraged. I used a complex of rules like the following:
    -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -m recent --set --name sshtrack --rsource
       -A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -m recent --update --seconds 60 --hitcount 3 --name sshtrack --rsource -j LOG --log-prefix "ssh rejection: "
       -A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -m recent --update  --seconds 60 --hitcount 3 --name sshtrack --rsource -j DROP
       -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
    
    What the above four rules do is:
    • For each new connection attempt to port 22, add the remote source address to the "sshtrack" tracking table
    • If this is the third such new connection within 60 seconds, update the remote source address entry in the tracking table and log rejection action
    • If this is the third such new connection within 60 seconds, update the remote source address entry in the tracking table and DROP the connection
    • Otherwise, accept the new connection.
    I could have chosen to simply "rcheck" rather than "update" the "sshtrack" table. However, by using "update", it essentially resets the time to live from the last connect attempt packet to whatever might be the next attempt. This way, you get the full sixty second window rather than (60 - ConnectInterval). If it becomes apparent that attackers start to use slow attacks to get past the rule, one can up the "seconds" from 60 to some other value. I chose 60 as a start. It might be reasonable to up it to 300 or even 900 since it's unlikely that I'm going to want to start more than three SSH sessions to the box within a 15 minute interval.

    As a bit of reference: on RHEL-based systems, you can check what iptables modules are available by listing out '/usr/include/linux/netfilter_ipv4/ipt_*'. You can then (for most) use `iptables -m [MODULE] --help` to show you the options for a given module. For example:
    # iptables -m recent --help | sed -n '/^recent v.*options:/,$p'
         recent v1.3.5 options:
         [!] --set                       Add source address to list, always matches.
         [!] --rcheck                    Match if source address in list.
         [!] --update                    Match if source address in list, also update last-seen time.
         [!] --remove                    Match if source address in list, also removes that address from list.
             --seconds seconds           For check and update commands above.
                                         Specifies that the match will only occur if source address last seen within
                                         the last 'seconds' seconds.
             --hitcount hits             For check and update commands above.
                                         Specifies that the match will only occur if source address seen hits times.
                                         May be used in conjunction with the seconds option.
             --rttl                      For check and update commands above.
                                         Specifies that the match will only occur if the source address and the TTL
                                         match between this packet and the one which was set.
                                         Useful if you have problems with people spoofing their source address in order
                                         to DoS you via this module.
             --name name                 Name of the recent list to be used.  DEFAULT used if none given.
             --rsource                   Match/Save the source address of each packet in the recent list table (default).
             --rdest                     Match/Save the destination address of each packet in the recent list table.
         ipt_recent v0.3.1: Stephen Frost .  http://snowman.net/projects/ipt_recent/
    
    Gives you the options for the "recent" iptables module and a URL for further infomation lookup.

    Tuesday, November 30, 2010

    Linux (Networking): Oh How I Hate You

    Working with most commercial UNIX systems (Solaris, AIX, etc.) and even Windows, you take certain things for granted. Easy networking setup is one of those things. It seems to be particularly so for systems that are designed to work in modern, redundant networks. Setting up things like multi-homed hosts is relatively trivial. I dunno. It may just be that I'm so used to how commercial OSes do things that, when I have to deal with The Linux Way™ it seems hopelessly archaic and picayune.

    If I take a Solaris host that's got more than one network address on it, routing's pretty simple. I can declare one default route or I can declare a default route per interface/network. At the end of the day, Solaris's internal routing mechanisms just get it right. The only time there's really any mucking about is if I want/need to set up multiple static routes (or use dynamic routing protocols).

    Linux... Well, perhaps it's just the configuration I had to make work. Someone wanted me to get a system with multiple bonded interfaces set up with VLAN tagging to route properly. Having the commercial UNIX mindset, I figured "just declare a GATEWAY in each of the bonded interface's /etc/sysconfig/network-scripts file" and that would be the end of the day.

    Nope. It seems like Linux has a "last default route declared is the default route" design. Ok. I can deal with that. I mean, I used to have to deal with that with commercial UNIXes. So, I figured, "alright, only declare a default route in one interface's scriptfile". And, that sorta worked. I always got that one default route as my interface. Unfortunately, Linux's network routing philosophy didn't allow that to fully work as experience with other OSes might lead one to expect.

    On the system I was asked to configure, one of the interfaces happened to be on the same network as the host I was administering it from. It should be noted that this interface is a secondary interface. The host's canonical name points to an IP on a different LAN segment. Prior to configuring the secondary interface on this host, I was able to log into that primary interface with no problems. Unfortunately, adding that secondary interface that was on the same LAN segment as my administration host cause problems. The Linux routing saw to it that I could only connect to the secondary interface. I was knocked out of trying to get into the primary interface.

    This seemed odd. So, I started to dig around on the Linux host to figure out what the heck was going on. First up, a gander at the routing tables:

    # netstat -rnv
    Kernel IP routing table
    Destination     Gateway          Genmask         Flags   MSS Window  irtt Iface
    192.168.2.0     0.0.0.0          255.255.255.0   U         0 0          0 bond1.1002
    192.168.33.0    0.0.0.0          255.255.255.0   U         0 0          0 bond0.1033
    169.254.0.0     0.0.0.0          255.255.0.0     U         0 0          0 bond1.1002
    0.0.0.0         192.168.33.254   0.0.0.0         UG        0 0          0 bond0.1033

    Hmm... Not quite what I'm used to seeing. On a Solaris system, I'd expect something more along the lines of:

    IRE Table: IPv4
      Destination             Mask           Gateway          Device   Mxfrg Rtt   Ref Flg  Out  In/Fwd
    -------------------- ---------------  -------------------- ------  ----- ----- --- --- ----- ------
    default              0.0.0.0          1
    92.168.8.254                1500*     0   1 UG    1836      0
    192.168.8.0          255.255.255.0    1
    92.168.8.77         ce1     1500*     0   1 U      620      0
    192.168.11.0         255.255.255.0    192.168.11.222       ce0     1500*     0   1 U        0      0
    224.0.0.0            240.0.0.0        1
    92.168.8.77         ce1     1500*     0   1 U        0      0
    127.0.0.1            255.255.255.255  127.0.0.1            lo0     8232*     0   1 UH   13292      0

    Yeah yeah, not identically formatted output, but similar enough that things on the Linux host don't look right if what you're used to seeing is the Solaris system's way of setting up routing. On a Solaris host, network destinations (i.e., "192.168.2.0", "192.168.33.0", "192.168.8.0" and "192.168.11.0" in the above examples) get routed through an IP address on a NIC. On Linux, however, it seems like all of the network routes were configured to go through whatever the default route was.

    Now, what `netstat -rnv` is showing for interface network routes may not be strictly representative of what Linux is actually doing, but, both what Linux is doing and how its presented is wrong - particularly if there's firewalls between you and the multi-homed Linux hosts. The above output is kind of a sloppy representation of Linux's symmetrical routing philosphy. Unfortunately, because of the way Linux routes, If I have a configuration where the multi-homed Linux host has an two IP addresses - 192.168.2.123 and 192.168.33.123 - and I'm connecting from a host with an address of 192.168.2.76 but am trying to connect to the Linux host's 192.168.33.123 address, my connection attempt times out. While Linux may, in fact, receive my connection request at the 192.168.33.123 address, its default routing behavior seems to be to send it back out through its 192.168.2.123 address - ostensibly because the host I'm connecting from is on the same segment as the Linux host's 192.168.2.123 address. 

    Given my background, my first thought is "make the Linux routing table look like the routing tables you're more used to.". Linux is nice enough to let me do what seem to be the right `route add` statements. However, it doesn't allow me to nuke the seemingly bogus network routes pointing at 0.0.0.0.

    Ok, apparently I'm in for some kind of fight. I've gotta sort out the routing philsophy differences between my experience and what the writers of the Linux networking stack are. Fortunately, I have a good friend (named "Google") who's generally pretty good at getting my back. It's with Google's help that I discover that this kind of routing problem is handled through Linux's "advanced routing" functionality. I don't really quibble about what's so "advanced" about sending a response packet back out the same interface that the request packet came in on. I just kinda shrug and chalk it up to differences in implementation philosophy. It does, however, leave me with the question of, "how do I solve this difference of philosphy?"

    Apparently, I have to muck about with files that I don't have to on either single-homed Linux systems or multi-homed commercial UNIX systems. I have to configure additional routing tables so that I can set up policy routing. Ok, so, I'm starting to scratch my head here. By itself, this isn't inherently horrible. However, it's not one of those topics that seems to come up a lot. It's neither well-documented in Linux nor do many relevant hits get returned by Google. Thus, I'm left to take what hits I do find and start experimenting. Ultimately, I found that I had to set up five files (in addition to the normal /etc/sysconfig/network-scripts/ifcfg-* files) to get thinks working as I think they ought:

    /etc/iproute2/rt_tables
    /etc/sysconfig/network-scripts/route-bond0.1033
    /etc/sysconfig/network-scripts/route-bond1.1002
    /etc/sysconfig/network-scripts/rule-bond0.1033

    /etc/sysconfig/network-scripts/rule-bond1.1002

    Using the "/etc/iproute2/rt_tables" file is kind of optional. Mostly, it lets me assign logical/friendly names to the extra routing tables I need to set up. I like friendly names. They're easier to remember and can be more self-documenting than plain-Jane numeric IDs. So, I edit the "/etc/iproute2/rt_tables" and add two lines:

    2         net1002
    33        net1033

    I should probably note that what I actually wanted to add was:

    1002      net1002
    1033      net1033

    I wanted these aliases as they would be reflective of what my 802.1q VLAN IDs were. Unfortunately, Linux seems to limit the range of numbers you can allocate table IDs out of. Worse, there are reserved and semi-reserved IDs in that limited range, further limiting your pool of table ID candidates. So, creating an enterprise-deployable standard config file might not be practical on a network with lots of VLANs, subnets, etc. Friendly names set up, I then had to set up the per "NIC" routing and rules files.

    I set up the "/etc/sysconfig/network-scripts/route-${IF}.{$VLANID}"  with two routing statements:

    table ${TABLENAME} to ${NETBASE}/${CIDR} dev ${DEV}.${VLAN}
    table ${TABLENAME} to default via ${GATEWAY} dev ${DEV}.${VLAN}

    This might actually be overkill. I might only need one of those lines. However, it was late in the day and I was tired of experimenting. They worked, so, that was good enough. Sometimes, a sledgehammer's a fine substitute for a scalpel.

    Lastly, I set up the "/etc/sysconfig/network-scripts/rule-${IF}.{$VLANID}" files with a single rule, each:

    from ${NETBASE}/${CIDR} table ${TABLENAME} priority ${NUM}

    Again, the priority value's I picked may be suboptimal (I set the net1002 priority to "2" and the net1033 priority to "33"). But, since they worked, I left them at those values.

    I did a `service network restart` and was able to access my multi-homed Linux host by either IP address from my workstation. Just to be super-safe, I bounced the host (to shake out anything that might have been left in place by my experiments). When the box came back from the reboot, I was still able to access it through either IP address from my workstation.

    Friday, November 26, 2010

    Linux Storage Multipathing in a Mixed-Vendor (NetApp/EMC) Environment

    So, recently, I've been tasked with coming up with documentation for the junior staff that has to work with RedHat Enterprise Linux 5.x servers in a multi-vendor storage environment. I our particular environment, the two most likely candidates to be seen on a Linux server are NetApp and/or CLARiiON storage arrays.

    Previously, I've covered how to set up an RHEL 5.x system to use the Linux multipath service with NetApp Filer-based fibrechannel storage. Below, I'll expand on that, a bit, by explaining how to deal with a multi-vendor storage environment where separate storage subsystems will use separate storage multipathing solutions. In this particular case, the NetApp Filer-based fibrechannel storage will continue to be managed with the native Linux storage multipathing solution (multipathd) and the EMC CLARiiON-based storage will use the EMC-provided storage multipathing solution, PowerPath. I'm not saying such a configuration will be normal in your environment or mine, it's just an "edge-case scenario" I explored in my testing environment just in case someone asked for it. It's almost a given that if you haven't tested or documented the edge-cases, someone will invariably want to know how to do it (and, conversely, if you test and document it, no one ever bothers you about how to do it in real life).

    Prior to presentation of EMC CLARiiON-based storage to your mixed-storage system, you will want to ensure that:
    • CLARiiON-based LUNs are excluded from your multipathd setup
    • PowerPath software has been installed
    To explicitly exclude CLARiiON-based storage from multipathd's management, it will be necessary to modify your system's /etc/multipath.conf file. You will need to modify this file's blacklist stanza to resemble the following:

    blacklist {
            wwid DevId
            devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
            devnode "^hd[a-z]"
            devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"

              #############################################################################
            # Comment out the next four lines if management of CLARiiON LUNs is *WANTED*
            #############################################################################
            device {
                    vendor "DGC"
                    product "*"
            }

      }

    The lines we're most interested in are the four that begin with the device { directive. These lines are what the blacklist interpreter uses to tell itself "ignore any devices whose SCSI inquiry returns a Vendor ID of "DGC".

    I should note that I'd driven myself slightly nuts working out the above. I'd tried simply placing the 'device' entry directly in the 'blacklist' block. However, I found that, if I didn't contextualize it into a 'device' sub-block, the multipathd service would pretty much just flip me the bird and ignore the directive (without spitting out errors to tell me that it was doing so or why). Thus, it would continue to grab my CLARiiON LUNs until I nested my directives properly. The 'product' definition is, also, probably overkill, but, it works.

    Once these are in place, restart the multipath daemon to get it to reread its configuration files. Afterwards, request storage and do the usual PowerPath tasks to bring the CLARiiON devices under PowerPath's control. Properly set up, this will result in a configuration similar to the following:
    # multipath -l
    360a98000486e2f34576f2f51715a714d dm-7 NETAPP,LUN
    [size=25G][features=0][hwhandler=0][rw]
    \_ round-robin 0 [prio=0][active]
     \_ 0:0:0:1 sda 8:0   [active][undef]
     \_ 0:0:1:1 sdb 8:16  [active][undef]
     \_ 1:0:0:1 sde 8:64  [active][undef]
     \_ 1:0:1:1 sdf 8:80  [active][undef]

    # powermt display dev=all
    Pseudo name=emcpowera
    CLARiiON ID=APM00034000388 [stclnt0001u]
    Logical device ID=600601F8440E0000DBE5E2FEF1E1DF11 [LUN 13]
    state=alive; policy=BasicFailover; priority=0; queued-IOs=0;
    Owner: default=SP A, current=SP B       Array failover mode: 1
    ==============================================================================
    --------------- Host ---------------   - Stor -   -- I/O Path --  -- Stats ---
    ###  HW Path               I/O Paths    Interf.   Mode    State   Q-IOs Errors
    ==============================================================================
       0 qla2xxx                  sdc       SP A1     unlic   alive       0      0
       0 qla2xxx                  sdd       SP B1     unlic   alive       0      0
       1 qla2xxx                  sdg       SP A0     active  alive       0      0
       1 qla2xxx                  sdh       SP B0     active  alive       0      0

    As can be seen from the above, the NetApp LUN(s) are showing up under the multipathd's control and the CLARiiON LUNs are showing up under PowerPath's control. Neither multi-pathing solution is seeing the others' devices.

    You'll also note that the Array failover mode is set to "1". In my test environment, the only CLARiiON I have access to is in dire need of a firmware upgrade. Its firmware doesn't support mode "4" (ALUA). Since I'm using this test configuration to test both PowerPath and native multipathing, I had to set the LUN to a mode that both the array and multipathd supported to get my logs to stop getting polluted with "invalid mode" settings. Oh well, hopefully a hardware refresh is coming to my lab.

    Lastly, you'll also likely note that I'm running PowerPath in unlicensed mode. Again, this is a lab scenario where I'm tearing stuff down and rebuilding, frequently. Were it a production system, the licensing would be in place to enable all of the PowerPath functionality.

    Wednesday, November 17, 2010

    Linux Multipath Path-Failure Simulation

    Previously, I've discussed how to set up the RedHat Linux storage multipathing software to manage fibrechannel-based storage devices. I didn't, however, cover how one tests that such a configuration is working as intended.
    When it comes to testing resilient configurations, one typically has a number of testing options. In a fibrechannel fabric situation, one can do any of:
    • Offline a LUN within the array
    • Down an array's storage processors and/or HBAs
    • Pull the fibrechannel connection between the array and the fibrechannel switching infrastructure
    • Shut off a switch (or switches) in a fabric
    • Pull connections between switches in a fabric
    • Pull the fibrechannel connection between the fibrechannel switching infrastructure and the storage-consuming host system
    • Disable paths/ports within a fibrechannel switch
    • Disable HBAs on the storage-consuming host systems
    • Disable particular storage targets within the storage-consuming host systems
    Generally, I favor approaches that limit the impact of the tested scenario as much as possible. I favor approaches that limit the likelihood of introducing actual/lasting breakage into the tested configuration.
    I also tend to favor approaches where I have as much control of the testing scenario as possible. I'm an impatient person and having to coordinate outages and administrative events with other infrastructure groups and various "stakeholders" can be a tiresome, protracted chore. Some would say that indicates I'm not a team-player: I like to think that I just prefer to get things done efficiently and as quickly as possible. Tomayto/tomahto.
    Through most of my IT career, I've worked primarily the server side of the house (Solaris, AIX, Linux ...and even - *ech* - Windows) - whether as a systems administrator or as an integrator. So, my testing approaches tend to be oriented from the storage-consumer's view of the world. If I don't want to have to coordinate outside of a server's ownership/management team, I'm pretty much limited to the last three items on the above list: yanking cables from the server's HBAs, disabling the server's HBAs and disabling storage targets within the server.
    Going back to the avoidance of "introducing actual/lasting breakage", I tend to like to avoid yanking cables. At the end of the day, you never know if the person doing the monkey-work of pulling the cable is going to do it like a surgeon or like a gorilla. I've, unfortunately, been burned by run-ins with more than a few gorillas. So, if I don't have to have cables physically disconnected, I avoid it.
    Being able to logically disable an HBA is a nice test scenario. It effects the kind of path-failure scenario that you're hoping to test. Unfortunately, not all HBA manufacturers seem to include the ability to logically disable the HBA from within their management utilities. Within commercial UNIX variants - like Solaris or AIX - this hasn't often proven to be a problem. Under Linux, however, the abilty to logically disable HBAs from within their management utilities seems to be a bit "spotty".
    Luckily, where the HBA manufacturers sometimes leave me in the lurch, RedHat Linux leaves me some alternatives. In the spirit of the Linux DIYism, those alternatives aren't really all that fun to deal with ...until you write tools, for yourself, that removes some of the pain. I wrote two tools to help myself in this area: one is a tool which offlines designated storage paths and one is a tool which attempts to restore those downed storage paths.
    Linux makes it possible to change the system-perceived state of a given device path by writing the term "offline" to the file location, /sys/block/${DEV}/device/state. Thus, were one to want to make the OS think that the path to /dev/sdg was down, one would execute the command, `echo "offline" > /sys/block/sdg/device/state`. All that my path-downing script is makes it so you can down a given /dev/sdX device by executing `pathdown.sh <DEVICE>` (e.g., `pathdown sdg`). There's minimal logic built in to verify that the named /dev/sdX device is a real, downable device and it provides a post-action status of that device, but, other than that, it's a pretty simple script.
    To decide which path one wants to down, it's expected that the tester will look at the multipather's view of its managed devices using `multipath -l <DEVICE>` (e.g., `multipath -l mpath0`). This command will produce output similar to the following:
    mpath0 (360a9800043346d364a4a2f41592d5849) dm-7 NETAPP,LUN
    [size=20G][features=0][hwhandler=0][rw]
    \_ round-robin 0 [prio=0][active]
     \_ 0:0:0:1 sda 8:0   [active][undef]
     \_ 0:0:1:1 sdb 8:16  [active][undef]
     \_ 1:0:0:1 sdc 8:32  [active][undef]
     \_ 1:0:1:1 sdd 8:48  [active][undef]
    
    Thus, if one wanted to deactivate one of the channels in the mpath0 multipathing group, one might issue the command  `pathdown sdb`. This would result in the path associated with /dev/sdb being taken offline. After taking this action, the output of `multipath -l mpath0` would change to:
    mpath0 (360a9800043346d364a4a2f41592d5849) dm-7 NETAPP,LUN
    [size=20G][features=0][hwhandler=0][rw]
    \_ round-robin 0 [prio=0][active]
     \_ 0:0:0:1 sda 8:0   [active][undef]
     \_ 0:0:1:1 sdb 8:16  [failed][faulty]
     \_ 1:0:0:1 sdc 8:32  [active][undef]
     \_ 1:0:1:1 sdd 8:48  [active][undef]
    Typically, when doing such testing activities, one would be performing a large file operation to the disk device (preferably a write operation). My test sequence is typically to
    1. Start an `iostat` job, grepping for the devices I'm interested in, and capturing the output to a file
    2. start up a file transfer (or even just a `dd` operation) into the device.
    3. Start downing paths as the transfer occurs
    4. Wait for the transfer to complete, then kill the `iostat` job
    5. Review the captured output from the `iostat` job to ensure that the I/O behaviors I expected to see actually occurred
    In the testing environment I had available when I wrote this page, I was using a NetApp filer presenting blockmode storage via fibrechannel. The NetApp multipathing plugin supports concurrent, multi-channel operations to the target LUN. Thus, the output from my `iostat` job will show uniform I/Os across all paths to the LUN, and then show outputs drop to zero on each path that I offline. Were I using an array that only supported Active/Passive I/O operations, I would expect to see the traffic move from the downed path to one of the failover paths, instead.
    So, great: you've tested that your multipathing system behaves as expected. However, once you've completed that testing, all of the paths that you've offlined have stayed offline. What to do about it?
    The simplest method is to reboot the system. However, I abhor knocking my systems' `uptime` if I don't absolutely have to. Fortunately, much as Linux provides the means to offline paths, it provides the means for reviving them (well, to be more accurate, to tell it "hey, go check these paths and see if they're online again"). As with offlining paths, the methods for doing so aren't currently built into any OS-provided utilities. What you have to do is:
    1. Tell the OS to delete the device paths
    2. Tell the OS to rescan the HBAs for devices it doesn't currently know about
    3. Tell the multipath service to look for changes to paths associated with managed devices
    The way you tell the OS to (gracefully) delete device paths is to write a value to a file. Specifically, one writes the value "`1" to the file /sys/block/${DEV}/device/delete. Thus, if one is trying to get the system to clean up for the downed device path, /dev/sdb, one would issue the command `echo "1" > /sys/block/sdb/device/delete`.
    The way you tell the OS to rescan the HBAs is to issue the command `echo "- - -" >  /sys/class/scsi_host/${HBA}/scan`. In Linux, the HBAs are numbered in the order found and named "hostN" (i.e., "host0", "host1", etc.). Thus, to rescan HBA 0, one would issue the command `echo "- - -" >  /sys/class/scsi_host/host0/scan` (for good measure, rescan all the HBAs).
    The way to tell the multipath service to look for changes to paths associated with managed devices is to issue the command `multipath` (I acutally use `multipath -v2` because I like the more verbose output that tells me what did or didn't happen as a result of the command). Granted, the multipath service periodically rescans the devices it manages to find state-change information, but I don't like to wait for systems to "get around to it".
    All that my path fixing script does is rolls up the above, three steps into one, easy to remember and use command.
    Depending on how you have logging configured on your system, the results of all the path offlining and restoration will be logged. Both the SCSI subsystem and the multipathing daemon should log events. Thus, you can verify the results of your activities by looking in your system logs.
    That said, if the system you're testing is hooked up to an enterprise monitoring system, you will want to let your monitoring groups know that they need to ignore the red flags you'll be generating on their monitoring dashboards.