Titular Discrepancy: SRM

Monday, March 5, 2012

AD Integration Woes

For a UNIX guy, I'm a big fan of leveraging Active Directory form centralized system and user management purposes. For me, one of the big qualifications for any application/device/etc. to be able to refer to itself as "enterprise" is that it has to be able to pull user authentication information via Active Directory. I don't care whether it does it the way that Winbind-y things do, or if they just pull data via Kerberos or LDAP. All I care is that I can offload my user-management to a centralized source.

In general, this is a good thing. However, like many things that are "in general, a good thing" it's not always a good thing. If you're in an enterprise that's evolving it's Active Directory infrastructure, tying things to AD means that you're impacted when AD changes or breaks. Someone decides, "hey, we need to reorganize the directory tree" and stuff can break. Someone decides, "hey, we need to upgrade our AD servers from 2003 to 2008" and stuff can break.

Recently, I started getting complaints from the users of our storage resource management (SRM) system that they couldn't login any more. It'd been nearly a year since I'd set it up, so sorting it out was an exercise in trying to remember what the hell I did ...and then Googling.

The application, itself, runs on a Windows platform. The login module that it uses for centralized authentication advertises itself as "Active Directory". In truth, the login module is a hybrid LDAP/Kerberos software module. Even though it's a "Windows" application, they actually use the TomCat Java server for the web UI (and associated login management components). TomCat is what uses Kerberos for authentication data.

Sometime in recent months, someone had upgraded Active Directory. Users that had been using the software before the AD-upgrade were able to authenticate, just fine. Users that had tried to start using the software after the AD-upgrade couldn't get in. Turns out that, when Active Directory had gotten upgraded, the encryption algorithms had gotten changed. Ironically, I didn't find the answer to my problem in any of the forums related to the application: I found it in the forums for another application that used TomCat's Kerberos components.

To start the troubleshooting process, one needs to first modify TomCat's bsclogin.conf file. Normally, this file is only used to tell TomCat where to find the Kerberos configuration file. However, if you modify your bsclogin.conf file and add the directive "debug=true" to it like so:

com.sun.security.auth.module.Krb5LoginModule required debug=true;

Enhanced user login debugging messages are enabled. Once this is added and tomcat is restarted, login-related messages will start showing up in your ${TOMCATHOME}/logs/stdout.log file. What started showing up in mine were messages like:

[Krb5LoginModule] user entered username: thjones2@SRMdomain.NET Acquire TGT using AS Exchange

[Krb5LoginModule] authentication failed KDC has no support for encryption type (14)

With this error message in hand, I was able to find out that TomCat's Kerberos modules were using the wrong encryption routines to get data out of Active Directory. The fix was to update my Kerberos initialization file and add the following two lines to my [libdefaults] stanza (I just added it right after the dns_lookup_realm line and before the next stanza of directives):

default_tkt_enctypes = rc4-hmac
default_tgs_enctypes = rc4-hmac

Making this change (and restarting TomCat) resulted in the failing users suddenly being able to get in.

I'd normally rag on my Windows guys for this, but, the Windows guys aren't exactly used to providing AD-related service notifications to anyone but Windows users. This application, while (currently) running on a Windows platform, isn't something that's traditionally thought of as AD-reliant. Factor in that true, native AD modules pretty much just auto-adjust to such changes, and it didn't occur to them to notify us of the changes.

Oh well. AD-integration is a learning experience for everyone involved, I suppose.

Friday, July 8, 2011

CLARiiON Report Data Verification

Earlier this year, the organization I work for decided to put into production an enterprise-oriented storage resource management (SRM) system. The tool we bought is actually pretty cool. We install collectors into each of our major data centers and they pull storage utilization data off of all of our storage arrays, SAN switches and storage clients (you know: the Windows and UNIX boxes that use up all that array-based storage). Then, all those collectors pump out the collected data to a reporting server at our main data center. The reporting server is capable of producing all kinds of nifty/pretty reports: configuration snapshots, performance reports, trending reports, utilization profiles, etc.

As cool as all this is, you have the essential problem of "how do I know that the data in all those pretty reports is actually accurate?" Ten or fifteen years ago, when array-based storage was fairly new and storage was still the realm of systems administrators with coding skills, you'd ask you nearest scruffy misanthrope, "could you verify the numbers on this report," and get an answer back within a few hours (and then within minutes each subsequent time you asked). Unfortunately, in the modern, GUI-driven world, asking your storage guys to verify numbers can be like pulling teeth. Many modern storage guys aren't really coders and frequently don't know the quick and easy way to get you hard numbers out of the devices they manage. In some cases, you may watch them cut and paste from the individual array's management web UIs into something like MicroSoft Calculator. So, you'll have to wait and, often times, you'll have to continually prod them for the data because it's such a pain in the ass for them to produce.

With our SRM rollout, I found myself in just such a situation. Fortunately, I've been doing Unix system adminstration for the best part of 20 years and, therefore, am rather familiar with scripting. I frequently wish I was able to code in better reporting languages, but I just don't have the time to keep my "real" coding skills up to par. I'm also not terribly patient. So, after waiting a couple weeks for our storage guys to get me the numbers I'd asked for, I said to myself, "screw it: there's gotta be a quicker/better way."

In the case of our CLARiiONs, that better way was to use the NaviCLI (or, these days, the NaviSECCLI). This is a tool set that has been around a looooooong time, in one form or another, and has been available for pretty much any OS that you might attach to a CLARiiON as a storage client. These days, it's a TCP/IP-based commandline tool - prior to NaviCLI, you either had platform-specific tools (IRIX circa 1997 had a CLI-based tool that did queries through the SCSI bus to the array) or you logged directly into the array's RS232 port and used its onboard tools (hopefully, you had a terminal or terminal program that allowed you to capture output) ...but I digress.

If you own EMC equipment, you've hopefully got maintenance contracts that give you rights to download tools and utilities from the EMC support site. NaviCLI is one such tool. Once you install it, you have a nifty little command-line tool that you can wrap inside of scripts. You can create these scripts to both provisioning tasks and reporting tasks. My use, in this case, was reporting.'

The SRM we bought came with a number of canned-reports - including ones for CLARiiON devices. Unfortunately, the numbers we were getting from our SRM were indicating that we only had about 77TiB on one of our arrays when the EMC order sheets said we should have had about 102TiB. That's a bit of a discrepancy. I was able to wrap some NaviCLI commands into a couple scripts (one that reported on RAID-group capacity and one that reported physical and logical disk capacities [ed.: please note that these scripts are meant to be illustrative of what you can do, but aren't really something you'd want to have as the nexus of your enterprise-reporting strategy. They're slow to run, particularly on larger arrays]) and verify that the 77TiB was sort of right and that the 102TiB was also sorta right. The group capacity script basically just spits out two numbers - total raw capacity and total capacity allocatable to clients (without reporting on how much of either is already allocated to clients). The disk capacity script reports how the disks are organized (e.g., RAID1, RAID5, Spare, etc.) - printing total number of disks in each configuration category and how much raw capacity that represented. Basically, the SRM tool was reporting the maximum number of blocks that were configured into RAID groups, not the total raw physical blocks in the array that we'd thought it was supposed to report.

Having these number in hand allowed us to tear apart the SRM's database queries and tables so that we could see what information it was grabbing, how it was storing/organizing it and how to improve on the vendor-supplied standard reports. Mostly, it consisted of changing the titles of some existing fields and adding some fields to the final report.

Yeah, all of this begs the question "what was the value of buying an SRM when you had to reverse-engineer it to make the presented data meaningful?" To be honest, "I dunno." I guess, at the very least, we bought a framework through which we could put together pretty reports and ones that were more specifically meaningful to us (though, to be honest, I'm a little surprised that we're the only customers of the SRM vendor to have found the canned-reports to be "sadly lacking"). It also gave me an opportunity to give our storage guys a better idea of the powerful tools they had available to them if only they were willing to dabble at the command line (even on Windows).

Still the vendor did provide a technical resource to help us get things sorted out faster than we might have done without that assistance. So, I guess that's something?