Friday, July 8, 2011

CLARiiON Report Data Verification

Earlier this year, the organization I work for decided to put into production an enterprise-oriented storage resource management (SRM) system. The tool we bought is actually pretty cool. We install collectors into each of our major data centers and they pull storage utilization data off of all of our storage arrays, SAN switches and storage clients (you know: the Windows and UNIX boxes that use up all that array-based storage). Then, all those collectors pump out the collected data to a reporting server at our main data center. The reporting server is capable of producing all kinds of nifty/pretty reports: configuration snapshots, performance reports, trending reports, utilization profiles, etc.

As cool as all this is, you have the essential problem of "how do I know that the data in all those pretty reports is actually accurate?" Ten or fifteen years ago, when array-based storage was fairly new and storage was still the realm of systems administrators with coding skills, you'd ask you nearest scruffy misanthrope, "could you verify the numbers on this report," and get an answer back within a few hours (and then within minutes each subsequent time you asked). Unfortunately, in the modern, GUI-driven world, asking your storage guys to verify numbers can be like pulling teeth. Many modern storage guys aren't really coders and frequently don't know the quick and easy way to get you hard numbers out of the devices they manage. In some cases, you may watch them cut and paste from the individual array's management web UIs into something like MicroSoft Calculator. So, you'll have to wait and, often times, you'll have to continually prod them for the data because it's such a pain in the ass for them to produce.

With our SRM rollout, I found myself in just such a situation. Fortunately, I've been doing Unix system adminstration for the best part of 20 years and, therefore, am rather familiar with scripting. I frequently wish I was able to code in better reporting languages, but I just don't have the time to keep my "real" coding skills up to par. I'm also not terribly patient. So, after waiting a couple weeks for our storage guys to get me the numbers I'd asked for, I said to myself, "screw it: there's gotta be a quicker/better way."

In the case of our CLARiiONs, that better way was to use the NaviCLI (or, these days, the NaviSECCLI). This is a tool set that has been around a looooooong time, in one form or another, and has been available for pretty much any OS that you might attach to a CLARiiON as a storage client. These days, it's a TCP/IP-based commandline tool - prior to NaviCLI, you either had platform-specific tools (IRIX circa 1997 had a CLI-based tool that did queries through the SCSI bus to the array) or you logged directly into the array's RS232 port and used its onboard tools (hopefully, you had a terminal or terminal program that allowed you to capture output) ...but I digress.

If you own EMC equipment, you've hopefully got maintenance contracts that give you rights to download tools and utilities from the EMC support site. NaviCLI is one such tool. Once you install it, you have a nifty little command-line tool that you can wrap inside of scripts. You can create these scripts to both provisioning tasks and reporting tasks. My use, in this case, was reporting.'

The SRM we bought came with a number of canned-reports - including ones for CLARiiON devices. Unfortunately, the numbers we were getting from our SRM were indicating that we only had about 77TiB on one of our arrays when the EMC order sheets said we should have had about 102TiB. That's a bit of a discrepancy. I was able to wrap some NaviCLI commands into a couple scripts (one that reported on RAID-group capacity and one that reported physical and logical disk capacities [ed.: please note that these scripts are meant to be illustrative of what you can do, but aren't really something you'd want to have as the nexus of your enterprise-reporting strategy. They're slow to run, particularly on larger arrays]) and verify that the 77TiB was sort of right and that the 102TiB was also sorta right. The group capacity script basically just spits out two numbers - total raw capacity and total capacity allocatable to clients (without reporting on how much of either is already allocated to clients). The disk capacity script reports how the disks are organized (e.g., RAID1, RAID5, Spare, etc.) - printing total number of disks in each configuration category and how much raw capacity that represented. Basically, the SRM tool was reporting the maximum number of blocks that were configured into RAID groups, not the total raw physical blocks in the array that we'd thought it was supposed to report.

Having these number in hand allowed us to tear apart the SRM's database queries and tables so that we could see what information it was grabbing, how it was storing/organizing it and how to improve on the vendor-supplied standard reports. Mostly, it consisted of changing the titles of some existing fields and adding some fields to the final report.

Yeah, all of this begs the question "what was the value of buying an SRM when you had to reverse-engineer it to make the presented data meaningful?" To be honest, "I dunno." I guess, at the very least, we bought a framework through which we could put together pretty reports and ones that were more specifically meaningful to us (though, to be honest, I'm a little surprised that we're the only customers of the SRM vendor to have found the canned-reports to be "sadly lacking"). It also gave me an opportunity to give our storage guys a better idea of the powerful tools they had available to them if only they were willing to dabble at the command line (even on Windows).

Still the vendor did provide a technical resource to help us get things sorted out faster than we might have done without that assistance. So, I guess that's something?

Wednesday, July 6, 2011

Show Me the Boot Info!

For crusty old systems administrators (such as yours truly), the modern Linux boot sequence can be a touch annoying. I mean, the graphical boot system is pretty and all, but, I absolutely hate having to continually click on buttons just to see the boot details. And, while I know that some Linux distributions give you the option of viewing the boot details by either disabling the graphical boot system completely (i.e., nuke out the "rhgb" option from your grub.conf's kernel line) or switching to an alternate virtual console configured to show boot messages, that's just kind of a suck solution. Besides, if your default Linux build is like the one my company uses, you don't even have the alternate VCs as an option.

Now, this is a RedHat-centric blog, since that's what we use at my place of work (we've a few devices that use embedded SuSE, but, I probably void the service agreement any time I directly access the shell on those!). So, my "solution" is going to be expressed in terms of RedHat (and, by extension, CentOS, Scientific Linux, Fedora and a few others). For many things in RedHat, they give you nifty files in /etc/sysconfig that allow you to customize behaviors. So, I'd made the silly assumption that there'd be an /etc/sysconfig/rhgb type of file. No such luck. So, I dug around in the init scripts (grep -li is great for this, by the way) to see if there were any mentions of tt>rhgb. There was. Well, there was mention of rhgb-client in /etc/init.d/functions.

Unfortunately, even though our standard build seems to include manual pages for every installed component, I couldn't find a manual page for rhgb-client (or an infodoc, for that matter). The best I was able to find was a /usr/share/doc/rhgb-${VERSION}/HOW_IT_WORKS file (I'm assuming that ${VERSION} is consistent with the version of the RHGB RPM installed - it seemed to be). While an interesting read, it's not exactly the best, most exhaustive document I've ever read. It's about what you'd expect from a typical README file, I guess. Still, it didn't display what, if any, arguments that the rhgb-client would take.

Not wanting to do anything too calamitous, I called `rhgb-client --help` as a non-privileged user. I was gladdened to see that it didn't give me one of those annoying "you must be root to run this command" errors. It also gave some usage details:

rhgb-client --help
Usage: rhgb-client [OPTION...]
  -u, --update=STRING      Update a service's status
  -d, --details=STRING     Show the details page (yes/no).
  -p, --ping               See if the server is alive
  -q, --quit               Tells the server to quit
  -s, --sysinit            Inform the server that we've finished rc.sysinit

Help options:
  -?, --help               Show this help message
  --usage                  Display brief usage message

I'd hoped that since /etc/init.d/functions had shown an "--update" argument, it might take other arguments (and, correctly, assumed one would be "--help"). So, I used the above and updated my /etc/init.d/functions script and added "--details=yes" and rebooted. Lo and behold: I get the graphical boot session but get to see all the detailed boot messages, too! Hurrah.

Still, it seemed odd that, since the RHGB components are (sorta) configurable, there wasn't a file in /etc/sysconfig to set the requisite options. I hate having to hack config files that are likely to get overwritten the next time the associated RPM gets updated. I also figure that I can't be the only person out there that wants the graphical boot system and details. So, why havent the RHGB maintainers fixed this (and, yes, I realize that Linux is a community thing and I'm free to contribute fixes to it - I'd just hoped that someone like RedHat or SuSE would have had enough complaints from commercial UNIX converts to have already done it for me)? Oh well, one of these days, I suppose.