Friday, July 8, 2011

CLARiiON Report Data Verification

Earlier this year, the organization I work for decided to put into production an enterprise-oriented storage resource management (SRM) system. The tool we bought is actually pretty cool. We install collectors into each of our major data centers and they pull storage utilization data off of all of our storage arrays, SAN switches and storage clients (you know: the Windows and UNIX boxes that use up all that array-based storage). Then, all those collectors pump out the collected data to a reporting server at our main data center. The reporting server is capable of producing all kinds of nifty/pretty reports: configuration snapshots, performance reports, trending reports, utilization profiles, etc.

As cool as all this is, you have the essential problem of "how do I know that the data in all those pretty reports is actually accurate?" Ten or fifteen years ago, when array-based storage was fairly new and storage was still the realm of systems administrators with coding skills, you'd ask you nearest scruffy misanthrope, "could you verify the numbers on this report," and get an answer back within a few hours (and then within minutes each subsequent time you asked). Unfortunately, in the modern, GUI-driven world, asking your storage guys to verify numbers can be like pulling teeth. Many modern storage guys aren't really coders and frequently don't know the quick and easy way to get you hard numbers out of the devices they manage. In some cases, you may watch them cut and paste from the individual array's management web UIs into something like MicroSoft Calculator. So, you'll have to wait and, often times, you'll have to continually prod them for the data because it's such a pain in the ass for them to produce.

With our SRM rollout, I found myself in just such a situation. Fortunately, I've been doing Unix system adminstration for the best part of 20 years and, therefore, am rather familiar with scripting. I frequently wish I was able to code in better reporting languages, but I just don't have the time to keep my "real" coding skills up to par. I'm also not terribly patient. So, after waiting a couple weeks for our storage guys to get me the numbers I'd asked for, I said to myself, "screw it: there's gotta be a quicker/better way."

In the case of our CLARiiONs, that better way was to use the NaviCLI (or, these days, the NaviSECCLI). This is a tool set that has been around a looooooong time, in one form or another, and has been available for pretty much any OS that you might attach to a CLARiiON as a storage client. These days, it's a TCP/IP-based commandline tool - prior to NaviCLI, you either had platform-specific tools (IRIX circa 1997 had a CLI-based tool that did queries through the SCSI bus to the array) or you logged directly into the array's RS232 port and used its onboard tools (hopefully, you had a terminal or terminal program that allowed you to capture output) ...but I digress.

If you own EMC equipment, you've hopefully got maintenance contracts that give you rights to download tools and utilities from the EMC support site. NaviCLI is one such tool. Once you install it, you have a nifty little command-line tool that you can wrap inside of scripts. You can create these scripts to both provisioning tasks and reporting tasks. My use, in this case, was reporting.'

The SRM we bought came with a number of canned-reports - including ones for CLARiiON devices. Unfortunately, the numbers we were getting from our SRM were indicating that we only had about 77TiB on one of our arrays when the EMC order sheets said we should have had about 102TiB. That's a bit of a discrepancy. I was able to wrap some NaviCLI commands into a couple scripts (one that reported on RAID-group capacity and one that reported physical and logical disk capacities [ed.: please note that these scripts are meant to be illustrative of what you can do, but aren't really something you'd want to have as the nexus of your enterprise-reporting strategy. They're slow to run, particularly on larger arrays]) and verify that the 77TiB was sort of right and that the 102TiB was also sorta right. The group capacity script basically just spits out two numbers - total raw capacity and total capacity allocatable to clients (without reporting on how much of either is already allocated to clients). The disk capacity script reports how the disks are organized (e.g., RAID1, RAID5, Spare, etc.) - printing total number of disks in each configuration category and how much raw capacity that represented. Basically, the SRM tool was reporting the maximum number of blocks that were configured into RAID groups, not the total raw physical blocks in the array that we'd thought it was supposed to report.

Having these number in hand allowed us to tear apart the SRM's database queries and tables so that we could see what information it was grabbing, how it was storing/organizing it and how to improve on the vendor-supplied standard reports. Mostly, it consisted of changing the titles of some existing fields and adding some fields to the final report.

Yeah, all of this begs the question "what was the value of buying an SRM when you had to reverse-engineer it to make the presented data meaningful?" To be honest, "I dunno." I guess, at the very least, we bought a framework through which we could put together pretty reports and ones that were more specifically meaningful to us (though, to be honest, I'm a little surprised that we're the only customers of the SRM vendor to have found the canned-reports to be "sadly lacking"). It also gave me an opportunity to give our storage guys a better idea of the powerful tools they had available to them if only they were willing to dabble at the command line (even on Windows).

Still the vendor did provide a technical resource to help us get things sorted out faster than we might have done without that assistance. So, I guess that's something?

No comments:

Post a Comment