Titular Discrepancy: 2012

Wednesday, October 31, 2012

No-Post Reboots

One of the things that's always sorta bugged me about Linux was reboots generally required a full reset of the system. That is, if you did an `init 6`, the default bahavior cause your system to drop back down to the boot PROM and go through its POST routines. While on a virtualized system, this is mostly an inconvenience as virtual hardware POST is fairly speedy. However, when you're running on physical hardware, it can be a genuine hardship (the HP-based servers I typically work with can take upwards of ten to fifteen minutes to run its POST routines).
At any rate, a few months back I was just dinking around online and found a nifty method for doing a quick, no-BIOS reboot:

# BOOTOPTS=`cat /proc/cmdline` ; KERNEL=`uname -r` ; \
kexec -l /boot/vmlinuz-$KERNEL --initrd=/boot/initrd-"{$KERNEL}".img \
--append=${BOOTOPTS} ; reboot

Basically, the above:

Reads your /proc/cmdline file to get the boot arguments of your most recent bootup and stuffs the value into the BOOTOPTS environmental
Grabs your system's currently running kernel release (in case you've got multiple kernels installed and want to boot back into the current one) and stuffs the value into the KERNEL environmental
Calls the `kexec` command (a nifty utility for directly booting into a new kernel), leveraging the previously set environmentals to tell `kexec` what to do
Finishes with a reboot to the `kexec`-defined kernel (and options)

In testing it on a physical server that normally takes about 15 minutes to reboot (10+ minutes of POST routines) and speeded it up by over 66%. On a VM, it only saves maybe a minute (though that will depend on your VM's configuration settings).

Thursday, August 9, 2012

Why So Big

Recently, while working on getting a software suite ready for deployment, I had to find space in our certification testing environment (where our security guys scan hosts/apps and decide what needs to be fixed for them to be safe to deploy). Our CTA environment is unfortunately-tight on resources. The particular app I have to get certified wants 16GB or RAM to run in but will accept as little as 12GB (less than that and the installer utility aborts).

When I went to submit my server (VM, actually) requirements to our CTA team so they could prep me an appropriate install host, they freaked. "Why does it take so much memory" was the cry. So, I dug through the application stack.

The application includes an embedded Oracle instance that wants to reserve about 9GB for its SGA and other set-asides. It's going on a 64bit RedHat server and RedHat typically wants 1GB of memory to function acceptably (can go down to half that, but you won't normally be terribly happy). That accounted for 10GB of the 12GB minimum the vendor was recommending.

Unfortunately, the non-Oracle components of the application stack didn't seem to have a single file that described memory set asides. It looked like it was spinning up two Java processes with an aggregate heap size of about 1GB.

Added to the prior totals, the aggregated heap sizes put me at about 11GB of the vendor-specified 12GB. That still left an unaccounted for 1GB. Now, it could have been the vendor was requesting 12GB because it was a "nice round number" or they could have been adding some slop to their equations to give the app a bit more wiggle-room.

I could have left it there, but decided, "well, the stack is running, lets see how much it really uses". So, I fired up top. Noticed that the Oracle DB ran under one userid and that the rest of the app-stack ran under a different one. I set top to look only at the userid used by the rest of the app-stack. The output was too long to fit on one screen and I was too lazy to want to add up the RSS numbers, myself. Figured since top wasn't a good avenue, I might be able to use ps (since the command supports the Berkeley-style output options).

Time to hit the man pages...

After digging through the man pages and a bit of cheating (Google is your friend) I found the invocation of ps that I wanted:

`ps -u <appuser> -U <appuser> -orss=`.

Horse that to a nice `awk '{ sum += 1 } END { print sum}' and I had a quick method of divining how much resident memory the application was actually eating up. What I found was that the app-stack had 52 processes (!) that had about 1.7GB of resident memory tied up. Mystery solved.

Tuesday, August 7, 2012

Why Google's Two-Factor Authentication Is Junk

To be fair, I understand the goals that Google was trying to achieve. And, they're starting down a good path. However, there are some serious flaws (as I see it) with how they've decided to treat services that don't support two-factor authentication.

Google advertises that you can set per-service passwords for each application. That is to say, if you use third-party mail clients such as Thunderbird, third-party calendaring clients such as Lightning, and third-party chat clients such as Trillian, you can set up a password for each service. Conceivably, one could set one password for IMAP, one password for SMTP, one password for iCAL and yet another password for GoogleTalk. However, Google doesn't actually sandbox the passwords. By "sandbox" I mean restrict a given password to a specific protocol. If I generate four passwords with the intention of using each password once for each service - as Google's per-application passwords would logically be inferred to work - one actually weakens the security of the associated services. Instead of each service having its own password, each of the four, generated passwords can be used with any of the four targeted services. Instead of having one guessable password, there are now four guessable passwords.
Google's "per-application" passwords do not allow you to set your password strings. You have to use their password generator. While I can give Google credit for generating 16-character passwords, the strength of the generated passwords is abysmally low. Google's generated passwords are comprised solely of lower case characters. When you go to a "how strong is my password site", Google's generated passwords are ridiculously easy. The Google password is rated at "very weak" - a massive cracking array would take 14 years to break it. By contrast, the password I used on my work systems, last December, is estimated take the best part of 16,000 centuries. For the record, my work password from last year is two characters shorter than the ones Google generates.

So, what you end up with is X number of services that are each authenticatable against with X number of incredibly weak passwords.

All in all, I'd have to rate Google's efforts, at this point, pretty damned close to #FAIL: you have all the inconvenience of two-factor authentication and you actually broaden your attack surfaces if you use anything that's not HTTP/HTTPS based.

Resources:

Tuesday, July 31, 2012

Finding Patterns

I know that most of my posts are of the nature "if you're trying to accomplish 'X' here's a way that you can do it". Unfortunately, this time, my post is more of a, "I've got yet to be solved problems going on". So, there's no magic bullet fix currently available for those with similar problems who found this article. That said, if you are suffering similar problmes, know that you're not alone. Hopefully that's small consolation and what follows may even help you in investigating your own problem.

Oh: if you have suggestions, I'm all ears. Given the nature of my configuration, there's not much in the way of useful information I've yet found via Google. Any help would be much appreciated...

The shop I work for uses Symantec's Veritas NetBackup product to perform backups of physical servers. As part of our effort to make more of the infrastructure tools we use more enterprise-friendly, I opted to leverage NetBackup 7.1's NetBackup Access Control (NBAC) subsystem. On its own, it provides fine-grained rights-delegation and role-based access control. Horse it to Active Directory and you're able to roll out a global backup system with centralized authentication and rights-management. That is, you have all that when things work.

For the past couple months, we've been having issues with one of the dozen NetBackup domains we've deployed into our global enterprise. When I first began trougleshooting the NBAC issues, the authentication and/or authorization failures had always been associated with a corruption of LikeWise's sqlite cachedb files. At the time the issues first cropped up, these corruptions always seemed to coincide with DRS moving the NBU master server from one ESX host to another. It seemed like, when under sufficiently heavy load - the kind of load that would trigger a DRS event - LikeWise didn't respond well to having the system paused and moved. Probably something to do with the sudden apparent time-jump that happens when a VM is paused for the last parts of the DRS action. My "solution" to the problem was to disable automated-relocation for my VM.

This seemed to stabilize things. LikeWise was no longer getting corrupted and seems like I'd been able to stabilize NBAC's authentication and authorization issues. Well, they stabilized for a few weeks.

Unfortunately, the issues have begun to manifest themselves, again, in recent weeks. We've now had enough errors that some patterns are starting to emerge. Basically, it looks like something is horribly bogging the system down around the time that the nbazd crashes are happening. I'd located all the instances of nbazd crashing from its log files ("ACI" events are loged to the /usr/openv/netbackup/logs/nbazd logfiles), and then began to try correlating them with system load shown by the host's sadc collections. I found two things: 1) I probably need to increase my sample frequency - it's currently at the default 10-minute interval - if I want to more-thoroughly pin-down and/or profile the events; 2) when the crashes have happened within a minute or two of an sadc poll, I've found that the corresponding poll was either delayed by a few seconds to a couple minutes or was completely missing. So, something is causing the server to grind to a standstill and nbazd is a casualty of it.

For the sake of thoroughness (and what's likely to have matched on a Google-search and brought you here), what I've found in our logs are messages similar to the following:

/usr/openv/netbackup/logs/nbazd/vxazd.log

07/28/2012 05:11:48 AM VxSS-vxazd ERROR V-18-3004 Error encountered during ACI repository operation.
07/28/2012 05:11:48 AM VxSS-vxazd ERROR V-18-3078 Fatal error encountered. (txn.c:964)
07/28/2012 05:11:48 AM VxSS-vxazd LOG V-18-4204 Server is stopped.
07/30/2012 01:13:31 PM VxSS-vxazd LOG V-18-4201 Server is starting.

/usr/openv/netbackup/logs/nbazd/debug/vxazd_debug.log

07/28/2012 05:11:48 AM Unable to set transaction mode. error = (-1)
07/28/2012 05:11:48 AM SQL error S1000 -- [Sybase][ODBC Driver][SQL Anywhere] Connection was terminated
07/28/2012 05:11:48 AM Database fatal error in transaction, error (-1)
07/30/2012 01:13:31 PM _authd_config.c(205) Conf file path: /usr/openv/netbackup/sec/az/bin/VRTSaz.conf
07/30/2012 01:22:40 PM _authd_config.c(205) Conf file path: /usr/openv/netbackup/sec/az/bin/VRTSaz.conf

Our NBU master servers are hosted on virtual machines. It's a supported configuration and adds a lot of flexibility and resiliency to the overall enterprise-design. It also means that I have some additional metrics available to me to check. Unfortunately, when I checked those metrics, while I saw utilization spikes on the VM, those spikes corresponded to healthy operations of the VM. There weren't any major spikes (or troughs) during the grind-downs. So, to ESX, the VM appeared to be healthy.

At any rate, I've requested our ESX folks see if there might be anything going on on the physical systems hosting my VM that aren't showing up in my VM's individual statistics. I'd previously had to disable automated DRS actions to keep LikeWise from eating itself - those DRS actions wouldn't have been happening had the hosting ESX system not been experiencing loading issues - perhaps whatever was causing those DRS actions is still afflicting this VM.

I've also tagged one of our senior NBU operators to start picking through NBU's logs. I've asked him to look to see if there are any jobs (or combinations of jobs) that are always running during the bog-downs. If it's a scheduling issue (i.e., we're to blame for our problems), we can always reschedule jobs to exert less loading or we can scale up the VM's memory and/or CPU reservations to accommodate such problem jobs.

For now, it's a waiting-game. At least there's an investigation path, now. It's all in finding the patterns.

Wednesday, April 25, 2012

When VMs Go POOF!

Like many organizations, the one I work for is following the pathway to the cloud. Right now, this consists of moving towards an Infrastructure As A Service model. This model heavily leverages virtualization. As we've been pushing towards this model, we've been doing the whole physical to virtual dance, wherever possible.

In many cases, this has worked well. However, like many large IT organizations, we have found some assets on our datacenter floors that have been running for years on both outdated hardware and outdated operating systems (and, in some cases, "years" means "in excess of a decade"). Worse, the people that set up these systems are often no longer employed by our organization. Worse still, the vendors of the software in use long-ago discontinued either the support of the software version in use or no longer maintain any version of the software.

One such customer was migrated last year. They weren't happy about it at the time, but, they made it work. Because we weren't just going to image their existing out of date OS into the pristine, new environment, we set them up with a virtualized operating environment that was as close to what they'd had as was possible. Sadly, what they'd had was a 32bit RedHat Linux server that had been built 12 years previously. Our current offerings only go back as far as RedHat 5, so that's what we built them. While our default build is 64bit, we offer 32bit builds - unfortunately, the customer never requested a 32bit environment The customer did their level best to massage the new environment into running their old software. Much of this was done by tar'ing up directories on the old system and blowing them onto their new VM. They'd been running on this massaged system for nearly six months.

If you noticed that 32-to-64bit OS-change, you probably know where this is heading...

Unfortunately, as is inevitable, we had to take a service outage across the entire virtualized environment. We don't yet have the capability in place to live migrate VMs from one data center to another for such windows. Even if we had, the nature of this particular outage (installation of new drivers into each VM) was such that we had to reboot the customer's VM, any way.

We figured we were good to go as we had 30+ days of nightly backups of each impacted customer VM. Sadly, this particular customer, after doing the previously-described massaging of their systems, had never bothered to reboot their system. It wasn't (yet) in our procedures to do a precautionary pre-modification reboot of the customers' VMs. The maintenance work was done and the customer's VM rebooted. Unfortunately, the system didn't come back from the reboot. Even more unfortunately, the thirty-ish backup images we had for the system were similarly unbootable and, worse, unrescuable.

Eventually, we tracked down the customer to inform them of the situation and to find out if the system was actually critical (the VM had been offline for nearly a full week by the time we located the system owner, but no angry "where the hell is my system" calls had been received or tickets opened). We were a bit surprised to find that this was, somehow, a critical system. We'd been able to access the broken VM to a sufficient degree to determine that it hadn't been rebooted in nearly six months, that it's owner hadn't logged in nearly that same amount of time but that the on-disk application data appeared to be intact (the filesystems they were on were mountable without errors). So, we were able to offer the customer the option of building them a new VM and helping them migrate their application data off the old VM to the new VM.

We'd figured we'd just restore data from the old VM's backups to a directory tree on the new VM. The customer, however, wanted the original disks back as mounted disks. So, we had to change plans.

The VMs we build make use of the Linux Volume Manager software to manage storage allocation. Each customer system is built off of standard templates. Thus, each VM's default/root volume groups all share the same group name. Trying to import another host's root volume group onto a system that already has a volume group of the same name tends to be problematic in Linux. That said, it's possible to massage (there's that word again) things to allow you to do it.

The customer's old VM wasn't bootable to a point where we could simply FTP/SCP its /etc/lvm/backup/RootVG file off. The security settings on our virtualization environment also meant we couldn't cut and paste from the virtual console into a text file on one of our management hosts. Fortunately, our backup system does support file-level/granular restores. So, we pulled the file from the most recent successful backup.

Once you have an /etc/lvm/backup/<VGNAME> file available, you can effect a name change on a volume group relatively easily. Basically, you:

Create your new VM
Copy the old VM's /etc/lvm/backup/<VGNAME> into the new VM's /etc/lvm/backup directory (with a new name)
Edit that file, changing the old volume group's object names to ones that don't collide with the currently-booted root volume groups (really, you only need to change the name of the root volume group object - the volumes can be left as is
Connect the old VM's virtual disk to the new VM
Perform a rescan of the VM's scsi bus so it sees the old VM's virtual disk
Do a `vgcfgrestore -f /etc/lvm/backup/<VGNAME> <NewVGname>`
Do a `pvscan`
Do a `vgscan`
Do a `vgchange -ay <NewVGname>`
Mount up the renamed volume group's volumes

As a side note (to answer "why didn't you just do a `boot linux rescue` and change things that way"): our security rules prevent us from keeping bootable ISOs (etc.) available to our production environment. Therefore, we can't use any of the more "normal" methods for diagnosing or recovering a lost system. Security rules trump diagnostics and/or recoverability needs. Dems da breaks.

Asymmetrical NIC-bonding

Currently, the organization I am working for is making the transition from 1Gbps networking infrastructure to 10Gbps infrastructure. The initial goal had been to first migrate all of the high network-IO servers that were using trunked 1Gbps interfaces to using 10Gbps Active/Passive configurations.
Given the current high per-port cost of 10Gbps networking, it was requested that a way be found to not waste 10Gbps ports. Having a 10Gbps port sitting idle "in case" the active port became unavailable was seen as financially wasteful. A a result, we opted to pursue the use of asymmetrical A/P bonds that used our new 10Gbps links for the active/primary path and reused our 1Gbps infrastructure for the passive/failover path.

Setting up bonding on Linux can be fairly trivial. However, when you start to do asymmetrical bonding, you want to ensure that your fastest paths are also your active paths. This requires some additional configuration of the bonded pairs beyond just the basic declaration of the bond memberships.
In a basic bonding setup, you'll have three primary files in the /etc/sysconfig/network-scripts directory: ifcfg-ethX, ifcfg-ethY and ifcfg-bondZ. The ifcfg-ethX and ifcfg-ethY files are basically identical but for their DEVICE and HWADDR parameters. At their most basic, they'll each look (roughly) like:

DEVICE=ethN
HWADDR=AA:BB:CC:DD:EE:FF
ONBOOT=yes
BOOTPROTO=none
MASTER=bondZ
SLAVE=yes

And the (basic) ifcfg-bondZ file will look like:

DEVICE=bondZ
ONBOOT=yes
BOOTPROTO=static
NETMASK=XXX.XXX.XXX.XXX
IPADDR=WWW.XXX.YYY.ZZZ
MASTER=yes
BONDING_OPTS="mode=1 miimon=100"

This type of configuration may produce the results you're looking for, but it's not guaranteed to. If you want to absolutely ensure that your faster NIC will be selected as the primary NIC (and that it will fail back to that NIC in the event that the faster NIC goes offline and then back online), you need to be a bit more explicit with your ifcfg-bondZ file. To do this, you'll mostly want to modify your BONDING_OPTS directive. I also tend to add some BONDING_SLAVEn directives, but that might be overkill. Your new ifcfg-bondZ file that forces the fastest path will look like:

DEVICE=bondZ

ONBOOT=yes

BOOTPROTO=static

NETMASK=XXX.XXX.XXX.XXX

IPADDR=WWW.XXX.YYY.ZZZ

MASTER=yes

BONDING_OPTS="mode=1 miimon=100 primary=ethX primary_reselect=1"

The primary= tells the bonding driver to set the ethX device as primary when the bonding-group first onlines. The primary_reselect= tells it to use a interface selection policy of "best".

Note: The default policy is "0". This policy simply says "return to interface declared as primary". I choose to override with policy "1" as a hedge against the primary interface coming back in some kind of degraded state (while most of our 10Gbps media is 10Gbps-only, some of the newer ones are 100/1000/1000). I only want to fail back to the 10Gbps interface if it's still running at 10Gbps and hasn't for some reason, negotiated down to some slower speed.

When using the more explicit bonding configuration, the resultant configuration will resemble something like:

Ethernet Channel Bonding Driver: v3.4.0-1 (Octobber 7, 2008)

Bonding Mode: fault-tolerance (active-backup)

Primary Slave: ethX (primary_reselect better)

Currently Active Slave: ethX

MII Status: up

MII Polling Interval (ms): 100

Up Delay (ms): 0

Down Delay (ms): 0

Slave Interface: ethX

MII Status: up

Speed: 10000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: AA:BB:CC:DD:EE:FF

Slave Interface: ethY

MII Status: up

Speed: 1000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: FF:EE:DD:CC:BB:AA

Thursday, March 22, 2012

ACL Madness

In our environment, security is a concern. So, many programs and directories that you might take being available to you on a "standard" Linux system will give you a "permission denied" on one of our systems. Traditionally, you might work around this by changing the group-ownership and permissions on the object to allow a subset of the system users the expected level of access to those files or directories. And, this will work great if all the people that need access share a common group. If they don't you have to explore other avenues. One of these avenues is the extened permissions attributes found in Linux's ACL handling (gonna ignore SELinux, here, mostly because: A) I'm lazy; B) I freaking hate SELinux - probably because of "A"; and, C) ACLs are a feature that is available across different implementations of UNIX and Linux, and is thus more portable than SELinux or other vendor-specific extended security attribute mechanisms). Even better, as a POSIX filesystem extension, you can use it on things like NFS shares and have your ACLs work across systems and *N*X platforms (assuming everone implements the POSIX ACLs the same way).

And, yes, I know things like `sudo` can be used to delegate access, but that's fraught with its own concerns, as well, least of all its lack of utility for users that aren't logging in with interactive shells.

Say you have a file <tt>/usr/local/sbin/killmenow</tt> that is currently set to mode 700 and you want to give access to it to members of the ca_opers and md_admins groups. You can do something like:

# setfacl -m g:ca_opers:r-x /usr/local/sbin/killmenow # setfacl -m g:md_admins:r-x /usr/local/sbin/killmenow

Now, members of both the ca_opers and md_admins groups can run this program.

All is well and good until someone asks, "what files have had their ACLs modified to allow this" and you've (or others on your team) gone crazy with the setfacl command. With standard file permissions, you can just use the `find` command to locate files that have a specific permissions-setting. With ACLs, `find` is mostly going to let you down. So, what to do? Fortunately, setfacl 's partner-command, `getfacl ` can come to your rescue. Doing something like

# getfacl --skip-base -R 2> /dev/null | sed -n 's/^# file://p`

Will walk the directory-structure, from downward, giving you a list of files with ACLs added to them. Once you've identified such-modified files, you can then run `getfacl` against them, individually, to show what the current ACLs are.

Friday, March 9, 2012

FaceBook Security Fail

Ok, this is a bit of a departure from my usual, work-related posting. That said, security fits into the rubric of "more serious postings", even if that security is related to social networks. Who knows: maybe I'll decide to move it, later. At any rate...

I'm a big fan of social networking. In addition to my various blogs (on Posterous, BlogSpot, Tumblr, etc.), I make heavy use of services like FaceBook, Plus and Twitter (and have "personal" and "work" personnae set up on each). On some services (FaceBook and LiveJournal) I run my accounts fairly locked-down; on others (most everything else), I run things wide-open. In either case, I'm not exactly the most censored of individuals. The way I look at it, if someone's going to use my postings against me, I may as well make it easy for them to do it up front than to put myself in a position where I'm invested in an individual or an organization only to have my posting history negatively subsequently sabotage that. It's a pick your poison kind of thing.

That said, not everyone is quite as laissez-faire about their online sharing activities. So, in general, I prefer to keep my stuff at least as locked down as the members of my sharing community do. That way, there's a lower likelihood that my activities will accidentally compromise someone else.

Today, a FaceBook friend of mine was trying to sort through they myriad security settings available to her so that she could create a "close friends only" kind of profile. She'd thought that she'd gotten things pretty locked down, until an unexpected personal message revealed to her that she had information leakage, somewhere, in her FaceBook usage. I was trying to help her ID it.

While my friend had fairly restrictively locked down her profile, she wasn't aware that certain actions could compromise those settings. Specifically, she wasn't aware that if she posted a comment to a public (or at least more open) thread, that others would be able to "see her". She'd assumed that if she set all of her security buttons-n-dials to "friends only" that anything she did would be kept friends only. With FaceBook, that's mostly the case. However, if you comment on a thread started by another friend, then everyone who is able to see that thread can "see" (aspects of) you, as well. Thus, if a friend starts a thread and has the permissions set to public and you comment on it, the entire Internet can see that you have some kind of FaceBook presence, even if they don't have permission to view your profile/timeline.

In attempting to illustrate this, I took a screen capture of a post that had been set to public. I'd done so using the post of someone I thought was a shared friend (when I'd clicked on the poster's profile, both myself and my security-conscious friend appeared to show up in the poster's "mutual friends" list). It turns out I was mistaken in that thought.

When I'd posted the screen shot to my security-conscious friend's wall, I tagged the original poster in that wall post. My security-conscious friend had set her wall to "friends only". When she informed me that the public-poster was not a mutual "friend" but a "friend of a friend", I'd made the suppostion that the tagging of the public-posting friend would be moot. After all: what kind of security model would allow me to overrided my security-conscious friend's wall security settings with something so simple as a tag-event? Turns out, FaceBook's security model would. To me, that would fall into the general heading of a "broken security model".

Oh well, now to figure out how to rattle some cages in FaceBook's site usability group to get them to fix that.

Monday, March 5, 2012

AD Integration Woes

For a UNIX guy, I'm a big fan of leveraging Active Directory form centralized system and user management purposes. For me, one of the big qualifications for any application/device/etc. to be able to refer to itself as "enterprise" is that it has to be able to pull user authentication information via Active Directory. I don't care whether it does it the way that Winbind-y things do, or if they just pull data via Kerberos or LDAP. All I care is that I can offload my user-management to a centralized source.

In general, this is a good thing. However, like many things that are "in general, a good thing" it's not always a good thing. If you're in an enterprise that's evolving it's Active Directory infrastructure, tying things to AD means that you're impacted when AD changes or breaks. Someone decides, "hey, we need to reorganize the directory tree" and stuff can break. Someone decides, "hey, we need to upgrade our AD servers from 2003 to 2008" and stuff can break.

Recently, I started getting complaints from the users of our storage resource management (SRM) system that they couldn't login any more. It'd been nearly a year since I'd set it up, so sorting it out was an exercise in trying to remember what the hell I did ...and then Googling.

The application, itself, runs on a Windows platform. The login module that it uses for centralized authentication advertises itself as "Active Directory". In truth, the login module is a hybrid LDAP/Kerberos software module. Even though it's a "Windows" application, they actually use the TomCat Java server for the web UI (and associated login management components). TomCat is what uses Kerberos for authentication data.

Sometime in recent months, someone had upgraded Active Directory. Users that had been using the software before the AD-upgrade were able to authenticate, just fine. Users that had tried to start using the software after the AD-upgrade couldn't get in. Turns out that, when Active Directory had gotten upgraded, the encryption algorithms had gotten changed. Ironically, I didn't find the answer to my problem in any of the forums related to the application: I found it in the forums for another application that used TomCat's Kerberos components.

To start the troubleshooting process, one needs to first modify TomCat's bsclogin.conf file. Normally, this file is only used to tell TomCat where to find the Kerberos configuration file. However, if you modify your bsclogin.conf file and add the directive "debug=true" to it like so:

com.sun.security.auth.module.Krb5LoginModule required debug=true;

Enhanced user login debugging messages are enabled. Once this is added and tomcat is restarted, login-related messages will start showing up in your ${TOMCATHOME}/logs/stdout.log file. What started showing up in mine were messages like:

[Krb5LoginModule] user entered username: thjones2@SRMdomain.NET Acquire TGT using AS Exchange

[Krb5LoginModule] authentication failed KDC has no support for encryption type (14)

With this error message in hand, I was able to find out that TomCat's Kerberos modules were using the wrong encryption routines to get data out of Active Directory. The fix was to update my Kerberos initialization file and add the following two lines to my [libdefaults] stanza (I just added it right after the dns_lookup_realm line and before the next stanza of directives):

default_tkt_enctypes = rc4-hmac
default_tgs_enctypes = rc4-hmac

Making this change (and restarting TomCat) resulted in the failing users suddenly being able to get in.

I'd normally rag on my Windows guys for this, but, the Windows guys aren't exactly used to providing AD-related service notifications to anyone but Windows users. This application, while (currently) running on a Windows platform, isn't something that's traditionally thought of as AD-reliant. Factor in that true, native AD modules pretty much just auto-adjust to such changes, and it didn't occur to them to notify us of the changes.

Oh well. AD-integration is a learning experience for everyone involved, I suppose.

Tuesday, February 21, 2012

Cross-Platform Sharing Pains

All in all, the developers of Linux seem to have done a pretty good job of ensuring that Linux is able to integrate with other systmes. Chief among this integration has been in the realm of sharing files between Linux and Windows hosts. Overall, the CIFS support has been pretty damned good and has allowed Linux to lead in this area compared to other, proprietary *N*X OSes.

That said, Microsoft (and others) seem to like to create a moving target for the Open Source community to aim at. If you have Windows 2008-based fileservers in your operations and are trying to get you Linux hosts to mount shares coming off these systems, you hay have run into issues with doing so. This is especially so if your Windows share-servers are set up with high security settings and you're trying to use service names to reference those share servers (i.e., the Windows-based fileserver may have a name like "dcawfs0035n" but you might have an alias like "repository5").

Normally, when mounting a CIFS fileserver by its real name, you'll do something like:

# mount -t cifs "//dcawfs0035n/LXsoftware" /mnt/NAS -o domain=mydomain,user=myuser

And, assuming the credentials you supply are correct, the URI is valid and the place you're attemtping to mount to exists and isn't busy, you'll end up with that CIFS share mounted on your Linux host. However, if you try to mount it via an alias (e.g., a CNAME in DNS):

# mount -t cifs "//repository5/LXsoftware" /mnt/NAS -o domain=mydomain,user=myuser

You'll get prompted for your password - as per normal - but, instead of being rewarded with the joy of your CIFS share mounted to you Linux host, you'll get an error similar to the following:

mount error 5 = Input/output error Refer to the mount.cifs(8) manual page (e.g.man mount.cifs)

Had you fat-fingered your password, you'd have gotten a "mount error 12" (permission denied), instead. The above results in strict name checking being performed on the share mount attempt. Because you've attempted to an alias to connect with, the name-checking will fail and you'll get the above denial. You can verify that this is the underlying cause by re-attempting the mount with the fileserver's real name. If that succeeds where the alias, failed, you'll know where to go, next.

The Microsoft-published solution is found in KB281308. To summarize, you'll need to:

Have admin and login rights on the share-server
Login to the share-server
Fire up regedit
Navigate to "HKLM\System\CurrentControlSet\Services\LanmanServer\Parameters"
Create a new DWORD paramter named "DisableStrictNameChecking"
Set its value to "1"
Reboot the fileserver
Retry your CIFS mount attempt.

At this point, your CIFS mount should succeed.

Interestingly, if you've ever tried to connect to this share from another Windows host not in the share-server's domain (e.g., from a host on different Windows domain that doesn't have cross-realm trusts set up or a standlone Windows client), you will probably have experienced connection errors, as well. Typical error messages being something on the order of "account is not allowed to logon from this resource" or just generally refusing to accept what you know to be a good set of credentials.

Wednesday, January 18, 2012

Quick-n-Dirty Yum Repo via HTTP

Recently, we had a bit of a SNAFU in the year-end renewal of our RedHat support. As a result, all of the RHN accounts tied to the previous contract lost access to RHN's software download repositories. This meant that things like being able to yum-down RPMs on rhn_register'ed systems no longer worked and, we couldn't log into RHN and do a manual download, either.

Fortunately, because we're on moderately decent terms with RedHat and they know that the contract eventually will get sorted out, they were willing to help us get through our current access issues. Moving RHN accounts from one permanent contract to another, after first associating them with some kind of temporary entitlement is a paperwork-hassle for all parties involved and is apt to get your account(s) mis-associated down the road. Since all parties knew that this was a temporary problem but needed an immediate fix, our RedHat representative furnished us with the requisite physical media necessary to get us through till the contracts could get sorted out.

Knowing that I wasn't the only one that might need the software and that I might need to execute some burndown-rebuilds on a particular project I was working on, I wanted to make it easy to pull packages to my test systems. We're in an ESX environment, so, I spun up a small VM (only 1GB of virtual RAM, 1GHz of virtual CPU, a couple Gigabytes of virtual disk for a root volume and about 20GB of virtual disk to stage software onto and build an RPM repository on) to act as a yum repository server.

After spinning this basic VM, I had to sort out what to do as far as getting that physical media turned into a repo. I'm not a big fan of copying CDs as a stream of discrete files (been burned, too many times, by over-the-wire corruption, permission issues and the like). So, I took the DVD and made an ISO from it. I then took that ISO and scp'ed it up to the VM.

Once I had the ISO file copied up to the VM, did a quick mount of it (`mount -t iso9660 -o loop,ro /tmp/RHEL5.7_x86_64.iso /mnt/DVD` for those of you playing at home). Once I mounted it, I did a quick copy of its contents to the filesystem I'd set aside for it. I'm kind of a fan of cpio for things like this, so I cd'ed into the root of the mounted ISO and did a `find . -print | cpio -pmd /RepoDir` to create a good copy of my ISO data into a "real" filesystem (note, you'll want to make sure you do a `umask 022` first to ensure that the permission structures from the mounted ISO get copied, intact, along with the files, themselves).

With all of the DVD's files copied to the repo-server and into a writeable filesystem, it's necessary to create all the repo structures and references to support use by yum. Our standard build doesn't include the createrepo tool, so, first I had to locate its RPM in the repo filessytem and then install it onto my repo-server. Doing a quick `find . -name "*createrepo*rpm"` while cd'ed into repo fileystem turned up the path to the requisite RPM. I then did an `rpm -Uh [PathFromFind]` to install the createrepo tool's RPM files.

The createrepo tool is a nifty little tool. You just cd into the root of the directory where you copied your media to, do a `createrepo .`, and it scans the directory structures to find all the RPMs and XMLs and other files and creates the requisite data structures and pointers that allow yum to know how to pull the appropriate RPMs from the filesystem.

Once that's done, if all you care about is local access to the RPMs, you can create a basic .repo file in /etc/yum.repos.d that uses a "baseurl=file:///Path/To/Media" directive in it.

In my case, I wanted to make my new repo available to other hosts at the lab. Easiest way to make the repo available over the network is to do so via HTTP. Our standard build doesn't include the standard RedHat HTTP server, by default. So, I manually installed the requisite RPMs from the repo's filesystem. I modified the stock /etc/httpd/conf/httpd.conf and added the folowing stanzas to it:

Alias /Repo/ "/RepoDir/"
<Directory "/RepoDir">
   Options Indexes MultiViews
   AllowOverride None
   Order allow,deny
   Allow from all
</Directory>

[Note: this is probably a looser configuration than I'd have in place if I was making this a permanent solution, but this was just meant as a quick-n-dirty workaround for a temporary problem.]

I made sure to do a `chkconfig httpd on` and then did a `service httpd start` to activate the web server. I then took my web browser and made sure that the repo filesystem's contents were visable via web client. It wasn't: I forgot that our standard build has port 80 blocked by default. I did the requisite juju to add an exception to iptables for port 80 and all was good to go.

With my RPMs (etc.) now visable via HTTP, I logged into the VM that I was actually needing to install RPMs to via yum. I escalated privileges to root and created an /etc/yum.repos.d/LAB.repo file that looked similar to the following:

[lab-repo]
name=RHEL 5.7
baseurl=http://repovm.domain.name/Repo
enabled=1
gpgcheck=0

I did a quick cleanup of the consuming VM's yum repo information with a `yum clean all` and then verified taht my consuming VM was able to properly see the repos's data by doing a `yum list`. All was good to go. Depending on how temporary this actually ends up being, I'll go back and make my consuming VM's .repo file a bit more "complete" and more properly layout the repo-server's filesystem and HTTP config.