Titular Discrepancy: NetBackup

Showing posts with label NetBackup. Show all posts

Saturday, July 16, 2016

Retrospective Automatic Image Replication in NetBackup

In version 7.x of NetBackup, VERITAS added the Automatic Image Replication functionality. This technology is more commonly referred to as "AIR". Its primary use case is to enable a NetBackup administrator to easily configure data replication between two different — typically geographically-disbursed — NetBackup domains.

Like many tools that are designed for a given use-case, AIR can be used for things it wasn't specifically designed for. Primary down-side to these not-designed-for use-cases is the documentation and tool-sets for such usage is generally pretty thin.

A customer I was assisting wanted to upgrade their appliance-based NetBackup system but didn't want to have to give up their old data. Because NetBackup appliances use Media Server Deduplication Pools (MSDP), it meant that I had a couple choices in how to handle their upgrade. I opted to try to use AIR to help me quickly and easily migrate data from their old appliance's MSDP to their new appliance's.

Sadly, as is typical of not-designed-for use-case, documentation for doing it was kind of thin-on-the ground. Worse, because Symantec had recently spun VERITAS back off as its own entity, many of the forums that survived the transition had reference- and discussion-links that pointed to nowhere. Fortunately, I had access to a set of laboratory systems (AWS/Azure/Google Cloud/etc. is great for this - both from the standpoint of setup speed and "ready to go" OS templates). I was also able to borrow some of my customer's NetBackup 7.7 keys to use for the testing.

I typically prefer to work with UNIX/Linux-based systems to host NetBackup. However, my customer is a Windows-based shop. My customer's planned migration was also going to have the new NetBackup domain hosted on a different VLAN from their legacy NetBackup domain. This guided my lab design: I created a cloud-based "lab" configuration using two subnets and two Windows Server 2012 instance-templates. I set up each of my instances with enough storage to host the NetBackup software on one disk and the MSDPs on another disk ...and provisioned each of my test master servers with four CPUs and 16GiB or RAM. This is considerably smaller then both their old and new appliances, but I also wasn't trying to simulate an enterprise outpost's worth of backpup traffic. I also set up a mix of about twenty Windows and Linux instances to act as testing clients (customer is beginning to add Linux systems as virtualization and Linux-based "appliances" have started to creep into their enterprise-stacks).

I set up two very generic NetBackup domains. Into each, I built an MSDP. I also set up a couple of very generic backup policies on the one NetBackup Master Server to backup all of the testing clients to the MSDP. I configured the policies for daily fulls and hourly incrementals, and set up each of the clients to continuously regenerate random data-sets in their filesystems. I let this run for forty-eight hours so that I could get a nice amount of seed-data into the source NBU domain's MSDP.

Note: If you're not familiar with MSDP setup, the SETTLERSOMAN website has a good, generic walkthrough.

After populating the source site's MSDP, I converted from using the MSDP by way of a direct STorage Unit definition (STU) to using it by way of a two-stage Storage Lifecycle Policy (SLP). I configured the SLP to use the source-site MSDP as the stage-one destination in the lifecycle and added the second NBU domain's MSDP as the stage-two destination in the lifecycle. I then seeded the second NBU domain's MSDP with data by executing a full backup of all clients against the SLP.

Note: For a discussion on setting up an AIR-based replication SLP, again, the SETLLERSOMAN website has a good, generic walkthrough.

All of the above is fairly straight-forward and well documented (both within the NBU documentation and sites like SETTLERSOMAN). However, it only addresses the issue of how you get newly-generated data from one NBU domain's MSDP to another's. Getting older data from an existing MSDP to a new MSDP is a bit more involved ...and not for the command-line phobic (or, in my case, PowerShell-phobic.)

At a high level, what you do is:

Use the `bpimmedia` tool to enumerate all of the backup images stored on the source-site's MSDP
Grab only the media-IDs of the enumerated backup images
Feed that list of media-IDs to the `nbreplicate` tool so that it can copy that old data to the new MSDP

Note: The vendor documentation for the `bpimmedia` and `nbreplicate` tools can be found at the VERITAS website.

When using the `bpimmedia` tool to automate image-ID enumeration, using the `-l` flag puts the output into a script-parsable format. The desired capture-item is the fourth field in all lines that begin 'IMAGE':

In UNIX/Linux shell, use an invocation similar to: `bpimmedia -l | awk '/^IMAGE/{print $4}`
In PowerShell, use an invocation similar to:`bpimmedia -l | select-string -pattern "IMAGE *" | ForEach-Object { $data = $_ -split " " ; "{0}" -f $data[3] }`

The above output can then be either captured to a file — so that one the `nbreplicate` job can be launched to handle all of the images — or each individual image-ID can be passed to an individual `nbreplicate` job (typically via a command-pipeline in a foreach script). I ended up doing the latter because, even though the documentation indicates that the tool supports specifying an image-file, when executed under PowerShell, `nbreplicate` did not seem to know what to do with said file.

The `nbreplicate` command has several key flags we're interested in for this exercise:

-backupid: The backup-identifier captured via the `bpimmedia` tool
-cn: The copy-number to replicate — in most circumstances, this should be "1"
-rcn: The copy-number to assign to the replicated backup-image — in most circumstances, this should be "1"
-slp: the name of the SLP hosted on the destination NetBackup domain
-target_sts: the FQDN of the destination storage-server (use `nbemmcmd -listhosts` to verify names - or the replication jobs will fail with a status 191, sub-status 174)
-target_user: the username of a user that has administrative rights to the destination storage-server
-target_user: the password of the the -target_user username

If you don't care about minimizing the number of replication operations, this can all be put together similar to the following:

For Unix:

for ID in $(bpimmedia -l | awk '/^IMAGE/{print $4}')
do
   nbreplicate -backupid ${ID} -cn 1 -slp_name <REMOTE_SLP_NAME> \
     -target_sts <REMOTE_STORAGE_SERVER> -target_user <REMOTE_USER> \
     -target_pwd <REMOTE_USER_PASSWORD>
done

For Windows:

@(bpimmedia -l | select-string -pattern "IMAGE *" | \
   ForEach-Object { $data = $_ -split " " ; "{0}" -f $data[3] }) | \
   ForEach-Object { nbreplicate -backupid $_ -cn 1 \
     -slp_name <REMOTE_SLP_NAME> -target_sts <REMOTE_STORAGE_SERVER> \
     -target_user <REMOTE_USER> -target_pwd <REMOTE_USER_PASSWORD> }

Tuesday, July 31, 2012

Finding Patterns

I know that most of my posts are of the nature "if you're trying to accomplish 'X' here's a way that you can do it". Unfortunately, this time, my post is more of a, "I've got yet to be solved problems going on". So, there's no magic bullet fix currently available for those with similar problems who found this article. That said, if you are suffering similar problmes, know that you're not alone. Hopefully that's small consolation and what follows may even help you in investigating your own problem.

Oh: if you have suggestions, I'm all ears. Given the nature of my configuration, there's not much in the way of useful information I've yet found via Google. Any help would be much appreciated...

The shop I work for uses Symantec's Veritas NetBackup product to perform backups of physical servers. As part of our effort to make more of the infrastructure tools we use more enterprise-friendly, I opted to leverage NetBackup 7.1's NetBackup Access Control (NBAC) subsystem. On its own, it provides fine-grained rights-delegation and role-based access control. Horse it to Active Directory and you're able to roll out a global backup system with centralized authentication and rights-management. That is, you have all that when things work.

For the past couple months, we've been having issues with one of the dozen NetBackup domains we've deployed into our global enterprise. When I first began trougleshooting the NBAC issues, the authentication and/or authorization failures had always been associated with a corruption of LikeWise's sqlite cachedb files. At the time the issues first cropped up, these corruptions always seemed to coincide with DRS moving the NBU master server from one ESX host to another. It seemed like, when under sufficiently heavy load - the kind of load that would trigger a DRS event - LikeWise didn't respond well to having the system paused and moved. Probably something to do with the sudden apparent time-jump that happens when a VM is paused for the last parts of the DRS action. My "solution" to the problem was to disable automated-relocation for my VM.

This seemed to stabilize things. LikeWise was no longer getting corrupted and seems like I'd been able to stabilize NBAC's authentication and authorization issues. Well, they stabilized for a few weeks.

Unfortunately, the issues have begun to manifest themselves, again, in recent weeks. We've now had enough errors that some patterns are starting to emerge. Basically, it looks like something is horribly bogging the system down around the time that the nbazd crashes are happening. I'd located all the instances of nbazd crashing from its log files ("ACI" events are loged to the /usr/openv/netbackup/logs/nbazd logfiles), and then began to try correlating them with system load shown by the host's sadc collections. I found two things: 1) I probably need to increase my sample frequency - it's currently at the default 10-minute interval - if I want to more-thoroughly pin-down and/or profile the events; 2) when the crashes have happened within a minute or two of an sadc poll, I've found that the corresponding poll was either delayed by a few seconds to a couple minutes or was completely missing. So, something is causing the server to grind to a standstill and nbazd is a casualty of it.

For the sake of thoroughness (and what's likely to have matched on a Google-search and brought you here), what I've found in our logs are messages similar to the following:

/usr/openv/netbackup/logs/nbazd/vxazd.log

07/28/2012 05:11:48 AM VxSS-vxazd ERROR V-18-3004 Error encountered during ACI repository operation.
07/28/2012 05:11:48 AM VxSS-vxazd ERROR V-18-3078 Fatal error encountered. (txn.c:964)
07/28/2012 05:11:48 AM VxSS-vxazd LOG V-18-4204 Server is stopped.
07/30/2012 01:13:31 PM VxSS-vxazd LOG V-18-4201 Server is starting.

/usr/openv/netbackup/logs/nbazd/debug/vxazd_debug.log

07/28/2012 05:11:48 AM Unable to set transaction mode. error = (-1)
07/28/2012 05:11:48 AM SQL error S1000 -- [Sybase][ODBC Driver][SQL Anywhere] Connection was terminated
07/28/2012 05:11:48 AM Database fatal error in transaction, error (-1)
07/30/2012 01:13:31 PM _authd_config.c(205) Conf file path: /usr/openv/netbackup/sec/az/bin/VRTSaz.conf
07/30/2012 01:22:40 PM _authd_config.c(205) Conf file path: /usr/openv/netbackup/sec/az/bin/VRTSaz.conf

Our NBU master servers are hosted on virtual machines. It's a supported configuration and adds a lot of flexibility and resiliency to the overall enterprise-design. It also means that I have some additional metrics available to me to check. Unfortunately, when I checked those metrics, while I saw utilization spikes on the VM, those spikes corresponded to healthy operations of the VM. There weren't any major spikes (or troughs) during the grind-downs. So, to ESX, the VM appeared to be healthy.

At any rate, I've requested our ESX folks see if there might be anything going on on the physical systems hosting my VM that aren't showing up in my VM's individual statistics. I'd previously had to disable automated DRS actions to keep LikeWise from eating itself - those DRS actions wouldn't have been happening had the hosting ESX system not been experiencing loading issues - perhaps whatever was causing those DRS actions is still afflicting this VM.

I've also tagged one of our senior NBU operators to start picking through NBU's logs. I've asked him to look to see if there are any jobs (or combinations of jobs) that are always running during the bog-downs. If it's a scheduling issue (i.e., we're to blame for our problems), we can always reschedule jobs to exert less loading or we can scale up the VM's memory and/or CPU reservations to accommodate such problem jobs.

For now, it's a waiting-game. At least there's an investigation path, now. It's all in finding the patterns.

Wednesday, September 7, 2011

NetBackup with Active Directory Authentication on UNIX Systems

While the specific hosts that I used for this exercise were all RedHat-based, it should work for any UNIX platform that both NetBackup 6.0/6.5/7.0/7.1 and Likewise Open are installed onto.

I'm a big fan of leveraging centralized-authentication services wherever possible. It makes life in a multi-host environment - particularly where hosts can number from the dozens to the thousands - a lot easier when you only have to remember one or two passwords. It's even more valuable in modern security environments where policies require frequent password changes (or, if you've even been through the whole "we've had a security incident, all the passwords on all of the systems and applications need to be changed, immediately" exercise). Over the years, I've used things like NIS, NIS+, LDAP, Kerberos and Active Directory to do my centralized authentication. If your primary platforms are UNIX-based, NIS, NIS+, LDAP and Kerberos have traditionally been relatively straight-forward to set up and use.

I use the caveat of "relatively" because, particularly in the early implementations of each service, things weren't always dead-simple. Right now, we seem to be mid-way through the "easiness" life-cycle of using Active Directory as a centralized authentication source for UNIX operating systems and UNIX-hosted applications. Linux and OSX seem to be leading the charge in the OS space for ease of integration via native tools. There's also a number of third-party vendors out there who provide commercial and free solutions to do it for you, as well. In our enterprise, we chose LikeWise, because, at the time, it was the only free option that also worked reasonably-well with our large and complex Active Directory implementation. Unfortunately, not all of the makers of software that runs on the UNIX hosts seem to have been keeping up on the whole "AD-integration within UNIX operating environment" front.

My latest pain in the ass, in this arena, is Veritas NetBackup. While Symantec likes to tout the value of NetBackup Access Control (NBAC) in a multi-administrator - particularly one where different administrators may have radically different NetBackup skill sets or other differentiating factors - using it in a mixed-platform environment is kind of sucktackular to set up. While modern UNIX systems have the PAM framework to make writing an application's authentication framework relatively trivial, Symantec seems to still be stuck in the pre-PAM era. NBAC's group lookup components appear to still rely on direct consultation of a server's locally-maintained group files rather than just doing a call to the host OS's authentication frameworks.

When I discovered this problem, I opened a support case with Symantec. Unfortunately, their response was "set up a Windows-based authentication broker". My NetBackup environment is almost entirely RedHat-based (actually, unless/until we implement BareMetal Restore (BMR) or other backup modules that require specific OSes be added into the mix, it is entirely RedHat-based). The idea of having to build a Windows server just to act as an authentication broker struck me as a rather stupid way to go about things. It adds yet another server to my environment and, unless I cluster that server, it introduces a single point of failure into and otherwise fairly resilient NetBackup design. I'd designed my NetBackup environment with a virtualized master server (with DRS and SRM supporting it) and multiple media servers for both throughput and redundancy

We already use LikeWise Open to provide AD-base user and group management service for our Linux and Solaris hosts. When I first was running NetBackup through my engineering process, using the old Java auth.conf method for login management worked like a champ. The Java auth.conf-based systems just assumes that any users trying to access the Java UI are users that are managed through /etc/passwd. All you have to do is add the requisite user/rights entries into the auth.conf file and Java treats AD-provided users the same as it treats locally-managed users. Because of this, I suspected that I could work around Symantec's authorization coding lameness.

After a bit of playing around with NBAC, I discovered that, so long as the UNIX group I wanted to map rights to existed in /etc/group, NBAC would see it as a valid, mappable "UNIX PWD" group. I tested by seeing if it would at least let me map the UNIX "wheel" group to one of the NBAC privilege groups. Whereas, even if I could look up the group via getent, if it didn't exist in /etc/group, NBAC would tell me it was an invalid group. Having already verified that a group's presence in /etc/group allowed NBAC to use a group, I proceded to use getent to copy my NetBackup-related groups out of Active Directory and into my /etc/group file (all you have to do is a quick `getent [GROUPNAME] >> /etc/group` and you've populated your /etc/group file).

Unfortunately, I didn't quite have the full groups picture. When I logged in using my AD credentials, I didn't have any of the expected mapped-privileges. I remembered that I'd explicitly emptied the userids from the group entries I'd added to my /etc/group file (I'd actually sed'ed the getents to do it ...can't remembery why, at this point - probably just anticipating the issue of not including userids in the /etc/group file entries). So, I logged out of the Java UI and reran my getent's - this time leaving the userids in place. I logged back into the Java UI and this time I had my mapped privileges. Eureka.

Still I wasn't quite done. I knew that, if I was going to roll this solution into production, I'd have to cron-out a job to keep the getent file up-to date with the changing AD group memberships. I noticed, while nuking my group entry out of getent, that only my userid was on the group line and not every member of the group. Ultimately, tracked it down to LikeWise not doing full group enumeration by default. So, I was going to have to force LikeWise to enumerate the group's membership before running my getent's.

I proceded to dig around in /opt/likewise/bin for likely candidates for forcing the enumeration. After trying several lw*group* commands, I found that doing a `lw-find-group-by-name [ADGROUP]` did the trick. Once that was run, my getent's produced fully-populated entries in my /etc/group file. I was then able to map rights to various AD groups and wrote a cron script to take care keeping my /etc/group file in sync with Active Directory.

In other words, I was able to get NBAC to work with Active Directory in an all RedHat environment and no need to set up a Windows server just to be an authentication broker. Overall, I was able to create a much lighter-weight, portable solution.