Saturday, November 2, 2013

Best practices for change management in the data center

Change management can be a complex road littered with potholes. Learn some navigation tips to apply in your organization.

datacentercloud.png
The irony of working in system or network administration is that you’re there to maintain the status quo (or as I like to phrase it, “preserving order in a chaotic world”), and yet careful change management is also your job.  Effective delivery of services and resources demands that you maintain the best possible uptime while transitioning from old to new, whether replacing technology or simply improving upon it.
 
 
Change management (also known as configuration management) isn’t always safe or easy.  On the other hand, if we only did what was safe in IT we might all still be running Windows NT 4 SP6a.  Rollout of new systems and technologies seems to be coming faster and even more furious all the time. I’ve seen systems implemented one year and then torn out the next to pave the way for something better. The fiscal conservative in me is sometimes appalled at the possible waste involved with this; the technologist side of my nature revels in the deployment of new things.

Over the years I’ve picked up a few guidelines for change management which I wanted to share.  Some came from direct experience, others from mentors, and a few more from observing worst-case scenarios in action at the companies of friends or colleagues.  

When I refer to change management, I’m referring to technological installations, upgrades, patching, and migrations (such as a physical server to a virtual machine).  Note that there are formal Change Management processes such as those related to the Information Technology Infrastructure Library (ITIL).  There are also dedicated software packages such as Evolven and McCabe CM which help these endeavors.  While some of this material overlaps with this article (and may be the subject of future columns), my commentary here entails a more casual set of tips based on good practices I’ve observed throughout successful companies.

You can never have too much redundancy

Most IT professionals won’t need to be sold on this (the challenge may lie in convincing finance departments), but anything mission-critical needs a twin. This applies to servers, network hardware, and even storage. If you need it to run your business, make sure there is two of everything. If you can’t have two, figure out how you can cobble together a replacement system if the primary one becomes unavailable. For instance, a few years back I set up a Windows file server with all shared data hosted on a SAN volume. We didn’t have the budget for an official clustering or load balancing solution, so I developed a failover plan with a backup server:
  • I analyzed and tested the method for mounting the server SAN volume on the backup server.
  • I exported the file share configuration from the primary server registry on a nightly basis, saving it on the C: drive of the backup server.
  • I set the primary server DNS record time-to-live (TTL) for 5 minutes.
  • I disabled strict name checking in the backup server registry so clients could connect to it via any DNS name I wished (by default the Windows server OS prevents this).
  • I documented the entire failover procedure.
This meant the backup server could “become” the primary server very easily just by updating the associated DNS record and users could be redirected to it in short order (many wouldn’t even notice the interruption). This included drive mappings and file share access. Documenting it meant any of my coworkers could follow the steps, too.

When it comes to redundant components, make them identical in every possible way to make supporting them as predictable as you can – they should be the same manufacturer/model, run the same operating system, have the same drivers and hotfixes, plugged into the same ports in different switches or PDUs, and so forth.
There is another critical tip involving redundancy…

Space out changes between redundant systems

Your redundancy will give you tremendous leverage when it comes to applying changes since you can take half of a redundant pair down to move or upgrade it, then do the same for the other half.  However, never do this without leaving a gap of time in between to make sure the first change was successful. When patching servers, for instance, don’t patch the second set of systems until several days have passed to give you some time to spot and correct any issues… during which you’ll need to rely on the systems which are still functional.

Use a centralized solution to deploy updates

For quality change management you should always opt for the least amount of complexity, which means a centralized in-house system for pushing out patches, software, antivirus updates and configuration settings.  This will allow you to the best opportunity to track your systems and plan out your changes, as well as reporting on the results.  Examples include Microsoft’s Windows Server Update Services, Microsoft’s System Center Configuration Manager, Microsoft Group Policy (part of Active Directory), Symantec Endpoint Protection Manager and Dell Management Console.  These products will give you a single point of reference and ensure your clients and servers aren’t just downloading updates willy-nilly from the internet (or worse, failing to do so and not informing you).

Never use a wrecking ball

I’ve watched a lot of horror movies in my day but none of them were as scary as the concept of tearing out an existing system to replace it with a new one.  Whether a file server, email server, storage device, or something else, you should always migrate to a new system leaving the old one intact until you’ve pronounced the change complete. Don’t decommission anything until it’s obsolete.

For instance, if updating a Windows 2008 file server to a Windows 2012 system copy all the data (with permissions!) from the old box to the new and have users test the access. On one occasion during this endeavor I found some issues with the network driver on the new server which forced me to cut users back to the old system. I didn’t mind this step since I felt fortunate to have the old system available for use! 

I grew up in the 1970’s and greatly enjoyed the show “The Dukes of Hazzard.” I especially liked the scenes where the good old Duke boys jumped a river or canyon in the General Lee – since the police were usually chasing them they generally had no choice but to try to make that jump. I like real life to be less exciting than TV. Climbing through the window of the General Lee is no way to start a change project in the data center.

Devise change plans  with multiple input

Just like you can never have enough redundancy, you can never have enough steps in your change plan and, like any good party, the more participants you have the better your chances will be.  

Get as much input from others as you can to spot any looming pitfalls. However, make your initial plan as thorough as you can so others don’t have to fill in the gaps for you. So, you’re upgrading the firmware on that Cisco switch, then rebooting it? How do you make sure this is successful? Well, you could ping it and then pronounce the upgrade complete if it replies… but I think that’s just scratching the surface. Log in, review error logs, and test all functions. Log in later and make sure it didn’t lock up due to a memory leak. Reboot it. Reboot it again. Connect to it from another subnet. Maybe upon review someone else will suggest testing some core apps running on a server which connects through that switch, thereby saving you from a “Gotcha!” moment. All of these are examples of what should be on your step-by-step checklist – and ideally you’ve come up with this checklist by working on a test system, though take warning: results in your test environment aren’t always guaranteed to be duplicated in production.

Don’t assume because you can do something then it must be working. Have someone else log in and try. I’ve seen plenty of issues whereby someone with admin rights could perform a function just fine but regular user rights didn’t work as expected, at least until they were tweaked.

One last point on this: going down your checklist multiple times on different systems will be tedious and dull, and you may be tempted to skip steps or cut corners, thinking, “Yeah, that worked twice already so why bother?”  Murphy’s Law loves that temptation: resist it.

Utilize multiple approval methods

It’s great if you can get feedback from others on what you should add to your change plan. However, smart companies make employees put their money where their mouths are: enact an approval method plan to obtain sanction from these or other appropriate parties. This may include your boss, the director of a related department, or the VP of your customer base. This approval process will ensure everyone knows about, agrees upon, and supports the proposed change(s). Let’s face it: if I know I’m going to put my name on a plan which might impact my company’s bottom line if it bombs, I’m going to make sure the process is sound.

Not only does this security blanket cover you if something goes wrong, but it will keep people informed in the event of a failure and can help groups work together to find solutions.  

Formulate a backout plan

Every single change should have a backout plan associated with it.  How are you going to put things back to the way they were if something fails? Will you use snapshots, such as in a virtual environment? Will you reimport crucial registry keys or apply a backup group policy to return a Windows server configuration to its previous state?  You need to document this plan and make it as clean yet elaborate as possible. Your creativity may well be impaired during a failed change/upgrade and researching options is the last thing you want to do during that stressful time. Your backout plan may well be an insurance policy you won’t ever need, but insurance is also there for peace of mind.  

If you do have to back out a change, make sure you do so by getting as many notes, screenshots or other supporting evidence that you can so you can figure out what went wrong and correct it for next time.  The strategy of “trying something a second time and hoping it works” is a recipe for an unpleasant entrée.

Choose your change schedule carefully

It almost goes without saying that most if not all changes in the data center should be planned after hours or during non-critical periods. Even upgrading redundant systems can pose a risk if your secondary server decides to go on strike at 10 am Monday. However, plan your timeframe carefully.

You COULD perform that database switchover at 11 pm Sunday.  But what if something causes a delay and the switchover is still running when users arrive at the office seven hours later?  

Maybe picking 5 pm on a Friday would be a better idea. Uh, well, just be careful you don’t find yourself wrapped up in your home life such that you forget to check the upgrade results until you arrive at work Monday morning.  

Perhaps you have a secondary site you use for disaster recovery (DR) purposes and you’ve made it your primary site to test your failover capabilities? Don’t scramble to upgrade the systems in your original primary site 12 hours before you’re scheduled to reverse the process.

As I said above, your schedule should be the product of the stakeholders and groups involved with using, supporting, and administering these systems (where applicable).

Use auditing and individual accounts

Where possible always use auditing in your environment (even if you have to turn it on temporarily during a change project then turn it off to preserve resources). This will help keep track of commands run on these systems and the resulting impact.  

On a similar note, don’t use shared or generic accounts like “administrator” if you can avoid it; these commands should be linked to individual accounts (preferably privileged accounts used only to perform this sort of work; you should normally use a limited account where possible). True, this isn’t always easy in an Active Directory environment, where many things still stubbornly demand use of the domain “administrator” account even when comparable privileges have (seemingly) been granted to a named individual. However, pursue this policy as far as you can.

I’ll admit this tip brings to mind a quote by the comedian Bill Cosby, who taught me most of what I practice in fatherhood: “If something's broken in the house, you have one child, you know who did it!” However, this isn’t about pointing fingers, but rather about documenting what happened and under which account. If a change needs to be rolled back or a problem identified you’ll need this information.

Always schedule downtime in your monitoring system

I’m going to go out on a limb and assume that you have a comprehensive monitoring environment set up to check the health and uptime of your critical systems and notify you of any issues.  When you’re planning to take any of these systems offline for change management purposes you should always schedule a reasonable downtime period in your monitoring system beforehand so it will remain silent.  It can be a pain to take this step, especially for multiple systems, but the strategy of ignoring critical alerts is too dangerous to pursue.  

If you’re in the middle of an upgrade you don’t really know what’s going on other than the immediate task at hand and you might find yourself fooled by circumstances.  For instance, if you receive a page telling you that your Cisco Ironport is unresponsive, you might think: “Yeah, I know it’s unresponsive since I’m upgrading it!” What if you later find out that page was for the OTHER Ironport supposedly in good working condition which has been dead for thirty minutes?

Bringing it all together

There is often an inordinate amount of pressure (both external or internal) on IT personnel to finish one task and rush to the next so they may continue to demonstrate value to the organization.  However, that pressure is antithetical to the concept of IT itself: keeping things running with a minimal amount of downtime and disruption.  

Many good change management techniques boil down to common sense, being conservative, and playing it safe. Hopefully these guidelines will help make change in your environment as predictable and controlled as possible, so you can embrace the possibilities rather than fear them.

0 comments:

Post a Comment

Appreciate your concern ...