Change management can be a complex road littered with potholes. Learn some navigation tips to apply in your organization.
The
irony of working in system or network administration is that you’re
there to maintain the status quo (or as I like to phrase it, “preserving
order in a chaotic world”), and yet careful change management is also
your job. Effective delivery of services and resources demands that you
maintain the best possible uptime while transitioning from old to new,
whether replacing technology or simply improving upon it.
Change
management (also known as configuration management) isn’t always safe
or easy. On the other hand, if we only did what was safe in IT we might
all still be running Windows NT 4 SP6a. Rollout of new systems and
technologies seems to be coming faster and even more furious all the
time. I’ve seen systems implemented one year and then torn out the next
to pave the way for something better. The fiscal conservative in me is
sometimes appalled at the possible waste involved with this; the
technologist side of my nature revels in the deployment of new things.
Over
the years I’ve picked up a few guidelines for change management which I
wanted to share. Some came from direct experience, others from
mentors, and a few more from observing worst-case scenarios in action at
the companies of friends or colleagues.
When I refer to change
management, I’m referring to technological installations, upgrades,
patching, and migrations (such as a physical server to a virtual
machine). Note that there are formal Change Management processes such as those related to the Information Technology Infrastructure Library (ITIL). There are also dedicated software packages such as Evolven and McCabe CM which
help these endeavors. While some of this material overlaps with this
article (and may be the subject of future columns), my commentary here
entails a more casual set of tips based on good practices I’ve observed
throughout successful companies.
You can never have too much redundancy
Most
IT professionals won’t need to be sold on this (the challenge may lie
in convincing finance departments), but anything mission-critical needs a
twin. This applies to servers, network hardware, and even storage. If
you need it to run your business, make sure there is two of everything.
If you can’t have two, figure out how you can cobble together a
replacement system if the primary one becomes unavailable. For instance,
a few years back I set up a Windows file server with all shared data
hosted on a SAN volume. We didn’t have the budget for an official
clustering or load balancing solution, so I developed a failover plan
with a backup server:
- I analyzed and tested the method for mounting the server SAN volume on the backup server.
- I exported the file share configuration from the primary server registry on a nightly basis, saving it on the C: drive of the backup server.
- I set the primary server DNS record time-to-live (TTL) for 5 minutes.
- I disabled strict name checking in the backup server registry so clients could connect to it via any DNS name I wished (by default the Windows server OS prevents this).
- I documented the entire failover procedure.
This
meant the backup server could “become” the primary server very easily
just by updating the associated DNS record and users could be redirected
to it in short order (many wouldn’t even notice the interruption). This
included drive mappings and file share access. Documenting it meant any
of my coworkers could follow the steps, too.
When it comes to
redundant components, make them identical in every possible way to make
supporting them as predictable as you can – they should be the same
manufacturer/model, run the same operating system, have the same drivers
and hotfixes, plugged into the same ports in different switches or
PDUs, and so forth.
There is another critical tip involving redundancy…
Space out changes between redundant systems
Your
redundancy will give you tremendous leverage when it comes to applying
changes since you can take half of a redundant pair down to move or
upgrade it, then do the same for the other half. However, never do this
without leaving a gap of time in between to make sure the first change
was successful. When patching servers, for instance, don’t patch the
second set of systems until several days have passed to give you some
time to spot and correct any issues… during which you’ll need to rely on
the systems which are still functional.
Use a centralized solution to deploy updates
For
quality change management you should always opt for the least amount of
complexity, which means a centralized in-house system for pushing out
patches, software, antivirus updates and configuration settings. This
will allow you to the best opportunity to track your systems and plan
out your changes, as well as reporting on the results. Examples include
Microsoft’s Windows Server Update Services, Microsoft’s System Center Configuration Manager, Microsoft Group Policy (part of Active Directory), Symantec Endpoint Protection Manager and Dell Management Console.
These products will give you a single point of reference and ensure
your clients and servers aren’t just downloading updates willy-nilly
from the internet (or worse, failing to do so and not informing you).
Never use a wrecking ball
I’ve
watched a lot of horror movies in my day but none of them were as scary
as the concept of tearing out an existing system to replace it with a
new one. Whether a file server, email server, storage device, or
something else, you should always migrate to a new system leaving the
old one intact until you’ve pronounced the change complete. Don’t
decommission anything until it’s obsolete.
For instance, if
updating a Windows 2008 file server to a Windows 2012 system copy all
the data (with permissions!) from the old box to the new and have users
test the access. On one occasion during this endeavor I found some
issues with the network driver on the new server which forced me to cut
users back to the old system. I didn’t mind this step since I felt
fortunate to have the old system available for use!
I grew
up in the 1970’s and greatly enjoyed the show “The Dukes of Hazzard.” I
especially liked the scenes where the good old Duke boys jumped a river
or canyon in the General Lee – since the police were usually chasing
them they generally had no choice but to try to make that jump. I like
real life to be less exciting than TV. Climbing through the window of
the General Lee is no way to start a change project in the data center.
Devise change plans with multiple input
Just
like you can never have enough redundancy, you can never have enough
steps in your change plan and, like any good party, the more
participants you have the better your chances will be.
Get
as much input from others as you can to spot any looming pitfalls.
However, make your initial plan as thorough as you can so others don’t
have to fill in the gaps for you. So, you’re upgrading the firmware on
that Cisco switch, then rebooting it? How do you make sure this is
successful? Well, you could ping it and then pronounce the upgrade
complete if it replies… but I think that’s just scratching the surface.
Log in, review error logs, and test all functions. Log in later and make
sure it didn’t lock up due to a memory leak. Reboot it. Reboot it
again. Connect to it from another subnet. Maybe upon review someone else
will suggest testing some core apps running on a server which connects
through that switch, thereby saving you from a “Gotcha!” moment. All of
these are examples of what should be on your step-by-step checklist –
and ideally you’ve come up with this checklist by working on a test
system, though take warning: results in your test environment aren’t
always guaranteed to be duplicated in production.
Don’t
assume because you can do something then it must be working. Have
someone else log in and try. I’ve seen plenty of issues whereby someone
with admin rights could perform a function just fine but regular user
rights didn’t work as expected, at least until they were tweaked.
One
last point on this: going down your checklist multiple times on
different systems will be tedious and dull, and you may be tempted to
skip steps or cut corners, thinking, “Yeah, that worked twice already so
why bother?” Murphy’s Law loves that temptation: resist it.
Utilize multiple approval methods
It’s
great if you can get feedback from others on what you should add to
your change plan. However, smart companies make employees put their
money where their mouths are: enact an approval method plan to obtain
sanction from these or other appropriate parties. This may include your
boss, the director of a related department, or the VP of your customer
base. This approval process will ensure everyone knows about, agrees
upon, and supports the proposed change(s). Let’s face it: if I know I’m
going to put my name on a plan which might impact my company’s bottom
line if it bombs, I’m going to make sure the process is sound.
Not
only does this security blanket cover you if something goes wrong, but
it will keep people informed in the event of a failure and can help
groups work together to find solutions.
Formulate a backout plan
Every
single change should have a backout plan associated with it. How are
you going to put things back to the way they were if something fails?
Will you use snapshots, such as in a virtual environment? Will you
reimport crucial registry keys or apply a backup group policy to return a
Windows server configuration to its previous state? You need to
document this plan and make it as clean yet elaborate as possible. Your
creativity may well be impaired during a failed change/upgrade and
researching options is the last thing you want to do during that
stressful time. Your backout plan may well be an insurance policy you
won’t ever need, but insurance is also there for peace of mind.
If
you do have to back out a change, make sure you do so by getting as
many notes, screenshots or other supporting evidence that you can so you
can figure out what went wrong and correct it for next time. The
strategy of “trying something a second time and hoping it works” is a
recipe for an unpleasant entrée.
Choose your change schedule carefully
It
almost goes without saying that most if not all changes in the data
center should be planned after hours or during non-critical periods.
Even upgrading redundant systems can pose a risk if your secondary
server decides to go on strike at 10 am Monday. However, plan your
timeframe carefully.
You COULD perform that database
switchover at 11 pm Sunday. But what if something causes a delay and
the switchover is still running when users arrive at the office seven
hours later?
Maybe picking 5 pm on a Friday would be a
better idea. Uh, well, just be careful you don’t find yourself wrapped
up in your home life such that you forget to check the upgrade results
until you arrive at work Monday morning.
Perhaps you have
a secondary site you use for disaster recovery (DR) purposes and you’ve
made it your primary site to test your failover capabilities? Don’t
scramble to upgrade the systems in your original primary site 12 hours
before you’re scheduled to reverse the process.
As I said
above, your schedule should be the product of the stakeholders and
groups involved with using, supporting, and administering these systems
(where applicable).
Use auditing and individual accounts
Where
possible always use auditing in your environment (even if you have to
turn it on temporarily during a change project then turn it off to
preserve resources). This will help keep track of commands run on these
systems and the resulting impact.
On a similar note,
don’t use shared or generic accounts like “administrator” if you can
avoid it; these commands should be linked to individual accounts
(preferably privileged accounts used only to perform this sort of work;
you should normally use a limited account where possible). True, this
isn’t always easy in an Active Directory environment, where many things
still stubbornly demand use of the domain “administrator” account even
when comparable privileges have (seemingly) been granted to a named
individual. However, pursue this policy as far as you can.
I’ll
admit this tip brings to mind a quote by the comedian Bill Cosby, who
taught me most of what I practice in fatherhood: “If something's broken
in the house, you have one child, you know who did it!” However, this
isn’t about pointing fingers, but rather about documenting what happened
and under which account. If a change needs to be rolled back or a
problem identified you’ll need this information.
Always schedule downtime in your monitoring system
I’m
going to go out on a limb and assume that you have a comprehensive
monitoring environment set up to check the health and uptime of your
critical systems and notify you of any issues. When you’re planning to
take any of these systems offline for change management purposes you
should always schedule a reasonable downtime period in your monitoring
system beforehand so it will remain silent. It can be a pain to take
this step, especially for multiple systems, but the strategy of ignoring
critical alerts is too dangerous to pursue.
If you’re in
the middle of an upgrade you don’t really know what’s going on other
than the immediate task at hand and you might find yourself fooled by
circumstances. For instance, if you receive a page telling you that
your Cisco Ironport is unresponsive, you might think: “Yeah, I know it’s
unresponsive since I’m upgrading it!” What if you later find out that
page was for the OTHER Ironport supposedly in good working condition
which has been dead for thirty minutes?
Bringing it all together
There
is often an inordinate amount of pressure (both external or internal)
on IT personnel to finish one task and rush to the next so they may
continue to demonstrate value to the organization. However, that
pressure is antithetical to the concept of IT itself: keeping things
running with a minimal amount of downtime and disruption.
0 comments:
Post a Comment
Appreciate your concern ...