I would be a rich person if I were able to count the number
of times an IT person has claimed that the reason for slowness in an
application or system is because of the network. As I type this, I can hear the
cheers of network support teams across the world shouting “Hallelujah, somebody
said it other than us.” Ok, so maybe not across the world but at least Tampa
and Jacksonville.
The network is and always will be an easy target for system
issues unless teams are given the tools to “easily” disseminate issues. A
number of these tools are free and come with the operating system or database.
What we’re really talking about here is training for IT staff. I don’t mean
reading the books, studying for a test,
passing and then become a “real” engineer. I mean understanding the OSI Model's Seven Layers.
·
Application
·
Presentation
·
Session
·
Transport
·
Network
·
Data Link
·
Physical
It is impossible to completely
segregate each layer, so it is completely logical that one layer having issues
can bleed into the others. This is really where problem management, thorough
root cause analyses and careful incident management comes into play.
I love (not an exaggeration)
troubleshooting. I have been asked repeatedly to help teach people how to
troubleshoot. While this is possible, far more often, troubleshooting is an
art; a skill culled over a course of many years.
·
Start
with the obvious. This may sound silly but it’s not uncommon for engineers to
“assume” that a system issue must be something complicated and exotic.
o Has the issue happened before? If so, when? What is
the failure frequency? To the same user/system or another?
§ If so, what resolved the issue then?
This is where strong incident management
comes into play. Recording issues with resolutions allows for trending to point
out repeated issues.
·
Are
errors logged? If so, use resources to look up potential resolutions.
Windows, Linux, Unix operating systems
ALL record errors. Applications can be configured to write errors to logs.
Words of caution: Massive error writing
can cause massive log entries. Verbose logging should only be used for true
errors, not application support.
This is where
the beauty of the internet and support agreements come into play. Internet
sites such as eventid.net and vendor sites offering searchable databases can
make fast work of troubleshooting.
·
Create a team across functional areas to
troubleshoot and perform root cause analysis work.
o
Consider triage training for your teams (ITIL
and MOF are excellent guidelines to follow)
·
Create a “no blame” environment
·
Track changes made to the environment – this
should really be given your highest focus as most issues occur because of
changes made to the environment.
o
Was a system having memory issues before the
application was updated?
o
Was communication between environments an issue
before a firewall change?
·
Track vendor releases for potential issue
resolution.
o
My favorite catch phrase is “Trust, but confirm”
– I would never have taken that tack in 2001 with Microsoft. Through the years
however, Microsoft has heeded the message from customers that we would not
accept crappy code any longer.
o
So, test, test and test again but UPDATE.
·
Is it the network? It could be but most likely
it’s because of an architecture issue. Insufficient bandwidth was architected
for the ever evolving needs of a mobile world. BYOD brings its own issues in
that everyone connects – and they connect across the network. Is it the
network? No, it’s the increased need for bandwidth.
As always with technology, communication is key. It’s not
unheard of for an engineer to have a “back pocket of tricks” to resolve issues.
This cannot be accepted by management. These steps, resolutions must be
documented in order to make the overall environment a stronger and more
successful one. Reward the guys with the bag of tricks but be sure they help
those less capable troubleshooters. This is not about job security for an individual;
it’s about the strength of the whole.
And, if you haven’t figured it out yet, it’s RARELY JUST about the network.
my favorite -- we didn't change anything and its broke/slow now
ReplyDeleteWilliam_fiore_jr@yahoo.com
I couldn't agree more Bill. Something I should have said in the blog - capacity management and system impact are areas that are ever changing. You cannot assume that the traffic across an environment is the same as it was when it was established. You need to create baselines, and regularly create and review metrics. In this way, you can stay on top of changes before they bring down an environment or worsen the user experience.
Deletegreat entry!
ReplyDelete