Tuesday, October 1, 2013

No, it's not the network


I would be a rich person if I were able to count the number of times an IT person has claimed that the reason for slowness in an application or system is because of the network. As I type this, I can hear the cheers of network support teams across the world shouting “Hallelujah, somebody said it other than us.” Ok, so maybe not across the world but at least Tampa and Jacksonville.

The network is and always will be an easy target for system issues unless teams are given the tools to “easily” disseminate issues. A number of these tools are free and come with the operating system or database. What we’re really talking about here is training for IT staff. I don’t mean reading the books, studying for a test, passing and then become a “real” engineer. I mean understanding the OSI Model's Seven Layers.

·         Application

·         Presentation

·         Session

·         Transport

·         Network

·         Data Link

·         Physical

It is impossible to completely segregate each layer, so it is completely logical that one layer having issues can bleed into the others. This is really where problem management, thorough root cause analyses and careful incident management comes into play.

I love (not an exaggeration) troubleshooting. I have been asked repeatedly to help teach people how to troubleshoot. While this is possible, far more often, troubleshooting is an art; a skill culled over a course of many years.

·         Start with the obvious. This may sound silly but it’s not uncommon for engineers to “assume” that a system issue must be something complicated and exotic.

o   Has the issue happened before? If so, when? What is the failure frequency? To the same user/system or another?

§  If so, what resolved the issue then?

This is where strong incident management comes into play. Recording issues with resolutions allows for trending to point out repeated issues.

·         Are errors logged? If so, use resources to look up potential resolutions.

Windows, Linux, Unix operating systems ALL record errors. Applications can be configured to write errors to logs. Words of caution:  Massive error writing can cause massive log entries. Verbose logging should only be used for true errors, not application support.

                This is where the beauty of the internet and support agreements come into play. Internet sites such as eventid.net and vendor sites offering searchable databases can make fast work of troubleshooting.

·         Create a team across functional areas to troubleshoot and perform root cause analysis work.

o   Consider triage training for your teams (ITIL and MOF are excellent guidelines to follow)

·         Create a “no blame” environment

·         Track changes made to the environment – this should really be given your highest focus as most issues occur because of changes made to the environment.

o   Was a system having memory issues before the application was updated?

o   Was communication between environments an issue before a firewall change?

·         Track vendor releases for potential issue resolution.

o   My favorite catch phrase is “Trust, but confirm” – I would never have taken that tack in 2001 with Microsoft. Through the years however, Microsoft has heeded the message from customers that we would not accept crappy code any longer.

o   So, test, test and test again but UPDATE.

·         Is it the network? It could be but most likely it’s because of an architecture issue. Insufficient bandwidth was architected for the ever evolving needs of a mobile world. BYOD brings its own issues in that everyone connects – and they connect across the network. Is it the network? No, it’s the increased need for bandwidth.

As always with technology, communication is key. It’s not unheard of for an engineer to have a “back pocket of tricks” to resolve issues. This cannot be accepted by management. These steps, resolutions must be documented in order to make the overall environment a stronger and more successful one. Reward the guys with the bag of tricks but be sure they help those less capable troubleshooters. This is not about job security for an individual; it’s about the strength of the whole.

And, if you haven’t figured it out yet, it’s RARELY JUST about the network.

3 comments:

  1. my favorite -- we didn't change anything and its broke/slow now

    William_fiore_jr@yahoo.com

    ReplyDelete
    Replies
    1. I couldn't agree more Bill. Something I should have said in the blog - capacity management and system impact are areas that are ever changing. You cannot assume that the traffic across an environment is the same as it was when it was established. You need to create baselines, and regularly create and review metrics. In this way, you can stay on top of changes before they bring down an environment or worsen the user experience.

      Delete