Tuesday, October 4, 2011

How did that happen?


The City of San Francisco – July 2008, London Ambulances full system outage– June 2011, Lloyd's Banking Group – August 2011, Japan's Mizuho Bank System Failure

From first glance, these topics seem to have nothing in common. . All were incidents caused by 1) human error, OR 2) lack of or poorly defined processes. The term incident as defined using ITIL standards. "An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet affected service is also an incident – for example, failure of one disk from a mirror set. While each of the incidents caused or had the potential for delays in service,

Let's take each incident individually and discuss how the issue could have been avoided or at a minimum mitigated. As armchair quarterbacks it will be easy for us to figure out how the incidents could have been avoided.

In July 2008, a disgruntled IT worker prevented city officials from accessing a primary system, which handled about 60 percent of the network traffic for city departments. According to KTVU, the employee had an issue with his supervisor. He was arrested after refusing to provide network passwords to his supervisors. Again according to KTVU, the employee had had previous circumstances where he had had issues with supervisors and had two felony arrests for burglary and theft. While there were no reported incidents, without those passwords, systems would have been down and unrecoverable in the event there had been a power failure.

Root Cause: the human element cannot be overlooked in a circumstance such as this. Someone has to hold the keys to the kingdom. There should however be a process in place for managing the superuser accounts. Should it be someone who has a poor employment history and is a convicted felon? Doubtful. Technology personnel are in positions of trust in a way that implies you are confident they will not engage in illegal or immoral actions that could cause harm to your environment. Background checks and reference calls should mitigate this risk.
Avoidable? Yes. It's always recommended that you take a multi-layer approach to security. In a very small shop, this is difficult but in a larger organization its incomprehensible for one person to be able to strip a company or a city of its ability to function. In a small company you may not have the forensics tools to circumvent actions by an irate administrator but there are processes that can be put into place to prevent an occurrence such as in San Francisco. Do you have password management processes? Superuser "break glass" processes? A policy against using shared accounts across environments? A process around systematically changing administrative passwords? None of these can absolutely prevent an incident such as the San Francisco hostage situation but it can mitigate the risk.
In June
2011, London Ambulances performed a system upgrade. The upgrade had unintended consequences with the end result being the company had to resort to paper records until the old system could be restored. The system being replaced was over 20 years old.

Root Cause: This one is multi-tiered. Allowing obsolescence particularly in a mission critical environment can be an expensive bad practice. Additionally going to production and failing generally indicates inadequate testing whether it be due to lack of testing resources or poor use case development.
Avoidable? Yes. Do you have an end of life/end of support strategy? Do you maintain your systems at "N" or "N-1" ? System refreshes are rarely inexpensive and without issue but with proper testing failures should be mitigated. Upgrading a 20 year old system has considerable potential downfalls. Who has maintained the documentation on the system's upgrades? Who knows the ins and outs of a system that was put into place so long ago that laptops are probably more powerful? Was business requirement gathering adequate? Were the right persons engaged at the right time?
In August 2011, Lloyd's Banking group had a server cooling system failure. End result? The online banking system was completely down. Paper records were the resort on one of the busiest stock trade days in history.

Following the tsunami and earthquake in Japan, the Mizuho banking system was inundated with money transfer requests that far exceeded the normal capacities. Their systems could not handle the backlog and could not recover. 620,000 people were unable to receive their pay until three days later.

Root Cause: The bank was unable to respond to a catastrophic increase in traffic across their backend processing systems.
Avoidable? No. Maybe. Doubtful. (yep that was a little dicey there). Any company can overbuild their systems to the point where they can manage any amount of traffic but is it the smartest use of revenue dollars to build for something that may never occur? That's a question the business has to answer.

 
I know you've heard this repeatedly, but its crucial to have a close trusting relationship with business leaders. There is a cost to running a technology shop the right way. There is an associated risk with taking shortcuts. As long as everyone understands that and supports that then it's a win-win and there "shouldn't" be finger pointing in the event of a catastrophic incident. That's where metrics and reporting becomes significant. Letting stakeholders know the state of their environments, the risk associated with them and the potential recoverability/or lack thereof should allow those stakeholders to make educated and risk aligned decisions regarding an environment.
You may be thinking yeah but how often does a tsunami happen? How often do these companies go down. They've probably got great availability numbers. How often do catastrophic incidents occur? Seldom. I'm sure those technology departments worked themselves into a tizzy recovering from their system outages. That's because that's what we do. I have yet to work in a technology department that didn't have world class firefighters. World class "let's just keep it up and prevent issues fighters" though? Um, not so much. I mean where's the glory in having an environment that self-heals, self-adjusts for growth, that just doesn't go down? It's there. It's predictable. It means there is a balance between work and life. That's glory enough for me.

Next blog topic: Biggest threats to a company that have nothing to do with security.

Resources:








 

No comments:

Post a Comment