CG Solutions of Jax: October 2011

A manufacturing company, year 1997, a LAN Admin checks a box about authentication requirements.
- End Result: All users attempting to log in are prompted repeatedly for additional credentials.
An international law firm, year 2000, an engineer pushes a new group policy to all workstations.
- End Result: 45 minutes after all of the domain controllers receive the policy and push to the workstations, all users are unable to use any software listed under c:\program files\*.*
A large telecommunications company, year 2005, an engineer pushes a new antivirus policy to all windows servers in the DMZ.
- End Result: All of the servers slow to a crawl as all files opened or created are scanned in real-time.
An internet bank, year 2010, a contractor pushes a software security patch to all systems.
- End Result: The software is installed on 3 different Windows server versions, in some cases, rendering mission critical legacy systems in an unstable state.

While none of these individual incidents may seem to present a huge threat, the attitude that permits them to occur does. Without processes and policies that discourage unscheduled or "un-approved" actions, people WILL do unauthorized work and at some point, there will be negative ramifications.

What is the cost of unauthorized work in downtime? While the answer to that question depends upon the environment, any cost is too much. Change control while different from ITIL's change management at a minimum controls the changes in an environment. The first aspect of change that has to be managed are the attitudes of the people involved. Unfortunately, technologist too often forget that they are working in a live environment with impact to end users who are then unable to do their jobs and make money for the company they work for. That's because impact is frequently undocumented.

Why do we need change control?

Millions of dollars of lost business opportunity can be contributed to environmental changes, whether they are technology changes or others. Being able to predict availability and stability can contribute to a stronger bottom line. These reasons alone are sufficient to warrant a change control policy. In a regulated or publicly traded business, changes that cause issues can also have regulatory or reputational as well as monetary impact.

How do you control unauthorized changes?

Having policies and procedures that reflect an attitude of intolerance toward unauthorized work is a beginning. The initial best step is to write a policy basing the documentation requirements upon risk to the environment. For example, making a change to one end user's desktop "should" be fairly non-impacting to the organization overall. Document what work is considered sufficiently low risk as to be permitted with minimal paperwork or documentation. That type work could generally be done with an email request or if the corporation has a ticket system, with a ticket request. Enforcing a change control policy will not make you popular but it will make your environment far more stable. The policy has to have support and be enforced for all technology groups. The policy has to have "bite", in other words, the terms for non-compliance must be severe up to and including termination.

Include in the policy:

A methodology for rating the risks related to the changes
Approvals needed
Change exception process
Emergency change process/approvals
Testing requirements
Rollback process
Communication plan
How to document implementation

Steps to take after you have a policy in place include:

Train the technologists,
Train the business (not to the same depth as technology but they need to be aware that as stakeholders they will need to approve changes),
Setup meetings to discuss changes whether on a daily, weekly or monthly basis depending upon the scope and number of changes in your environment,
Review high impact or risk changes on a regular basis,
Review the change outcomes and improve upon success rates. Measure, document and report.

Of course the best way to control unauthorized changes is to have systems in place that prevent unauthorized changes and in the event there is one, roll it back. These systems can be cost prohibitive to a small or medium business and in that level of environment be more bureaucratic than necessary. Maintain the stance that the systems needs be no more complex than the risk profile of your business. If you are in a non-regulated business, the change approval process may not need to be as stringent as it would in a regulated business.

What happened in the situations documented above?

LAN Admin @ Manufacturing Co –100 employee impact - there was no central helpdesk so pockets of employees were complaining to each other before someone called the LAN Admin. Approx 2.5 hrs downtime.

GPO Push to desktops – 1500 employee impact - the helpdesk began getting calls and they notified the engineering team who scrutinized the change that had been made and realized the issue. Approx 3 hrs downtime.

Antivirus push to systems – 3000 servers received the push – operations engaged engineering when alerting realized a change had been made – the push occurred on a Saturday evening at 7pm. Engineering was able to engage the appropriate personnel and after investigating corrected the issue before customer impact. Clean up was completed at approximately 3 am. 0 downtime.

2010 software push – several of the systems had received the push but not attempted installation. Less than 10 systems ended up having to be rebuilt. Approx 5 hrs downtime.

Next steps:

See next blog. We'll progress through other areas that can contribute to instabilities and lapses in any business.

The City of San Francisco – July 2008, London Ambulances full system outage– June 2011, Lloyd's Banking Group – August 2011, Japan's Mizuho Bank System Failure

From first glance, these topics seem to have nothing in common. . All were incidents caused by 1) human error, OR 2) lack of or poorly defined processes. The term incident as defined using ITIL standards. "An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet affected service is also an incident – for example, failure of one disk from a mirror set. While each of the incidents caused or had the potential for delays in service,

Let's take each incident individually and discuss how the issue could have been avoided or at a minimum mitigated. As armchair quarterbacks it will be easy for us to figure out how the incidents could have been avoided.

In July 2008, a disgruntled IT worker prevented city officials from accessing a primary system, which handled about 60 percent of the network traffic for city departments. According to KTVU, the employee had an issue with his supervisor. He was arrested after refusing to provide network passwords to his supervisors. Again according to KTVU, the employee had had previous circumstances where he had had issues with supervisors and had two felony arrests for burglary and theft. While there were no reported incidents, without those passwords, systems would have been down and unrecoverable in the event there had been a power failure.

Root Cause: the human element cannot be overlooked in a circumstance such as this. Someone has to hold the keys to the kingdom. There should however be a process in place for managing the superuser accounts. Should it be someone who has a poor employment history and is a convicted felon? Doubtful. Technology personnel are in positions of trust in a way that implies you are confident they will not engage in illegal or immoral actions that could cause harm to your environment. Background checks and reference calls should mitigate this risk.

Avoidable? Yes. It's always recommended that you take a multi-layer approach to security. In a very small shop, this is difficult but in a larger organization its incomprehensible for one person to be able to strip a company or a city of its ability to function. In a small company you may not have the forensics tools to circumvent actions by an irate administrator but there are processes that can be put into place to prevent an occurrence such as in San Francisco. Do you have password management processes? Superuser "break glass" processes? A policy against using shared accounts across environments? A process around systematically changing administrative passwords? None of these can absolutely prevent an incident such as the San Francisco hostage situation but it can mitigate the risk.

In June
2011, London Ambulances performed a system upgrade. The upgrade had unintended consequences with the end result being the company had to resort to paper records until the old system could be restored. The system being replaced was over 20 years old.

Root Cause: This one is multi-tiered. Allowing obsolescence particularly in a mission critical environment can be an expensive bad practice. Additionally going to production and failing generally indicates inadequate testing whether it be due to lack of testing resources or poor use case development.

Avoidable? Yes. Do you have an end of life/end of support strategy? Do you maintain your systems at "N" or "N-1" ? System refreshes are rarely inexpensive and without issue but with proper testing failures should be mitigated. Upgrading a 20 year old system has considerable potential downfalls. Who has maintained the documentation on the system's upgrades? Who knows the ins and outs of a system that was put into place so long ago that laptops are probably more powerful? Was business requirement gathering adequate? Were the right persons engaged at the right time?

In August 2011, Lloyd's Banking group had a server cooling system failure. End result? The online banking system was completely down. Paper records were the resort on one of the busiest stock trade days in history.

Following the tsunami and earthquake in Japan, the Mizuho banking system was inundated with money transfer requests that far exceeded the normal capacities. Their systems could not handle the backlog and could not recover. 620,000 people were unable to receive their pay until three days later.

Root Cause: The bank was unable to respond to a catastrophic increase in traffic across their backend processing systems.

Avoidable? No. Maybe. Doubtful. (yep that was a little dicey there). Any company can overbuild their systems to the point where they can manage any amount of traffic but is it the smartest use of revenue dollars to build for something that may never occur? That's a question the business has to answer.

I know you've heard this repeatedly, but its crucial to have a close trusting relationship with business leaders. There is a cost to running a technology shop the right way. There is an associated risk with taking shortcuts. As long as everyone understands that and supports that then it's a win-win and there "shouldn't" be finger pointing in the event of a catastrophic incident. That's where metrics and reporting becomes significant. Letting stakeholders know the state of their environments, the risk associated with them and the potential recoverability/or lack thereof should allow those stakeholders to make educated and risk aligned decisions regarding an environment.

You may be thinking yeah but how often does a tsunami happen? How often do these companies go down. They've probably got great availability numbers. How often do catastrophic incidents occur? Seldom. I'm sure those technology departments worked themselves into a tizzy recovering from their system outages. That's because that's what we do. I have yet to work in a technology department that didn't have world class firefighters. World class "let's just keep it up and prevent issues fighters" though? Um, not so much. I mean where's the glory in having an environment that self-heals, self-adjusts for growth, that just doesn't go down? It's there. It's predictable. It means there is a balance between work and life. That's glory enough for me.

Next blog topic: Biggest threats to a company that have nothing to do with security.

Resources:

http://www.ktvu.com/news/27931463/detail.html
http://www.itproportal.com/2011/06/10/london-ambulances-reduced-pen-paper-for-day/
http://www.computerweekly.com/Articles/2011/08/05/247531/Lloyds-Bank-cooling-system-failure-to-blame-for-trading.htm

CG Solutions of Jax

Monday, October 10, 2011

Biggest Threats to a company that have nothing to do with security

Tuesday, October 4, 2011

How did that happen?

About Me

Blog Archive