580 flight delays, nine cancellations, thousands of frustrated passengers and millions in lost productivity—that is the legacy of the United Airlines IT failure in late August.
United initially explained, “The outage was caused when a piece of communication equipment in one of our data centers failed and disabled communications with our airports and web site. We have fully redundant systems and we are working with the manufacturers to determine why the backup equipment did not work as it was supposed to." This situation is a classic in the world of risk management, similar to a recent workshop I led when a creative team was asked to design a scenario with the independent failure of multiple components.
United stated that its hardware failed to communicate. In these situations, there are usually broader human failures to communicate. IT risk professionals can use this to remind their CIOs and business leaders of the history of failures at a range of companies. Yet, the message is more than just looking back at past quagmires. Leaders will propose actions to better manage such risks in the future.
From United’s pain, IT professionals can learn and act on three critical lessons:
- It is about systems. Organizations are getting better at managing risk in components. But the real world is not about neat and tidy components. It is the world of systems—interrelationships and dependencies. Life is messy. Airlines and logistics in general are great examples of systems. In this and other failures, multiple devices were involved; further, Hurricane Isaac recovery was underway.
- It is the little things. There is a tendency in risk evaluation (especially continuity exercises) to look for the big causes, such as fire or flood. But many of the worst messes (especially in IT-land) start with little causes that cascade.
- Expect failures related to people, process and technology. In particular, expect real-world situations where one failure complicates another. Remember the three horsemen of risk—change, complexity and human fatigue.
These three lessons lead to action for improvement:
- Scenario analysis that is more lifelike and robust. Scenario analysis is known as the “heart of risk management” for good reason. Poor scenario analysis results in poor ability to recognize problems unfolding, prioritize actions, improve capabilities and react appropriately.
- Plan Bs built directly on scenarios, resourced and ready to respond to the inevitable shocks to the system. Key is the system’s (people, process and technology) ability to take a hit and keep flying. This is about strengthening capabilities in oversight, management and business process to deal with inevitable failures in capabilities or shocks from the environment.
- Practice. A classic test is to walk into a data center, randomly unplug a cable and/or turn off a switch. How ready are you really? As part of your test, write your news release. What does your president want to say? Are you able to deliver? If not, it is time to communicate.
Resources for you: For more on the risk-management process, see ISACA’s Risk IT based on COBIT here. For more on good IT-management practices, see COBIT 5 (which incorporates concepts from Risk IT) here. For more on the specifics of scenarios, systems, root cause and Plan Bs, see The Operational Risk Handbook here.
Principal Analyst & Advisor, ValueBridge Advisors
Brian Barnier, ValueBridge Advisors, is an analyst and advisor who shares his commentaries from the floor of the NYSE and the NASDAQ MarketSite. He is the author of The Operational Risk Handbook (Harriman House, London, 2011) and contributed to Risk Management in Finance (Wiley, New York, 2009). [email protected]
We welcome your comments! Please log in using the Sign In link at the top right of this page and then leave your comment in the box at the end of the post. To view all blog posts, please click on the ISACA Now link in the blue box on the left.