Resilience Challenges, Metrics, and Responses in Today’s Landscape

Resilience Challenges, Metrics, and Responses in Today’s Landscape
Author: Spiros Alexiou, Ph.D., CISA, CSX-F, CIA
Date Published: 21 May 2024
Read Time: 16 minutes

The variety and magnitude of the threats organizations and their IT infrastructures face today is staggering. Events that are difficult to foresee and plan for happen every day—from cybersecurity incidents involving insiders, vendors, and third parties, to environmental issues such as fires, floods, and natural disasters. From health crises triggered by pandemics, to manmade disasters such as wars. Although enterprises strive to take mitigating measures, the question remains: Will they be enough when—not if—a major disruption occurs? Every organization hopes so, but given the unpredictability of such event and their potential effects, it is hard to conceive of a surefire way to answer “yes.”

Resilience is defined as “the ability to anticipate, prepare for, and adapt to changing conditions and withstand, respond to, and recover rapidly from disruptions.”1 It typically addresses low-probability but high-impact events that represent large-scale stresses to normal operations. It is distinct from reliability, which refers to delivering a fairly constant service even in the presence of fluctuating conditions as part of normal operations.

A significant challenge with building resilience is that it is a thankless job, one that does not result in bonuses being handed out for increasing sales or cutting costs. Yet it is absolutely necessary because no one lives in a world where adverse events never happen. Furthermore, adverse events are not limited to natural disasters and cyberattacks. In today’s complex world, even supposedly harmless, routine tasks can lead to major disruptions. Thus, it is critical to explore common challenges with resilience, possible solutions, and metrics that can be used to measure their effectiveness.

Getting Started With Resilience

The main attribute of a resilient organization is minimized reliance on any particular aspect of the enterprise’s function. The less one depends on anything, the easier it is to tolerate its absence or reduction. This is true for both organizations and individuals. For example, a shortage of coffee would have no consequences for people who do not drink coffee. So, to achieve resilience, rather than ask “what can go wrong?” the question should be, “what are we most dependent on?”

If the answer is IT infrastructure, there should—at a minimum—be a backup saved. If it is raw materials, then alternative sources should be available. If it is cash stored in a bank, plans should be in place for the possibility that the bank, and therefore the cash, might become inaccessible. If an enterprise is most dependent on its physical office, then there should be a backup plan for its loss or inaccessibility—for example, the ability to work remotely. As the recent COVID-19 pandemic demonstrated,2 an organization should be able to carry out essential operations without an office building, assuming an alternative physical space with relevant infrastructure, power, and networks is available.

In today’s complex world, even supposedly harmless, routine tasks can lead to major disruptions. Thus, it is critical to explore common challenges with resilience, possible solutions, and metrics that can be used to measure their effectiveness.

However, it is clear that contractual guarantees via insurance are not a viable resilience strategy, as litigation is often “neither effective nor efficient.”3 This is especially true today, as brand reputation and customer loyalty are often more important than the product itself. Failure to service one’s customer(s) for a prolonged period is extremely damaging, and even getting the insurance money to rebuild does not guarantee the customer(s) will come back.

An enterprise can achieve a higher degree of resilience by identifying its most disruptive potential losses, systematically ranking them, and taking action to mitigate the hypothetical disruption. This is, of course, plain old business impact analysis (BIA), although often BIA is performed on a subset of enterprise assets and operations. The key point is that disruptive events are not always IT-related. Arguably, most events that make headlines by putting organizations’ livelihoods at risk do not involve IT. However, with rising automation and digitalization, the already significant potential for an IT-related disruption is increasing rapidly, and cybersecurity is a major concern. But it is not the only one.

Automation and intelligent systems have their own pitfalls; nothing works perfectly forever, and great care should be taken to put remedies in place for when things do not work as intended. In general, a traditional system that simply performs what it has been programmed to do has less potential of malfunctioning than an intelligent system whose actions are not strictly prescribed and controlled. And, of course, a security incident within an automated, intelligent system can have dire consequences.

Measuring and Achieving Higher Resilience

Higher resilience is desirable, of course, but what an organization really wants to know is if it is doing enough to meet minimum requirements, since higher resilience is generally associated with higher costs. So, metrics to perform a quantitative assessment can prove valuable. The standard metrics for disaster recovery are the recovery time objective (RTO) and recovery point objective (RPO), which specify respectively the maximum acceptable time before a recovery of essential operations is resumed and the maximum time frame that the organization can afford to lose the data affected by a disruptive event.

However, RTO and RPO relate only to the recovery of individual IT assets and do not measure either the ability of the enterprise to withstand disruption or to achieve recovery as a whole. For example, enhancing a building’s static support certainly contributes toward a more resilient organization in terms of earthquakes, but this cannot be conveyed by RTO or RPO metrics. So, what are some other methods one can use to measure resilience?

Outsourcing Resilience
One approach is to outsource resilience and thus avoid the problem—to a degree. This involves a number of IT failover or cloud-type solutions. The key is that one must assume that the outsourced solution itself will be resilient. It may or may not be the case that the organization to which a subset of the resilience is outsourced is more resilient than the outsourcer, despite being a higher-value, more attractive target.4 However, the outsourcer cannot verify this independently, except when a negative incident occurs and becomes known. Furthermore, the compensation that the outsourcer would receive if an incident did occur—and if the outsourcer could prove that it was the provider’s fault (which is difficult to do)—is typically low. Such compensation may not be envisaged in the contract at all.5

In most cases, outsourcing works resilience-wise on the assumption that the resilience measures the organization cannot control or audit offer more assurance than the resilience measures the organization can directly manage and audit. Typically, cloud assurance relies on third-party auditors’ system and organization controls (SOC) 2 reports, which are (or should be) annual reports designed to provide assurance about the security control effectiveness of the cloud provider.6 While they are valuable, they are not guarantees. In short, the concept of resilience includes resilience against the failure of any outsourced solution.

The Hand Rule
One fairly simple metric derives from the so-called Hand rule,7 named after Learned Hand, a US Federal judge. Hand attempted to quantify a measure of due diligence by ruling that the cost (burden) incurred in terms of preventive measures should be larger than the risk of an adverse event, defined as probability multiplied by impact. Indeed, this line of reasoning does introduce a metric, as it provides incentives for taking preventive or mitigating measures while establishing a ceiling over them. However, the Hand rule is more appropriate in a legal setting.

In practice, there are many ways of incurring costs, and not all of them are equally easy or effective, so this criterion does not have a direct causal effect on the desirable outcome, that is, the mitigation of risk. Typically, when a disruptive event occurs, the law does not mandate that assistance be provided to the affected organization. Insurance solutions are also interested in the measures taken and their degree of prevention, rather than money spent to assess insurance costs and liabilities. So, it is not clear whether money or resources in general are useful assurance tools.

Nevertheless, a milder form of the Hand rule, the principle of proportionality,8 is useful as a guide and has been incorporated into resilience directives.9 Proportionality dictates that any measure taken must:

  • Be both necessary and suitable to achieve the desired result
  • Not impose an excessive burden in relation to the desired result

Digital Trust: One Component of Resilience
Digital trust—for example, the ability to trust that a remote worker performing a task is the actual employee assigned to the task—is an important enabler of resilience, as the COVID-19 crisis demonstrated. However, it should be kept in mind that first, this is only one enabler and by no means a guarantor for resilience; and second, multifactor authentication (MFA), which is at the heart of digital trust and currently its best defense, is not foolproof.10

Other Attempts and Suggestions
Some work has been done to define resilience metrics for a number of topics of current interest—for example, climate change,11 Internet resilience,12 and power.13 Similar efforts by the European Union Agency for Cybersecurity (ENISA)14 reinforce that resilience is a new area of focus lacking a standardized framework or standard practices. ENISA recommends, in addition to raising awareness of resilience, adopting a common understanding of, and good practices or standards for, resilience.

According to ENISA, “Organizations use their own specific approaches and means of measuring resilience, if they use any at all.”15 ENISA also correctly recognizes that metrics must be practical and useful to avoid overreliance on key performance indicators (KPIs), and specifically recommends the following actions:

  • Start with a small set of clearly defined, actionable, meaningful, and generally accepted metrics.
  • Focus on specific systems and
  • Consider data availability constraints and
  • Define and refine thresholds based on several measurement periods, not as an absolute value upfront. Review and evaluate metrics on a regular basis to avoid KPI worship and only keep useful, actionable

Building Resilience From Scratch
A comprehensive approach to addressing and quantifying risk mitigation is to create a systematic list of dependencies of critical business functions and available alternative options. Specific regulations and directives for information and communications technology (ICT) apply to, for example, the financial sector.16 Among other things, the EU Digital Operational Resilience Act (DORA) requires that organizations “shall identify all information assets and ICT assets, including those on remote sites, network resources, and hardware equipment, and shall map those considered critical.”17

Steps to Build Resilience
The first step to achieving resilience is to map assets. Once they are mapped, appropriate measures to achieve higher resilience can be implemented. For example, when it has been established that customer records in the enterprise database are critical, creating air-gapped copies in one or more separate disaster sites can improve resilience. Then, as budget and other considerations allow, the organization can start to systematically build resilience into the most critical functions, starting with the items that are considered prerequisites and perhaps using a score to quantify criticality.

For example, power, networks, and telecommunications are often prerequisites of IT systems that are taken for granted. Single points of failure (SPOF) are especially important targets for gradual mitigation/elimination because of their criticality. The next step is to assess the risk and devise a concrete plan to improve resilience by methodically addressing resilience deficiencies on the list, with the highest-risk items addressed first, using the available budget, and lower-risk items left for a later stage, as time and budget permit. While this would undoubtedly be the most desirable approach, this process is often budget-driven in practice, (i.e., resilience is improved only for the assets that fall within the available budget).

Hidden Assumptions
Employees involved in operations will certainly be asked to contribute to this mapping of assets. The difficulty here is that operations employees are accustomed to the usual state, with things working as designed. However, for operations to work well, a number of assumptions are made, and it is not guaranteed that all will be covered by resilience improvement initiatives. For example, when one discusses resilience with a database administrator (DBA), the discussion will likely include replication and disc capacity, but the operational system or power availability may not be on the DBA’s radar. Even events that one may not normally consider significant, such as a minor technical slip-up18 or misconfiguration issue during a maintenance upgrade,19 can cause substantial disruption. One does not normally consider the protocols followed during software upgrades as resilience-related, yet in the case of Optus, the “lack of robust, tested resilience protocols exacerbated the prolonged recovery period.”20

Physical threats may also be underestimated. For example, utility companies have multiple buildings that can take over if one or more fail. But whereas natural disasters are random events that do not have a purpose, a group of people with a purpose to take out a particular utility can defeat resilience measures by a simultaneous physical attack on all buildings. While individual buildings have their own physical security measures, often they are not adequate against a coordinated attack, something that is more likely to occur in today’s increasingly polarized world. Audit professionals in particular, with their global view, are uniquely well-positioned to review or draft a list of enterprise assets, an attack on which could maximize damage. However, learning to think beyond one’s immediate area of responsibility can be cultivated, leading to a more resilient organizational culture across all departments.

At least some of these assumptions can and should be tested. An example is a disaster site that is not a mirror image of the primary site. While IT can demonstrate how to ready every system in the disaster recovery site within the specified time frame, this does not mean that operations can resume as normal—security rules may not allow systems to exchange data. Figuring out which data fails to transmit and the reasons why is neither a simple nor fast process, and can easily derail the set recovery targets.

Human Capital
One other extremely important aspect of resilience that is often overlooked is human capital. In a crisis, it is not the backups that will save an organization. It is first and foremost its own people—those who know the business and who need to understand what happened to steer the organization toward normal operations. Crises are unpredictable. The best hope for surviving a crisis lies in the hands of people who understand the operations in detail, are very good at what they do, and are able to navigate the troubled waters of the crisis to reach a safe harbor.

Mitigation Measures

Mitigation measures to improve resilience are in general driven by complex cost-benefit considerations that are hard to generalize. For example, an enterprise may use generators as a resilience measure against power failures in its data center. However, such generators are usually placed in the basement or on the ground floor, making them susceptible to floods. Whether to mitigate such risk by redesigning the building and placing generators on a higher floor, or building a backup data center, say, on a mountaintop, which would be safe from flooding, but possibly exposed to fires, is ultimately a cost-benefit decision. No one-size-fits-all solution exists. Situations need to be evaluated on a case-by-case basis. That said, risk maps per area and environmental disruption (e.g., fire, flood, storm, earthquake) are commercially available today.

Often organizations consider who should have responsibility for resilience. The truth is that resilience is everybody’s business, but someone has to be designated responsible for coordinating the efforts to improve resilience. It does not matter so much whether it is IT or another function that is responsible for resilience. Rather, what is important is that someone in the enterprise has the overall authority, as well as the ear of C-level management and the board of directors, so that risk and budgets can be discussed at the appropriate level.

Conclusion

Building resilience into an organization is a complex but important quest that goes beyond addressing natural disasters and cyberattacks. The best approach is a systematic consideration of every step that constitutes a critical process or supports a critical process, and a prioritization and systematic strengthening of resilience. IT risk and audit professionals who have a global view, without the experience taken for granted by operational personnel, are uniquely positioned to review and support this effort, ask questions, identify risk, and suggest appropriate mitigation measures.

Endnotes

1 National Renewable Energy Laboratory, “Resilience Metrics and Valuation,” USA, https://www.nrel.gov/security-resilience/metrics-valuation.html
2 Adavade, ; “Operational Resilience: Preparing for the Next Global Crisis,” ISACA® Journal, vol. 3, 2022, https://www.isaca.org/archives
3 Brown, ; “Achieving Organizational Resilience Through Digital Trust,” Forbes, https://www.forbes.com/sites/forbestechcouncil/2022/09/19/achieving-organizational-resilience-through-digital-trust/?sh=391f896311e2
4 Alexiou, ; “Is Business Continuity Management Still Relevant?,” ISACA®Journal, vol. 3, 2022, https://www.isaca.org/archives
5 Amazon Web Services (AWS), “AWS Customer Agreement,” https://aws.amazon.com/agreement/; Microsoft, “Microsoft Cloud Agreement,” 2017, https://download.microsoft.com/download/2/C/8/2C8CAC17-FCE7-4F51-9556-4D77C7022DF5/MCA2017Agr_EMEA_EU-EFTA_ENG_Sep20172_CR.pdf
6 Association of International Certified Professional Accountants; The Chartered Institute of Management Accountants; "SOC 2 - SOC for Service Organizations: Trust Services Criteria,” https://www.aicpa-cima.com/topic/audit-assurance/audit-and-assurance-greater-than-soc-2
7 Cooter, ; “Hand Rule Damages for Incompensable Losses,” San Diego Law Review, vol. 40, iss. 1097, 2003, https://digital.sandiego.edu/cgi/viewcontent.cgi?article=3188&context=sdlr#:~:text=In%20general%2C%20Hand%20rule%20damages,yield%20reasonable%20values%20of%20damages
8 Eur-Lex, “Principle of Proportionality,” European Union, https://eur-lex.europa.eu/EN/legal-content/glossary/principle-of-proportionality.html
9 European Commission, Regulation (EU) 2022/2554 of the European Parliament and of the Council of 14 December 2022 on Digital Operational Resilience for the Financial Sector, European Union, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32022R2554
10 SecurityScorecard, “Hackers Are Using These 3 Techniques to Bypass MFA,” 7 December 2022, https://securityscorecard.com/blog/techniques-to-bypass-mfa/
11 S. Climate Resilience Toolkit, “Resilience Metrics,” USA, https://toolkit.climate.gov/tool/resilience-metrics
12 Ford, M.; Internet Society Pulse Internet Resilience Index, Internet Society, https://datatracker.ietf.org/meeting/118/materials/slides-118-gaia-internet-society-pulse-internet-resilience-index-iri-00; Internet Society, “Internet Resilience,“ Pulse, https://pulse.internetsociety.org/resilience
13 Op cit National Renewable Energy Laboratory
14 European Union Agency for Cybersecurity, “Resilience Metrics”
15 Ibid.
16 Op cit European Commission
17 Ibid.
18 Siddiqui, ; “Optus Outage Exposes Australia’s Internet Resilience,“ Internet Society, 15 November 2023, https://pulse.internetsociety.org/blog/optus-outage-exposes-australias-internet-resilience
19 Cheung, ; “The Rogers Outage of 2022: Takeaways for SREs,” DevOps.com, 15 August 2022, https://devops.com/the-rogers-outage-of-2022-takeaways-for-sres/
20 Op cit Siddiqui

Spiros Alexiou, PH.D, CISA, CIA, CSX-F

Is an IT auditor at a large company where he has worked for 16 years. He has more than 25 years of experience in IT systems, has designed and audited resilience, and has taught and written about business continuity. He can be reached at spiralexiou@gmail.com.

Additional resources