Klaus Julisch, Ph.D. and Damian Walch
Today’s data centers are extremely complex systems, consisting of thousands or tens of thousands of servers, racks, switches and other equipment, all of which have intricate interdependencies. Moreover, due to the rising frequency and severity of disasters1, 2 and the high value of data and processes hosted, data centers are required to provide disaster recovery capabilities. To ensure and improve the quality of these capabilities, it is common to conduct disaster recovery (DR) assessments.
In practice, assessing the DR programs of modern data centers is a challenging task. Given the complexity of the data centers and the interdependencies among systems, it is easy to get mired in the details. When the concepts of cloud computing, globalization and virtualization are added, it can become almost unmanageable.
This article presents a practice-tested framework that structures and prioritizes the assessment of DR programs, testing the most business-critical aspects first. This article specifically shows where DR failures and shortcomings are most common or severe.3 Depending on the scope and budget of a specific assessment, the framework offers an incremental path for adding as much depth and detail as needed. As data center operations are increasingly outsourced, dealing with the client-provider relationship during assessments is also covered.
This article offers guidance to IT executives that enables them to better focus and guide DR assessments. Auditors in charge of conducting assessments will benefit from the proposed framework to structure and prioritize their work.
DR is a management as well as a technical discipline. Further, management is overseen and controlled by a governance function. The success of a DR program, therefore, depends on all three areas—governance, management and technical operations—working effectively by themselves and in combination. Accordingly, any assessment of a DR program also has to assess all three layers. In doing so, it is important to start with governance (see figure 1) and work down toward assessing the technical and operational aspects. The reason for this prioritization scheme is that a DR program with strong governance tends to be resilient—in the sense that it continuously finds and eliminates weaknesses in the lower layers. Conversely, even when management or technical operations are designed effectively at the time of the assessment, they tend to degrade over time and become dysfunctional when governance is weak. It is for these reasons that any assessment should start with governance first and proceed from the top down.
The assessment objective is different for each layer of the DR program:
The three tiers in figure 1 are sometimes referred to as “governance, process and technology” or “policy, people and process, and technology,” with some variations in semantics. The terminology as defined here is, however, more descriptive for the purpose of this article and will, therefore, be used throughout this text.
Assessing Disaster Recovery GovernanceGovernance sets the strategy of what DR activities to undertake and, to a lesser degree, how to carry them out. To address the what-related questions, governance bodies use business impact analysis (BIA), which is generally prepared by the business, to determine the criticality of DR for business success and future growth. The DR vision, objectives and budget are accordingly derived based on how important the availability and recovery of IT systems are to the business. The vision and objectives drive the DR methods to which the organization adheres (i.e., the how of governance).
When defining the how, governance bodies do not spell out operational details of disaster recovery, but rather define guiding principles and rules that management must follow, including roles, responsibilities, decision rights and processes, and standards or professional practices such as those from the Business Continuity Institute4 (BCI) or the Disaster Recovery Institute International5 (DRII). Governance bodies must further specify the metrics and reports used to measure management’s performance in executing the DR vision and objectives.
When assessing DR governance, the following areas, which are particularly frequent problem areas, are a good start:6
Assessing ManagementWhile governance defines the general rules and principles of DR, management is responsible for implementing and operating an effective DR program. This is arguably the most challenging aspect of DR. The core processes that management has to execute are:
When assessing DR management, these processes must be completed according to the mandate given by governance and in compliance with industry good practice. More detailed guidance on assessing any management processes can be found elsewhere.8, 9 To focus DR assessments, one should start with the following processes when shortcomings are particularly common or severe:
A more technical and in-depth assessment of DR management would include a review of the available backup capacity, technology choices (e.g., use of virtualization), and completeness and correctness of dependency maps.
Assessing Technical OperationsWhile management may be the most challenging aspect of the framework, the operations of technology and infrastructure may be the most straightforward, provided that one has addressed the other two aspects of the framework. Determining whether the backup, restoration and recovery strategies are implemented, managed and maintained in accordance with management’s directive is relatively black and white. The most common shortcomings encountered here are misconfigurations and omissions of all kinds, including:
It has become common to outsource IT operations. As figure 2 shows, this also affects the DR program.
Roles and ResponsibilitiesAs illustrated, the bottom tier (i.e., the technical operation of the DR infrastructure) becomes the sole responsibility of the outsourcing provider. Management is largely the provider’s responsibility, but the client retains certain management rights in, for example, the planning and execution of DR tests or the execution of audits. These management rights depend, however, on DR governance, which is split into client-side, provider-side and shared governance. The shared governance is partially defined by the outsourcing contract and service level agreements (SLAs) and partially left open based on implicit agreements.
An assessment of the technical operations, the client-side governance or the provider-side governance can be done according to the framework outlined previously.
Assessing Shared Governance and Management ProcessesAssessments of shared governance and management are more complex, as they involve two parties: the client and the provider. The Achilles heel of such assessments is always the outsourcing contract, which can leave too many issues unspecified and subject to implicit agreements. In extreme cases, the outsourcing contract may not even authorize the client to audit or assess the provider’s DR program. The most visible symptoms of outsourcing relationships that rely too much on implicit agreements rather than contracts and SLAs are dissatisfied clients, poor DR performance and burdened client/provider relationships.
When such situations are encountered, it is of limited value to the outsourcing client to point out all the ways in which the provider’s DR program falls short of the client’s expectations. Unless such shortcomings are in clear violation of the outsourcing contract, it will be difficult, or even impossible, for the client to request rectification or restitution. Therefore, it is recommended to review the outsourcing contract and SLAs and to identify changes that need to be made to the contract when it is renegotiated. Additionally, it is frequently beneficial to review and revise the client’s approach to managing vendor risk, including its criteria for selecting outsourcing providers.10
When the outsourcing contract follows good practices, these issues do not arise and the assessment can proceed very much along the lines of the previously outlined framework. In such cases, it is best to focus specifically on the interface between the client-side and provider-side management teams. This interface is frequently the source of errors and communication breakdowns, which negatively impact the management processes (see Assessing Management section).
In most organizations, the DR function is under constant cost pressure, and efficiency is of the essence. To support this imperative, this article presents a framework that can help auditors as well as executives focus and prioritize their DR assessments. The framework distinguishes the governance, management and operations tiers and shows what issues are most commonly encountered during assessments of each tier. Many of these issues are mistakes or omissions that can result from the inherent complexities and cost pressures of disaster recovery. The framework also makes an important contribution to the field of DR assessment because it adds minimally to these costs and complexities. The framework applies to outsourced IT operations, in which many issues have their origin in overly vague outsourcing contracts. In these cases, it has been shown that the contract must be made more rigorous before any DR-specific issues can be resolved.
The authors thank Elias Schibli and Florian Widmer for their valuable comments on earlier versions of this article.
1 Hiles, Andrew; The Definitive Handbook of Business Continuity Management, 3rd Edition, John Wiley & Sons, 20112 McClean, Denis [ed.]; World Disaster Report, International Federation of Red Cross and Red Crescent Societies, 2010, www.ifrc.org/en/publications-and-reports/world-disasters-report/3 Based on the experiences of the authors4 Bird, Lyndon [ed.]; Good Practice Guidelines 2010, Global Edition, The Business Continuity Institute, 20105 DRI International, Professional Practices, www.drii.org and www.drj.com/GAP/gap.pdf6 Doughty, Ken; “IT Governance: Pass or Fail?,” ISACA Journal, vol. 2, 20057 Op cit, Hiles8 Ibid.9 Gregory, Peter H.; CISA Certified Information Systems Auditor All-In-One Exam Guide, McGraw-Hill, 201010 Davis, Chris; Mike Schiller; Kevin Wheeler; IT Auditing: Using Controls to Protect Information Assets, 2nd Edition, McGraw-Hill, 2011, ch. 11 and 14
Klaus Julisch, Ph.D., is a manager in Deloitte’s Enterprise Risk Services division. With more than 10 years of experience designing, managing and assessing security and resiliency solutions for Fortune 500 companies, his work has resulted in numerous patents and has been published internationally.
Damian Walch is a director at Deloitte with responsibility for delivering disaster recovery, business continuity, information security and risk-related services. Walch has more than 18 years of experience in the field of information systems, with specialized experience in development and deployment of disaster recovery, high-availability information security governance programs, enterprise risk management programs and regulatory compliance.
Enjoying this article? To read the most current ISACA Journal articles, become a member or subscribe to the Journal.
The ISACA Journal is published by ISACA. Membership in the association, a voluntary organization serving IT governance professionals, entitles one to receive an annual subscription to the ISACA Journal.
Opinions expressed in the ISACA Journal represent the views of the authors and advertisers. They may differ from policies and official statements of ISACA and/or the IT Governance Institute and their committees, and from opinions endorsed by authors’ employers, or the editors of this Journal. ISACA Journal does not attest to the originality of authors’ content.
© 2012 ISACA. All rights reserved.
Instructors are permitted to photocopy isolated articles for noncommercial classroom use without fee. For other copying, reprint or republication, permission must be obtained in writing from the association. Where necessary, permission is granted by the copyright owners for those registered with the Copyright Clearance Center (CCC), 27 Congress St., Salem, MA 01970, to photocopy articles owned by ISACA, for a flat fee of US $2.50 per article plus 25¢ per page. Send payment to the CCC stating the ISSN (1526-7407), date, volume, and first and last page number of each article. Copying for other than personal use or internal reference, or of articles or columns not owned by the association without express permission of the association or the copyright owner is expressly prohibited.