Recovery in the Cloud 

 
Download Article

In a previous column, I wrote about information security in the age of cloud computing.1 That era, of course, is still in the future, but if present trends continue, the Age of Cloud Computing will soon be upon us. Any organization planning to utilize cloud computing services should be well aware of the risks and should implement a robust control structure to counter them. Among the foremost risks is disruption of service, which includes both downtime and data loss.

Promise and Concern

There are aspects of cloud computing that raise concern, and others that promise significant recoverability. Cloud computing is a set of utility services backed by a virtualized infrastructure that is highly scalable in response to the ebb and flow of business demands. It is geographically dispersed, designed for user self-service and (most important for a discussion of recoverability) self-healing.2 The internal structure of cloud computing is based on architecture in which services are offered from multiple physical data centers—all of this masked from the end user. Data are routinely replicated from site to site, so that if one location is disrupted, damaged or destroyed, one or more of the others can continue processing.

Virtualization is the basis for recovery. A vendor can rapidly re-create an entire processing environment by porting the virtualized image from one location to another, including the server(s), storage and network interface.

If It Works

This seems like all promise, no risk; the concern derives from a single, significant caveat: if it works. And sometimes, it does not work. A few recent examples demonstrate the potential fragility of cloud computing:

  • In October 2009, the telecommunications carrier T-Mobile lost the data it was storing for users of Microsoft’s Sidekick system. Although a solution was eventually developed, some individuals lost their contact lists, calendars, notes, tasks, photographs, etc.3
  • In December 2009, Amazon’s EC2 cloud services were disrupted for six hours by a power failure in their Virginia, USA, data center. Due to connectivity issues, the redundancy protection designed into Amazon’s system failed to utilize unaffected data centers.4, 5
  • Workday, a small Software as a Service (SaaS) provider, suffered a 15-hour outage in September 2009, which was ironically caused by the backup to a system with built-in redundancy that took itself offline.6

In a sense, it is unfair to highlight the failures of cloud computing services, inasmuch as organizations’ internal computing systems are no less vulnerable to failure than those of service providers. The fact that specific cloud computing failures are newsworthy does point out the relative rarity of outages, even (perhaps particularly) at this early stage of the technology. Fair or unfair, it is a concern that every acquirer of cloud computing services should be cognizant of and responsive to. Specifications of meantime to failure, meantime to repair and service level agreements should be part of every contract with a cloud computing provider. And, every customer should have some independent, audited assurance that the services acquired are resilient or at least readily recoverable within defined time frames.

Auditing Cloud Computing Recoverability

Of course, gaining that assurance is not a simple matter. A customer might try to include a right to audit into a contract, but this is difficult to obtain. If audits were possible, it would then fall to that company’s internal auditors to determine that the terms of the contract are met, which again would be quite challenging from an outside viewpoint. Third-party audits are another solution, with the proviso that the scope and objectives of such an audit explicitly include recoverability. This is often not the case,7 and so the terms of a third-party audit should be carefully scrutinized. Frankly, I am not aware of any cloud computing vendor that offers a third-party audit report of its recoverability practices. I would welcome learning of vendors who do so.

No such audit can offer assurance that a computing service would never fail, nor can it offer assurance of the amount of recovery time if a failure does occur. But, certain assertions are auditable: the number and frequency of recovery tests, the time interval for replication among sites, the existence of a disaster recovery plan with specified roles for emergencies, the training of staff in their emergency roles, and so on. It would be valuable to have assurance that a cloud computing vendor could use its data centers to reinforce one another. Do the data centers each have sufficient (i.e., excess) capacity? Are they far enough from one another that an incident affecting one will not also disable the other(s)? Contrariwise, if instantaneous recovery with no data loss is a customer requirement, are the data centers close enough to each other to make this possible?

Auditors should note that the questions to be answered in an examination of a cloud computing vendor are the same as those for internal recovery capabilities. The audit challenges are the same as those for outsourcing, which is an attribute of cloud computing, but not its definition. The core of the cloud computing promise is that the use of computing services is distinct from the ownership and operation of equipment and facilities.

RaaS

Use of the cloud as an alternative to recovery data centers is an intriguing but relatively little-explored possibility.8 I refer to it as RaaS, recovery as a service. A customer would acquire storage as a service to back up all or part of its data. In normal times, only one application would run, constantly updating the virtualized database, either through straightforward replication or by applying log files. If ever the customer’s primary data center were incapacitated, its applications, infrastructure and network would be inflated in the cloud to continue operations while the primary site was being recovered or repaired. Such an arrangement would provide exceptional flexibility for customers and reduce the level of effort of disaster recovery testing. In fact, an organization could run recovery tests as often as it would like, from the comfort of a tester’s office.

The economics of RaaS are not sufficiently advanced to make this a routinely adopted service. The cost of storage, even in virtualized form, is an inhibitor, as is the expense of software and network capacity to replicate databases. These strike me as obstacles that will be overcome as the service, and cloud computing as a whole, matures.

I believe that recoverability as such is not a central problem of cloud computing; recoverability is intrinsic to the architecture of the cloud. Reliability is the real challenge, changing if it works to it works. Reliability is a matter of confidence, which in turn comes from experience. In time, we will consider the cloud as reliable as, say, a company’s centralized computers. This may not be very reassuring, but we have learned to survive with that level of reliability in our data centers. The cloud may be much better…when it works.

Endnotes

1 “Cloudy Daze,” ISACA Journal, ISACA, USA, vol. 1, 2010
2 Bugener, Eric; “Replication and Cloud Computing Are Inseparable,” April 2009, www.infostor.com/index/articles/display/6599123393/articles/infostor/backup-and_recovery/cloud-storage/replication-and_cloud.html
3 Vance, Ashlee; “Some Users May Lose Data on a T-Mobile Smartphone,” New York Times, USA, 11 October 2009
4 Thibodeau, Patrick; “Amazon’s Data Center Outage Reads Like a Thriller,” Computerworld, 11 December 2009, www.computerworld.com/s/article/9142154/Amazon_s_data_center_outage_reads_like_a_thriller
5 Kay, Mike; “Cloud Computing: Harnessing the Storm,” HP, 5 January 2010, www.communities.hp.com/online/blogs/information-faster/archive/2010/01/05/cloud-computingharnessing-the-storm.aspx
6 Weier, Mary Hayes; “Who Do You Blame for Cloud Computing Failures?,” InformationWeek, 12 October 2009, www.informationweek.com/cloud-computing/blog/archives/2009/10/who_do_you_blam.html
7 I have tried to avoid references to specific audit standards, which differ from country to country. In the US, a Service Auditor Report (SAS 70) or, in Canada, a CICA 5970 report, generally excludes recoverability from the scope of a system of internal controls. Readers should consult the auditing standards in each country as appropriate. Overall, it would be best to obtain an audit specifically focused on recovery capabilities.
8 Some vendors do tout their recoverability capabilities, and some are even offering recovery services such as I describe here. But, these are very recent announcements and raise as many questions as they answer.

Steven J. Ross, CISA, MBCP, CISSP
a retired director from Deloitte, is the founder of Risk Masters Inc. He can be reached at stross@riskmastersinc.com.


Enjoying this article? To read the most current ISACA® Journal articles, become a member or subscribe to the Journal.

The ISACA Journal is published by ISACA. Membership in the association, a voluntary organization serving IT governance professionals, entitles one to receive an annual subscription to the ISACA Journal.

Opinions expressed in the ISACA Journal represent the views of the authors and advertisers. They may differ from policies and official statements of ISACA and/or the IT Governance Institute® and their committees, and from opinions endorsed by authors’ employers, or the editors of this Journal. ISACA Journal does not attest to the originality of authors’ content.

© 2010 ISACA. All rights reserved.

Instructors are permitted to photocopy isolated articles for noncommercial classroom use without fee. For other copying, reprint or republication, permission must be obtained in writing from the association. Where necessary, permission is granted by the copyright owners for those registered with the Copyright Clearance Center (CCC), 27 Congress St., Salem, MA 01970, to photocopy articles owned by ISACA, for a flat fee of US $2.50 per article plus 25¢ per page. Send payment to the CCC stating the ISSN (1526-7407), date, volume, and first and last page number of each article. Copying for other than personal use or internal reference, or of articles or columns not owned by the association without express permission of the association or the copyright owner is expressly prohibited.