August 30, 2011 By Brian Heaton
When part of Amazon’s Elastic Compute Cloud (EC2) crashed on April 21, government agencies in the midst of moving to the cloud received a grim reminder of the need to secure critical databases and files. Although cloud technology is new to many, experts say the concepts behind disaster recovery and prevention remain essentially unchanged.
From having multiple backup contingencies in place to making sure cloud provider service agreements are clear on system redundancies, the same due diligence performed in pre-cloud times is required to ensure that data stays accessible in the event of a crash.
Though many technology professionals want to take advantage of virtualized cloud computing, Terry Weipert, a partner with technology consulting and outsourcing company Accenture, said stronger planning models should be in place first. “You still have to do all the backup and archiving that is needed, just like if you were managing your own hardware or data center,” she explained. “If you are requiring Amazon or another provider to [do] that, you [must] have governance and policy in place to know that they have followed those procedures.”
Thomas Shelford, president of IT services firm Corsis, agrees. He said companies that experienced disruption when Amazon’s EC2 service went down may have been under the false impression that Amazon was too big to fail. “It’s really a cultural issue where a lot of companies feel they don’t need to have a system administrator in-house because the cloud [provider] takes care of redundancies,” he said.
However, that’s a dangerous misconception. “The cloud provides capacity on demand, but not architecture on demand,” Shelford said. “The types of backups you did in the old days still apply.”
While setting policy and enforcing agreements may seem simple enough, the message clearly wasn’t received by all, given the number of private- and public-sector users that went down during Amazon’s crash. Commercial websites, such as the popular location-tagging mobile platform Foursquare, the online knowledge database Quora and social news website Reddit were all temporarily offline.
The U.S. Department of Energy’s OpenEI website was another outage casualty. The site, which promotes collaboration in clean energy research, was out of commission for almost two days.
Online movie rental giant Netflix and ShareFile, a file storage and transmittal firm, are both Amazon customers that got through the outage relatively unscathed. They did it by following the old adage of “don’t put all your eggs in one basket.” Both companies had detailed plans in place to handle outages and designed system architectures assuming that failures would ultimately occur, making the situation easier to manage.
1. Incorporate failover for all points in the system. Every server image should be deployable in multiple regions and data centers, so the system can keep running even if there are outages in more than one region.
2. Develop the right architecture for your software. Architectural nuances can make a huge difference to a system’s failover response. A carefully created system will keep the database in sync with a copy of the database elsewhere, allowing for a seamless failover.
3. Carefully negotiate service-level agreements. SLAs should provide reasonable compensation for the business losses you may suffer from an outage. Simply receiving prorated credit for your hosting costs during downtime won’t compensate for the costs of a large system failure.
4. Design, implement and test a disaster recovery strategy. One component of such a plan is the ability to draw on resources like failover instances, at a secondary provider. Provisions for data recovery and backup servers are also essential. Run simulations and periodic testing to ensure your plans will work.
5. In coding your software, plan for worst-case scenarios. In every part of your code, assume that the resources it needs to work might become unavailable, and that any part of the environment could go haywire. Simulate potential problems in your code, so that the software will respond correctly to cloud outages.
6. Keep your risks in perspective, and plan accordingly. In cases where even a brief downtime would incur massive costs or impair vital government services, multiple redundancies and split-second failover can be worth the investment, but it can be quite costly to eliminate the risk of a brief failure.
ShareFile has a farm of server instances spread out across Amazon’s East Coast and West Coast data centers. Although the company hosts its operational databases at a co-located data center near its headquarters in North Carolina, all of its clients’ files are in the cloud.
To protect those files — and ensure that client uploads and downloads are done without interruption — ShareFile created a proprietary “heartbeat” system that pings each of the servers in the cloud to verify that they’re online and responding to the requests. It’s a technology that’s been around for decades. While the system gives ShareFile more information than a simple “yes, I’m here” response, that’s all it really boils down to. If the response is less than satisfactory, or there isn’t a response at all, the company drops that server.
Amazon has a variety of backup options now, but ShareFile CEO Jesse Lipson said that when his company agreed to be an Amazon cloud beta customer years ago, there wasn’t a backup system in place, so the company developed its own.
“The good thing about it is that the system is pretty fault tolerant,” Lipson said. “If a server goes offline for any reason, it’s likely to disrupt only a small number of customers because we’re heartbeating the servers every minute. It’ll be dropped out, and even if, by chance, a customer caught it in that minute, all they’d have to do is try again, and the upload and download will work.”
ShareFile also saves every file that’s uploaded or downloaded by customers into a disaster recovery data center outside the cloud. Though Lipson admitted that the practice is duplicative and expensive, it’s an extra layer of security that adds to ShareFile’s — and its customers’ — peace of mind. “We didn’t have to use it during the EC2 crash,” Lipson said, “but the long-term idea is that we could recover files from completely outside of Amazon.”
Netflix’s story is similar to ShareFile’s. When Netflix moved to the cloud, its staff foresaw the likelihood of such a cloud crash and designed its system around the possibility. In its Tech Blog, Netflix representatives said the company’s IT architecture avoids using Elastic Block Store — which provides Amazon cloud users with persistent storage — as its main storage service. Instead, Netflix uses a mix of storage services and a distributed management system to ensure redundancy.
While staff at Netflix admitted in a blog post that there was a bit of internal scrambling to manually reroute customer traffic, the company is looking at automating much more of the process.
Despite the Amazon crash, experts were universal in their opinion that the cloud is still the way to go in the future. Weipert said that while the process of backing up data is “definitely more involved,” the learning curve can be somewhat overcome by keeping the process simple.
“Just like your current environment, you still have the same issues of trying to dynamically manage an event,” Weipert said of a potential cloud crash. “You don’t lose data in the cloud. There is a way to trace in the cloud computing environment, but you really have to have a plan and be able to do things dynamically.”
Shelford agreed and stressed that the Amazon crash and others should be treated as lessons learned.
“Cloud computing offers significant cost-savings opportunities for government institutions that should be taken advantage of,” Shelford said. “The lesson here is that countermeasures were relatively easy to implement. The moral of the story is, you’ve got to stick with traditional best practices.”
You may use or reference this story with attribution and a link to