I’m reading through Twitter streams, Amazon Forums, and other news sources trying to get a sense of how users are responding and what their problems are. It’s pretty appalling out there. B2B companies admitting they have no recent backups and just have to wait for it to come back online. A company that claims patient’s lives are at stake as they do cardiac monitoring based in the Amazon Cloud and are desperately seeking assistance. The list goes on.
There’s some basic insurance any company using the Amazon Cloud needs to take out first chance they get. It’s not hard, it’s not expensive, it’s not push a button and get hot failover to multiple Clouds, and it won’t fix your problems if you’re caught in the current outage. But it will at least give you a little more maneuvering room. Many of the acounts I’m reading boil down to a lack of options other than waiting because they have no accessible backup data. In other words, they’d love to bring up their sites again on another Amazon Region, but they can’t because they’re missing access to a reasonably current data backup, or the Amazon Machine Instances are all in the affected region or issues along those lines.
Companies need the Cloud equivalent of offsite backup. At a minimum, you need to be sure you can get access to a backup of your infrastructure–all the AMI’s and Data needed to restart. Storage is cheap. Heck, if you’re totally paranoid, turn the tables and backup the Cloud to your own datacenter which consists of just the backup infrastructure. At least that way you’ve always got the data. Yes, there will be latency issues and that data will not be up to the minute. But look at all that’s happened. Suppose you could’ve spun up in another region having lost 2 hours of data. Not good, not good at all. But is it really worse than waiting over 24 hours or would you be feeling blessed about now if you could’ve done it 2 hours into the emergency? These are the kind of trade offs to be thinking about for disaster recovery. It’s chewing gum and bailing wire until you get an architecture that’s more resilient, but it sure beats not having any choices and waiting.
Another thing: make sure you test your backups. Do they restore? Can you go through the exercise of spinning up in another region to see that it works? Don’t just test once and forget about it. Pick an interval and retest. Make it routine so you know it works.
Staging all the data to other locations is not that expensive compared to continuously running dual failover infrastructure. That’s one of the beauties of elasticity.
There’s a lot of grumbling about how hard it is to failover to other regions and how expensive. Nothing is harder than explaining to your customers why your site is down. But at least get some cheap insurance in place so you have options the next time this happens. And there will be a next time, no matter whether it is Amazon, some other Cloud provider, or your own datacenter. There is always a next time.
While you’re at it, consider some other cheap insurance:
– Do you have a way to communicate with your customers when your site is down? An ops blog that you’re sure is hosted in a different cloud is cheap and cheerful.
– Can you at least get your web site home page showing? Think about how to get DNS access and a place to host that don’t rely 100% on one Cloud provider.
– Is there something about your app that would make partial access in an outage valuable? For example, on a customer service app, being able to log trouble tickets as email during an outage or scheduled downtime would be extremely helpful. Mail is cheap and easy to offer as alternate infrastructure, and it is also easy to imagine piping the email messages through a converter that would file them as tickets when the site came back up. It’s not hard to imagine being able to queue many kinds of transaction this way in an emergency. What are the key limited-functionality areas your users will want to have access to in an emergency?
– For some apps, it is easier to provide high availability for reading than for writing. Can you arrange that in an emergency, reading is still possible, just not writing or creating new objects? Customers are a lot more tractable if they know they still have access to their data, but just can’t create new data for a while. For example, a bookmarking site that lets me access my bookmarks but not create new ones during an outage is much less threatening than one that just brings up its Fail Whale equivalent on me.
Welcome to the world of Disaster Recovery. Disasters have a User Experience too. Have you planned your customer’s Disaster UX yet?