Quite a bit of hubbub over WordPress’s recent outage. A number of high profile blogs including Techcrunch, GigaOm, CNN, and your very own SmoothSpan use WordPress. Matt Mullenweg told Read/WriteWeb:
“The cause of the outage was a very unfortunate code change that overwrote some key options in the options table for a number of blogs. We brought the site down to prevent damage and have been bringing blogs back after we’ve verified that they’re 100% okay.”
Apparently, WordPress has three data centers, 1300 servers, and is home to on the order of 10 million blogs. Techcrunch is back and talking about it, but as I write this, GigaOm is still out. Given the nature of the outage, WordPress presumably has to hand tweak that option information back in for all the blogs that got zapped. If it is restoring from backup, that can be painful too.
While one can lay blame at the doorstep of whatever programmer made the mistake, the reality is that programmers make mistakes. It is unavailable. The important question is what has been done from an Operations and Architecture standpoint that either mitigates or compounds the likelihood such mistakes cause a problem. In this case, I blame multitenancy. When you can make a single code change that zaps all you customers very quickly like this, you had to have help from your architecture to pull it off.
Don’t get me wrong, I’m all for multitenancy. In fact, it’s essential for many SaaS operations. But, companies need to have a plan to manage the risks inherent in multitenancy. The primary risk is the rapidity with which rolling out a change can affect your customer base. When operations are set up so that every tenant is in the same “hotel”, this problem is compounded, because it means everyone gets hit.
What to do?
First, your architecture needs to support multiple hotels, and it needs to include tools that make it easy for your operations personnel to manage which tenants are in which hotels, which codelines run on which hotels (more on that one in a minute), and to rapidly rehost tenants to a different hotel, if desired. These capabilities pave the way for a tremendous increase in operational flexibility that makes it far easier to do all sorts of things and possible to do some things that are completely impossible with a single hotel.
Second, I highly encourage the use of a Cloud data center, such as Amazon Web Services. Here again, the reason is operational flexibility. Spinning up more servers rapidly for any number of reasons is easy to do, and you take the cost of temporarily having a lot more servers (for example, to give your customers a beta test of a new release) off the table because it is so cheap to temporarily have a lot of extra servers.
Last step: use a feathered release cycle. When you roll out a code change, no matter how well-tested it is, don’t deploy to all the hotels. A feathered release cycle delivers the code change to one hotel at a time, and waits an appropriate length of time to see that nothing catastrophic has occurred. It’s amazing what a difference a day makes in understanding the potential pitfalls of a new release. Given the operational flexibility of being able to manage multiple hotels, you can adopt all sorts of release feathering strategies. Start with smaller customers, start with brand new customers, start with your freemium customers, and start out by beta testing customers are all possibilities that can result in considerable risk mitigation for the majority of your customer base.
If you’re a customer looking at SaaS solutions, ask about their capacity for multiple hotels and release feathering. It just may save you considerable pain.