Reddit, Foursquare, EngineYard and Quora were among the many sites that went down recently due to a rather prolonged outage of Amazon’s cloud services. On Thursday April 21, When Amazon Elastic Block Store (EBS) went offline, it took many of its Web and database servers depending on that storage down. With Amazon working aggressively to set this back right, on Sunday April 24, most of the services were restored back . As promised and as would be expected, Amazon has now come out with a detailed explanation describing what went wrong, and explaining why the failure was so widely felt and why it took that much time to restore back all the services. Some say that measured against Amazon’s promised availability, this lengthy outage would mean that Amazon may need to maintain full availability for more than a decade to adhere to their promised availability service level commitments.
Now, let’s examine what happened and how this happened. To start with some basics: Amazon has its facilities spread out around the world. Most users would know that its cloud computing data centers are in five different locations. Virginia, Northern California, Ireland, Singapore, and Tokyo. These centers are so architected that within each of these regions, the cloud services are further separated into what Amazon calls Availability Zone. The availability zones are within themselves self contained with physically and logically separate groups of computers setup therein. Amazon explains that such an arrangement helps customer choose the right level of redundancy as appropriate to their win needs. Such a design with a spectrum of options helps customers choose the right level of robustness also when they for a premium choose to host them in multiple regions. The logic here is that hosting in multiple availability zones within a same region must provide comparable robustness (as in hosting across multiple regions) but would come with a much better economics benefitting the customer.
Amazon offers several services as part of this arrangement. Amongst those services, Elastic Block Store(EBS) is an important service. With EBS, Amazon provides mountable disk volumes to virtual machines using the more well known Elastic Compute Cloud(EC2). This is quite attractive to customers, as Amazon with this service, provides the virtual machines with huge amount of reliable storage – typically this gets used for database hosting and the like. The powerfulness of this feature can be seen by the fact that while this can be used from EC2, another Amazon feature called Amazon Relational Database Service( RDS) also uses this as a data store. As an added feature for its services, Amazon has designed this feature for high availability purposes and replicates data through EBS between multiple systems. Given the volume and variety involved therein, this process is highly automated. In such an arrangement, if for some reason an EBS node loses connection form its replica, instantly an alternate storage within the same zone is made available to maintain connectivity.
As per Amazon, while doing routine maintenance operations in Virgnia operations on April 21, engineers were trying to make a change in network configuration to the zone. As part of the process, Traffic to the routers affected apparently got moved into a low capacity network as against getting moved onto a backup. . The low capacity network, is meant for handling inter node communication and not large scale replication/data transfer internally between the system and so the additional traffic made the network malfunction. With the primary network brought down for maintenance and the secondary network completely mal-functioning the EBS nodes lost their ability to replicate for want of nodes. This is where the unintended consequence of automation began to rear its ugly head. Every system in this network acted as if they are at risk and began to frenetically look for available nodes with free space for replication. While Amazon tried to restore the primary network, damage has been by then done, with all the available space within the cluster were already used, while some remaining nodes continued their search for nodes with free space available – while such nodes with free space were not available.
With a massive deadlock: nodes trying to find replicas, while there were not nodes with free space, impacted the control system’s performance. The control system performance issue severely impacted execution of new service requests like creating a new volume. A long back up began to get created for the slow control system to act upon and this with time reached catastrophic proportions, with some requests beginning to get returned with fail messages. Now, comes the second but the most crucial part of the outage – unlike other services, the control systems span across the region and not the individual availability zones. The net impact was therefore experienced across different availability zones. Remember the idea of Single Point Of Failure? That was proven here in its full might.
Slowly and deliberately, Amazon began the course correction – by beginning to tend to the control system and by adding more nodes to the cluster. Over time, the backlogs on the control system began to get cleared and this took painful efforts and a lot of time in the process. Outages of public cloud systems have made news in the past but clearly with time, the body of knowledge and maturity levels ought to improve things. Cloud service providers make high availability as the cornerstone of their offerings but this outage would in many ways, would put such claims to question. Even while this outage happened with Amazon Virginia operations, there were many users of AWS, who managed to maintain availability of their system. A majority of those installations had fall back in terms of multiple regions, multiple zone coverage. Such moves necessarily bring cost, complexity equation into consideration.
It’s a little odd to see that when the problem of non availability of nodes happened, Amazon almost began to get into a denial –of-service attacks within their environment . Amazon now claims that this aspect of crisis related actions have been set right but one may have to wait till next outage to see what else could give way It may be noted that Amazon cloud services suffered a major outage in 2008 – the failure pattern looks somewhat similar upon diagnosis.Clearly, the systems need to operate differently under different circumstances – while it’s normal for nodes to keep replicating on storage/access concerns, the system ought to exhibit different behavior with a different nature of crisis. With the increasing adoption of public cloud services, certainly the volume, complexity and range of workloads would increase and the systems would get tested under varying circumstances for availability and reliability. All business and IT users would seek answers to such questions as they consider moving their workloads onto the cloud
It is interesting to see how Netfix, a poster user of Amazon cloud services managed to survive this outage. Netflix says,” When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to. Our architecture avoids using EBS as our main data storage service, and the SimpleDB, S3 and Cassandra services that we do depend upon were not affected by the outage”. Netflix admits that their service ran without intervention but with a higher than usual error rate and higher latency than normal through the morning, which is the low traffic time of day for Netflix streaming. Amongst the major engineering decisions that they implemented to avoid such outages includes designing things as stateless applications and maintain multiple redundant hot copies of the data spread across zones. Netflix calls their solution –“ Cloud Solutions for the Cloud” as the claim here is that instead of fork-lifting our existing applications from their data centers to Amazon’s and simply using EC2, with their approach they believe that they have fully embraced the cloud paradigm. Essentially, Netflix has automated its zone fail-over and recovery process, hosted its services in multiple regions while reducing its dependence on EBS.
Clearly there are ways to get the best of cloud – except that this may have different economics and would call for greater ability to engineer and manage the operations. Amazon may have to increase the level of transparency in terms of their design and the operational metrics need to cover many more areas of operations as against the narrow set of metrics that users get to see now. To sum up , I would hesitate to call AWS as failure of the cloud but this journey into the cloud would call for more preparation and better thought out design to be in place from user’s side.