Thursday, June 14, 2012

Anatomy of a Cloud Failure

Technoracle has published many articles on Cloud Computing, a technology of virtualizing computer functionality.  The virtualization occurs when a network or system's physical topology no longer aligns to it's logical topology.

Today an event happened that once again should serve as a reminder that cloud computing might not always be the best solution.  At around an hour ago Pacific Daytime Time, reports starting surging in that instances of Neo4J on the cloud were failing.  These messages initially were focused on Heroku, a generally very reliable cloud provider.  It became quickly apparent that the outages were hitting Amazon Web Services EC2 instances as well as other cloud providers.  The messages at YCombinator revealed how groups became aware of these outages:

michaelfairley 48 minutes ago | link
This is a more widespread EC2/EBS issue: reply DigitalSea 41 minutes ago | link I couldn't see any red circles indicating an issue with EC2/EBS.

reply michaelfairley 37 minutes ago |
link The circle is green with a little "note" on it. "8:50 PM PDT We are investigating degraded performance for some volumes in a single AZ in the us-east-1 region." reply

DigitalSea 29 minutes ago |
link Wouldn't that only affect a small subset of visitors. For example why would I be seeing any issues if I'd be hitting an Asia-pacific volume instead of a us-east region one? Seems like it goes deeper than that. reply mechanical_fish 4 minutes ago | link One problem which we've seen before is: If a large percentage of AWS infrastructure goes down, the customers don't just quietly suffer. Instead they scramble to try and launch infrastructure in other zones or regions, which creates a cascading series of load spikes throughout the AWS system. AWS is a fascinating science experiment. Pity about the websites, though. -----

michaelfairley 12 minutes ago |
link It's now yellow with this: "9:27 PM PDT We continue to investigate this issue. We can confirm that there is both impact to volumes and instances in a single AZ in US-EAST-1 Region. We are also experiencing increased error rates and latencies on the EC2 APIs in the US-EAST-1 Region." AWS has been historically bad at reporting the severity of their outages promptly. reply

The conversation can be viewed here -

The spread of this from one cloud provider to another in such rapid succession shows the fragility interconnected systems have an how they are susceptible to these types of events.  With time, it is hoped that the lessons learned from these types of events will help us all build better systems. 

At the time of this posting, the event is still unfolding.