Overwhelmed network devices took down Amazon Web Services, affecting millions

Article by Gadjo Sevilla | Dec 14, 2021

The news: Amazon Web Services (AWS) has provided a post-mortem on the massive outage that took out a bevy of services for seven hours last week.

How we got here: AWS stated that a large surge of connection activity and congestion overwhelmed servers and network connections, leading to a widespread outage.

The AWS outage last week originated in the US-East region in Northern Virginia, which is notorious for breaking. Internally known as IAD, this location opened in 2006, just as Amazon launched AWS, per Insider.
Services running on AWS servers that went offline included Disney Plus and Delta Airlines; network games like PUBG, League of Legends, and Valorant; various Amazon services like Kindle ebooks, Amazon Music, and Ring cameras; as well as Tinder, Roku, Coinbase, and Venmo, among others.
US-East has the largest concentration of AWS data centers in the world but has become an inside joke among some employees for often needing fixes. A major AWS customer told Insider that the IAD data centers are typically "a large failure point" for those who rely on it as their primary AWS region.
At least nine of the 17 largest outages in AWS history originated from IAD data centers, according to AWS Maniac, a blog tracking AWS service disruptions.
AWS is responding by disabling scaling activities that triggered last week’s event.

The problem: We’re starting to see cracks in cloud infrastructure providers' ability to keep a growing list of internet services and apps online. AWS’ recent outage exposes the fragility of busy data center regions.

The proliferation of streaming services, online gaming platforms, IoT devices, and online services is taking its toll on an internet infrastructure that’s several decades old or simply can’t scale fast enough to meet demand.
Recent outages are also taking longer to resolve, indicating that massive growth is quickly becoming unmanageable.

The bigger picture: We’re seeing the effects of overwhelming highly centralized data regions and the over-reliance on a handful of providers that can take down wide swaths of the internet when their systems fail.

AWS is hoping to be able to better track outages of this magnitude. “We expect to release a new version of our Service Health Dashboard early next year that will make it easier to understand service impact and a new support system architecture across multiple AWS regions,” AWS said.