Roblox suffers a three-day outage, blames downtime on surge in web traffic, overwhelmed servers

The news: Popular gaming platform Roblox was offline for three days and unavailable during the peak Halloween weekend, per The Wall Street Journal.

How we got here: Roblox, an online game platform with more than 43 million daily players, went dark last Thursday and struggled to get back online until late Sunday.

  • More than half of US kids 16 and younger play Roblox, per The Verge. The platform hosts 9.5 million developers and offers over 24 million “experiences,” including games, in-game concerts, promos, and events. 
  • Initial reports attributed the outage to a promotion with Chipotle—the restaurant chain was giving away $1 million worth of free burritos in Roblox in an event that started an hour before servers went down. Roblox denied the promo was the reason for downtime. 
  • Roblox founder and CEO David Baszucki said the outage “involved a combination of several factors. A core system in our infrastructure became overwhelmed, prompted by a subtle bug in our back-end service communications while under heavy load … the failure was caused by the growth in the number of servers in our data centers.”

The bigger picture: 2021 has seen an increasing number of internet service outages, exposing the fragility of an overburdened server infrastructure that relies on a handful of service providers. 

In Roblox’s case, a sudden uptick of users and the resulting server expansion seem to have caused a chain reaction that overwhelmed its capacity and kicked users out of its servers. It’s a problem that will continue to affect internet services.

  • The rise of video streaming, spike in video game use, and remote-work-related internet access for services like Zoom have placed a palpable strain on internet networks that rely on content delivery networks (CDNs) to reach end users.
  • Outages now last longer and are more complex to solve. Facebook-owned platforms, including WhatsApp and Instagram, went offline for six hours last month, affecting more than 3.5 billion people and even locking some employees out of their offices.

What’s next: Simply adding servers to handle more traffic and users seems to be causing more complex problems to the internet infrastructure. Recent outages have also been taking longer to resolve, indicating that massive growth is quickly becoming unmanageable. 

  • Building redundancies by diversifying bandwidth providers, anticipating sudden user growth, and increasing network resilience could help web-based services like Roblox recover from outages faster.

"Behind the Numbers" Podcast