The Great Facebook Crash of 2010
If you were unaware Facebook crashed Thursday. I heard about the crash on the radio, news, online, twitter, smoke signals, and Morse code to name a few. We all were disconnected for a very painful and never ending 2.5 hours. I’m going to open up to Facebook…ready? I took you for granted my very special friend and without you I was forced to work for those entire 2.5 hours nonstop. And that my friends was just wrong. So what did our friends at Facebook have to say for themselves? Check out this blog post from one of their engineers yesterday:
Early today Facebook was down or unreachable for many of you for approximately 2.5 hours. This is the worst outage we’ve had in over four years, and we wanted to first of all apologize for it. We also wanted to provide much more technical detail on what happened and share one big lesson learned.
The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.
Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
Ouch, the crash sounded painful but at least we got our friend back. I’m going to let you off easy this time Facebook but please do not let it happen again. Do you understand how long 2.5 hours is without being connected? — No posting, no pictures, no friends…geez, no life.