In its most severe outage in years, Facebook is experiencing service issues across its network which has been largely fixed but is still be ongoing. The disruption was initially attributed to a flaw in the platform’s configuration value system, an automated mechanism designed to verify and update configuration values across its infrastructure.
CNN is reporting that more than 500,000 users have experienced a loss of service to the technology giant’s platform. For those affected, they are completely unable to log into messenger, Facebook, or Instagram. The trolling has already begun on platforms like X where Elon Musk tweeted “If you’re reading this post, it’s because our servers are working.” This issue has occurred on Super Tuesday making it one of the more important political related dates prior to the general election.
“We’re aware people are having trouble accessing our services. We are working on this now,” Meta spokesperson Andy Stone wrote in a post on the social media site X Tuesday. The outages are widespread visible through the MetaStatus site itself, which track issues across the network. At one point, Meta was reporting major issues for Ads Manager, Facebook and Instagram Shops, Meta Business Suite, Instagram Boost, Meta Admin Center, Facebook Login, and the Graph API. Hashtags #instagramdown and #facebooknotworking are both trending currently across social media with shares down a reported 1% already according to CNBC.
According to Facebook’s technical explanation, the system malfunctioned when attempting to rectify an error condition. The system’s primary function is to detect and replace invalid configuration values in the cache with updated values from the persistent store. However, it failed to handle cases where the persistent store itself contained invalid values.
This oversight triggered a domino effect: a change made to a configuration value was misinterpreted as invalid, prompting every client to query a database cluster simultaneously. The resulting influx of queries overwhelmed the cluster, leading to a feedback loop where errors compounded, exacerbating the issue.
To mitigate the situation, Facebook had to halt traffic to the affected database cluster, effectively shutting down the site until the databases could recover. As a short-term measure, they disabled the system responsible for correcting configuration values and pledged to explore new designs to prevent similar incidents.
This outage underscores the critical importance of system reliability and highlights the challenges of managing complex configurations at scale. Facebook reassures users of its commitment to improving platform performance and reliability, recognizing the impact such disruptions have on its global user base.
Facebook Engineering Official Statement
“Early today Facebook was down or unreachable for many of you for approximately 2.5 hours. This is the worst outage we’ve had in over four years, and we wanted to first of all apologize for it. We also wanted to provide much more technical detail on what happened and share one big lesson learned.
The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.
Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.
The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.
This got the site back up and running today, and for now we’ve turned off the system that attempts to correct configuration values. We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.
We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously.”