On June 12, 2025, a significant disruption impacted Google Cloud services, leading to widespread outages affecting numerous online platforms and customer workloads. The incident, which lasted for several hours, caused major service interruptions for over fifty distinct Google Cloud offerings and rippled across the internet, impacting services reliant on Google Cloud infrastructure.
The outage originated from an invalid automated quota update to Google's API management system. This faulty update was distributed globally, causing external API requests to be rejected, triggering a crash loop within the Service Control system, a core component of Google Cloud's API infrastructure responsible for authorization, policy enforcement, and quota management. A new feature added on May 29, 2025, for additional quota policy checks, lacked proper error handling and feature flag protection, exacerbating the issue when policy metadata with unintended blank fields was introduced on June 12th.
The impact was felt across a wide range of Google Cloud services, including Google Compute Engine, Google Cloud Storage, Google Cloud SQL, and Vertex AI. This disruption extended beyond Google's own services, affecting third-party platforms that rely on Google Cloud, such as Spotify, Snapchat, Discord, and Cloudflare. Cloudflare, a major internet infrastructure provider, experienced issues with its authentication systems due to their reliance on Google Cloud, leading to session validation and login workflow problems for its customers.
The outage resulted in significant consequences for businesses and users globally. Millions of users were unable to access critical applications, disrupting daily workflows and online activities. Companies using Google Cloud Platform for their operations experienced problems with application performance and data processing tasks. The financial markets also reacted, with Alphabet's stock experiencing a dip.
Google's Site Reliability Engineering (SRE) team responded swiftly, identifying the root cause within minutes and commencing recovery efforts. However, the recovery process was prolonged in larger Google Cloud regions due to the Service Control system's inability to handle the overload caused by restarting tasks. The company addressed the issue by bypassing the offending quota check, which allowed recovery in most regions within two hours. However, the quota policy database in the us-central1
region became overloaded, resulting in a much longer recovery time.
In response to the outage, Google has pledged to implement measures to prevent future recurrences, including improving error handling, enhancing testing and monitoring, and preventing metadata propagation without adequate safeguards. Google also plans to improve external communications to provide customers with timely information during incidents and ensure monitoring and communication infrastructure remains operational even when core services are down. These initiatives are effectively an admission that Google did not provide enough info during this outage, and plans to do something about that.
The Google Cloud services disruption serves as a reminder of the potential vulnerabilities in cloud infrastructure and the importance of robust backup planning and system diversification for businesses reliant on cloud services. The incident has sparked renewed discussions about cloud reliability and the need for effective testing, dependency management, and cascading failure prevention in complex cloud systems.