On October 20, 2025, Amazon Web Services (AWS) experienced a significant outage in its US-East-1 region (Northern Virginia), which disrupted multiple services and affected numerous dependent platforms worldwide. This event, occurring in the week of October 15-21, 2025, highlighted the interconnected nature of cloud infrastructure, as issues in a single region cascaded to impact global users. The outage primarily stemmed from internal DNS resolution problems, leading to API errors and connectivity failures across various AWS services.
Timeline of Events
The disruption unfolded over several hours, with the following key milestones based on official reports and third-party observations:
- Start of Issues: Increased error rates began at 11:49 PM PDT on October 19, 2025 (equivalent to approximately 6:49 AM UTC on October 20). Third-party monitoring detected degradation around 7:55 AM UTC, with requests timing out or returning service errors. Some reports pinpoint the onset closer to 3:11 AM Eastern Time (approximately 7:11 AM UTC).
- Identification and Initial Mitigation: By 12:26 AM PDT on October 20, AWS identified the root cause as DNS resolution issues for regional DynamoDB endpoints. The primary issue was mitigated by 2:24 AM PDT (9:24 AM UTC), allowing services to begin recovering. Recovery signs were observed around 9:22 AM UTC, with the issue appearing cleared by 9:35 AM UTC in some monitoring.
- Ongoing Recovery: Post-mitigation, some internal subsystems remained impaired, leading to temporary throttling of operations like EC2 instance launches. Significant recovery for customers was noted by 12:28 PM PDT, with all services returning to normal by 3:01 PM PDT (10:01 PM UTC).
- Duration: The core outage lasted over 7 hours, though full normalization extended into the afternoon.
User reports on downtime trackers spiked around 6:00 AM Pacific Time (1:00 PM UTC), dropping below 5,000 by early afternoon as recovery progressed.
Facts and Cause
The outage was officially attributed to DNS resolution failures affecting the regional endpoints of Amazon DynamoDB, a key NoSQL database service. This internal issue caused widespread API errors and connectivity problems across multiple AWS services in US-East-1, without any evident external network disruptions. As a result, services dependent on these endpoints experienced cascading failures, impacting AWS customers and downstream applications.
Key facts include:
- Affected Region: Exclusively US-East-1, AWS’s largest and most utilized region, which hosts critical infrastructure for many global services.
- Impacted AWS Services: DynamoDB endpoints were the epicenter, but the outage rippled to services like EC2 (with throttled launches during recovery), AWS Support, and broader API operations.
- Broader Impact: External platforms relying on AWS went down, including Snapchat (complete blackout), Ring doorbells (stopped recording), Fortnite (login failures), Robinhood (portfolio access issues), Slack and Atlassian (collaboration tools disrupted), UK banks like Lloyds and Halifax (service loss), and Amazon Alexa (devices unresponsive). Millions of users were affected globally, underscoring US-East-1’s role as a de facto backbone for internet services.
- No External Factors: Monitoring showed no coinciding network events, confirming the problem was internal to AWS’s architecture.
Hypotheses
While the official cause points to DynamoDB DNS resolution issues, analyses suggest this acted as a single point of failure (SPOF) within AWS’s infrastructure. Hypotheses from technical breakdowns include:
- A misconfiguration or overload in DNS resolvers specific to DynamoDB, potentially exacerbated by high traffic or an internal software bug, leading to resolution timeouts.
- Cascading effects due to over-reliance on US-East-1 for global features, even in multi-region setups, as some services default to or depend on this region’s endpoints.
- No evidence of malicious activity like cyberattacks; it appears to be an operational failure.
These align with past AWS outages, where regional dependencies amplify localized problems.
Mitigations and Lessons Learned
AWS resolved the issue by addressing the DynamoDB DNS problems and implementing temporary throttling on resource-intensive operations (e.g., EC2 launches) to stabilize recovery without overwhelming impaired subsystems. Services were gradually restored, with full operations confirmed by late afternoon PDT.
For customers and architects, key mitigations and lessons include:
- Multi-Region Architectures: Design applications to failover to other regions (e.g., US-East-2 or US-West-2) automatically, reducing dependency on US-East-1. However, challenges like cross-region resource definitions in tools such as CloudFormation highlight the need for better multi-region support.
- Resiliency Best Practices: Implement chaos engineering, redundant DNS setups, and monitoring for SPOFs to prevent cascades. Use services like Route 53 for diversified DNS resolution.
- Monitoring and Alerts: Rely on tools like AWS Health Dashboard for real-time updates and integrate third-party monitoring (e.g., ThousandEyes) for external visibility.
- Broader Implications: The event underscores the fragility of cloud monocultures; organizations should consider hybrid or multi-cloud strategies to mitigate vendor-specific risks.
Overall, while AWS restored services efficiently, the outage reinforces the importance of distributed systems in an increasingly cloud-dependent world. Customers are advised to review their architectures for similar vulnerabilities.

Leave a comment