Beefree SDK customers experienced slowdowns and errors when loading the editors, particularly on the Auth and Utils endpoints. API SDK calls were impacted by slowness and errors.
On Saturday, March 8, 2025, we updated our infrastructure and services to support a multi-region deployment, enabling our SDK to operate across two AWS regions—Dublin, Ireland, and Frankfurt, Germany.
The scheduled maintenance was successfully completed at 11:36 AM EST, with additional adjustments made on March 9 and the morning of March 10 (CET).
However, as traffic increased on Monday, March 10, 2025, the Beefree SDK and Beefree SDK API began experiencing intermittent issues, which later escalated into a stable outage.
During the escalation, we investigated the root cause and explored multiple possible fixes. Given the time constraints and uncertainty of an immediate resolution, we prioritized stability and swiftly redirected traffic to a single region, which restored normal service.
Services are normal with no further detected issues. We are actively monitoring for any residual effects.
The issue was highly traffic-dependent, as no slowdowns or failures were observed in the days prior or during load testing. Autoscaling worked correctly, ruling it out as a contributing factor. The problem originated from the use of shared resources within the EKS cluster, specifically DNS resolution.
A preliminary analysis indicates that the primary cause was a slowdown in CoreDNS within our AWS infrastructure. As traffic surged, CoreDNS pods became overwhelmed with DNS queries, leading to delayed responses. However, CPU and memory usage remained stable, showing no signs of resource exhaustion.
We are conducting further investigations to validate this finding and ensure there are no additional contributing factors.
10:38 AM EDT
Our monitoring systems triggered phone alerts, notifying developers of the issue. Our automated monitoring service also posted an alert in our internal Slack channel. Internal escalation was officially initiated.
10:41 AM EDT
A cross-functional incident response meeting was held with developers and support team members. It quickly became evident that the issue was causing slowdowns across the entire infrastructure. The team began troubleshooting to identify the root cause.
11:45 AM EDT
The issue was identified as an anomaly in the internal network management of the newly deployed multi-region AWS setup. To mitigate the impact, we redirected all traffic for essential services to a single region (AWS Dublin, Ireland).
12:05 PM EDT
Traffic redirection was fully completed, and infrastructure slowdowns resolved almost immediately. We began monitoring the effectiveness of the solution.
12.35 PM EDT
After 30 minutes of stable monitoring, we officially closed the escalation
Our team is currently conducting a full Root Cause Analysis (RCA) to completely understand the underlying cause. We will post updates here as we continue to receive them, including when the RCA is available upon request from our team.
We will evaluate additional monitoring, testing, and safeguards to prevent recurrence.
We sincerely apologize for the disruption and appreciate your patience. We will provide an update once we complete our analysis.