Partial Outage – Login Issues (403 & 504 Errors)

Incident Report for Beefree SDK

Postmortem

Duration: 1 hour, 37 minutes

Impact:

Beefree SDK customers experienced slowdowns and errors when loading the editors, particularly on the Auth and Utils endpoints. API SDK calls were impacted by slowness and errors.

Summary:

On Saturday, March 8, 2025, we updated our infrastructure and services to support a multi-region deployment, enabling our SDK to operate across two AWS regions—Dublin, Ireland, and Frankfurt, Germany.

The scheduled maintenance was successfully completed at 11:36 AM EST, with additional adjustments made on March 9 and the morning of March 10 (CET).

However, as traffic increased on Monday, March 10, 2025, the Beefree SDK and Beefree SDK API began experiencing intermittent issues, which later escalated into a stable outage.

During the escalation, we investigated the root cause and explored multiple possible fixes. Given the time constraints and uncertainty of an immediate resolution, we prioritized stability and swiftly redirected traffic to a single region, which restored normal service.

Current status:

Services are normal with no further detected issues. We are actively monitoring for any residual effects.

Root cause:

The issue was highly traffic-dependent, as no slowdowns or failures were observed in the days prior or during load testing. Autoscaling worked correctly, ruling it out as a contributing factor. The problem originated from the use of shared resources within the EKS cluster, specifically DNS resolution.

A preliminary analysis indicates that the primary cause was a slowdown in CoreDNS within our AWS infrastructure. As traffic surged, CoreDNS pods became overwhelmed with DNS queries, leading to delayed responses. However, CPU and memory usage remained stable, showing no signs of resource exhaustion.

We are conducting further investigations to validate this finding and ensure there are no additional contributing factors.

Issue timeline

10:38 AM EDT
Our monitoring systems triggered phone alerts, notifying developers of the issue. Our automated monitoring service also posted an alert in our internal Slack channel. Internal escalation was officially initiated.

10:41 AM EDT
A cross-functional incident response meeting was held with developers and support team members. It quickly became evident that the issue was causing slowdowns across the entire infrastructure. The team began troubleshooting to identify the root cause.

11:45 AM EDT
The issue was identified as an anomaly in the internal network management of the newly deployed multi-region AWS setup. To mitigate the impact, we redirected all traffic for essential services to a single region (AWS Dublin, Ireland).

12:05 PM EDT
Traffic redirection was fully completed, and infrastructure slowdowns resolved almost immediately. We began monitoring the effectiveness of the solution.

12.35 PM EDT
After 30 minutes of stable monitoring, we officially closed the escalation

Next steps:

Our team is currently conducting a full Root Cause Analysis (RCA) to completely understand the underlying cause. We will post updates here as we continue to receive them, including when the RCA is available upon request from our team.

We will evaluate additional monitoring, testing, and safeguards to prevent recurrence.

If you are still experiencing issues, please contact our support team. No further action is needed by you at this time.

We sincerely apologize for the disruption and appreciate your patience. We will provide an update once we complete our analysis.

Posted Mar 11, 2025 - 15:10 EDT

Resolved

The issue has been resolved, and all services are now operating normally. We will continue to monitor the situation closely over the next few days to ensure stability. Thank you for your patience.
Posted Mar 10, 2025 - 12:36 EDT

Monitoring

We have identified the root cause of the issue and implemented a solution to address the intermittent issues affecting key services. Our team is closely monitoring the situation to ensure system stability.
Posted Mar 10, 2025 - 12:18 EDT

Update

Update: We are currently identifying the root cause of the issue and are actively working on a solution.

Our team is dedicated to resolving the problem as quickly as possible, and we will continue to provide updates on our progress.
Posted Mar 10, 2025 - 11:19 EDT

Investigating

We are currently experiencing a partial outage that is affecting our login system. As a result, users may encounter 403 (Forbidden) and 504 (Gateway Timeout) errors when attempting to access our services via our login endpoint. Users may experience login failures, including 403 and 504 errors, and access to certain endpoints may be delayed or unavailable.

Our team is actively investigating the issue and working towards a resolution.
Posted Mar 10, 2025 - 11:09 EDT
This incident affected: Authorization and API.