We are sharing the details on the outage as provided by Twilio, our telecom backbone carrier:
Preliminary Reason for Outage - Reference #13002 Dropped phone calls
Summary
On March 31, 2025 from 15:09:00 UTC until 16:45:00 UTC, a service disruption occurred. The issue affected inbound and outbound voice traffic. The incident was caused due to a failure in Twilio in the caching proxy that fetches files to play. A subsystem related to telemetry and diagnostics impacted the performance of the system, and this resulted in system timeouts for and TwiML invocations. On identification of the throughput issue, capacity was increased which alleviated the problem.
Timeline
● [2025-03-31 15:09 UTC]: First occurrence of the event that triggered the incident.
● [2025-03-31 15:12 UTC]: Monitors for failing system detect high load and alerts Twilio to the issue.
● [2025-03-31 15:22 UTC]: On-call engineering team begins investigating the issue.
● [2025-03-31 16:16 UTC]: On-call engineering team starts the initial remediation process by adding additional capacity to the caching proxy layer to compensate for the high load.
● [2025-03-31 16:27 UTC]: Twilio Status Page is updated to alert customers that Programmable Voice and Flex are experiencing issues.
● [2025-03-31 16:38 UTC]: The process of adding additional capacity to the caching proxy layer ends.
● [2025-03-31 16:44 UTC]: The root cause is identified: there is an issue with the integration with our internal logging systems, which saturated the caching proxy layer and failed to serve requests.
● [2025-03-31 16:44 UTC]: Service starts to show recovery.
● [2025-03-31 16:56 UTC]: Service functioning normally.
● [2025-03-31 21:07 UTC]: On-call engineering team deploys a patch to prevent recurrence of the issue.
● [2025-04-01 04:16 UTC]: The patch deployment finishes. Preliminary RFO: #13002 UPDATED: April 1, 2025
Customer Impact
On March 31, 2025 from 15:10 UTC until 16:44 UTC, approximately 25% of TwiML verbs and failed with code 11200. This made impacted customer call flows that execute either of the two verbs to drop. In total, approximately 13.3% of phone calls were dropped during this incident across customers using this functionality.
Cause
An increase in specific requests sent to a subsystem related to telemetry and diagnostics caused an internal buffer to overflow, generating error events. These requests originated from the caching proxy that retrieves the media files for and TwiML invocations. The process that collects these error events was saturated, consuming a significant amount of resources in all the caching layer fleet causing timeouts.
Resolution
To resolve the immediate issue, the caching layer fleet was scaled up, mitigating the incident on 2025-03-31 at 16:56 UTC. On 2025-04-01 at 04:16 UTC, we took the following actions to prevent recurrence of this issue:
● Increased the buffer capacity to prevent a buffer overflow.
● Reduced the size of individual log payloads to minimize the amount of data processed by the telemetry and diagnostics system.
● Disabled the process that collects the error events from the logging system.
Additional details on the actions Twilio will be committing to in order to prevent a recurrence will be shared in the Final Reason for Outage document. Conclusion We sincerely apologize for this incident. Improving the quality, resilience, and robustness of Twilio's platform remains our top priority. We are applying the lessons learned from these events to implement betterments that will prevent failures of this kind from impacting our customers in the future.
Preliminary RFO: #13002 UPDATED: April 1, 2025