Intermittent Calls Dropping

Incident Report for MotionCX

Postmortem

We are sharing the details on the outage as provided by Twilio, our telecom backbone carrier:

Preliminary Reason for Outage - Reference #13002 Dropped phone calls

Summary

On March 31, 2025 from 15:09:00 UTC until 16:45:00 UTC, a service disruption occurred. The issue affected inbound and outbound voice traffic. The incident was caused due to a failure in Twilio in the caching proxy that fetches files to play. A subsystem related to telemetry and diagnostics impacted the performance of the system, and this resulted in system timeouts for and TwiML invocations. On identification of the throughput issue, capacity was increased which alleviated the problem.

Timeline

● [2025-03-31 15:09 UTC]: First occurrence of the event that triggered the incident.

● [2025-03-31 15:12 UTC]: Monitors for failing system detect high load and alerts Twilio to the issue.

● [2025-03-31 15:22 UTC]: On-call engineering team begins investigating the issue.

● [2025-03-31 16:16 UTC]: On-call engineering team starts the initial remediation process by adding additional capacity to the caching proxy layer to compensate for the high load.

● [2025-03-31 16:27 UTC]: Twilio Status Page is updated to alert customers that Programmable Voice and Flex are experiencing issues.

● [2025-03-31 16:38 UTC]: The process of adding additional capacity to the caching proxy layer ends.

● [2025-03-31 16:44 UTC]: The root cause is identified: there is an issue with the integration with our internal logging systems, which saturated the caching proxy layer and failed to serve requests.

● [2025-03-31 16:44 UTC]: Service starts to show recovery.

● [2025-03-31 16:56 UTC]: Service functioning normally.

● [2025-03-31 21:07 UTC]: On-call engineering team deploys a patch to prevent recurrence of the issue.

● [2025-04-01 04:16 UTC]: The patch deployment finishes. Preliminary RFO: #13002 UPDATED: April 1, 2025

Customer Impact

On March 31, 2025 from 15:10 UTC until 16:44 UTC, approximately 25% of TwiML verbs and failed with code 11200. This made impacted customer call flows that execute either of the two verbs to drop. In total, approximately 13.3% of phone calls were dropped during this incident across customers using this functionality.

Cause

An increase in specific requests sent to a subsystem related to telemetry and diagnostics caused an internal buffer to overflow, generating error events. These requests originated from the caching proxy that retrieves the media files for and TwiML invocations. The process that collects these error events was saturated, consuming a significant amount of resources in all the caching layer fleet causing timeouts.

Resolution

To resolve the immediate issue, the caching layer fleet was scaled up, mitigating the incident on 2025-03-31 at 16:56 UTC. On 2025-04-01 at 04:16 UTC, we took the following actions to prevent recurrence of this issue:

● Increased the buffer capacity to prevent a buffer overflow.

● Reduced the size of individual log payloads to minimize the amount of data processed by the telemetry and diagnostics system.

● Disabled the process that collects the error events from the logging system.

Additional details on the actions Twilio will be committing to in order to prevent a recurrence will be shared in the Final Reason for Outage document. Conclusion We sincerely apologize for this incident. Improving the quality, resilience, and robustness of Twilio's platform remains our top priority. We are applying the lessons learned from these events to implement betterments that will prevent failures of this kind from impacting our customers in the future.

Preliminary RFO: #13002 UPDATED: April 1, 2025

Posted Apr 01, 2025 - 14:59 EDT

Resolved

This issue has been closed on the carrier side, all systems have been operational for the past several hours. We are reaching out to Twilio for a post-mortem and will provide additional information regarding RCA and steps takes to safeguard against recurrence as it becomes available.

Posted Mar 31, 2025 - 15:17 EDT

Update

We are continuing to monitor for any further issues.

Posted Mar 31, 2025 - 15:15 EDT

Update

Twilio reports that the affected systems are now operating normally. We will continue to monitor for system stability. We'll provide another update in 30 minutes or as soon as more information becomes available.

Posted Mar 31, 2025 - 14:27 EDT

Update

Twilio continues scaling up systems and deploying new hosts in order to ease customer issues. We will update in 1 hour or as soon as more information becomes available.

Posted Mar 31, 2025 - 13:40 EDT

Monitoring

Call traffic has resumed. We are continuing to monitor and test as we see services operating as expected.

Posted Mar 31, 2025 - 12:55 EDT

Update

Some test calls are getting through and we believed the issue is subsiding. We do not yet have confirmation that it is fully resolved, however, we are continuing to monitor and test.

Posted Mar 31, 2025 - 12:30 EDT

Update

We are in contact with Twilio, our telecom backbone, and they do have an error reported. We are coordinating with Twilio support to provide information for troubleshooting. We will continue to update this thread as we get more information. We do not have an ETA for this issue to be resolved yet.

Posted Mar 31, 2025 - 12:14 EDT

Investigating

We are currently investigating an issue where inbound calls are not connecting to IVR workflows or are dropping shortly after connecting. We will update this incident as we gather more information.

Posted Mar 31, 2025 - 11:54 EDT

This incident affected: MotionCX Applications (Interaction Routing) and Interaction Channels (Voice Services).