Skip to content

Conversation

@jamsea
Copy link
Contributor

@jamsea jamsea commented Oct 1, 2025

Problem

This PR fixes an issue where the on_client_disconnected callback would not fire in approximately 5% of calls when using FastAPIWebsocketTransport with Telnyx (and potentially other telephony providers).

Root Cause

The issue is a race condition with task cancellation that occurs when a remote client disconnects:

  1. Remote client disconnects → _receive_messages() loop exits
  2. The code should call trigger_client_disconnected() to fire the event
  3. BUT before it can execute, stop() is called (due to pipeline cleanup)
  4. stop()_stop_tasks()cancels the _receive_task
  5. Task cancellation prevents trigger_client_disconnected() from running

This race condition happens intermittently (~5% of the time) depending on the timing between the remote disconnect and the pipeline cleanup.

Background: Intentional Design from Commit 8569b61

This fix respects the intentional design from commit 8569b615 (Sept 15, 2025) which established that:

on_client_disconnected should only be triggered when the remote client actually disconnects, NOT when the bot ends the conversation.

The issue is NOT with this design decision, but with the implementation being vulnerable to task cancellation.

Solution

The fix uses asyncio.shield() in a finally block to protect the disconnect callback from being cancelled:

finally:
    # Use shield to prevent cancellation from stopping the disconnect callback
    try:
        await asyncio.shield(trigger_disconnect_if_needed())
    except asyncio.CancelledError:
        # Even if we're cancelled, try to trigger the disconnect
        await trigger_disconnect_if_needed()
        raise

How asyncio.shield() works

asyncio.shield() protects a coroutine from being cancelled. When the outer task is cancelled:

  • The shielded coroutine continues running to completion
  • The callback fires successfully
  • Then the CancelledError is re-raised for proper cleanup

This ensures:

  • ✅ Remote client disconnects → event always fires (protected from cancellation)
  • ✅ Bot calls stop()/cancel() → event does not fire (respects is_closing flag)

Related

  • Respects design intent from commit 8569b615
  • Issue reported in production since v0.0.86
  • Different from (and more correct than) the approach in origin/fastapi_disconnect_issue branch which was created but never merged

@jamsea jamsea requested review from aconchillo and filipi87 October 1, 2025 02:17
@codecov
Copy link

codecov bot commented Oct 1, 2025

Codecov Report

❌ Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/pipecat/transports/websocket/fastapi.py 0.00% 8 Missing ⚠️
Files with missing lines Coverage Δ
src/pipecat/transports/websocket/fastapi.py 0.00% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…cketTransport

This fix addresses an issue where the on_client_disconnected callback
would not fire in approximately 5% of calls when using FastAPIWebsocketTransport.

The problem was caused by a race condition introduced after v0.0.86 when
event handlers were changed to run in parallel. When stop() or cancel()
initiated the disconnect, the _closing flag would be set, preventing
trigger_client_disconnected() from being called in _receive_messages().

Now disconnect() always calls trigger_client_disconnected() when
closing the WebSocket, ensuring the event fires reliably whether the
disconnect is initiated by the transport or the remote client.

Fixes the same issue as commit 019c1a6 from origin/fastapi_disconnect_issue branch.
@jamsea jamsea force-pushed the fix/fastapi-websocket-disconnect-event branch from b75a901 to ae8b9f0 Compare October 1, 2025 02:23
Comment on lines -300 to -301
if not self._client.is_closing:
await self._client.trigger_client_disconnected()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that reaching this point means the client has closed the socket, which will trigger on_client_disconnected.

Usually, the task is cancelled(stop) inside the user’s code at that point. Like, inside the on_client_disconnected callback.

I’m not sure how the race condition you mentioned would specifically impact this case. Like who is invoking stop at the same time ?

Have you been able to reproduce the issue locally ? Do you have the steps we should follow ?

Also tagging @aconchillo here for thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the same. Also. we only want to trigger on_client_disconnected when the client really disconnects.

@ayubSubhaniya
Copy link

++ to the issue, there is a race condition when on_client_disconnect doesn't happen and runner.run() exits, where runner is runner = PipelineRunner(handle_sigint=False).

It is easily reproducible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants