Skip to content

Conversation

@cdoern
Copy link
Contributor

@cdoern cdoern commented Nov 11, 2025

What does this PR do?

Add explicit connection cleanup and shorter timeouts to OpenAI client fixtures. Fixes CI deadlock after 25+ tests due to connection pool exhaustion. Also adds 60s timeout to test_conversation_context_loading as safety net.

Test Plan

tests pass

Add explicit connection cleanup and shorter timeouts to OpenAI client
fixtures. Fixes CI deadlock after 25+ tests due to connection pool
exhaustion. Also adds 60s timeout to test_conversation_context_loading
as safety net.

Signed-off-by: Charlie Doern <[email protected]>
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 11, 2025
@cdoern cdoern marked this pull request as draft November 11, 2025 01:18
@cdoern cdoern marked this pull request as ready for review November 11, 2025 01:21
@cdoern
Copy link
Contributor Author

cdoern commented Nov 12, 2025

I did some digging, and I don't think there is an API deadlock causing the regular CI timeouts. The hang is due to HTTP connection pool exhaustion in the OpenAI client, not the conversations API itself. So I believe this solution is right.

Evidence Against API Deadlock:

  1. Only happens in CI, not locally - Logic deadlocks are deterministic and would reproduce everywhere
  2. Occurs after 25+ tests - Progressive resource exhaustion pattern, not immediate deadlock
  3. Hangs before HTTP request is made, meaning the OpenAI client itself is blocked waiting for an available
    connection

The OpenAI Python client maintains an internal httpx connection pool (default max 100 connections). Without explicit cleanup:

  • Each test session creates connections
  • In CI environments, connections don't get garbage collected aggressively
  • After 25+ tests, the pool is exhausted
  • New requests block indefinitely waiting for an available connection

This PR adds:

  1. client.close() - Explicitly releases HTTP connections
  2. timeout=30.0 - Safety net to prevent infinite waits
  3. max_retries=0 - Fail fast instead of retry with exhausted pool

@franciscojavierarceo @leseb @ashwinb WDYT about this fix?

Copy link
Collaborator

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this to see if it helps CI failures. It's not exactly clear to me what the root cause is that this fixes, so we may want to keep looking into root cause in parallel.

@bbrowning bbrowning merged commit 37853ca into llamastack:main Nov 12, 2025
22 checks passed
@bbrowning
Copy link
Collaborator

Merging as the tests passed here which should unblock our CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants