Skip to content

Conversation

@devin-ai-integration
Copy link
Contributor

Fix issue #3871: kickoff_for_each() hangs after completion

Summary

This PR fixes a threading issue where kickoff_for_each() would hang indefinitely after successful completion. The root cause was that event handlers scheduled in the ThreadPoolExecutor were not being awaited, leaving non-daemon worker threads active and preventing process exit.

Changes made:

  1. Added _wait_for_event_handlers() helper method that safely waits on event bus futures with a 30-second timeout and proper error handling
  2. Modified kickoff() to wait on all three lifecycle event emissions (CrewKickoffStartedEvent, CrewKickoffCompletedEvent, CrewKickoffFailedEvent)
  3. Added is_tracing_disabled() helper to respect CREWAI_DISABLE_TRACING, CREWAI_DISABLE_TRACKING, and OTEL_SDK_DISABLED environment variables
  4. Updated tracing enable logic to check disable flags first, preventing tracing setup when explicitly disabled
  5. Added comprehensive tests covering the threading issue and tracing disable functionality

Files changed:

  • lib/crewai/src/crewai/crew.py: Core fix for event handler waiting
  • lib/crewai/src/crewai/events/listeners/tracing/utils.py: New tracing disable helper
  • lib/crewai/tests/test_kickoff_for_each_hang.py: New test file with 4 test cases

Review & Testing Checklist for Human

⚠️ Risk Level: YELLOW - Core execution path changes with potential performance impact

  • Test with real crews: Run kickoff_for_each() with actual LLM calls and tracing enabled to verify the hang is fixed and there's no significant performance degradation
  • Verify timeout value: The 30-second timeout per event emission may need adjustment based on production event handler behavior. Test with slow event handlers to ensure the timeout is appropriate
  • Test tracing disable flags: Verify that setting CREWAI_DISABLE_TRACING=true properly disables tracing and doesn't break existing setups
  • Check error scenarios: Test what happens when event handlers fail or timeout - ensure the warnings are logged appropriately and execution continues

Recommended test plan:

  1. Create a simple crew with 2-3 tasks
  2. Run kickoff_for_each() with 5-10 inputs and verify it completes without hanging
  3. Test with CREWAI_DISABLE_TRACING=true to ensure tracing is properly disabled
  4. Monitor execution time to ensure the blocking waits don't cause unacceptable slowdowns
  5. Check logs for any timeout warnings during normal operation

Notes

  • The tests mock _run_sequential_process to avoid actual LLM calls, so they verify timing behavior but not full integration
  • The 30-second timeout is a conservative default - may need tuning based on production event handler performance
  • All event handlers for CrewKickoff* events are synchronous (verified during investigation), so the blocking wait should be safe
  • The fix ensures each kickoff() call fully processes its event handlers before returning, preventing accumulation of pending executor tasks

Session details:

This commit fixes the threading issue where kickoff_for_each() would hang
indefinitely after successful completion. The root cause was that event
handlers scheduled in the ThreadPoolExecutor were not being awaited,
leaving non-daemon worker threads active and preventing process exit.

Changes:
1. Added _wait_for_event_handlers() helper method to safely wait on event
   bus futures with timeout (30s default) and proper error handling
2. Modified kickoff() to wait on all three lifecycle event emissions:
   - CrewKickoffStartedEvent
   - CrewKickoffCompletedEvent
   - CrewKickoffFailedEvent
3. Added is_tracing_disabled() helper to respect CREWAI_DISABLE_TRACING,
   CREWAI_DISABLE_TRACKING, and OTEL_SDK_DISABLED environment variables
4. Updated tracing enable logic to check disable flags first, preventing
   tracing setup when explicitly disabled
5. Added comprehensive tests covering:
   - kickoff_for_each() waiting for event handlers
   - kickoff() waiting for event handlers on error
   - Tracing disable flags being respected

This ensures that each kickoff() call fully processes its event handlers
before returning, preventing the accumulation of pending executor tasks
that would block process exit in kickoff_for_each().

Fixes #3871

Co-Authored-By: João <[email protected]>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

- Remove trailing whitespace from docstring in utils.py (W293)
- Fix test mocking to use class-level patches instead of instance-level
  patches so they work with crew.copy() in kickoff_for_each()
- Adjust timing thresholds slightly to account for CI variance

Co-Authored-By: João <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant