Skip to content

Conversation

@prashant182
Copy link

@prashant182 prashant182 commented Aug 6, 2025

Summary

  • Fixed critical resource cleanup bug in LocalBackend.close() that was causing service close methods to not execute properly when they were async
  • Made LocalBackend.close() async to properly handle async service cleanup and match the base Backend class interface
  • Updated __exit__ method to handle both sync and async contexts appropriately

Problem

The close() method was synchronous but attempted to call close() on services that may have async close methods. This meant:

  • Async close methods returned coroutine objects instead of executing
  • Memory leaks and zombie processes accumulated over time
  • Port conflicts occurred when restarting services

Solution

  • Made LocalBackend.close() async and added proper async/sync detection
  • Service close methods are now properly awaited if they're async coroutines
  • Maintained backward compatibility for synchronous close methods
  • Updated __exit__ method to handle event loop contexts correctly

Impact

This fix prevents:

  • GPU memory leaks from improperly closed vLLM engines
  • Zombie training processes
  • Port conflicts on service restart

Critical for production stability and cost management in ML training environments.

Test Plan

  • All existing tests pass
  • Code formatting and linting checks pass
  • Manual verification that both sync and async service close methods work correctly

@corbt
Copy link
Contributor

corbt commented Aug 7, 2025

@JonesAndrew could you take a look at this? I think you played with the closing logic at one point.

@giladfrid009
Copy link
Contributor

I would also propose to finish wandb active runs inside LocalBackend._wandb_runs, which causes log results to flush, and updates their status to "finished" rather that "crashed".

@corbt
Copy link
Contributor

corbt commented Aug 7, 2025

@giladfrid009 makes sense, would you mind opening a PR?

Copy link
Collaborator

@JonesAndrew JonesAndrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr!

def __enter__(self):
return self

def __exit__(self, *excinfo):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to make this an aexit if it needs to be async, just so no one tries to use the context manager in an event loop and get confused at why it doesn't work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll get it out shortly

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bradhilton Do you know if your close changes interact here at all? I might be a little sleepy right now, but this pr might not be needed anymore, but wondering if you have more context with the change you just made.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I think I addressed the issue

prashant182 and others added 3 commits August 8, 2025 23:38
The close() method was calling service.close() without checking if it was
async, causing resource leaks when services had async cleanup methods.
This fix:
- Makes LocalBackend.close() async to match the base Backend class
- Properly awaits async service close methods
- Maintains backward compatibility for sync close methods
- Handles __exit__ context manager compatibility

Prevents GPU memory leaks and zombie processes in production deployments.
…nforce async usage inside event loop\n\n- Implement __aenter__/__aexit__ to properly await cleanup\n- Make __exit__ raise if used under a running event loop, guiding to 'async with'\n- Keep sync 'with' working when no loop is running by calling asyncio.run(close())
… manager (__aenter__/__aexit__) for LocalBackend
Copy link
Author

@prashant182 prashant182 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants