Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix closing sessions #6114

Merged
merged 83 commits into from
Jan 15, 2025
Merged

Fix closing sessions #6114

merged 83 commits into from
Jan 15, 2025

Conversation

tofarr
Copy link
Collaborator

@tofarr tofarr commented Jan 7, 2025

End-user friendly description of the problem this fixes or functionality that this introduces

This PR improves the handling of multiple conversations and session management in OpenHands. It ensures that user workspaces are preserved even after disconnections or server restarts, and implements a smart session management system that automatically handles conversation limits.

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Improved multi-conversation support with automatic session management and workspace preservation. Users can now maintain multiple conversations across different tabs while ensuring their work is preserved, even after disconnections or server restarts.


Summary of Changes

  • Added user_id tracking to sessions for better user-specific resource management
  • Implemented proper closing of stale sessions to prevent resource leaks
  • Added "agent stopped" event emission for better frontend state management
  • Enhanced recovery mechanism to preserve workspace/files after disconnection
  • Added smart session management for handling multiple conversations

Acceptance Criteria for Multi-conversation Runtime Management

Recovery

  • Start a conversation
  • Disconnect
  • Restart the server
  • Verify workspace/files are preserved

Conversation Limits

  • Start 4 conversations in different tabs
  • First conversation goes to "agent stopped"
  • Sending a new message starts it back up, and another conversation goes to "agent stopped"
  • Verify workspace is totally recovered

Testing Instructions

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:6997cb8-nikolaik   --name openhands-app-6997cb8   docker.all-hands.dev/all-hands-ai/openhands:6997cb8
  -p 3000:3000 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  --add-host host.docker.internal:host-gateway \
  -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:b2a0de2-nikolaik \
  --name openhands-app-b2a0de2 \
  docker.all-hands.dev/all-hands-ai/openhands:b2a0de2

@tofarr tofarr marked this pull request as ready for review January 7, 2025 19:51
Comment on lines 192 to 189
if sid in self._detached_conversations:
conversation, _ = self._detached_conversations.pop(sid)
self._active_conversations[sid] = (conversation, 1)
logger.info(f'Reusing detached conversation {sid}')
return conversation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we lose this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we just leave _attached_conversations until the whole thing closes? That seems reasonable actually...

Copy link
Collaborator Author

@tofarr tofarr Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept of stored detached conversations was replaced with a general concept of session staleness. A session is considered stale and subject to close if...

  • It does not have any connections to it.
    AND...
  • It has not had an update within the close_delay (Now 15 seconds by default).

Note: I think there may actually have been a bug here before my changes where the stale check was initialized along with the runloop and was not always being hit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean 15 minutes? 😅 15 seconds seems unbelievably low, just a quick tab away

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. It is 15 minutes. (I actually changed this from 15 seconds to 15 minutes on Monday:

)

sids = {sid for sid, _ in items}
return sids

async def get_running_agent_loops_in_cluster(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def get_running_agent_loops_in_cluster(
async def get_running_agent_loops_remotely(

this seems like maybe a better name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

openhands/server/session/session.py Show resolved Hide resolved
logger.info(
f'Attached conversations: {len(self._active_conversations)}'
)
logger.info(
f'Detached conversations: {len(self._detached_conversations)}'
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove?

Comment on lines 192 to 189
if sid in self._detached_conversations:
conversation, _ = self._detached_conversations.pop(sid)
self._active_conversations[sid] = (conversation, 1)
logger.info(f'Reusing detached conversation {sid}')
return conversation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we just leave _attached_conversations until the whole thing closes? That seems reasonable actually...

async def _cleanup_session_later(self, sid: str):
# Once there have been no connections to a session for a reasonable period, we close it
try:
await asyncio.sleep(self.config.sandbox.close_delay)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to remove this config right?

Copy link
Collaborator Author

@tofarr tofarr Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, there are OSS users that use this value - they have a use case where they want a session to persist for 8 hours while there is no connection to it. (As opposed to the 15 seconds we have by default)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we have been using a long N hours close_delay to keep our workspaces running around even after every browser closes.

With this new PR, is there a better way to achieve the same effect?

Copy link
Collaborator Author

@tofarr tofarr Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diwu-sf - The settings you currently use should be fine - but you may get away with a shorter delay because the new behavior is that a conversation will be stopped if all three of the following are true:

  1. It has not been updated in close_delay seconds.
  2. There are no connections to it.
  3. The agent is not in a running state. (This one is new!)

Now that I think about it, one thing that may affect you is that we have introduced a limit of 3 concurrent conversations per user. (So if you already have 3 running and start another it will kill one of the old ones regardless of the 3 criteria above - this is designed to stop the system crashing due to users trying to start too many concurrent docker containers). If this will affect you, we can introduce a config setting for this too.

@tofarr tofarr mentioned this pull request Jan 14, 2025
1 task
@@ -111,14 +111,11 @@ async def start(
)
self._initializing = False

def close(self):
async def close(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to get rid of async close methods. Not sure if that's a goal worth pursuing, doesn't have to block this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is now async is that it does send a final message down any connected socket indicating that the session is closing. (This is so that if a user deletes a conversation to which they are connected they get an appropriate message)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, not about agent session, but it's generally a goal worth pursuing where possible IMHO. Timing can make things more complex/fragile in the execution of a multi-agent run, if some events may in theory come after their controller is closed or viceversa.

@@ -16,7 +17,7 @@
from openhands.runtime.base import Runtime
from openhands.security import SecurityAnalyzer, options
from openhands.storage.files import FileStore
from openhands.utils.async_utils import call_async_from_sync, call_sync_from_async
from openhands.utils.async_utils import call_sync_from_async
from openhands.utils.shutdown_listener import should_continue

WAIT_TIME_BEFORE_CLOSE = 300
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to reduce this to something like 30, which would make problems more apparent and easier to debug. Any concerns with that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remote runtime can occasionally take more than 30 seconds to start for me - I'll reduce it to 90. for now, and we can revisit later.

controller = self.controller
if controller:
return controller.state.agent_state
if time.time() > self._started_at + WAIT_TIME_BEFORE_CLOSE:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take my comment above this probably needs to change

)
running_sids.union(running_cluster_sids)
return running_sids
logger.warning(f'error_cleaning_stale: {str(e)}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very large block to have a blanket exception catch. It worries me a bit. This should probably be error at least since it's unexpected

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching Exception instead of error will mean it does not catch things like KeyboardInterrupt - this is really just to make sure that we don't stop cleaning up stale conversations due to an unexpected error

@rbren
Copy link
Collaborator

rbren commented Jan 15, 2025

Comments here are non-blocking.

Let's just make sure the latest commit has been thoroughly tested (especially in a multi-replica mode, and especially with multiple users connected) before merging

@tofarr tofarr enabled auto-merge (squash) January 15, 2025 16:41
@tofarr tofarr disabled auto-merge January 15, 2025 16:57
@tofarr tofarr merged commit 8795ee6 into main Jan 15, 2025
13 checks passed
@tofarr tofarr deleted the fix-closing-sessions branch January 15, 2025 17:04
@xingyaoww
Copy link
Collaborator

xingyaoww commented Jan 15, 2025

Looks like this cause /download endpoint to stop working

e.g., This commit adds new testcases that relies on copy_from method that uses /download.

CI error: https://github.com/All-Hands-AI/OpenHands/actions/runs/12793391296/job/35666481217

I just locally reproduced this, it seems after reverting this PR, the test started to work. With this commits, the zip downloaded from runtime has a size of 0B.

xingyaoww added a commit that referenced this pull request Jan 16, 2025
xingyaoww added a commit that referenced this pull request Jan 16, 2025
xingyaoww added a commit that referenced this pull request Jan 16, 2025
csmith49 pushed a commit to csmith49/OpenHands that referenced this pull request Jan 19, 2025
@kripper
Copy link

kripper commented Jan 21, 2025

@tofarr #6382

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants