Fix closing sessions #6114

tofarr · 2025-01-07T15:34:24Z

End-user friendly description of the problem this fixes or functionality that this introduces

This PR improves the handling of multiple conversations and session management in OpenHands. It ensures that user workspaces are preserved even after disconnections or server restarts, and implements a smart session management system that automatically handles conversation limits.

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Improved multi-conversation support with automatic session management and workspace preservation. Users can now maintain multiple conversations across different tabs while ensuring their work is preserved, even after disconnections or server restarts.

Summary of Changes

Added user_id tracking to sessions for better user-specific resource management
Implemented proper closing of stale sessions to prevent resource leaks
Added "agent stopped" event emission for better frontend state management
Enhanced recovery mechanism to preserve workspace/files after disconnection
Added smart session management for handling multiple conversations

Acceptance Criteria for Multi-conversation Runtime Management

Recovery

Start a conversation
Disconnect
Restart the server
Verify workspace/files are preserved

Conversation Limits

Start 4 conversations in different tabs
First conversation goes to "agent stopped"
Sending a new message starts it back up, and another conversation goes to "agent stopped"
Verify workspace is totally recovered

Testing Instructions

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:6997cb8-nikolaik   --name openhands-app-6997cb8   docker.all-hands.dev/all-hands-ai/openhands:6997cb8
  -p 3000:3000 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  --add-host host.docker.internal:host-gateway \
  -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:b2a0de2-nikolaik \
  --name openhands-app-b2a0de2 \
  docker.all-hands.dev/all-hands-ai/openhands:b2a0de2

rbren · 2025-01-07T19:57:52Z

openhands/server/session/manager.py

-            if sid in self._detached_conversations:
-                conversation, _ = self._detached_conversations.pop(sid)
-                self._active_conversations[sid] = (conversation, 1)
-                logger.info(f'Reusing detached conversation {sid}')
-                return conversation


why did we lose this?

I guess we just leave _attached_conversations until the whole thing closes? That seems reasonable actually...

The concept of stored detached conversations was replaced with a general concept of session staleness. A session is considered stale and subject to close if...

It does not have any connections to it.
AND...

It has not had an update within the close_delay (Now 15 seconds by default).

Note: I think there may actually have been a bug here before my changes where the stale check was initialized along with the runloop and was not always being hit.

I assume you mean 15 minutes? 😅 15 seconds seems unbelievably low, just a quick tab away

Correct. It is 15 minutes. (I actually changed this from 15 seconds to 15 minutes on Monday:

OpenHands/openhands/core/config/sandbox_config.py

Line 63 in ff9058e

close_delay: int = 900

)

rbren · 2025-01-07T19:58:40Z

openhands/server/session/manager.py

+        sids = {sid for sid, _ in items}
+        return sids
+
+    async def get_running_agent_loops_in_cluster(


Suggested change

async def get_running_agent_loops_in_cluster(

async def get_running_agent_loops_remotely(

this seems like maybe a better name?

openhands/server/session/session.py

rbren · 2025-01-07T20:01:43Z

openhands/server/session/manager.py

-                logger.info(
-                    f'Attached conversations: {len(self._active_conversations)}'
-                )
-                logger.info(
-                    f'Detached conversations: {len(self._detached_conversations)}'
-                )


why remove?

rbren · 2025-01-07T20:03:22Z

openhands/server/session/manager.py

-            if sid in self._detached_conversations:
-                conversation, _ = self._detached_conversations.pop(sid)
-                self._active_conversations[sid] = (conversation, 1)
-                logger.info(f'Reusing detached conversation {sid}')
-                return conversation


I guess we just leave _attached_conversations until the whole thing closes? That seems reasonable actually...

…nds into fix-closing-sessions

rbren · 2025-01-07T20:33:21Z

openhands/server/session/manager.py

-    async def _cleanup_session_later(self, sid: str):
-        # Once there have been no connections to a session for a reasonable period, we close it
-        try:
-            await asyncio.sleep(self.config.sandbox.close_delay)


We need to remove this config right?

AFAIK, there are OSS users that use this value - they have a use case where they want a session to persist for 8 hours while there is no connection to it. (As opposed to the 15 seconds we have by default)

Yep, we have been using a long N hours close_delay to keep our workspaces running around even after every browser closes.

With this new PR, is there a better way to achieve the same effect?

@diwu-sf - The settings you currently use should be fine - but you may get away with a shorter delay because the new behavior is that a conversation will be stopped if all three of the following are true:

It has not been updated in close_delay seconds.

There are no connections to it.

The agent is not in a running state. (This one is new!)

Now that I think about it, one thing that may affect you is that we have introduced a limit of 3 concurrent conversations per user. (So if you already have 3 running and start another it will kill one of the old ones regardless of the 3 criteria above - this is designed to stop the system crashing due to users trying to start too many concurrent docker containers). If this will affect you, we can introduce a config setting for this too.

…nds into fix-closing-sessions

We return the state, ERROR, or None

rbren · 2025-01-10T14:11:45Z

openhands/server/session/agent_session.py

@@ -111,14 +111,11 @@ async def start(
        )
        self._initializing = False

-    def close(self):
+    async def close(self):


I've been trying to get rid of async close methods. Not sure if that's a goal worth pursuing, doesn't have to block this PR

The reason this is now async is that it does send a final message down any connected socket indicating that the session is closing. (This is so that if a user deletes a conversation to which they are connected they get an appropriate message)

FWIW, not about agent session, but it's generally a goal worth pursuing where possible IMHO. Timing can make things more complex/fragile in the execution of a multi-agent run, if some events may in theory come after their controller is closed or viceversa.

rbren · 2025-01-15T16:24:11Z

openhands/server/session/agent_session.py

@@ -16,7 +17,7 @@
 from openhands.runtime.base import Runtime
 from openhands.security import SecurityAnalyzer, options
 from openhands.storage.files import FileStore
-from openhands.utils.async_utils import call_async_from_sync, call_sync_from_async
+from openhands.utils.async_utils import call_sync_from_async
 from openhands.utils.shutdown_listener import should_continue

 WAIT_TIME_BEFORE_CLOSE = 300


I'm inclined to reduce this to something like 30, which would make problems more apparent and easier to debug. Any concerns with that?

The remote runtime can occasionally take more than 30 seconds to start for me - I'll reduce it to 90. for now, and we can revisit later.

rbren · 2025-01-15T16:24:43Z

openhands/server/session/agent_session.py

+        controller = self.controller
+        if controller:
+            return controller.state.agent_state
+        if time.time() > self._started_at + WAIT_TIME_BEFORE_CLOSE:


If you take my comment above this probably needs to change

rbren · 2025-01-15T16:26:12Z

openhands/server/session/manager.py

-        )
-        running_sids.union(running_cluster_sids)
-        return running_sids
+                logger.warning(f'error_cleaning_stale: {str(e)}')


This is a very large block to have a blanket exception catch. It worries me a bit. This should probably be error at least since it's unexpected

Catching Exception instead of error will mean it does not catch things like KeyboardInterrupt - this is really just to make sure that we don't stop cleaning up stale conversations due to an unexpected error

rbren · 2025-01-15T16:29:17Z

Comments here are non-blocking.

Let's just make sure the latest commit has been thoroughly tested (especially in a multi-replica mode, and especially with multiple users connected) before merging

…nds into fix-closing-sessions

xingyaoww · 2025-01-15T22:27:15Z

Looks like this cause /download endpoint to stop working

e.g., This commit adds new testcases that relies on copy_from method that uses /download.

CI error: https://github.com/All-Hands-AI/OpenHands/actions/runs/12793391296/job/35666481217

I just locally reproduced this, it seems after reverting this PR, the test started to work. With this commits, the zip downloaded from runtime has a size of 0B.

This reverts commit 8795ee6.

kripper · 2025-01-21T09:25:06Z

@tofarr #6382

tofarr added 17 commits January 6, 2025 11:13

Closing stale sessions

3e364cb

Merge branch 'main' into fix-closing-sessions

18f02e7

User id

3750e5e

Added user_id to session

753c054

WIP

80603e4

Merge branch 'main' into fix-closing-sessions

3e5ad1a

Refactor conversations

187b3e8

Closing existing session

7c55584

Fix test

eb3bb1b

Test fixes

675a9a0

Merge branch 'main' into fix-closing-sessions

882a7e7

WIP

1845a41

Merge branch 'main' into fix-closing-sessions

c78c549

Emit stopped event when stopping session

559fa85

Merge branch 'main' into fix-closing-sessions

48e27a4

WIP

b2a0de2

Merge branch 'main' into fix-closing-sessions

7910c12

tofarr marked this pull request as ready for review January 7, 2025 19:51

rbren reviewed Jan 7, 2025

View reviewed changes

tofarr added 3 commits January 7, 2025 13:18

Changed name as suggested

9c649fc

Merge branch 'fix-closing-sessions' of github.com:All-Hands-AI/OpenHa…

bf9cd2a

…nds into fix-closing-sessions

Merge branch 'main' into fix-closing-sessions

8ff5e95

rbren reviewed Jan 7, 2025

View reviewed changes

tofarr added 6 commits January 7, 2025 13:58

Merge branch 'main' into fix-closing-sessions

be9eaac

Merge branch 'main' into fix-closing-sessions

bab53a0

Merge branch 'main' into fix-closing-sessions

88af9f8

Remote check fix

a008351

Merge branch 'fix-closing-sessions' of github.com:All-Hands-AI/OpenHa…

0c92868

…nds into fix-closing-sessions

Merge branch 'main' into fix-closing-sessions

0d0d5f9

tofarr added 12 commits January 13, 2025 11:13

Merge branch 'main' into fix-closing-sessions

b5cfd80

Fix for FD leak

48fb12f

Merge branch 'main' into fix-closing-sessions

af7c572

FD Leak fix

1dafcf7

Consistent null check

3e77d7e

Fix null check

dc65c7b

WIP

ab22b75

Merge branch 'main' into fix-closing-sessions

5a76048

Stop a runtime only if it is started or is taking too long to start

e274031

Merge branch 'main' into fix-closing-sessions

00aae30

User id to str

4bf3e7d

Clean up the get_state method

d4fecbc

We return the state, ERROR, or None

tofarr mentioned this pull request Jan 14, 2025

FD Leak Tracker #6277

Closed

1 task

tofarr added 2 commits January 14, 2025 16:36

Merge branch 'main' into fix-closing-sessions

602b370

Merge branch 'main' into fix-closing-sessions

411514b

rbren approved these changes Jan 15, 2025

View reviewed changes

tofarr added 2 commits January 15, 2025 09:37

Reduced cleanup interval

3f0eac2

Merge branch 'fix-closing-sessions' of github.com:All-Hands-AI/OpenHa…

6997cb8

…nds into fix-closing-sessions

tofarr enabled auto-merge (squash) January 15, 2025 16:41

tofarr disabled auto-merge January 15, 2025 16:57

tofarr merged commit 8795ee6 into main Jan 15, 2025
13 checks passed

tofarr deleted the fix-closing-sessions branch January 15, 2025 17:04

xingyaoww added a commit that referenced this pull request Jan 16, 2025

Revert "Fix closing sessions (#6114)"

009a492

This reverts commit 8795ee6.

xingyaoww mentioned this pull request Jan 16, 2025

Revert "Fix closing sessions" #6300

Merged

xingyaoww added a commit that referenced this pull request Jan 16, 2025

Revert "Fix closing sessions (#6114)"

43ddad4

This reverts commit 8795ee6.

xingyaoww added a commit that referenced this pull request Jan 16, 2025

Revert "Fix closing sessions (#6114)"

0852fed

This reverts commit 8795ee6.

csmith49 pushed a commit to csmith49/OpenHands that referenced this pull request Jan 19, 2025

Fix closing sessions (All-Hands-AI#6114)

c7fc693

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix closing sessions #6114

Fix closing sessions #6114

tofarr commented Jan 7, 2025 •

edited by github-actions bot

Loading

rbren Jan 7, 2025

rbren Jan 7, 2025

tofarr Jan 7, 2025 •

edited

Loading

enyst Jan 8, 2025

tofarr Jan 8, 2025

rbren Jan 7, 2025

tofarr Jan 7, 2025

rbren Jan 7, 2025

rbren Jan 7, 2025

rbren Jan 7, 2025

tofarr Jan 7, 2025 •

edited

Loading

diwu-sf Jan 8, 2025

tofarr Jan 9, 2025 •

edited

Loading

rbren Jan 10, 2025

tofarr Jan 15, 2025

enyst Jan 15, 2025

rbren Jan 15, 2025

tofarr Jan 15, 2025

rbren Jan 15, 2025

rbren Jan 15, 2025

tofarr Jan 15, 2025

rbren commented Jan 15, 2025

xingyaoww commented Jan 15, 2025 •

edited

Loading

kripper commented Jan 21, 2025

	async def get_running_agent_loops_in_cluster(
	async def get_running_agent_loops_remotely(

Fix closing sessions #6114

Fix closing sessions #6114

Conversation

tofarr commented Jan 7, 2025 • edited by github-actions bot Loading

Acceptance Criteria for Multi-conversation Runtime Management

Recovery

Conversation Limits

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tofarr Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tofarr Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tofarr Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbren commented Jan 15, 2025

xingyaoww commented Jan 15, 2025 • edited Loading

kripper commented Jan 21, 2025

tofarr commented Jan 7, 2025 •

edited by github-actions bot

Loading

tofarr Jan 7, 2025 •

edited

Loading

tofarr Jan 7, 2025 •

edited

Loading

tofarr Jan 9, 2025 •

edited

Loading

xingyaoww commented Jan 15, 2025 •

edited

Loading