Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix closing sessions #6114

Merged
merged 83 commits into from
Jan 15, 2025
Merged
Show file tree
Hide file tree
Changes from 81 commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
3e364cb
Closing stale sessions
tofarr Jan 6, 2025
18f02e7
Merge branch 'main' into fix-closing-sessions
tofarr Jan 6, 2025
3750e5e
User id
tofarr Jan 6, 2025
753c054
Added user_id to session
tofarr Jan 6, 2025
80603e4
WIP
tofarr Jan 6, 2025
3e5ad1a
Merge branch 'main' into fix-closing-sessions
tofarr Jan 6, 2025
187b3e8
Refactor conversations
tofarr Jan 7, 2025
7c55584
Closing existing session
tofarr Jan 7, 2025
eb3bb1b
Fix test
tofarr Jan 7, 2025
675a9a0
Test fixes
tofarr Jan 7, 2025
882a7e7
Merge branch 'main' into fix-closing-sessions
tofarr Jan 7, 2025
1845a41
WIP
tofarr Jan 7, 2025
c78c549
Merge branch 'main' into fix-closing-sessions
tofarr Jan 7, 2025
559fa85
Emit stopped event when stopping session
tofarr Jan 7, 2025
48e27a4
Merge branch 'main' into fix-closing-sessions
tofarr Jan 7, 2025
b2a0de2
WIP
tofarr Jan 7, 2025
7910c12
Merge branch 'main' into fix-closing-sessions
tofarr Jan 7, 2025
9c649fc
Changed name as suggested
tofarr Jan 7, 2025
bf9cd2a
Merge branch 'fix-closing-sessions' of github.com:All-Hands-AI/OpenHa…
tofarr Jan 7, 2025
8ff5e95
Merge branch 'main' into fix-closing-sessions
tofarr Jan 7, 2025
be9eaac
Merge branch 'main' into fix-closing-sessions
tofarr Jan 7, 2025
bab53a0
Merge branch 'main' into fix-closing-sessions
tofarr Jan 7, 2025
88af9f8
Merge branch 'main' into fix-closing-sessions
tofarr Jan 7, 2025
a008351
Remote check fix
tofarr Jan 7, 2025
0c92868
Merge branch 'fix-closing-sessions' of github.com:All-Hands-AI/OpenHa…
tofarr Jan 7, 2025
0d0d5f9
Merge branch 'main' into fix-closing-sessions
tofarr Jan 7, 2025
ddad146
Timezones are in UTC
tofarr Jan 7, 2025
30d8dc5
Minio did not delete directories. Now it does.
tofarr Jan 8, 2025
72dac8c
Renamed oh_event to session_msg
tofarr Jan 8, 2025
235725a
Fix types
tofarr Jan 8, 2025
411e0d1
Merge branch 'main' into fix-closing-sessions
tofarr Jan 8, 2025
73744f5
Merge branch 'main' into fix-closing-sessions
tofarr Jan 8, 2025
edfbc2e
Merge branch 'main' into fix-closing-sessions
tofarr Jan 8, 2025
02d6c56
WIP
tofarr Jan 8, 2025
334c849
Merge branch 'fix-closing-sessions' of github.com:All-Hands-AI/OpenHa…
tofarr Jan 8, 2025
26a0183
Merge branch 'main' into fix-closing-sessions
tofarr Jan 8, 2025
d38783c
Merge branch 'main' into fix-closing-sessions
tofarr Jan 8, 2025
09df0d0
Not sending messages if not needed
tofarr Jan 8, 2025
365ccb7
Do not send messages unless needed
tofarr Jan 8, 2025
bcc4657
Removed unneeded code
tofarr Jan 8, 2025
76496b6
Revert
tofarr Jan 8, 2025
563a25d
Fix iteration bug
tofarr Jan 8, 2025
f98f71d
Merge branch 'main' into fix-closing-sessions
tofarr Jan 8, 2025
3611ad1
Added try catch for more resiliency
tofarr Jan 8, 2025
24db670
Merge branch 'fix-closing-sessions' of github.com:All-Hands-AI/OpenHa…
tofarr Jan 8, 2025
b061c26
Setting keep runtime alive to false
tofarr Jan 8, 2025
869f47d
Merge branch 'main' into fix-closing-sessions
tofarr Jan 9, 2025
b77c8be
Merge branch 'main' into fix-closing-sessions
tofarr Jan 9, 2025
d88e5cc
Fix for FD
tofarr Jan 9, 2025
2af5ae0
Event stream close handles its own unsubscribe
tofarr Jan 9, 2025
43b7754
Reduced default close delay to 60 seconds
tofarr Jan 9, 2025
78416b4
Changed close delay default to 15 seconds
tofarr Jan 9, 2025
0f92daf
Merge branch 'main' into fix-closing-sessions
tofarr Jan 10, 2025
73d42f9
Updated started at
tofarr Jan 10, 2025
b21764a
Merge branch 'main' into fix-closing-sessions
tofarr Jan 13, 2025
b4a4f7b
Closing
tofarr Jan 13, 2025
6f677aa
Make more like main
tofarr Jan 13, 2025
773b453
Merge branch 'main' into fix-closing-sessions
tofarr Jan 13, 2025
e197e70
Merge branch 'main' into fix-closing-sessions
tofarr Jan 13, 2025
5f316aa
Feature: User id propagation
tofarr Jan 13, 2025
4861011
Merge branch 'main' into feat-user-id-propagation
tofarr Jan 13, 2025
71497e3
Merge branch 'main' into fix-closing-sessions
tofarr Jan 13, 2025
f80cd6f
Merge branch 'main' into feat-user-id-propagation
tofarr Jan 13, 2025
a187670
Merge branch 'main' into fix-closing-sessions
tofarr Jan 13, 2025
9c5bcdc
Revert because it would break eval
tofarr Jan 13, 2025
31b9d82
Merge branch 'feat-user-id-propagation' into fix-closing-sessions
tofarr Jan 13, 2025
beb4077
Merge branch 'main' into fix-closing-sessions
tofarr Jan 13, 2025
b5cfd80
Merge branch 'main' into fix-closing-sessions
tofarr Jan 13, 2025
48fb12f
Fix for FD leak
tofarr Jan 13, 2025
af7c572
Merge branch 'main' into fix-closing-sessions
tofarr Jan 13, 2025
1dafcf7
FD Leak fix
tofarr Jan 14, 2025
3e77d7e
Consistent null check
tofarr Jan 14, 2025
dc65c7b
Fix null check
tofarr Jan 14, 2025
ab22b75
WIP
tofarr Jan 14, 2025
5a76048
Merge branch 'main' into fix-closing-sessions
tofarr Jan 14, 2025
e274031
Stop a runtime only if it is started or is taking too long to start
tofarr Jan 14, 2025
00aae30
Merge branch 'main' into fix-closing-sessions
tofarr Jan 14, 2025
4bf3e7d
User id to str
tofarr Jan 14, 2025
d4fecbc
Clean up the get_state method
tofarr Jan 14, 2025
602b370
Merge branch 'main' into fix-closing-sessions
tofarr Jan 14, 2025
411514b
Merge branch 'main' into fix-closing-sessions
tofarr Jan 15, 2025
3f0eac2
Reduced cleanup interval
tofarr Jan 15, 2025
6997cb8
Merge branch 'fix-closing-sessions' of github.com:All-Hands-AI/OpenHa…
tofarr Jan 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions openhands/core/config/sandbox_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ class SandboxConfig:

remote_runtime_api_url: str = 'http://localhost:8000'
local_runtime_url: str = 'http://localhost'
keep_runtime_alive: bool = True
keep_runtime_alive: bool = False
rm_all_containers: bool = False
api_key: str | None = None
base_container_image: str = 'nikolaik/python-nodejs:python3.12-nodejs22' # default to nikolaik/python-nodejs:python3.12-nodejs22 for eventstream runtime
Expand All @@ -60,7 +60,7 @@ class SandboxConfig:
runtime_startup_env_vars: dict[str, str] = field(default_factory=dict)
browsergym_eval_env: str | None = None
platform: str | None = None
close_delay: int = 900
close_delay: int = 15
remote_runtime_resource_factor: int = 1
enable_gpu: bool = False
docker_runtime_kwargs: str | None = None
Expand Down
7 changes: 3 additions & 4 deletions openhands/runtime/builder/remote.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from openhands.core.logger import openhands_logger as logger
from openhands.runtime.builder import RuntimeBuilder
from openhands.runtime.utils.request import send_request
from openhands.utils.http_session import HttpSession
from openhands.utils.shutdown_listener import (
should_continue,
sleep_if_should_continue,
Expand All @@ -18,12 +19,10 @@
class RemoteRuntimeBuilder(RuntimeBuilder):
"""This class interacts with the remote Runtime API for building and managing container images."""

def __init__(
self, api_url: str, api_key: str, session: requests.Session | None = None
):
def __init__(self, api_url: str, api_key: str, session: HttpSession | None = None):
self.api_url = api_url
self.api_key = api_key
self.session = session or requests.Session()
self.session = session or HttpSession()
self.session.headers.update({'X-API-Key': self.api_key})

def build(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
from openhands.runtime.base import Runtime
from openhands.runtime.plugins import PluginRequirement
from openhands.runtime.utils.request import send_request
from openhands.utils.http_session import HttpSession


class ActionExecutionClient(Runtime):
Expand All @@ -55,7 +56,7 @@ def __init__(
attach_to_existing: bool = False,
headless_mode: bool = True,
):
self.session = requests.Session()
self.session = HttpSession()
self.action_semaphore = threading.Semaphore(1) # Ensure one action at a time
self._runtime_initialized: bool = False
self._vscode_token: str | None = None # initial dummy value
Expand Down
7 changes: 4 additions & 3 deletions openhands/runtime/utils/request.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import requests
from tenacity import retry, retry_if_exception, stop_after_attempt, wait_exponential

from openhands.utils.http_session import HttpSession
from openhands.utils.tenacity_stop import stop_if_should_exit


Expand Down Expand Up @@ -34,7 +35,7 @@ def is_retryable_error(exception):
wait=wait_exponential(multiplier=1, min=4, max=60),
)
def send_request(
session: requests.Session,
session: HttpSession,
method: str,
url: str,
timeout: int = 10,
Expand All @@ -48,11 +49,11 @@ def send_request(
_json = response.json()
except (requests.exceptions.JSONDecodeError, json.decoder.JSONDecodeError):
_json = None
finally:
response.close()
raise RequestHTTPError(
e,
response=e.response,
detail=_json.get('detail') if _json is not None else None,
) from e
finally:
response.close()
return response
2 changes: 1 addition & 1 deletion openhands/server/routes/manage_conversations.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ async def search_conversations(
for conversation in conversation_metadata_result_set.results
if hasattr(conversation, 'created_at')
)
running_conversations = await session_manager.get_agent_loop_running(
running_conversations = await session_manager.get_running_agent_loops(
get_user_id(request), set(conversation_ids)
)
result = ConversationInfoResultSet(
Expand Down
31 changes: 19 additions & 12 deletions openhands/server/session/agent_session.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import asyncio
import time
from typing import Callable, Optional

from openhands.controller import AgentController
Expand All @@ -16,7 +17,7 @@
from openhands.runtime.base import Runtime
from openhands.security import SecurityAnalyzer, options
from openhands.storage.files import FileStore
from openhands.utils.async_utils import call_async_from_sync, call_sync_from_async
from openhands.utils.async_utils import call_sync_from_async
from openhands.utils.shutdown_listener import should_continue

WAIT_TIME_BEFORE_CLOSE = 300
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to reduce this to something like 30, which would make problems more apparent and easier to debug. Any concerns with that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remote runtime can occasionally take more than 30 seconds to start for me - I'll reduce it to 90. for now, and we can revisit later.

Expand All @@ -36,7 +37,8 @@ class AgentSession:
controller: AgentController | None = None
runtime: Runtime | None = None
security_analyzer: SecurityAnalyzer | None = None
_initializing: bool = False
_starting: bool = False
_started_at: float = 0
_closed: bool = False
loop: asyncio.AbstractEventLoop | None = None

Expand Down Expand Up @@ -88,7 +90,8 @@ async def start(
if self._closed:
logger.warning('Session closed before starting')
return
self._initializing = True
self._starting = True
self._started_at = time.time()
self._create_security_analyzer(config.security.security_analyzer)
await self._create_runtime(
runtime_name=runtime_name,
Expand All @@ -109,24 +112,19 @@ async def start(
self.event_stream.add_event(
ChangeAgentStateAction(AgentState.INIT), EventSource.ENVIRONMENT
)
self._initializing = False
self._starting = False

def close(self):
async def close(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to get rid of async close methods. Not sure if that's a goal worth pursuing, doesn't have to block this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is now async is that it does send a final message down any connected socket indicating that the session is closing. (This is so that if a user deletes a conversation to which they are connected they get an appropriate message)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, not about agent session, but it's generally a goal worth pursuing where possible IMHO. Timing can make things more complex/fragile in the execution of a multi-agent run, if some events may in theory come after their controller is closed or viceversa.

"""Closes the Agent session"""
if self._closed:
return
self._closed = True
call_async_from_sync(self._close)

async def _close(self):
seconds_waited = 0
while self._initializing and should_continue():
while self._starting and should_continue():
logger.debug(
f'Waiting for initialization to finish before closing session {self.sid}'
)
await asyncio.sleep(WAIT_TIME_BEFORE_CLOSE_INTERVAL)
seconds_waited += WAIT_TIME_BEFORE_CLOSE_INTERVAL
if seconds_waited > WAIT_TIME_BEFORE_CLOSE:
if time.time() <= self._started_at + WAIT_TIME_BEFORE_CLOSE:
logger.error(
f'Waited too long for initialization to finish before closing session {self.sid}'
)
Expand Down Expand Up @@ -311,3 +309,12 @@ def _maybe_restore_state(self) -> State | None:
else:
logger.debug('No events found, no state to restore')
return restored_state

def get_state(self) -> AgentState | None:
controller = self.controller
if controller:
return controller.state.agent_state
if time.time() > self._started_at + WAIT_TIME_BEFORE_CLOSE:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take my comment above this probably needs to change

# If 5 minutes have elapsed and we still don't have a controller, something has gone wrong
return AgentState.ERROR
return None
Loading
Loading