-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding exception handling for room start tasks #30
Conversation
Thanks for submitting your first pull request! You are awesome! 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jzhang20133.
I suggest to go a step further: stop serving a client if it fails, and stop writing to the store if it fails.
I'll merge #24 first, I don't think it will affect this PR too much, but you may have to rebase.
pycrdt_websocket/yroom.py
Outdated
self.log.debug("Writing Y update to YStore") | ||
self._task_group.start_soon(self.ystore.write, update) | ||
except Exception as e: | ||
self.log.error("Error broadcast updates for room:", exc_info=e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.log.error("Error broadcast updates for room:", exc_info=e) | |
self.log.error("Error broadcasting YRoom updates", exc_info=e) |
pycrdt_websocket/yroom.py
Outdated
for client in self.clients: | ||
self.log.debug("Sending Y update to client with endpoint: %s", client.path) | ||
message = create_update_message(update) | ||
self._task_group.start_soon(client.send, message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove a client if we cannot send them updates? Otherwise it will likely fail on every updated.
for client in self.clients: | |
self.log.debug("Sending Y update to client with endpoint: %s", client.path) | |
message = create_update_message(update) | |
self._task_group.start_soon(client.send, message) | |
for client in self.clients: | |
try: | |
self.log.debug("Sending Y update to client with endpoint: %s", client.path) | |
message = create_update_message(update) | |
self._task_group.start_soon(client.send, message) | |
except Exception as e: | |
self.log.error("Error sending Y update to client with endpoint: %s", client.path, exc_info=e) | |
self.clients.remove(client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For send method in YDocWebsocketHandler class, it already have a try except and will capture all exception and log debug message. https://github.com/jupyterlab/jupyter-collaboration/blob/adcde3229c986ffe498b8e6a18857195e1605ace/projects/jupyter-server-ydoc/jupyter_server_ydoc/handlers.py#L222 It is unlikely this client.send flow that raise exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also if just one send failure for temporary reason, we should probably not remove client. But for some specific exception like WebSocketClosedError, we should remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also if just one send failure for temporary reason, we should probably not remove client.
Yeah, I think we need some tolerance here. Perhaps we have a time-based tolerance before removing the client. If it's not responding for some fixed time, say 1 minute, remove it. For tornado-based servers, we should likely match the websocket ping timeout configured by jupyter_server..
But for some specific exception like WebSocketClosedError, we should remove.
As @davidbrochart already mentioned, this package is agnostic to the server framework, so we can't use this error type. However, if we add a time-based tolerance for exception, we don't have to be so specific about the exception type here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will remove that WebsocketClosedError for now.
pycrdt_websocket/yroom.py
Outdated
if self.ystore: | ||
self.log.debug("Writing Y update to YStore") | ||
self._task_group.start_soon(self.ystore.write, update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we stop writing to YStore if it fails?
if self.ystore: | |
self.log.debug("Writing Y update to YStore") | |
self._task_group.start_soon(self.ystore.write, update) | |
if self.ystore: | |
try: | |
self._task_group.start_soon(self.ystore.write, update) | |
self.log.debug("Writing Y update to YStore") | |
except Exception as e: | |
self.ystore = None | |
self.log.error("Error writing Y update to YStore", exc_info=e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Update write is caused by a temporary issue and next write restores, would the missing of one Y update cause any issue for future updates or it could auto recovers? I have been purposely adding exception in ystore update flow but I did not notice any obvious issue in UI.
@@ -229,7 +232,7 @@ async def serve(self, websocket: Websocket): | |||
) | |||
tg.start_soon(client.send, message) | |||
except Exception as e: | |||
self.log.debug("Error serving endpoint: %s", websocket.path, exc_info=e) | |||
self.log.error("Error serving endpoint: %s", websocket.path, exc_info=e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove a client if we cannot serve them?
self.log.error("Error serving endpoint: %s", websocket.path, exc_info=e) | |
self.log.error("Error serving endpoint: %s", websocket.path, exc_info=e) | |
self.clients.remove(websocket) |
64114b9
to
598301f
Compare
b52e3c8
to
b806317
Compare
@davidbrochart and @Zsailer would you like to review this PR again? |
@jzhang20133 in the future could you create a new branch from which you send a PR? I see your branch is |
I just rebased. |
@@ -16,6 +16,7 @@ | |||
from anyio.abc import TaskGroup, TaskStatus | |||
from anyio.streams.memory import MemoryObjectReceiveStream, MemoryObjectSendStream | |||
from pycrdt import Doc, Subscription | |||
from tornado.websocket import WebSocketClosedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pycrdt-websocket
is server agnostic, Tornado cannot be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Just to explain the reasoning here, @jzhang20133. Branching helps maintainers, so they can help you rebase or push changes directly to your branch without contaminating their local |
I'd like to align this PR with #31, so that we don't directly handle exceptions in |
Sounds good, I will address those. @davidbrochart @Zsailer |
Sometimes, we have observed issues that task group in websocket_server is no longer active and user will be stuck after that and no longer able to access files and have to restart nb servers to unblock.
_task_group
instance variable in websocket_server will become inactive when one of its tasks of starting room or its children task group task fail with exception.One place where an exception can occur is in the
_broadcast_update
method, which is a key task in children task group of_task_group
in websocket_server. In this PR, the exception is handled and logged to prevent the parent task from crashing while still displaying the issue that will allow us better handle them in the future.Resolving open source issues:
jupyterlab/jupyter-collaboration#290
jupyterlab/jupyter-collaboration#245