-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User pod launching timeout with large single user image when cluster scaling up #3010
Comments
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗 |
KubeSpawner makes a pod be registered with the k8s api-server, at that point in time, the jupyterhub's server startup timeout starts. You can configure this timeout, and typically I increase it to 20 minutes when large images are involved. https://z2jh.jupyter.org/en/stable/resources/reference.html#singleuser-starttimeout |
Thanks for the reference. We tried to increase the timeout, but our user has to wait a long time (in our case around 10mins) to be able to get to the UI after logging in. However, we still have resources in the existing node to assign user. This happens when user placeholder is enabled with node density scheduling or default scheduling without placeholder enabled. As new node has no pods running, the user pod will be scheduled there as soon as it is ready (before single user image is pulled). This is kind like an edge case as it is not being triggered very often. It has to satisfy the following conditions:
|
This is a tricky topic, I've considered in the past and I recall I couldn't conclude a strategy to resolve it that was simple and robust enough to think it was worth working towards.
If you have a strategy to handle this situation, that you believe will be robust and simple - for example not assuming to much about the cloud providers or requiring users to setup node pools with various taints etc, then I'm open to reviewing it! I've not now read up all notes from past cosiderations of this, but this is a topic I've considered in depth in the past without a clear resolution. Here are some past topics that relates to some degree:
|
Thanks @consideRatio. I wasn't aware there has been quite extensive discussions. I think my solution is simple enough. So sending in the PR #3011 for review. There is one simple go binary added to manage the taints currently hosted in our repo. Happy to donate to jupyterhub as well. |
I understand now from the reproduction steps that we thought about two different situations. You considered if there were multiple nodes available for the user to schedule on, and I considered if the user pod starting triggered a scale up because it couldn't fit on a node. Hmmm if you are using the jupyterhub chart provided user-scheduler, you can make user pods pack on the most used node which may help you. Then there is use of user-placeholder pods complicating things as well. Are you currently using user-scheduler and/or user-placeholder pods @xcompass? I'd like to pinpoint the problems we could look to resolve with this new strategy that isn't resolved using existing techniques. This will be required for the documentation we need for a feature like this anyhow. |
Yes, we are using user-scheduler and user-placeholder. I remember we did the testing with both enabled. I'll redo a test to make sure. |
Bug description
During the cluster scaling up, once the node is provisioned and is ready in k8s, the user pods may get scheduled to the newly provisioned node. However, if the continuous image puller is still pulling the single user image because the image is large (a few GB) or slow network, the user will get timeout error message from UI if the image pulling is not completed by the set timeout. Increasing the timeout might be a quick fix but it is not a good user experience as user may have to wait long time to get access (for us, it may take 10mins).
Expected behaviour
The user pods should not be scheduled to the new node before image pulling is done.
Actual behaviour
See Bug description
How to reproduce
Our current workaround
Adding a taint before starting image puller or when provisioning the node to prevent the user pod being schedule to the node. Then remove taint after image puller is done.
We have to code ready to send a PR. But want to see if there is a better way to solve it. Happy to send the PR anytime.
The text was updated successfully, but these errors were encountered: