User pod launching timeout with large single user image when cluster scaling up #3010

xcompass · 2023-02-02T22:51:00Z

Bug description

During the cluster scaling up, once the node is provisioned and is ready in k8s, the user pods may get scheduled to the newly provisioned node. However, if the continuous image puller is still pulling the single user image because the image is large (a few GB) or slow network, the user will get timeout error message from UI if the image pulling is not completed by the set timeout. Increasing the timeout might be a quick fix but it is not a good user experience as user may have to wait long time to get access (for us, it may take 10mins).

Expected behaviour

The user pods should not be scheduled to the new node before image pulling is done.

Actual behaviour

See Bug description

How to reproduce

Create a large single user image or use a slower registry/network
Trigger a scaling up event
Spawn a few user pods, some of them will be scheduled to the new node when the node is ready
The user will see timeout from UI

Our current workaround

Adding a taint before starting image puller or when provisioning the node to prevent the user pod being schedule to the node. Then remove taint after image puller is done.

We have to code ready to send a PR. But want to see if there is a better way to solve it. Happy to send the PR anytime.

welcome · 2023-02-02T22:51:02Z

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.

You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

consideRatio · 2023-02-03T06:10:54Z

KubeSpawner makes a pod be registered with the k8s api-server, at that point in time, the jupyterhub's server startup timeout starts. You can configure this timeout, and typically I increase it to 20 minutes when large images are involved.

https://z2jh.jupyter.org/en/stable/resources/reference.html#singleuser-starttimeout

xcompass · 2023-02-03T08:28:41Z

Thanks for the reference. We tried to increase the timeout, but our user has to wait a long time (in our case around 10mins) to be able to get to the UI after logging in. However, we still have resources in the existing node to assign user. This happens when user placeholder is enabled with node density scheduling or default scheduling without placeholder enabled. As new node has no pods running, the user pod will be scheduled there as soon as it is ready (before single user image is pulled).

This is kind like an edge case as it is not being triggered very often. It has to satisfy the following conditions:

A new node being provisioned (scaling up)
User placeholders are being used (if using node density scheduling)
Large single user image is being used and pre-pulled by continuous image puller
A user is trying to login after the new node is ready but before single user image is pulled.

consideRatio · 2023-02-03T10:20:46Z

This is a tricky topic, I've considered in the past and I recall I couldn't conclude a strategy to resolve it that was simple and robust enough to think it was worth working towards.

If the strategy used to solve this isn't very robust, it may cause other problems for users.
If the strategy used to solve this isn't simple enough, it may cause complexity maintaining it long term - and capacity to maintain open source projects is always limited.

If you have a strategy to handle this situation, that you believe will be robust and simple - for example not assuming to much about the cloud providers or requiring users to setup node pools with various taints etc, then I'm open to reviewing it!

I've not now read up all notes from past cosiderations of this, but this is a topic I've considered in depth in the past without a clear resolution. Here are some past topics that relates to some degree:

xcompass · 2023-02-03T18:21:22Z

Thanks @consideRatio. I wasn't aware there has been quite extensive discussions. I think my solution is simple enough. So sending in the PR #3011 for review. There is one simple go binary added to manage the taints currently hosted in our repo. Happy to donate to jupyterhub as well.

consideRatio · 2023-02-07T07:49:53Z

I understand now from the reproduction steps that we thought about two different situations. You considered if there were multiple nodes available for the user to schedule on, and I considered if the user pod starting triggered a scale up because it couldn't fit on a node.

Hmmm if you are using the jupyterhub chart provided user-scheduler, you can make user pods pack on the most used node which may help you. Then there is use of user-placeholder pods complicating things as well.

Are you currently using user-scheduler and/or user-placeholder pods @xcompass? I'd like to pinpoint the problems we could look to resolve with this new strategy that isn't resolved using existing techniques. This will be required for the documentation we need for a feature like this anyhow.

xcompass · 2023-02-07T21:31:16Z

Yes, we are using user-scheduler and user-placeholder. I remember we did the testing with both enabled. I'll redo a test to make sure.

xcompass added the bug label Feb 2, 2023

xcompass mentioned this issue Feb 3, 2023

[WIP] Adding taint during image pre-pulling to prevent user pod being scheduled #3011

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User pod launching timeout with large single user image when cluster scaling up #3010

User pod launching timeout with large single user image when cluster scaling up #3010

xcompass commented Feb 2, 2023

welcome bot commented Feb 2, 2023

consideRatio commented Feb 3, 2023

xcompass commented Feb 3, 2023

consideRatio commented Feb 3, 2023

xcompass commented Feb 3, 2023

consideRatio commented Feb 7, 2023

xcompass commented Feb 7, 2023

User pod launching timeout with large single user image when cluster scaling up #3010

User pod launching timeout with large single user image when cluster scaling up #3010

Comments

xcompass commented Feb 2, 2023

Bug description

Expected behaviour

Actual behaviour

How to reproduce

Our current workaround

welcome bot commented Feb 2, 2023

consideRatio commented Feb 3, 2023

xcompass commented Feb 3, 2023

consideRatio commented Feb 3, 2023

xcompass commented Feb 3, 2023

consideRatio commented Feb 7, 2023

xcompass commented Feb 7, 2023