Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jupyter Kernel Restart does not release the RAM usage by the kernel running in JEG kubernetes cluster #1195

Open
sharmasw opened this issue Nov 14, 2022 · 6 comments

Comments

@sharmasw
Copy link

sharmasw commented Nov 14, 2022

Description

We have a JEG running in the Kubernetes cluster when we spawn a pod to execute a jupyter notebook, everything works well, but when the user restarts the kernel the RAM of the kernel does not get released by the pod immediately. We either have to wait for an indefinite time for it to get released, if we continue using it, eventually it goes out of memory and Kubernetes kills the pod.

Screenshots / Logs

Start of the Kernel:
image

After executing some commands:
image

1st restart:
image

immediately 2nd restart without executing any code:
image

3rd restart without executing any code:
image

Now If we wait for some indefinite time (for this example it took 4 minutes) and it will release the memory:
image
image
image

Any clue or suggestion as to why this behavior, we just want to release all the RAM utilized post restart action is performed.

Environment

  • Enterprise Gateway Version 2.6.0
  • Platform: Kubernetes
  • Jupyter Server 1.15.6

Resource configuration

  • KERNEL_CPUS 50m
  • KERNEL_MEMORY 4096Mi
  • KERNEL_MEMORY_LIMIT 4096Mi
    But we have other configs as well like
  • KERNEL_CPUS 50m
  • KERNEL_MEMORY 1024Mi or 256Mi
  • KERNEL_MEMORY_LIMIT 1024Mi or 256Mi
@welcome
Copy link

welcome bot commented Nov 14, 2022

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@kevin-bates
Copy link
Member

Hi @sharmasw - I'm not familiar with how/when resources are deallocated surrounding a pod's lifecycle. I guess the information you provide is not too surprising. When kernels are restarted, we retain the namespace and give the new pod the same name as the previous (because the kernel_id is also preserved). I imagine this might be why k8s defers the cleanup that you observe, and might transfer the resources to the new pod provided it's on the same node as the previous.

Can you share your resource configuration in case others want to look into this? Are these specified as limits or requests, and via envs, or just configured directly into the pod's launch script?

Does anyone else know how resources are deallocated in k8s? @lresende, @rahul26goyal

If we can make that determination, we can possibly update KubernetesProcessProxy.terminate_container_resources() to explicitly deallocate resources on shutdowns.

@rahul26goyal
Copy link
Contributor

Hi @sharmasw
Can you please share the kernel type .. is it a custom kernel ?
I see that the pod name has not changed across restarts which is unlikely for the kernels which EG supports today.
As @kevin-bates mentioned, we kill the existing kernel pod and create a new one when you restart a kernel.
Please correct me if I have misunderstood anything here.

@kevin-bates
Copy link
Member

I see that the pod name has not changed across restarts which is unlikely for the kernels which EG supports today.

Pod names are preserved across restarts. By default, they are composed of kernel username and kernel id, both of which are static values in this context.

@sharmasw
Copy link
Author

If we can make that determination, we can possibly update KubernetesProcessProxy.terminate_container_resources() to explicitly deallocate resources on shutdowns.

Hi @kevin-bates could you elaborate on what could actually be done for explicitly deallocating the resource? We looked into the Kubernetes python library and did not find any documentation or function that talks about deallocating unused resources from a given pod.

@kevin-bates
Copy link
Member

Hi @sharmasw - well, I'm afraid you answered the question. If the API does not expose a means to deallocate resources sooner, I'm not sure there's much we can do. Had there been a way to address this via the API, we could introduce those calls into KubernetesProcessProxy.terminate_container_resources().

This behavior implies that resources may be indexed by pod name (and probably namespace) - which seems very odd. I just confirmed that the Docker container ID changes across restarts - so it's definitely a different instance.

Can you share your resource configuration in case others want to look into this? Are these specified as limits or requests, and via envs, or just configured directly into the pod's launch script?

Since the pod name (and namespace) are the same, perhaps the resources are treated as high-water marks or something. (This is definitely the kind of thing that is difficult to locate w/o knowing the code or how the scheduler works as its probably not an ordinary use-case.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants