-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EG continually receives empty payloads from launcher after initial connection #1157
Comments
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗 |
Hi @blair-anson - thanks for opening this issue. The The JSON decoding issues may be a versioning thing where the kernel images are relative to a different version than EG. I suspect EG is older than the kernel images in this case. What version of EG are you running (this should be present in the EG logs, within the first page of output)? |
@kevin-bates thank you for the prompt response, it is greatly appreciated.
Ahh I see. I made the change and the working directory is now working as expected. Thanks!
|
Only the active kernels request should be getting proxied to the Gateway. Terminal and session requests will be handled by the server. You should also see periodic requests (about every 10 seconds) for kernel specs as well - which will also be proxied to the Gateway. Jupyter Lab (IMHO) does way too much polling and this is why I added kernel-spec caching to EG. I'm hoping Lab will be able to leverage the event framework that is currently being added to Jupyter Server to alleviate some of this. |
Back to your issue. Have you modified |
No, I have not modified I do build EG from source following the make process and packaged up in my own docker image. So I tried using the image I thought it might be the AWS loadbalancer, so I removed the target groups but the error still continuously appear in the EG logs. Only thing left is the Kubernetes configuration. I did have a lot of trouble trying to get the service exposed via an ingress to an AWS loadbalancer so that is the next likely suspect.
|
hi @blair-anson Issue-1: using
|
Yes the kernel does launch and run, and I can execute cells in the notebook. It is just that after executing a cell the JEG log gets spammed with these error messages. But even with those error messages filling up the log, executing additional cells in the notebook still works.
|
I looked into the JEG codebase and traced the exception to this function
Specifically this line to decode the data received from asyncio The data received is empty and hence the json read fails. I added a simple data length check. if not buffer: # send is complete, process payload
if len(data) > 0: ## <<----- check for empty data
self.log.debug("Received payload '{}'".format(data))
payload = self._decode_payload(data)
self.log.debug("Decrypted payload '{}'".format(payload))
self._post_connection(payload)
break
data = data + buffer.decode(encoding='utf-8') # append what we received until we get no more... |
Hi @blair-anson - what is perplexing to both @rahul26goyal and me is that that code is responsible for receiving the kernel connection information from the remote kernel (pod in this case). So if that isn't working, the kernel should not be working because the server has no way to communicate with it. The other odd thing is the "spam" that occurs. It sounds like there's a correlation between the empty payload messages and cell execution yet the remote kernel sends the connection payload exactly once. Might you have other kernel pods running from a previously failed attempt? Can you check all namespaces (since you have kernels running in their own namespace - e.g., In the opening description, you show the log output of the launch and an empty payload, so there's no way a kernel with ID Could you please do the following...
Thank you. It's good that you're able to use the kernel, we just need to figure out why. 😄 |
Also, if you're in a position to try EG 3.0.0rc0, that would be great! That would serve as another datapoint and bring you up to date on the main branch. |
FWIW - I just deployed EG 2.6.0 in my kubernetes cluster and get the following log entries... The expected startup banner...
The expected payload response after 8 poll attempts...
The expected shutdown response...
More questions...
|
No, I only see one EG namespace with one EG pod in it
I am using tag
Here is the logs as requested and the steps I followed....
I also wanted to mention that when I do the same process using JupyterHub instead of a local JupyterLab I sometimes see a kernel being created when JH connects to EG, even if no notebook is open by default. Then when I open a notebook and and start a kernel a second kernel is created EG. I suspect this may also be related, and is the likely cause of the 6 kernels from step 3). jupyterlab.log |
logs from the JupyterHub test |
Thank you for the detailed response! This is extremely odd - I definitely see what you mean by "spamming the log"! We've never seen this before and can't reproduce this. Are you able to try running a deployment with zero changes - or the minimum necessary to operate within AWS? Might the kernel pod's restart policy (should be The kernel is not shutting down gracefully either, so it seems something else is in-play here. If a Lab session is ended with an open notebook, that notebook will be opened and its kernel started on next invocation. EG will not start kernels upon its startup - only on request via its REST API. I'm at a loss as to what might be going on here. It's like something outside of the software is hitting the response address ( I'm also not seeing a discernible pattern in the spam log timestamps. There may be a backoff pattern happening but that's tough to determine based on general log latency. I'm hoping you can try a vanilla deployment of some form so we can get a better comparison. The kernel pod should be logging whenever it goes to send the connection information (which your other log shows) but we don't see the spam calls. Also, given that they continue to occur after the kernel pod and its namespace are terminated is, well, alarming. |
Yes, later this week I plan to start with the basic deployment and test each modification. I will let you know how it goes. |
Hi @blair-anson - Could you please send a copy of the |
I've gone ahead and updated the title to reflect the actual issue. |
@blair-anson - I suspect this issue might be related to the periodic callback that calls When you added this protection, did things behave correctly? Hmm. One difference that would occur using the "protective measure" vs. not, is that in the case where the empty payload is processed, an exception (multiple actually) is raised and caught, but the connection is never closed! With the protection in place, the connection is closed. Therefore, I recommend we make two changes...
I'm going to experiment with timeout values to see if I can reproduce this in some manner, but I think we should make those two changes in our 3.0 release. |
@kevin-bates apologies for the delay. Here are the files requested. You will notice in
Kind of... I found that starting a kernel often timed out the first time. Then reloading the jupyter page sometimes connected to the new kernel, and other times I had to start a new kernel..then after a while I ended up with 2 running kernels for 1 notebook. Once the first kernel was running then things seemed to work as expected, and adding a additional kernels was no problem. I did increase the timeout but setting up the connection to the first kernel still seems flaky. As I am running JupyterHub remotely to EG, I am not sure if it is a network problem, kubernetes problem, or a EG configuration problem. I have not yet had time to have a proper look at what is going on. Alternatively I may just move to hosting JupyterHub on the same kubernetes cluster as EG to see if that resolves the problem. |
Thanks - no worries about the delay. I was thinking, do you deploy the kernel-image-puller and, if not, are your nodes "pre-configured" to include the necessary kernel images? Often, if the image is not present, the initial request will timeout within the time it takes to pull the image, so I'm wondering if that's related to the "first time" issues you're seeing. This scenario fits the notion where you'd wind up with two kernels running since the first does eventually startup (once the image is available). I'll try to check the files you provided early next week - thanks for their inclusion. |
This may be my misunderstanding, so let me outline what I am seeing.
I have EG setup on an EKS kubernetes cluster. I have mounting of a user's NFS folder working, and user impersonation appears to be working as well. However the default folder, and the "home" folder for the kernel is set as
/home/jovyan
According to the documentation, if I set
EG_MIRROR_WORKING_DIRS
andKERNEL_WORKING_DIR
then the kernel working directory will change, and I assume I will see/home/user1
set as the kernel home folder but I don't see that change applied.I run a local jupyter instance with the environment variables set like so...
But when I run in the notebook
pwd
orls ~
I can see the home folder is/home/jovyan
and not/home/user1
.The jupyter logs look like this...
The enterprise-gateway pod logs have a number of these errors...
The text was updated successfully, but these errors were encountered: