-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot start `sh': Resource temporarily unavailable #15
Comments
I have just this week become aware of a significant issue with jobeinabox if running it on a host that's providing other services. Although I haven't yet confirmed it, I think that can give rise to the error message you're quoting. I've also seen that error message arising on CentOS and RHEL servers. Those OSs don't play well with Docker. So, can you please advise what OS the host is running, what its configuration is (memory size, number of CPUs), and whether the host is providing other services as well as Jobe. |
Hello,the docker host is an Ubuntu 20.04 VM with 6 Virtual CPUs and 10G Ram. The Hypervisor I think ist ProxMox. beside the docker system is a native installed JOBE (that runs correct on a different port of course) and gitlab-ce ! But we want to switch from the native install JOBE Server to the Docker container, because of security reasons (otherwise for example a python program can read the host file system !) |
OK, thanks for the info. Would you mind opening a shell inside the container, navigating to /var/www/html/jobe, and typing the following command, please:
Please paste the entire output into your response or attach it as a file. Also, could you let me know the output from the following two commands when executed on the host, please:
|
As an aside, with regard to security issues, Jobe is pretty secure, and has had several security audits over the years. Is it really a problem that the jobs can read the file system? I'd hope that no sensitive information was in world-readable files. A jobe task has significantly less rights than a logged-in user on the system (e.g. limited processes, memory, job time). |
Hello, when i run python3 testsubmit.py I got a huge list auf failed Tests, like
Attached is the complete report ! And this is the output of the other two commands:
Well and for the security issue, yes I believe that jobe is secure, but hard to say, if there are'nt any files with read permissions to the world. That's why I prefer to have everything inside a docker container ! |
This is very interesting. You're the second person reporting this problem and the symptoms are almost identical. Did you notice that the Java jobs ran fine? And that 3 of the 10 identical C jobs thrown at the server in quick succession passed, whereas the same job had failed multiple times earlier in the test run. In discussion with the other guy reporting this problem I've identified the problem as being the process limit. Java uses a large process limit (ulimit NPROC) of several hundred whereas the default for most jobs is 30. It turns out that the C jobs (and probably all the others too) run fine if you raise the process limit. It also seems that some of the higher-numbered jobe users aren't affected but because Jobe users are allocated in order jobe00, jobe01 etc, the higher-numbered ones never get to run jobs unless you toss a lot of jobs at Jobe in a short space of time (which is why 3 of the 10 ran, or so I conjecture). I had been theorising that because the user namespaces in the container are shared with the host (unless you're running the docker daemon with isolated user namespaces - do you know?), there must be processes running on the host with the same UID as that of the jobe users in the container. You do indeed have such users - the ones created by jobe on the host - but there's no way they can be clocking up sufficient processes to block the ones in the container (if that's even possible - I'm not sure how ulimits are enforced within the container). I'm now frankly baffled and not sure what to suggest. The other guy is also running a host system of Ubuntu 20:04, but that should be fine. It could be significant that I only recently changed the base OS for jobeinabox from Ubuntu 20:04 to Ubuntu 22:04, but I wouldn't have thought that would matter. And I just now fired up jobeinabox on a 20:04 host with no problems. Clutching at straws here but ... would you mind building your own jobeinabox image from scratch, please (see here) and confirming nothing changes, please? And then, to be quite sure, editing the Docker file to change the base OS back to Ubuntu 20:04 and build with that? But you'll also need to edit the line
to
I'd really like to get to the heart of this - two such reports in the space of week suggests something "interesting" has happened. |
Ok I rebuild the Image from the Docker file, that I have downgraded to ubuntu 20:04 and jdk16. Unfortunately with no effect, se the output of python3 testsubmit.py ! |
Many thanks for that - it's very helpful to at least eliminate that possibility, but it doesn't leave us any closer to an explanation. The dialogue with the other guy is here if you want to see where the issue is at. You'll see that they just switched to running jobeinabox on an AWS server. I'm frankly baffled, with little to suggest. I can't debug if I can't replicate the problem. You could perhaps check that the problem goes away if you change line 44 of jobe/application/libraries/LanguageTask.php from
to
But even if it does (as I expect), I'd be unhappy running a server with such an inexplicable behaviour. The tests should nearly all run with a value of 2. The only thing I can see in common between your system and the other guys is that you're both running additional services in containers - they're running Moodle, you're running gitlab. Are you perhaps able to stop gitlab (and any other docker processes) and check Jobe again? This isn't quite as silly as it sounds - they do all share the same UID space. Do you have any other suggestions yourself? |
I've thought of one other test you could do, if you wouldn't mind, please? In the container:
Uncomment the line
and set the value to 800 instead. Similarly uncomment and set SYS_GID_MAX to 800. Then:
Does it work OK now? |
I've try the first approach and set numprocs to 200 an this looks relay good. I only got 3 Errors ! I will continue with your second approach ! |
Now with yours second approach.... GREAT!!, No error was reportet (but I still have set numprocs to 200, shall I reduce it ? |
Yes please - I'd like to be reassured that you still get no errors with it set back to 30. Many thanks for your debugging support. This is definitely progress, of a sort. If you still get no errors, as I would hope, I'd be fairly confident that you have a working Jobe server. But I really would like to know why those top few system-level user IDs are causing problems. The implication is that something else in your system is using them too. And running lots of processes with them. But what? They're not being used by the host (we checked the password file and did a ps to be sure) so I can only assume that another container is using them. Are you able to list all running containers (docker ps), then exec the following two commands in all of them to see if we can find the culprit?
If that last command throws up a list of processes within any of the containers, we've found the problem! |
Ok, reduces the numproc to 30 again, and have NO errors when i run python3 testsubmit.py. I have 3 running docker containers. A gitlab runner, a self developed application and jobe, here are the results of the command: gitlab runner:
Self developed Application
JOBE
|
By the way, when I rund these commands on the host system, I've got below's output. On the Host systems is installed gitlab and jobe! May be this could help too ! But I'am happy to have the jobeinbox running fine. It would be nice to know what I have to change in the Dockerfile, so that i can build a working image :-) !
|
Good news that you still get no errors after reducing numproc to 30 again. I think you could comfortably use that jobeinabox container if you wanted, but it's really no solution to the problem as you'd have to repeat all the UID fiddling every time you ran up a new container. I need to understand exactly what's happening and fix it. None of your containers seem to be using any of the default Jobe UIDs. However, I do note that gitlab-runner is using 999 which is the same UID that jobe uses. I don't see how this could cause the problem, but I will pore again over the runguard code. No more time today, though - I have some "real work" to do :-) Are you easily able to fire up a new jobeinabox container and check if it runs OK while the gitlab container is stopped? No problem if not - you've given me something to ponder, regardless. Many thanks again for the debugging help. Stay tuned - I hope to come back within a day or two. |
Ok, thanks a lot for your Help ! |
Aha. That's it! All suddenly is clear. nginx is using UID 998 - same as jobe00. It creates lots of worker processes, so jobe00 doesn't get a chance. We had a misunderstanding earlier when I asked you to run that command on the host. You gave me the output
I failed to notice that you ran the commands in the container, not on the host! Many thanks - I now know exactly what the problem is. All I need to do now is figure out how to fix it. That requires some thought. Stay tuned. |
I've pushed a change to Jobe to allow customised setting of the UIDs allocated to the jobe processes. I also pushed a new version of Dockerfile and updated the latest jobeinabox image on Docker Hut to make use of the new functionality. Are you able to check with jobeinabox:latest to confirm that the problem has been solved, please? Thanks again for the great help in reporting and debugging. |
Hello, we are using JOBE for some year now, now we are switching to a new (virtual) Server. I start the JOBE Docker container like on the old Server!
But launching a Code Runner Question i've got the following error-message:
Did you have any idea what could go wrong ?
best regards
jtuttas
The text was updated successfully, but these errors were encountered: