Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers left alive on MOM node #780

Open
WardLT opened this issue Apr 13, 2022 · 0 comments
Open

Workers left alive on MOM node #780

WardLT opened this issue Apr 13, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@WardLT
Copy link
Contributor

WardLT commented Apr 13, 2022

Describe the bug
Workers are not cleaned up from a launch node after a job is terminated by the schedular.

My app places the FuncX manager on the MOM node of a Cray supercomputer so that workers can launch MPI applications via system calls. The manager process is killed when the job exits but the workers stay afterwards.

Looking at the logs, I note the workers report receiving a Signal 15 but do not exit. Is that expected?

(miniconda-3/latest//home/lward/exalearn/edw/env) lward@thetalogin6:~/.funcx/nwchem/HighThroughputExecutor/worker_logs/70f647195873> more funcx_worker_32.log
1649883351.811096 2022-04-13 20:55:51 INFO MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:85 __init__ Initializing worker 32
1649883351.813936 2022-04-13 20:55:51 INFO MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:86 __init__ Worker is of type: RAW
1649883351.814639 2022-04-13 20:55:51 INFO MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:95 __init__ Trying to connect to : tcp://127.0.0.1:52075
1649883351.815704 2022-04-13 20:55:51 INFO MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:109 start Starting worker
1649884526.686283 2022-04-13 21:15:26 ERROR MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:101 handler Signal handler called with signal 15
1649884762.299373 2022-04-13 21:19:22 ERROR MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:101 handler Signal handler called with signal 15

To Reproduce
TBD. My app has a complex set up, but I can create a minimal example on request.

Expected behavior
Everything dies when Cobalt commands it.

Environment

  • OS: CentOS

  • OS & Container technology: None

  • Python version @ 3.8.12

  • Python version @ 3.8.12

  • funcx version @ 58493f

  • funcx-endpoint version @ 58493f

Distributed Environment

  • Where are you running the funcX script from? Login node
  • Where does the endpoint run? Login node
  • What is your endpoint-uuid? ff59d7d1-e2f5-4a38-8bb8-ba6de588c7c7
  • Attach endpoint logs at ~/.funcx/<ENDPOINT_NAME> if this is an endpoint issue.
    Please let us know if you'd prefer to share logs privately.

worker-no-die.tar.gz

@WardLT WardLT added the bug Something isn't working label Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant