Workers left alive on MOM node #780

WardLT · 2022-04-13T21:37:38Z

Describe the bug
Workers are not cleaned up from a launch node after a job is terminated by the schedular.

My app places the FuncX manager on the MOM node of a Cray supercomputer so that workers can launch MPI applications via system calls. The manager process is killed when the job exits but the workers stay afterwards.

Looking at the logs, I note the workers report receiving a Signal 15 but do not exit. Is that expected?

(miniconda-3/latest//home/lward/exalearn/edw/env) lward@thetalogin6:~/.funcx/nwchem/HighThroughputExecutor/worker_logs/70f647195873> more funcx_worker_32.log
1649883351.811096 2022-04-13 20:55:51 INFO MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:85 __init__ Initializing worker 32
1649883351.813936 2022-04-13 20:55:51 INFO MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:86 __init__ Worker is of type: RAW
1649883351.814639 2022-04-13 20:55:51 INFO MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:95 __init__ Trying to connect to : tcp://127.0.0.1:52075
1649883351.815704 2022-04-13 20:55:51 INFO MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:109 start Starting worker
1649884526.686283 2022-04-13 21:15:26 ERROR MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:101 handler Signal handler called with signal 15
1649884762.299373 2022-04-13 21:19:22 ERROR MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:101 handler Signal handler called with signal 15

To Reproduce
TBD. My app has a complex set up, but I can create a minimal example on request.

Expected behavior
Everything dies when Cobalt commands it.

Environment

OS: CentOS
OS & Container technology: None
Python version @ 3.8.12
Python version @ 3.8.12
funcx version @ 58493f
funcx-endpoint version @ 58493f

Distributed Environment

Where are you running the funcX script from? Login node
Where does the endpoint run? Login node
What is your endpoint-uuid? ff59d7d1-e2f5-4a38-8bb8-ba6de588c7c7
Attach endpoint logs at ~/.funcx/<ENDPOINT_NAME> if this is an endpoint issue.
Please let us know if you'd prefer to share logs privately.

worker-no-die.tar.gz

The text was updated successfully, but these errors were encountered:

WardLT added the bug Something isn't working label Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workers left alive on MOM node #780

Workers left alive on MOM node #780

WardLT commented Apr 13, 2022

Workers left alive on MOM node #780

Workers left alive on MOM node #780

Comments

WardLT commented Apr 13, 2022