Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup redirected stdout crash before re-redirecting #2311

Merged
merged 2 commits into from
Jul 26, 2024
Merged

Conversation

sbailey
Copy link
Contributor

@sbailey sbailey commented Jul 26, 2024

This PR fixes #2269 by merging any per-rank log files leftover from a previous crash before proceeding with a new stdout/stderr log redirect per-rank. Previously these logfiles would get overwritten, losing the record of the crash.

I tested this with $CFS/desi/users/sjbailey/dev/stdout_redirect/mpi_test_redirect which purposefully crashes a requested rank. e.g.

# run without crashing any ranks; generates a blat.log file
$> srun -n 4 python mpi_test_redirect
Starting up; will crash rank -1
INFO:parallel.py:465:stdouterr_redirected: Begin log redirection to blat.log at Thu Jul 25 17:46:51 2024
INFO:parallel.py:524:stdouterr_redirected: End log redirection to blat.log at Thu Jul 25 17:46:51 2024
All done
$> ls
blat.log  mpi_test_redirect

# run while crashing rank 2, leaves behind backup of previous run and per-rank files from this run
$> srun -n 4 python mpi_test_redirect 2
Starting up; will crash rank 2
INFO:parallel.py:465:stdouterr_redirected: Begin log redirection to blat.log at Thu Jul 25 17:48:12 2024
srun: error: nid200053: task 2: Segmentation fault
srun: Terminating StepId=28590762.10
slurmstepd: error: *** STEP 28590762.10 ON nid200053 CANCELLED AT 2024-07-26T00:48:15 ***
srun: error: nid200053: tasks 0-1,3: Terminated
srun: Force Terminated StepId=28590762.10
$> ls
blat.log.0  blat.log-rank0  blat.log-rank1  blat.log-rank2  blat.log-rank3  mpi_test_redirect

# run again without crashing; will cleanup blat.log-rank* files, back them up to blat.log.1
# then generate blat.log
$> srun -n 4 python mpi_test_redirect
Starting up; will crash rank -1
INFO:parallel.py:465:stdouterr_redirected: Begin log redirection to blat.log at Thu Jul 25 17:50:24 2024
INFO:parallel.py:524:stdouterr_redirected: End log redirection to blat.log at Thu Jul 25 17:50:24 2024
All done
$> ls
blat.log  blat.log.0  blat.log.1  mpi_test_redirect

Detail: previously the per-rank files had names like output.log_0 which was confusingly similar to the backup names like output.log.0. I changed this to output.log-rank0 instead to make it clearer which files are leftover per-rank files and which are the backups.

Note: if a crash happens cleanly with an exception, stdouterr_redirected recovers and is able to merge the per-rank files. The leftover per-rank logfiles case comes from hard crashes like OOM or in this test case, from a purposeful segfault.

@sbailey sbailey requested a review from akremin July 26, 2024 00:56
Copy link
Member

@akremin akremin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this useful update to make the pipeline logging more robust. I have two very minor comments. Both should take two seconds to implement or they can both be ignored, as I don't feel strongly about either of them.

py/desispec/parallel.py Show resolved Hide resolved
py/desispec/parallel.py Outdated Show resolved Hide resolved
@sbailey sbailey merged commit 0061f3f into main Jul 26, 2024
26 checks passed
@sbailey sbailey deleted the stdouterr_redirect branch July 26, 2024 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stdouterr_redirected recovery from mid-stream crash
2 participants