-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broken NVTX reports #2911
Comments
This is needed to fix #2530. |
Here's a summary of what is passing/failing:
|
Error messages are:
and
|
I think I fixed this at some point (at least, on clima). Is this still an issue? |
Yeah, the original failure seems to be fixed, but it does look like one issue remains: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/330#019151a2-dc7f-4525-aba6-b92ea170dd76: ┌ Info: Progress
│ simulation_time = "4 hours, 49 minutes"
│ n_steps_completed = 193
│ wall_time_per_step = "945 milliseconds, 292 microseconds"
│ wall_time_total = "15 minutes, 7 seconds"
│ wall_time_remaining = "12 minutes, 5 seconds"
│ wall_time_spent = "3 minutes, 2 seconds"
│ percent_complete = "20.1%"
│ sypd = 0.261
│ date_now = 2024-08-14T09:56:47.422
└ estimated_finish_date = 2024-08-14T10:08:52.461
The target application terminated. One or more process it created re-parented.
Waiting for termination of re-parented processes.
Use the `--wait` option to modify this behavior.
Generating '/tmp/slurm-35905/nsys-report-09db.qdstrm' Should we keep this issue open for this new error? The title is sufficiently general 🤷🏻♂️ |
Yes, at least this seems to be consistent. It is always with that particular job: |
The pipeline is still failing: |
Without nsight, the jobs run to completion: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/344 |
That's a different error, the reports are being generated, now it's OOMing. I'm going to close this and open a new issue. |
Opened #3375. |
We need to fix the broken NVTX reports, both on central and on clima.
The text was updated successfully, but these errors were encountered: