Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken NVTX reports #2911

Closed
charleskawczynski opened this issue Apr 17, 2024 · 10 comments · Fixed by #3372
Closed

Fix broken NVTX reports #2911

charleskawczynski opened this issue Apr 17, 2024 · 10 comments · Fixed by #3372

Comments

@charleskawczynski
Copy link
Member

We need to fix the broken NVTX reports, both on central and on clima.

@charleskawczynski
Copy link
Member Author

This is needed to fix #2530.

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Apr 22, 2024

Here's a summary of what is passing/failing:

Central
GPU: GPU dry baroclinic wave                           | qdstrm error 27%
GPU: GPU moist Held-Suarez                             | qdstrm error 16%
GPU: GPU moist Held-Suarez cloud diagnostics per stage | qdstrm error 17%
:umbrella: GPU: gpu_aquaplanet_dyamond                 | qdstrm error 27%
GPU: Prognostic EDMFX aquaplanet                       | qdstrm error 55%

Clima
dry baroclinic wave                                    | rpc returns EmptyMessage
moist Held-Suarez                                      | rpc returns EmptyMessage
moist Held-Suarez - 4 gpus                             | multi-rpc returns EmptyMessage
dry baroclinic wave - 4 gpus                           | success
gpu_aquaplanet_dyamond - strong scaling - 1 GPU        | success
gpu_aquaplanet_diagedmf - 1 GPU                        | success

@charleskawczynski
Copy link
Member Author

Error messages are:

qdstrm

Generating '/tmp/slurm-40915165/nsys-report-4d1c.qdstrm'
[1/1] [====27%                     ] report.nsys-rep
Importer error status: Importation failed.
Import Failed with unexpected exception: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Host/QdstrmImporter/main.cpp(34): Throw in function {anonymous}::Importer::Importer(const boost::filesystem::path&, const boost::filesystem::path&)
Dynamic exception type: boost::wrapexcept<QuadDCommon::RuntimeException>
std::exception::what: RuntimeException
[QuadDCommon::tag_message*] = Status: AnalysisFailed
Error {
  Type: RuntimeError
  SubError {
    Type: InvalidArgument
    Props {
      Items {
        Type: OriginalExceptionClass
        Value: "N5boost10wrapexceptIN11QuadDCommon24InvalidArgumentExceptionEEE"
      }
      Items {
        Type: OriginalFile
        Value: "/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Host/Analysis/Modules/EventCollection.cpp"
      }
      Items {
        Type: OriginalLine
        Value: "1055"
      }
      Items {
        Type: OriginalFunction
        Value: "void QuadDAnalysis::EventCollection::CheckOrder(QuadDAnalysis::EventCollectionHelper::EventContainer&, const QuadDAnalysis::ConstEvent&) const"
      }
      Items {
        Type: ErrorText
        Value: "Wrong event order has been detected when adding events to the collection:\nnew event ={ StartNs=403098042813 StopNs=403129613160 GlobalId=349883374385042 Event={ TraceProcessEvent=[{ Correlation=139850 EventClass=1 TextId=920 ReturnValue=0 },] } Type=48 }\nlast event ={ StartNs=448052547615 StopNs=448084509068 GlobalId=349883374385042 Event={ TraceProcessEvent=[{ Correlation=209574 EventClass=1 TextId=920 ReturnValue=0 },] } Type=48 }"
      }
    }
  }
}
Generated:
    /central/scratch/esm/slurm-buildkite/climaatmos-ci/18394/climaatmos-ci/target_gpu_implicit_baroclinic_wave/output_active/report.qdstrm

and

/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/AgentAPI/Src/SessionImpl.cpp(18): rpc Start(.Agent.StartRequest) returns (.Agent.EmptyMessage);
 is canceled because the timeout period is expired
🚨 Error: The command exited with status 1

@Sbozzolo
Copy link
Member

I think I fixed this at some point (at least, on clima). Is this still an issue?

@charleskawczynski
Copy link
Member Author

Yeah, the original failure seems to be fixed, but it does look like one issue remains: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/330#019151a2-dc7f-4525-aba6-b92ea170dd76:

┌ Info: Progress
│   simulation_time = "4 hours, 49 minutes"
│   n_steps_completed = 193
│   wall_time_per_step = "945 milliseconds, 292 microseconds"
│   wall_time_total = "15 minutes, 7 seconds"
│   wall_time_remaining = "12 minutes, 5 seconds"
│   wall_time_spent = "3 minutes, 2 seconds"
│   percent_complete = "20.1%"
│   sypd = 0.261
│   date_now = 2024-08-14T09:56:47.422
└   estimated_finish_date = 2024-08-14T10:08:52.461
The target application terminated. One or more process it created re-parented.
Waiting for termination of re-parented processes.
Use the `--wait` option to modify this behavior.
Generating '/tmp/slurm-35905/nsys-report-09db.qdstrm'

Should we keep this issue open for this new error? The title is sufficiently general 🤷🏻‍♂️

@Sbozzolo
Copy link
Member

Yes, at least this seems to be consistent. It is always with that particular job:

https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/329#01910503-1d89-4183-94b7-8c69e98619a2

@Sbozzolo
Copy link
Member

Sbozzolo commented Oct 9, 2024

@Sbozzolo Sbozzolo reopened this Oct 9, 2024
@Sbozzolo
Copy link
Member

Sbozzolo commented Oct 9, 2024

Without nsight, the jobs run to completion: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/344

@charleskawczynski
Copy link
Member Author

Without nsight, the jobs run to completion: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/344

That's a different error, the reports are being generated, now it's OOMing. I'm going to close this and open a new issue.

@charleskawczynski
Copy link
Member Author

Opened #3375.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants