Collecting system metrics during runs. #37

venkat-1 · 2024-07-09T17:06:33Z

We need to collect system metrics, including GPU memory errors, NIC errors and re-transmits, etc. during each run. Ideally, we will collect this with as low-overhead using a background thread/process. This will help identify any slowdown during runs and where it could originate from.

Another aspect to consider is to run periodic tests on the GPUs, such as hgemm and Igemm and any peer-to-peer tests, to see if there are any low-performing GPUs over time. So, in addition to running at the beginning of a job, we can run this after a specific duration, such as after every checkpoint.

nscottnichols · 2024-07-09T17:23:29Z

venkat-1 · 2024-07-09T17:27:43Z

It needs to target multiple systems. So, at least for now target Polaris (Nvidia+ SS11), Aurora(Intel + SS11), and Sophia (Nvidia + IB)

nscottnichols · 2024-07-09T17:30:37Z

Another aspect to consider is to run periodic tests on the GPUs, such as hgemm and Igemm and any peer-to-peer tests, to see if there are any low-performing GPUs over time. So, in addition to running at the beginning of a job, we can run this after a specific duration, such as after every checkpoint.

We have GEMM benchmarks as part of the node performance overview. I can adapt these tests to be run as part of a pre-execution hook that we run during our job submissions and to target other systems.

nscottnichols · 2024-07-09T17:56:54Z

@venkat-1 I created a new issue for testing, #40. Also, for metric tracking, are we looking for something that can be post processed, something with a live view/feed, or something integrated with W&B?

venkat-1 assigned venkat-1 and nscottnichols Jul 9, 2024

nscottnichols closed this as completed Jul 9, 2024

nscottnichols reopened this Jul 9, 2024

nscottnichols mentioned this issue Jul 9, 2024

Develop pre-/mid-execution test harness #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collecting system metrics during runs. #37

Collecting system metrics during runs. #37

venkat-1 commented Jul 9, 2024

nscottnichols commented Jul 9, 2024 •

edited

Loading

venkat-1 commented Jul 9, 2024

nscottnichols commented Jul 9, 2024

nscottnichols commented Jul 9, 2024

Collecting system metrics during runs. #37

Collecting system metrics during runs. #37

Comments

venkat-1 commented Jul 9, 2024

nscottnichols commented Jul 9, 2024 • edited Loading

venkat-1 commented Jul 9, 2024

nscottnichols commented Jul 9, 2024

nscottnichols commented Jul 9, 2024

nscottnichols commented Jul 9, 2024 •

edited

Loading