Skip to content

Commit

Permalink
Update slides
Browse files Browse the repository at this point in the history
  • Loading branch information
cscjlan committed May 16, 2024
1 parent 21a24ec commit f8518db
Showing 1 changed file with 46 additions and 50 deletions.
96 changes: 46 additions & 50 deletions application-performance/docs/02-gpu-performance.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,30 @@
---
title: Single node performance optimization
title: GPU performance optimization
event: CSC Summer School in High-Performance Computing 2024
lang: en
---

# GPU performance optimization {.section}

# Introduction
- GPUs (Graphics Processing Units) are widely used in High-Performance Computing (HPC) applications.
- GPUs are powerful and complex processors designed for parallel computing.
- GPUs require explicit expression of parallelism by the programmer.

- GPUs (Graphics Processing Units) are widely used in High-Performance Computing (HPC) applications
- GPUs are powerful and complex processors designed for parallel computing
- GPUs require explicit expression of parallelism by the programmer

# General Principles for High GPU Performance

<div class=column>
:::::: {.columns}
::: {.column width="50%"}
- Keep all the compute resources busy (idle resources are a waste)
- Minimize the synchronization at all levels
- Minimize the data transfers between host and device
- Keep the data in faster memory and use an appropriate access pattern
</div>
<div class=column>
![](img/lumi_node.png){.center width=40%}
</div>
:::
::: {.column width="50%"}
![](img/lumi_node.png){.center width=50%}
:::
::::::

# GPU performance analysis {.section}

Expand All @@ -30,6 +33,7 @@ lang: en
![](img/perf-analysis-single-gpu.svg){.center width=60%}

# Measuring performance

- Don’t speculate about performance – measure it!
- Performance analysis tools help to
- Find hot-spots
Expand All @@ -43,12 +47,10 @@ lang: en

# Hardware performance counters

- Hardware performance counters are special registers on CPU \& GPU that count
hardware events
- They enable more accurate statistics and low overhead
- Special registers on CPU \& GPU that count hardware events
- Enable more accurate statistics and low overhead
- In some cases they can be used for tracing without any extra
instrumentation

- Number of counters is much smaller than the number of events that can be
recorded
- Different devices have different counters
Expand All @@ -71,71 +73,65 @@ lang: en
- Start with an overview!
- Call tree information, what routines are most expensive?

# Sampling vs. Tracing
# Sampling

<div class=column>
Sampling
:::::: {.columns}
::: {.column width="50%"}

- Application is stopped at predetermined intervals
- Information is collected about the state of application
- Lightweight, but may give skewed results
- Statistical information

</div>
<div class=column>
Tracing
:::
::: {.column width="50%"}

- Records events, e.g., every function call
- Requires usually modification to the executable *i.e.* instrumentation
- More accurate, but may affect program behavior
- Generates often lots of data

</div>
![](img/sampling.png){.left width=100%}

# Sampling vs. Tracing
:::
::::::

<div class=column>
Sampling
# Tracing

![](img/sampling.png){.left width=80%}
:::::: {.columns}
::: {.column width="50%"}

</div>
<div class=column>
Tracing
- Records events, e.g., every function call
- Requires usually modification to the executable: code instrumentation
- More accurate than sampling, but may affect program behavior
- Generates often lots of data

![](img/tracing.png){.left width=80%}
:::
::: {.column width="50%"}

</div>
![](img/tracing.png){.left width=100%}

# Tau Analysis Utilities
:::
::::::

<small>
# Tau Analysis Utilities

- TAU is a powerful performance evaluation toolkit
- <https://www.cs.uoregon.edu/research/tau/home.php>
- A performance evaluation toolkit
- Runs on all HPC platforms, relatively easy to install
- Targets all parallel programming/execution paradigms (GPU, MPI, OpenMP, pthreads, ...)
- Programming languages: Fortran, C, C++, UPC, Java, Python, ...
- Programming languages: Fortran, C, C++, UPC, Java, Python, ...

# Tau Analysis Utilities cont.

- TAU has instrumentation, measurement and analysis tools
- User-friendly graphical interface
- Profiling: Measures total time spent in each routine
- Tracing: Shows events and their timings across processes on a timeline
- I/O performance evaluation
- Memory debugging

</small>

# Omniperf Tools

- <https://amdresearch.github.io/omniperf/getting_started.html>
- system performance profiling tool for machine
learning/HPC workloads running on AMD MI GPUs.
- presently targets usage on MI100 and MI200 accelerators.
learning/HPC workloads running on AMD MI GPUs
- presently targets usage on MI100 and MI200 accelerators
- profiling, roofline model, tracing
- built on top of `roctracer` and `rocprof`
- supports both a web-based GUI and a command-line analyzer for user convenience.


# Web Resources
- TAU homepage
- <https://www.cs.uoregon.edu/research/tau/home.php>
- Omniperf
- <https://amdresearch.github.io/omniperf/getting_started.html>
- supports both a web-based GUI and a command-line analyzer for user convenience

0 comments on commit f8518db

Please sign in to comment.