Add a native profiler. #2024

maleadt · 2023-08-09T13:09:49Z

This PR adds a native profiler, built on top of CUPTI, that should make it easier to do some simple profiling without having to resort to NSight. Output is loosely based on the old nvprof tool, rendered using PrettyTables.jl. Requires Julia 1.9, and CUDA 11.2.

The old CUDA.@profile, which only activated an external profiler, has been moved to CUDA.@profile external=true. As such, this probably will need to be a breaking release.

TODO:

Documentation
~~NVTX integration~~: seems like a bug in CUPTI, I've contacted NVIDIA

Small demo:

julia> CUDA.@profile Array(CUDA.rand(Float16, 1024, 1024).+1);
Profiler ran for 486.61 µs, capturing 26 events.

Host-side activity: calling CUDA APIs took 439.64 µs (90.35% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────┤
│   43.36% │  211.0 µs │     1 │  211.0 µs │  211.0 µs │  211.0 µs │ cuMemcpyDtoHAsync   │
│   42.28% │ 205.76 µs │     2 │ 102.88 µs │   6.68 µs │ 199.08 µs │ cuLaunchKernel      │
│    3.33% │  16.21 µs │     2 │   8.11 µs │   2.62 µs │  13.59 µs │ cuMemAllocAsync     │
│    0.20% │ 953.67 ns │     1 │ 953.67 ns │ 953.67 ns │ 953.67 ns │ cuStreamSynchronize │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────┘

Device-side activity: GPU was busy for 96.08 µs (19.75% of the trace)
┌──────────┬──────────┬───────┬──────────┬──────────┬──────────┬──────────────────────────────────────────────────────
│ Time (%) │     Time │ Calls │ Avg time │ Min time │ Max time │ Name                                                ⋯
├──────────┼──────────┼───────┼──────────┼──────────┼──────────┼──────────────────────────────────────────────────────
│   18.13% │ 88.21 µs │     1 │ 88.21 µs │ 88.21 µs │ 88.21 µs │ [copy device to pageable memory]                    ⋯
│    0.88% │  4.29 µs │     1 │  4.29 µs │  4.29 µs │  4.29 µs │ rand_                                               ⋯
│    0.73% │  3.58 µs │     1 │  3.58 µs │  3.58 µs │  3.58 µs │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArra ⋯
└──────────┴──────────┴───────┴──────────┴──────────┴──────────┴──────────────────────────────────────────────────────
                                                                                                      1 column omitted

Also features a trace mode where events are listed chronologically:

julia> CUDA.@profile trace=true Array(CUDA.rand(Float16, 1024, 1024).+1);
Profiler ran for 977.04 µs, capturing 26 events.

Host-side activity: calling CUDA APIs took 920.77 µs (94.24% of the trace)
┌────┬───────────┬───────────┬─────────────────────┬──────────────────────────┐
│ ID │     Start │      Time │                Name │ Details                  │
├────┼───────────┼───────────┼─────────────────────┼──────────────────────────┤
│  2 │  19.07 µs │  13.59 µs │     cuMemAllocAsync │ 2.000 MiB, device memory │
│  6 │  41.01 µs │ 200.51 µs │      cuLaunchKernel │ -                        │
│  8 │ 247.24 µs │   2.38 µs │     cuMemAllocAsync │ 2.000 MiB, device memory │
│ 12 │ 254.15 µs │   7.39 µs │      cuLaunchKernel │ -                        │
│ 18 │ 279.66 µs │  690.7 µs │   cuMemcpyDtoHAsync │ -                        │
│ 22 │ 972.03 µs │   1.19 µs │ cuStreamSynchronize │ -                        │
└────┴───────────┴───────────┴─────────────────────┴──────────────────────────┘

Device-side activity: GPU was busy for 120.4 µs (12.32% of the trace)
┌────┬───────────┬───────────┬─────────┬────────┬──────┬───────────┬───────────┬──────────────┬───────────────────────
│ ID │     Start │      Time │ Threads │ Blocks │ Regs │     SSMem │      Size │   Throughput │ Name                 ⋯
├────┼───────────┼───────────┼─────────┼────────┼──────┼───────────┼───────────┼──────────────┼───────────────────────
│  6 │ 238.18 µs │   4.29 µs │      64 │   3408 │   17 │ 256 bytes │         - │            - │ rand_                ⋯
│ 12 │ 258.68 µs │   3.58 µs │     768 │    284 │   32 │   0 bytes │         - │            - │ _Z16broadcast_kernel ⋯
│ 18 │ 289.68 µs │ 112.53 µs │       - │      - │    - │         - │ 2.000 MiB │ 17.356 GiB/s │ [copy device to page ⋯
└────┴───────────┴───────────┴─────────┴────────┴──────┴───────────┴───────────┴──────────────┴───────────────────────
                                                                                                      1 column omitted

We do some filtering and pre-processing to make the output a little more compact; this can be disabled using raw=true:

julia> CUDA.@profile trace=true raw=true Array(CUDA.rand(Float16, 1024, 1024).+1);
Profiler ran for 1.71 ms, capturing 36 events.

Host-side activity: calling CUDA APIs took 1.62 ms (95.02% of the trace)
┌──────┬─────────┬───────────┬────────────┬──────────────────────────────────┬──────────────────────────┐
│   ID │   Start │      Time │     Thread │                             Name │ Details                  │
├──────┼─────────┼───────────┼────────────┼──────────────────────────────────┼──────────────────────────┤
│ 1866 │  0.0 ns │   1.13 ms │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1867 │ 1.13 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1868 │ 1.14 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1869 │ 1.14 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1870 │ 1.15 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1871 │ 1.16 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1872 │ 1.16 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1873 │ 1.16 ms │  14.07 µs │ 1552268352 │                 cuCtxSynchronize │ -                        │
├──────┼─────────┼───────────┼────────────┼──────────────────────────────────┼──────────────────────────┤
│ 1874 │ 1.19 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1875 │ 1.19 ms │  14.07 µs │ 1552268352 │                  cuMemAllocAsync │ 2.000 MiB, device memory │
│ 1876 │ 1.22 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1877 │ 1.22 ms │ 715.26 ns │ 1552268352 │ cuOccupancyMaxPotentialBlockSize │ -                        │
│ 1878 │ 1.22 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1879 │ 1.22 ms │ 218.87 µs │ 1552268352 │                   cuLaunchKernel │ -                        │
│ 1880 │ 1.44 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1881 │ 1.44 ms │    3.1 µs │ 1552268352 │                  cuMemAllocAsync │ 2.000 MiB, device memory │
│ 1882 │ 1.45 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1883 │ 1.45 ms │ 476.84 ns │ 1552268352 │ cuOccupancyMaxPotentialBlockSize │ -                        │
│ 1884 │ 1.45 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1885 │ 1.45 ms │   7.63 µs │ 1552268352 │                   cuLaunchKernel │ -                        │
│ 1886 │ 1.46 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1887 │ 1.46 ms │   2.15 µs │ 1552268352 │            cuPointerGetAttribute │ -                        │
│ 1888 │ 1.47 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1889 │ 1.47 ms │ 715.26 ns │ 1552268352 │                    cuStreamQuery │ -                        │
│ 1890 │ 1.47 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1891 │ 1.47 ms │  226.5 µs │ 1552268352 │                cuMemcpyDtoHAsync │ -                        │
│ 1892 │  1.7 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1893 │  1.7 ms │ 715.26 ns │ 1552268352 │                    cuStreamQuery │ -                        │
│ 1894 │  1.7 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 1895 │  1.7 ms │ 953.67 ns │ 1552268352 │              cuStreamSynchronize │ -                        │
│ 1896 │  1.7 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
├──────┼─────────┼───────────┼────────────┼──────────────────────────────────┼──────────────────────────┤
│ 1897 │  1.7 ms │   1.43 µs │ 1552268352 │                 cuCtxSynchronize │ -                        │
│ 1898 │ 1.71 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
└──────┴─────────┴───────────┴────────────┴──────────────────────────────────┴──────────────────────────┘

Device-side activity: GPU was busy for 120.4 µs (7.06% of the trace)
┌──────┬─────────┬───────────┬────────────────────────────────┬────────┬─────────┬────────┬──────┬───────────┬────────
│   ID │   Start │      Time │                         Device │ Stream │ Threads │ Blocks │ Regs │     SSMem │   DSM ⋯
├──────┼─────────┼───────────┼────────────────────────────────┼────────┼─────────┼────────┼──────┼───────────┼────────
│ 1879 │ 1.44 ms │   4.29 µs │ NVIDIA RTX 6000 Ada Generation │     13 │      64 │   3408 │   17 │ 256 bytes │ 0 byt ⋯
│ 1885 │ 1.46 ms │   3.34 µs │ NVIDIA RTX 6000 Ada Generation │     13 │     768 │    284 │   32 │   0 bytes │ 0 byt ⋯
│ 1891 │ 1.48 ms │ 112.77 µs │ NVIDIA RTX 6000 Ada Generation │     13 │       - │      - │    - │         - │       ⋯
└──────┴─────────┴───────────┴────────────────────────────────┴────────┴─────────┴────────┴──────┴───────────┴────────
                                                                                                     4 columns omitted

Fixes #2017

Any suggestions for improvements are welcome. Reporting of metrics/performance counters, source-code correlation, or other advanced features is currently not on the table, just use NSight for that (this functionality is not intended to replace those tools, which work perfectly fine, but are just a bit cumbersome to set-up for most user's needs).

maleadt · 2023-08-09T18:34:05Z

Sefaults pretty quickly on non-1.9. I guess this depends on foreign thread adoption, as CUPTI calls us from an unmanaged worker thread.

maleadt · 2023-08-12T20:29:41Z

The hang on 11.1 seems real.

EDIT: let's just not support CUDA <11.2; who's using that anyway.

[skip tests]

[skip julia] [skip cuda] [skip subpackages] [skip benchmarks]

codecov · 2023-08-14T10:21:42Z

Codecov Report

Patch coverage: 8.63% and project coverage change: -42.45% ⚠️

Comparison is base (4cd4d14) 59.03% compared to head (d3fcc7c) 16.58%.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2024       +/-   ##
===========================================
- Coverage   59.03%   16.58%   -42.45%     
===========================================
  Files         152      152               
  Lines       12851    13214      +363     
===========================================
- Hits         7586     2192     -5394     
- Misses       5265    11022     +5757

Files Changed	Coverage Δ
lib/cupti/wrappers.jl	`3.30% <0.00%> (-96.70%)`	⬇️
src/CUDA.jl	`100.00% <ø> (ø)`
src/profile.jl	`12.00% <12.00%> (ø)`

... and 66 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

maleadt added the enhancement New feature or request label Aug 9, 2023

maleadt force-pushed the tb/native_profiler branch from eac7973 to 099d67f Compare August 9, 2023 13:31

maleadt force-pushed the tb/native_profiler branch 3 times, most recently from 629c9ce to da34fa7 Compare August 12, 2023 17:45

maleadt changed the title ~~WIP: Add a native profiler.~~ Add a native profiler. Aug 12, 2023

maleadt added 10 commits August 14, 2023 09:15

Add a native profiler.

3ce5b0f

Bump CUDA testing to Julia 1.9.

5ed63a0

Fix trace time calculation.

abeb872

Update documentation.

3ad9ea2

Remove unneeded module prefixes, and tweak error messages.

22c4813

Don't highlight empty traces.

e9efbd2

Fixes for CUDA 11.0.

d509855

For simplicity, require CUDA 11.2.

00ae803

Fix doctests.

4c88bc4

[skip tests]

Upgrade more of CI to 1.9.

d4333f1

[skip tests]

maleadt force-pushed the tb/native_profiler branch from 523ccd0 to d4333f1 Compare August 14, 2023 07:16

maleadt added 2 commits August 14, 2023 09:35

Add stronger error messages.

9f0e046

Don't run profile tests under compute sanitizer.

d3fcc7c

[skip julia] [skip cuda] [skip subpackages] [skip benchmarks]

maleadt merged commit a06920e into master Aug 14, 2023

maleadt deleted the tb/native_profiler branch August 14, 2023 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a native profiler. #2024

Add a native profiler. #2024

Uh oh!

maleadt commented Aug 9, 2023 •

edited

Loading

Uh oh!

maleadt commented Aug 9, 2023

Uh oh!

maleadt commented Aug 12, 2023 •

edited

Loading

Uh oh!

codecov bot commented Aug 14, 2023 •

edited

Loading

Uh oh!

Uh oh!

Add a native profiler. #2024

Add a native profiler. #2024

Uh oh!

Conversation

maleadt commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Aug 9, 2023

Uh oh!

maleadt commented Aug 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

maleadt commented Aug 9, 2023 •

edited

Loading

maleadt commented Aug 12, 2023 •

edited

Loading

codecov bot commented Aug 14, 2023 •

edited

Loading