Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a native profiler. #2024

Merged
merged 12 commits into from
Aug 14, 2023
Merged

Add a native profiler. #2024

merged 12 commits into from
Aug 14, 2023

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Aug 9, 2023

This PR adds a native profiler, built on top of CUPTI, that should make it easier to do some simple profiling without having to resort to NSight. Output is loosely based on the old nvprof tool, rendered using PrettyTables.jl. Requires Julia 1.9, and CUDA 11.2.

The old CUDA.@profile, which only activated an external profiler, has been moved to CUDA.@profile external=true. As such, this probably will need to be a breaking release.

TODO:

  • Documentation
  • NVTX integration: seems like a bug in CUPTI, I've contacted NVIDIA

Small demo:

julia> CUDA.@profile Array(CUDA.rand(Float16, 1024, 1024).+1);
Profiler ran for 486.61 µs, capturing 26 events.

Host-side activity: calling CUDA APIs took 439.64 µs (90.35% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────┤
│   43.36%211.0 µs │     1211.0 µs │  211.0 µs │  211.0 µs │ cuMemcpyDtoHAsync   │
│   42.28%205.76 µs │     2102.88 µs │   6.68 µs │ 199.08 µs │ cuLaunchKernel      │
│    3.33%16.21 µs │     28.11 µs │   2.62 µs │  13.59 µs │ cuMemAllocAsync     │
│    0.20%953.67 ns │     1953.67 ns │ 953.67 ns │ 953.67 ns │ cuStreamSynchronize │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────┘

Device-side activity: GPU was busy for 96.08 µs (19.75% of the trace)
┌──────────┬──────────┬───────┬──────────┬──────────┬──────────┬──────────────────────────────────────────────────────
│ Time (%) │     Time │ Calls │ Avg time │ Min time │ Max time │ Name                                                
├──────────┼──────────┼───────┼──────────┼──────────┼──────────┼──────────────────────────────────────────────────────
│   18.13%88.21 µs │     188.21 µs │ 88.21 µs │ 88.21 µs │ [copy device to pageable memory]                    0.88%4.29 µs │     14.29 µs │  4.29 µs │  4.29 µs │ rand_                                               0.73%3.58 µs │     13.58 µs │  3.58 µs │  3.58 µs │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArra 
└──────────┴──────────┴───────┴──────────┴──────────┴──────────┴──────────────────────────────────────────────────────
                                                                                                      1 column omitted

Also features a trace mode where events are listed chronologically:

julia> CUDA.@profile trace=true Array(CUDA.rand(Float16, 1024, 1024).+1);
Profiler ran for 977.04 µs, capturing 26 events.

Host-side activity: calling CUDA APIs took 920.77 µs (94.24% of the trace)
┌────┬───────────┬───────────┬─────────────────────┬──────────────────────────┐
│ ID │     Start │      Time │                Name │ Details                  │
├────┼───────────┼───────────┼─────────────────────┼──────────────────────────┤
│  219.07 µs │  13.59 µs │     cuMemAllocAsync │ 2.000 MiB, device memory │
│  641.01 µs │ 200.51 µs │      cuLaunchKernel │ -                        │
│  8247.24 µs │   2.38 µs │     cuMemAllocAsync │ 2.000 MiB, device memory │
│ 12254.15 µs │   7.39 µs │      cuLaunchKernel │ -                        │
│ 18279.66 µs │  690.7 µs │   cuMemcpyDtoHAsync │ -                        │
│ 22972.03 µs │   1.19 µs │ cuStreamSynchronize │ -                        │
└────┴───────────┴───────────┴─────────────────────┴──────────────────────────┘

Device-side activity: GPU was busy for 120.4 µs (12.32% of the trace)
┌────┬───────────┬───────────┬─────────┬────────┬──────┬───────────┬───────────┬──────────────┬───────────────────────
│ ID │     Start │      Time │ Threads │ Blocks │ Regs │     SSMem │      Size │   Throughput │ Name                 
├────┼───────────┼───────────┼─────────┼────────┼──────┼───────────┼───────────┼──────────────┼───────────────────────
│  6238.18 µs │   4.29 µs │      64340817256 bytes │         -- │ rand_                12258.68 µs │   3.58 µs │     768284320 bytes │         -- │ _Z16broadcast_kernel 18289.68 µs │ 112.53 µs │       ----2.000 MiB │ 17.356 GiB/s │ [copy device to page 
└────┴───────────┴───────────┴─────────┴────────┴──────┴───────────┴───────────┴──────────────┴───────────────────────
                                                                                                      1 column omitted

We do some filtering and pre-processing to make the output a little more compact; this can be disabled using raw=true:

julia> CUDA.@profile trace=true raw=true Array(CUDA.rand(Float16, 1024, 1024).+1);
Profiler ran for 1.71 ms, capturing 36 events.

Host-side activity: calling CUDA APIs took 1.62 ms (95.02% of the trace)
┌──────┬─────────┬───────────┬────────────┬──────────────────────────────────┬──────────────────────────┐
│   ID │   Start │      Time │     Thread │                             Name │ Details                  │
├──────┼─────────┼───────────┼────────────┼──────────────────────────────────┼──────────────────────────┤
│ 18660.0 ns │   1.13 ms │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18671.13 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18681.14 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18691.14 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18701.15 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18711.16 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18721.16 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18731.16 ms │  14.07 µs │ 1552268352 │                 cuCtxSynchronize │ -                        │
├──────┼─────────┼───────────┼────────────┼──────────────────────────────────┼──────────────────────────┤
│ 18741.19 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18751.19 ms │  14.07 µs │ 1552268352 │                  cuMemAllocAsync │ 2.000 MiB, device memory │
│ 18761.22 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18771.22 ms │ 715.26 ns │ 1552268352 │ cuOccupancyMaxPotentialBlockSize │ -                        │
│ 18781.22 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18791.22 ms │ 218.87 µs │ 1552268352 │                   cuLaunchKernel │ -                        │
│ 18801.44 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18811.44 ms │    3.1 µs │ 1552268352 │                  cuMemAllocAsync │ 2.000 MiB, device memory │
│ 18821.45 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18831.45 ms │ 476.84 ns │ 1552268352 │ cuOccupancyMaxPotentialBlockSize │ -                        │
│ 18841.45 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18851.45 ms │   7.63 µs │ 1552268352 │                   cuLaunchKernel │ -                        │
│ 18861.46 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18871.46 ms │   2.15 µs │ 1552268352 │            cuPointerGetAttribute │ -                        │
│ 18881.47 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18891.47 ms │ 715.26 ns │ 1552268352 │                    cuStreamQuery │ -                        │
│ 18901.47 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18911.47 ms │  226.5 µs │ 1552268352 │                cuMemcpyDtoHAsync │ -                        │
│ 18921.7 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18931.7 ms │ 715.26 ns │ 1552268352 │                    cuStreamQuery │ -                        │
│ 18941.7 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
│ 18951.7 ms │ 953.67 ns │ 1552268352 │              cuStreamSynchronize │ -                        │
│ 18961.7 ms │ 238.42 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
├──────┼─────────┼───────────┼────────────┼──────────────────────────────────┼──────────────────────────┤
│ 18971.7 ms │   1.43 µs │ 1552268352 │                 cuCtxSynchronize │ -                        │
│ 18981.71 ms │    0.0 ns │ 1552268352 │                  cuCtxGetCurrent │ -                        │
└──────┴─────────┴───────────┴────────────┴──────────────────────────────────┴──────────────────────────┘

Device-side activity: GPU was busy for 120.4 µs (7.06% of the trace)
┌──────┬─────────┬───────────┬────────────────────────────────┬────────┬─────────┬────────┬──────┬───────────┬────────
│   ID │   Start │      Time │                         Device │ Stream │ Threads │ Blocks │ Regs │     SSMem │   DSM 
├──────┼─────────┼───────────┼────────────────────────────────┼────────┼─────────┼────────┼──────┼───────────┼────────
│ 18791.44 ms │   4.29 µs │ NVIDIA RTX 6000 Ada Generation │     1364340817256 bytes │ 0 byt 18851.46 ms │   3.34 µs │ NVIDIA RTX 6000 Ada Generation │     13768284320 bytes │ 0 byt 18911.48 ms │ 112.77 µs │ NVIDIA RTX 6000 Ada Generation │     13----
└──────┴─────────┴───────────┴────────────────────────────────┴────────┴─────────┴────────┴──────┴───────────┴────────
                                                                                                     4 columns omitted

Fixes #2017

Any suggestions for improvements are welcome. Reporting of metrics/performance counters, source-code correlation, or other advanced features is currently not on the table, just use NSight for that (this functionality is not intended to replace those tools, which work perfectly fine, but are just a bit cumbersome to set-up for most user's needs).

@maleadt maleadt added the enhancement New feature or request label Aug 9, 2023
@maleadt
Copy link
Member Author

maleadt commented Aug 9, 2023

Sefaults pretty quickly on non-1.9. I guess this depends on foreign thread adoption, as CUPTI calls us from an unmanaged worker thread.

@maleadt maleadt force-pushed the tb/native_profiler branch 3 times, most recently from 629c9ce to da34fa7 Compare August 12, 2023 17:45
@maleadt maleadt changed the title WIP: Add a native profiler. Add a native profiler. Aug 12, 2023
@maleadt
Copy link
Member Author

maleadt commented Aug 12, 2023

The hang on 11.1 seems real.

EDIT: let's just not support CUDA <11.2; who's using that anyway.

[skip julia]
[skip cuda]
[skip subpackages]
[skip benchmarks]
@codecov
Copy link

codecov bot commented Aug 14, 2023

Codecov Report

Patch coverage: 8.63% and project coverage change: -42.45% ⚠️

Comparison is base (4cd4d14) 59.03% compared to head (d3fcc7c) 16.58%.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2024       +/-   ##
===========================================
- Coverage   59.03%   16.58%   -42.45%     
===========================================
  Files         152      152               
  Lines       12851    13214      +363     
===========================================
- Hits         7586     2192     -5394     
- Misses       5265    11022     +5757     
Files Changed Coverage Δ
lib/cupti/wrappers.jl 3.30% <0.00%> (-96.70%) ⬇️
src/CUDA.jl 100.00% <ø> (ø)
src/profile.jl 12.00% <12.00%> (ø)

... and 66 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@maleadt maleadt merged commit a06920e into master Aug 14, 2023
@maleadt maleadt deleted the tb/native_profiler branch August 14, 2023 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Native profiler using CUPTI
1 participant