Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cargo: build with frame pointers #10226

Closed
wants to merge 1 commit into from
Closed

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Dec 22, 2024

Problem

Frame pointers are typically disabled by default (depending on CPU architecture), to improve performance. This frees up a CPU register, and avoids a couple of instructions per function call -- but this often doesn't matter on modern CPU architectures, and benchmarks did not show measurable overhead. However, it makes stack unwinding much more inefficient, since it has to use DWARF debug information instead, and gives worse results with e.g. perf and eBPF profiles. With continuous profiling, cheaper stack unwinding will likely be a net win, and allow us to use higher sampling resolution.

The Rust standard library and jemalloc already enable frame pointers by default.

For more information, see https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html.

Resolves #10224.

Summary of changes

Enable frame pointers in all builds, and use frame pointers for pprof-rs stack sampling.

@erikgrinaker erikgrinaker changed the title cargo: build with fram pointers cargo: build with frame pointers Dec 22, 2024
Copy link

github-actions bot commented Dec 22, 2024

7095 tests run: 6797 passed, 0 failed, 298 skipped (full report)


Flaky tests (5)

Postgres 17

Postgres 16

  • test_physical_replication_config_mismatch_too_many_known_xids: release-arm64
  • test_physical_replication_config_mismatch_max_locks_per_transaction: release-arm64

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 31.2% (8399 of 26877 functions)
  • lines: 48.0% (66685 of 139059 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
de36026 at 2024-12-29T12:58:56.126Z :recycle:

@erikgrinaker
Copy link
Contributor Author

I ran some benchmarks for WalStreamDecoder::complete_record() on a Linux amd64 box. This isn't necessarily representative (it's mostly very optimized CRC32 checksumming), but it is a CPU-bound hot path that doesn't involve allocations or IO. The results don't show significant overhead -- there is a small amount of variation though, so I ran each benchmark for 30 seconds multiple times, with similar results.

complete_record/size=64 time:   [41.561 ns 41.891 ns 42.152 ns]
                        thrpt:  [1.4141 GiB/s 1.4229 GiB/s 1.4341 GiB/s]
                 change:
                        time:   [+0.7893% +1.1356% +1.4278%] (p = 0.00 < 0.05)
                        thrpt:  [-1.4077% -1.1228% -0.7831%]

complete_record/size=1024
                        time:   [144.64 ns 144.73 ns 144.83 ns]
                        thrpt:  [6.5847 GiB/s 6.5893 GiB/s 6.5935 GiB/s]
                 change:
                        time:   [-3.4100% -3.3531% -3.2898%] (p = 0.00 < 0.05)
                        thrpt:  [+3.4017% +3.4694% +3.5303%]

complete_record/size=8192
                        time:   [965.16 ns 965.27 ns 965.39 ns]
                        thrpt:  [7.9029 GiB/s 7.9039 GiB/s 7.9048 GiB/s]
                 change:
                        time:   [+0.4465% +0.4593% +0.4730%] (p = 0.00 < 0.05)
                        thrpt:  [-0.4708% -0.4572% -0.4445%]

complete_record/size=131072
                        time:   [13.298 µs 13.352 µs 13.425 µs]
                        thrpt:  [9.0927 GiB/s 9.1425 GiB/s 9.1799 GiB/s]
                 change:
                        time:   [+5.9949% +6.8990% +7.8233%] (p = 0.00 < 0.05)
                        thrpt:  [-7.2557% -6.4537% -5.6558%]

@erikgrinaker
Copy link
Contributor Author

Interestingly, the frame-pointer feature of pprof-rs is an order of magnitude slower than just using libunwind without frame pointers (11 µs vs. 1.4 µs). libunwind performance did not change with or without frame pointers.

This was using a benchmark of pprof-rs TraceImpl::trace() with a stack depth of 40 on Linux.

This probably isn't worth it then. There was some hope that it might resolve #10225, as seen in grafana/pyroscope-rs#124, but if it requires the frame-pointer feature then it'll cause a 10x slowdown of traces which isn't acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build with frame pointers for improved profiling
1 participant