Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link-Time Optimization (LTO), Profile-Guided Optimization (PGO), Post-Link Optimization (PLO) benchmark results #765

Open
zamazan4ik opened this issue May 24, 2024 · 1 comment

Comments

@zamazan4ik
Copy link

zamazan4ik commented May 24, 2024

Hi!

As was proposed here, I decided to perform various tests with optimization resvg with more advanced compiler optimizations like LTO, PGO, PLO. Recently I tested Profile-Guided Optimization (PGO) compiler optimization on different projects in different software domains - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Here are my results for the project - I hope they will be helpful to someone.

Test environment

  • Fedora 39
  • Linux kernel 6.8.9
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Compiler - Rustc 1.78.0
  • resvg version: the latest for now from the master branch on commit 4b4e8970de29407e6257aac3d2f501b60e88236a
  • Disabled Turbo boost

Benchmark

For benchmark purposes, I use a simple scenario of converting an SVG file to a PNG file with the resvg input.svg output.png command. For PGO optimization I use cargo-pgo tool. Release build is done with cargo build --release, PGO instrumented - cargo pgo build, PGO-optimized - cargo pgo optimize build.

taskset -c 0 is used for reducing the OS scheduler's influence on the results during all measurements. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).

As an input file for the training purposes for the resvg input.svg output.png command, I use this file.

Additionally, I decided to enable back LTO for the tool. You disabled this optimization nearly 5 years ago due to some compiler bugs. I guess during the last 5 years the LTO implementation in the compiler became much more stable, and we can consider enabling it once again. So, for resvg during the benchmarks I enabled it with the following addition to the Cargo.toml file:

[profile.release]
codegen-units = 1
lto = true

Post-Link Optimization is also done with cargo-pgo with the same training workload as for the PGO step.

Results

Firstly, let's check the scenario when the training workload and the benchmark workload are the same. Such a benchmark is still useful for scenarios where you need to convert the same file many times (like a part of CI without caching):

hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_release large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_release_lto large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_lto_optimized large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_1.svg large_png_file_1.png'
Benchmark 1: taskset -c 0 ./resvg_release large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      3.349 s ±  0.011 s    [User: 3.082 s, System: 0.257 s]
  Range (min … max):    3.333 s …  3.368 s    15 runs

Benchmark 2: taskset -c 0 ./resvg_release_lto large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      3.062 s ±  0.018 s    [User: 2.802 s, System: 0.250 s]
  Range (min … max):    3.040 s …  3.120 s    15 runs

Benchmark 3: taskset -c 0 ./resvg_lto_optimized large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      2.631 s ±  0.008 s    [User: 2.368 s, System: 0.255 s]
  Range (min … max):    2.622 s …  2.644 s    15 runs

Benchmark 4: taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      2.611 s ±  0.007 s    [User: 2.347 s, System: 0.256 s]
  Range (min … max):    2.598 s …  2.622 s    15 runs

Summary
  taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_1.svg large_png_file_1.png ran
    1.01 ± 0.00 times faster than taskset -c 0 ./resvg_lto_optimized large_svg_file_1.svg large_png_file_1.png
    1.17 ± 0.01 times faster than taskset -c 0 ./resvg_release_lto large_svg_file_1.svg large_png_file_1.png
    1.28 ± 0.01 times faster than taskset -c 0 ./resvg_release large_svg_file_1.svg large_png_file_1.png

where:

  • resvg_release - regular Release build
  • resvg_release_lto - Release + LTO
  • resvg_lto_optimized - Release + LTO + PGO optimized
  • resvg_lto_bolt_optimized - Release + LTO + PGO optimized + BOLT optimized

According to the results, LTO and PGO measurably improve performance. However, BOLT didn't improve the situation too much.

What if training and benchmarking workloads are different files? For this, I used the same file for training as above but for the benchmarks, I use another file. Here we go:

hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_release large_svg_file_3.svg large_png_file_3.png' 'taskset -c 0 ./resvg_release_lto large_svg_file_3.svg large_png_file_3.png' 'taskset -c 0 ./resvg_lto_optimized large_svg_file_3.svg large_png_file_3.png' 'taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_3.svg large_png_file_3.png'
Benchmark 1: taskset -c 0 ./resvg_release large_svg_file_3.svg large_png_file_3.png
  Time (mean ± σ):      2.398 s ±  0.006 s    [User: 2.260 s, System: 0.131 s]
  Range (min … max):    2.391 s …  2.414 s    15 runs

Benchmark 2: taskset -c 0 ./resvg_release_lto large_svg_file_3.svg large_png_file_3.png
  Time (mean ± σ):      2.130 s ±  0.008 s    [User: 1.991 s, System: 0.133 s]
  Range (min … max):    2.123 s …  2.157 s    15 runs

Benchmark 3: taskset -c 0 ./resvg_lto_optimized large_svg_file_3.svg large_png_file_3.png
  Time (mean ± σ):      1.846 s ±  0.006 s    [User: 1.707 s, System: 0.134 s]
  Range (min … max):    1.838 s …  1.859 s    15 runs

Benchmark 4: taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_3.svg large_png_file_3.png
  Time (mean ± σ):      1.864 s ±  0.021 s    [User: 1.723 s, System: 0.135 s]
  Range (min … max):    1.851 s …  1.935 s    15 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./resvg_lto_optimized large_svg_file_3.svg large_png_file_3.png ran
    1.01 ± 0.01 times faster than taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_3.svg large_png_file_3.png
    1.15 ± 0.01 times faster than taskset -c 0 ./resvg_release_lto large_svg_file_3.svg large_png_file_3.png
    1.30 ± 0.01 times faster than taskset -c 0 ./resvg_release large_svg_file_3.svg large_png_file_3.png

We got a performance boost once again for a different file. I suppose it's because these two files execute similar paths inside the tool but cannot say more since I am not an SVG expert at all :)

However, there are cases that show that training on only one file is not sufficient - e.g. let's use this file for the benchmark (the training file remains the same as in the tests above):

hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_release large_svg_file_2.svg large_png_file_2.png' 'taskset -c 0 ./resvg_release_lto large_svg_file_2.svg large_png_file_2.png' 'taskset -c 0 ./resvg_lto_optimized large_svg_file_2.svg large_png_file_2.png' 'taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_2.svg large_png_file_2.png'
Benchmark 1: taskset -c 0 ./resvg_release large_svg_file_2.svg large_png_file_2.png
  Time (mean ± σ):      1.415 s ±  0.003 s    [User: 1.040 s, System: 0.357 s]
  Range (min … max):    1.409 s …  1.421 s    15 runs

Benchmark 2: taskset -c 0 ./resvg_release_lto large_svg_file_2.svg large_png_file_2.png
  Time (mean ± σ):      1.439 s ±  0.004 s    [User: 1.055 s, System: 0.365 s]
  Range (min … max):    1.429 s …  1.445 s    15 runs

Benchmark 3: taskset -c 0 ./resvg_lto_optimized large_svg_file_2.svg large_png_file_2.png
  Time (mean ± σ):      1.488 s ±  0.002 s    [User: 1.107 s, System: 0.361 s]
  Range (min … max):    1.483 s …  1.491 s    15 runs

Benchmark 4: taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_2.svg large_png_file_2.png
  Time (mean ± σ):      1.497 s ±  0.002 s    [User: 1.116 s, System: 0.363 s]
  Range (min … max):    1.493 s …  1.502 s    15 runs

Summary
  taskset -c 0 ./resvg_release large_svg_file_2.svg large_png_file_2.png ran
    1.02 ± 0.00 times faster than taskset -c 0 ./resvg_release_lto large_svg_file_2.svg large_png_file_2.png
    1.05 ± 0.00 times faster than taskset -c 0 ./resvg_lto_optimized large_svg_file_2.svg large_png_file_2.png
    1.06 ± 0.00 times faster than taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_2.svg large_png_file_2.png

Here we see some performance decrease from all optimizations (even from LTO that's strange). It shows that the training PGO set should be wider.

Just for reference, I also measured the tool slowdown during the PGO and PLO training phases:

hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_lto_instrumented large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_lto_bolt_instrumented large_svg_file_1.svg large_png_file_1.png'
Benchmark 1: taskset -c 0 ./resvg_lto_instrumented large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      3.670 s ±  0.062 s    [User: 3.397 s, System: 0.262 s]
  Range (min … max):    3.638 s …  3.891 s    15 runs

Benchmark 2: taskset -c 0 ./resvg_lto_bolt_instrumented large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      4.593 s ±  0.010 s    [User: 4.223 s, System: 0.338 s]
  Range (min … max):    4.572 s …  4.610 s    15 runs

Summary
  taskset -c 0 ./resvg_lto_instrumented large_svg_file_1.svg large_png_file_1.png ran
    1.25 ± 0.02 times faster than taskset -c 0 ./resvg_lto_bolt_instrumented large_svg_file_1.svg large_png_file_1.png

where:

  • resvg_lto_instrumented - Release + LTO + PGO instrumentation
  • resvg_lto_bolt_instrumented - Release + LTO + PGO optimization + BOLT instrumentation

Also, I want to report the binary size changes (without strip-ing that can influence the binary size a lot):

  • Release: 3.6 Mib
  • Release + LTO: 3.1 Mib
  • Release + LTO + PGO instrumentation: 7.8 Mib
  • Release + LTO + PGO optimization: 4.8 Mib
  • Release + LTO + PGO optimization + BOLT instrumentation: 20 Mib
  • Release + LTO + PGO optimization + BOLT optimization: 8.7 Mib

Further steps

I can suggest the following action points:

  • Enable LTO. I expect in general performance boost "for free" and the binary size reduction.
  • Perform more PGO benchmarks with other datasets (if you are interested enough in it). If it shows improvements - add a note to the documentation (the README file, I guess) about possible improvements in the resvg's performance with PGO.
  • Probably, you can try to get some insights about how the code can be optimized further based on the changes that the compiler performed with PGO. It can be done via analyzing flamegraphs before and after applying PGO to understand the difference. Like more aggressive inlining.
  • Testing Post-Link Optimization techniques (like LLVM BOLT) with wider datasets would be interesting too (Clang and Rustc already use BOLT as an addition to PGO). However, I recommend starting from the usual PGO since it's a much more stable technology with much fewer limitations.

I would be happy to answer your questions about PGO.

P.S. Please do not treat the issue like a bug or something like that - it's just a benchmark report. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.

@RazrFalcon
Copy link
Collaborator

Oh wow, that's a much bigger improvement than I was expecting. Thanks for looking into it!

I will try to find time to learn cargo-pgo

Enable LTO. I expect in general performance boost "for free" and the binary size reduction.

Yep, will do in the next release.

Perform more PGO benchmarks with other datasets

The only dataset available in CI is the resvg test suite.

And I will probably add build instructions with a PGO section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants