Skip to content
Thiemo Wiedemeyer edited this page Feb 25, 2016 · 48 revisions

Benchmark

Benchmarking instructions

Setup:

  1. Report CPU and GPU models
  2. Report OS version (include kernel version if Linux), compiler version, API versions (OpenGL/CUDA/OpenCL if you can find it)
  3. Report date of testing.
  4. Build with -DENABLE_CXX11=ON -DENABLE_PROFILING=ON

Test cases:

You can use export LOGFILE=/dev/null to suppress internal logging.

You can also use -frames 3000 to limit the frames in one test.

  • Linux

(If you feel like, also use top -d1 visually to report per thread usage (H thread view, V tree view, I Irix mode to report per core usage).)

  1. CPU/TurboJPEG LIBVA_DRIVER_NAME=none ./bin/Protonect -noviewer cpu
  2. OpenGL/TurboJPEG LIBVA_DRIVER_NAME=none ./bin/Protonect -noviewer gl
  3. OpenCL/VAAPI ./bin/Protonect -noviewer cl (maybe use -gpu=1 to select GPU)
  4. CUDA/VAAPI ./bin/Protonect -noviewer cuda
  5. CUDA/TegraJPEG ./bin/Protonect -noviewer cuda
  • Windows
  1. CPU/TurboJPEG .\install\bin\Protonect.exe -noviewer cpu
  2. OpenGL/TurboJPEG .\install\bin\Protonect.exe -noviewer gl
  3. OpenCL/TurboJPEG .\install\bin\Protonect.exe -noviewer cl (maybe use -gpu=1 to select GPU)
  4. CUDA/TurboJPEG .\install\bin\Protonect.exe -noviewer cuda
  • Mac OS X
  1. CPU/TurboJPEG ./bin/Protonect -noviewer cpu
  2. OpenGL/VT ./bin/Protonect -noviewer gl
  3. OpenCL/VT ./bin/Protonect -noviewer cl

Raw data

Configuration Depth (min, 5%, median, 95%, max, mean, std) RGB (min, 5%, median, 95%, max, mean, std) Thread per core usage

@ Feb 23, 2016: Intel i7-4770K, GTX 980Ti; Ubuntu 14.04, kernel 4.2.0-30, gcc 4.8.5 (wiedemeyer)
CPU/TurboJPEG  196.568 198.383 201.401 208.951 226.687 mean=201.931 std=3.89492  12.0908 12.3903 13.3141 14.3214 17.7562 mean=13.3479 std=0.678972  CPU:90% TurboJPEG:40% USB:6% Reg:3%  
Nvidia-OpenGL/TurboJPEG  3.2732 3.36939 3.67837 8.56402 35.9935 mean=5.27091 std=2.45964  12.2045 12.4162 13.4634 19.0064 28.3374 mean=14.2262 std=2.05367  OpenGL:11% TurboJPEG:43% USB:6% Reg:20%  
Nvidia-OpenCL/TurboJPEG  1.0699 1.08166 1.09989 1.15623 2.33229 mean=1.11208 std=0.0589829  12.1233 12.4032 13.7723 19.765 24.256 mean=14.6285 std=2.30791  OpenCL:3% TurboJPEG:43% USB:6% Reg:20%  
CUDA/TurboJPEG  0.840885 0.84678 0.857848 0.916531 1.63734 mean=0.86993 std=0.0300092  12.0679 12.3273 13.1907 14.2099 19.2813 mean=13.1633 std=0.633368  CUDA:3% TurboJPEG:43% USB:6% Reg:20%  
CPU/VAAPI  193.444 194.614 196.841 202.536 207.836 mean=197.321 std=2.37429  4.10754 4.21363 4.56406 5.62728 10.9242 mean=4.66903 std=0.462371  CPU:95% VAAPI:2% USB:5% Reg:3%  
Nvidia-OpenGL/VAAPI  3.21803 3.43304 3.60592 5.92615 40.4823 mean=3.79538 std=1.25804  4.14062 4.16862 4.33703 7.23047 10.0539 mean=4.81332 std=1.08225  OpenGL:11% VAAPI:2% USB:6% Reg:20%  
Nvidia-OpenCL/VAAPI  1.07179 1.07902 1.09182 1.14629 3.62958 mean=1.106 std=0.0905646  4.12983 4.16519 4.34095 5.65602 7.28449 mean=4.50889 std=0.517033  OpenCL:3% VAAPI:2% USB:6% Reg:20%  
CUDA/VAAPI  0.846309 0.850773 0.858518 0.912932 1.64333 mean=0.87033 std=0.032205  4.15287 4.17488 4.51934 6.75331 11.153 mean=4.84019 std=0.964148  CUDA:3% VAAPI:2% USB:6% Reg:20%  

@ Feb 22, 2016: Intel i7-4800MQ, Quadro K2100M; Windows 8.1, Visual Studio 2013, Intel OpenCL SDK 2016, CUDA 7.5 (xlz)
Intel-OpenGL/TurboJPEG 12.3286 12.4666 12.8285 13.4629 92.2784 mean=12.9065 std=1.47811 14.4425 14.5801 14.7143 14.8709 16.2421 mean=14.7265 std=0.11763
Nvidia-OpenGL/TurboJPEG 4.19632 4.49891 11.4372 11.7519 48.1245 mean=9.57569 std=3.10212 14.2703 14.36 14.4805 15.3769 21.5556 mean=14.5977 std=0.357192
CPU-OpenCL/TurboJPEG 9.69081 10.1245 10.2793 10.7054 18.7844 mean=10.3386 std=0.239183 14.8527 14.9876 15.5312 23.0857 24.0778 mean=16.9761 std=2.89156
Intel-OpenCL/TurboJPEG 3.9652 4.29706 4.52362 4.7555 5.37398 mean=4.5282 std=0.144457 14.3285 14.5087 14.685 14.8055 15.6749 mean=14.6764 std=0.0935275
Nvidia-OpenCL/TurboJPEG 7.43736 7.49856 7.52745 7.57079 14.9557 mean=7.534 std=0.139388 14.3045 14.3931 14.4965 14.6159 17.438 mean=14.5128 std=0.156611
CUDA/TurboJPEG 3.88461 3.904 3.91578 3.93213 4.81784 mean=3.91704 std=0.0205062 14.5436 15.2929 15.5483 15.8201 18.3921 mean=15.5582 std=0.202376

@ Feb 22, 2016: Intel i7-4800MQ, Quadro K2100M; Ubuntu 14.04, kernel 4.2.0-29, gcc 4.8.4, CUDA 7.5  (xlz)
CPU/TurboJPEG 178.611 181.418 184.926 198.933 249.969 mean=187.806 std=8.54019 13.5352 13.8274 14.908 16.5734 26.2598 mean=15.1921 std=1.82302
Intel-OpenGL/TurboJPEG 13.5375 14.7761 18.1625 18.6433 25.157 mean=17.8306 std=1.20715 13.5665 14.5667 15.9952 18.3494 24.3842 mean=16.1912 std=1.21878
Intel-OpenCL/VAAPI 9.48047 10.1894 10.8433 15.6502 25.3932 mean=11.8891 std=1.97348 4.02646 4.10556 4.30542 6.7557 9.98244 mean=4.85615 std=0.936518
CUDA/VAAPI 3.82277 4.05362 4.08011 4.11954 5.0675 mean=4.08193 std=0.0569049 4.04337 4.10738 4.43007 5.76721 15.7759 mean=4.75454 std=1.21339

@ Feb 22, 2016: Intel i7-4600U, Intel HD4400; Debian stretch, kernel 4.4.1, gcc 5.3.1  (xlz)
CPU/TurboJPEG 218.622 219.892 229.477 272.806 390.69 mean=233.894 std=18.9372 15.1118 15.6991 17.6618 25.1604 53.6918 mean=18.1084 std=3.157
OpenGL/TurboJPEG 14.4316 15.0638 15.8955 19.7889 28.9187 mean=16.432 std=1.71578 15.1342 16.22 18.9046 24.2028 64.0743 mean=19.2296 std=2.80734
OpenCL/VAAPI 12.2266 12.855 13.2127 13.8175 18.7401 mean=13.2687 std=0.342778 4.81057 4.89762 5.00952 5.38739 9.53695 mean=5.07463 std=0.196652

@ Feb 22, 2016: ARM Cortex-A15, Tegra K1; Ubuntu 14.04, kernel 3.10.40, gcc 4.8.4, CUDA 6.5  (xlz)
CPU/TurboJPEG 1160.53 1187.43 1194.16 1197.62 1270.47 mean=1193.79 std=8.513 n=116 30.8582 37.8927 38.0711 42.2842 54.9553 mean=38.5263 std=2.07462 
OpenGL/TurboJPEG 17.5932 18.2723 21.1319 23.3073 47.4586 mean=21.5441 std=1.61175 n=6324 41.1681 45.3502 45.8704 48.3545 52.594 mean=46.3226 std=1.05828
CUDA/TegraJPEG 9.67744 10.173 10.6871 11.4994 12.7231 mean=10.7886 std=0.478002 n=9013 11.7575 11.8062 11.8787 12.0801 18.6544 mean=11.8949 std=0.147521

@ Feb 19, 2016: Intel i7-4770K, GTX 980Ti; Ubuntu 14.04, kernel 4.2.0-29, gcc 4.8.5 (wiedemeyer)
CPU/TurboJPEG  194.328 196.617 200.837 212.289 225.055 mean=201.911 std=4.65475  12.0173 12.2656 13.1827 19.4198 22.4747 mean=13.6461 std=1.78962  CPU:90% TurboJPEG:40% USB:5% Reg:3%  
Nvidia-OpenGL/TurboJPEG  3.22797 3.36801 8.02571 9.06995 108.705 mean=7.09752 std=2.96411  11.9735 12.2769 13.5689 19.4342 28.3156 mean=14.3881 std=2.23828  OpenGL:26% TurboJPEG:44% USB:6% Reg:20%  
Nvidia-OpenCL/VAAPI  1.07144 1.08136 1.0924 1.145 2.46014 mean=1.1035 std=0.0599953  4.14765 4.1658 4.66865 7.72171 11.1335 mean=4.98485 std=1.18519  OpenCL:3% VAAPI:2% USB:6% Reg:18%  
CUDA/VAAPI  0.857415 0.861542 0.868286 0.924719 3.31855 mean=0.882014 std=0.0699696  4.12401 4.14701 4.6825 10.9971 11.2794 mean=5.18491 std=1.60745  CUDA:5% VAAPI:2% USB:6% Reg:22%  

@ Feb 18, 2016: Intel i7-4800MQ, Quadro K2100M; Windows 8.1, Visual Studio 2013, Intel OpenCL SDK 2016, CUDA 7.5  (xlz) 
Intel-OpenGL/TurboJPEG  12.4188 12.5586 12.878 13.6519 87.9265 mean=13.0355 std=1.60253  12.7974 14.1773 14.3069 15.4161 24.7421 mean=14.4687 std=0.733611  N/A  
Nvidia-OpenGL/TurboJPEG  4.1188 4.42405 11.1658 12.0736 40.8151 mean=9.3866 std=3.08662  13.9013 14.0636 14.1788 14.7703 25.0648 mean=14.2977 std=0.620528  N/A  
Nvidia-OpenCL/TurboJPEG  9.54604 9.67225 9.86232 9.96686 14.1282 mean=9.8599 std=0.129916  13.9557 14.0743 14.1716 14.2906 18.5743 mean=14.183 std=0.127545  N/A  
Intel-OpenCL/TurboJPEG  4.45599 4.85779 5.3706 5.76746 6.92993 mean=5.35864 std=0.278621  14.0522 14.178 14.3122 14.5418 16.2471 mean=14.3355 std=0.161894  N/A  
CPU-OpenCL/TurboJPEG  9.27766 10.114 10.4177 10.859 17.6673 mean=10.409 std=0.276858  14.4666 14.6331 15.1691 22.6319 24.5121 mean=16.1868 std=2.52044  N/A  
CUDA/TurboJPEG  3.85118 3.89224 3.90669 3.9542 5.60971 mean=3.91251 std=0.0374401  14.0438 14.0986 14.1818 14.4149 25.0375 mean=14.3003 std=0.616492  N/A  

@ Feb 17, 2016: Intel i7-4600U, Intel HD4400; Debian stretch, kernel 4.4.1, gcc 5.3.1  (xlz)
CPU/TurboJPEG  211.717 222.087 233.171 256.851 304.558 mean=234.497 std=12.0616  15.7237 15.8093 16.5118 20.6042 37.9908 mean=17.2682 std=1.97223  CPU:95% TurboJPEG:50% USB:10% Reg:3%  
OpenGL/TurboJPEG  14.2609 14.8663 21.6813 23.0952 37.1771 mean=20.4175 std=2.95671  15.2525 16.8032 19.4003 22.8167 41.6874 mean=19.4453 std=2.10631  OpenGL:17% TurboJPEG:60% USB:20% Reg:16%  
Intel-OpenCL/VAAPI  12.9236 13.5946 14.1522 16.4632 29.1926 mean=14.4144 std=1.05776  4.81327 4.8892 4.99418 5.45149 11.5202 mean=5.08095 std=0.298308  OpenCL:6% VAAPI:3% USB:15% Reg:15%  

@ Feb 17, 2016: ARM Cortex-A15, Tegra K1; Ubuntu 14.04, kernel 3.10.40, gcc 4.8.4, CUDA 6.5  (xlz)
CPU/TurboJPEG  1196.93 1225.1 1232.61 1319.89 1356.19 mean=1242.61 std=30.2808  31.6025 38.3982 38.6873 43.1643 55.0731 mean=39.4813 std=2.26751  CPU:98% TurboJPEG:60% USB:36% Reg:3%  
OpenGL/TurboJPEG  17.2772 20.1502 21.9076 23.6485 49.9671 mean=21.702 std=1.41032  41.5806 46.2578 47.2497 50.1534 59.9174 mean=47.5074 std=1.48139  OpenGL:47% TurboJPEG:65% USB:60% Reg:64%  
CUDA/TegraJPEG  9.59201 10.1711 10.7408 11.5411 20.2425 mean=10.8238 std=0.529091  11.8931 11.962 12.1092 12.3543 20.1383 mean=12.256 std=0.846912  CUDA:4% TegraJPEG:4% USB:59% Reg:76%  

@ Feb 17, 2016: Intel i7-4800MQ, Quadro K2100M; Ubuntu 14.04, kernel 4.2.0-29, gcc 4.8.4, CUDA 7.5  (xlz)
CPU/TurboJPEG  177.368 178.725 184.239 232.401 237.202 mean=192.605 std=17.612  13.5816 13.9677 14.6258 22.746 24.4688 mean=15.3834 std=2.26277  CPU:91% TurboJPEG:45% USB:7% Reg:2%  
Intel-OpenGL/TurboJPEG  8.55666 13.9514 16.1583 18.1522 26.7371 mean=16.1974 std=1.87477  13.583 13.6906 14.8395 16.6675 24.4041 mean=14.887 std=1.2095  OpenGL:9% TurboJPEG:45% USB:9% Reg:12%  
Intel-OpenCL/VAAPI  9.70148 10.455 11.9606 16.42 23.1066 mean=12.6258 std=2.05783  4.03962 4.10536 4.64751 6.73393 10.8338 mean=4.99849 std=0.907321  OpenCL:4% VAAPI:2% USB:9% Reg:13%  
CUDA/VAAPI  3.81637 4.03557 4.06498 4.10873 7.7775 mean=4.07313 std=0.101962  4.04017 4.09998 4.5888 8.64204 16.0589 mean=5.15824 std=1.53683  CUDA:15% VAAPI:2% USB:9% Reg:15%  

gnuplot:
set term pdfcairo enhanced size 6.5,4.333 font "Helvetica Neue"
set output "benchmark.pdf"
set xrange [0:53]
set yrange [24.5:-1]
set key at 52,1
set ytics out scale 0 offset 0,-0.2 right
set xtics ("" 0,"16.7 (60Hz)" 16.667, "33.3 (30Hz)" 33.333)
set style fill solid 0.2 noborder
set grid xtics
set title "{/=20 Run time per frame (ms)}"
Hz(ms) = sprintf("%.0fHz", 1000.0/ms)

plot "/tmp/1" u (stringcolumn(1) eq "@"?0:1/0):($0-0.1):(stringcolumn(2).' '.stringcolumn(3).' '.stringcolumn(4).' '.stringcolumn(5).' '.stringcolumn(6).' '.stringcolumn(7).' '.stringcolumn(8).' '.stringcolumn(9).' '.stringcolumn(10)) with labels offset 0.2,0 left notitle, \
           "" u (stringcolumn(1) eq "@"?1/0:0):($0-0.5):(0):4:($0-0.5-0.2):($0-0.5+0.2):ytic(1) with boxxyerrorbars lc 1 title "Depth median", \
           "" u   4:($0-0.5):3:5:($0-0.5-0.1):($0-0.5+0.1) with boxxyerrorbars lc 1 fillstyle solid 0.4 title "Depth 5-95% percentile", \
           "" u   (stringcolumn(1) eq "@"?1/0:0):($0-0.5):(Hz($4)) with labels left offset 0.2,0 font "Helvetica Neue,9" notitle, \
           "" u (stringcolumn(1) eq "@"?1/0:0):($0-0.1):(0):11:($0-0.1-0.2):($0-0.1+0.2) with boxxyerrorbars lc 3 title "RGB median", \
           "" u  11:($0-0.1):10:12:($0-0.1-0.1):($0-0.1+0.1) with boxxyerrorbars lc 3 fillstyle solid 0.4 title "RGB 5-95% percentile", \
           "" u   (stringcolumn(1) eq "@"?1/0:0):($0-0.1):(Hz($11)) with labels left offset 0.2,0 font "Helvetica Neue,9" notitle

Platform acceleration of JPEG decoding

  • VA-API (Intel, Linux): Good
  • Intel Media SDK (Intel, Windows): possible to implement. mfx_mft_mjpgvd_64.dll 91CD2D6E-897B-4FA1-B0D7-51DC88010E0A Intel Hardware M-JPEG decoder MFT - it's probably an abstraction over DXVA/D3D11.
  • VDPAU (Nvidia): No. Does not support JPEG at all.
  • Tegra: In fact in all of Nvidia's products, only Tegra has hardware JPEG decoder (A separate tegra libjpeg decoder is being worked on).
  • AMD implements JPEG decoder with OpenCL, but we don't want it to compete with depth decoding for resources. (I evaluated GPUJPEG, and it was not good.)
  • Samsung's Exynos4 provides JPEG codec via v4l2, but this is for mobile devices.
  • I looked at mpv and ffmpeg. They have no hardware acceleration for JPEG at all.
  • Chromium uses VAAPI and V4L2.
  • On Mac a new decoder is provided by @fran6co. (@fran6co: The mac decoder is not hardware accelerated yet, if they ever decide to do it my implementation is going to have it.)
Clone this wiki locally