Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HIP backend #2008

Merged
merged 40 commits into from
Aug 18, 2023
Merged

Add HIP backend #2008

merged 40 commits into from
Aug 18, 2023

Conversation

athas
Copy link
Member

@athas athas commented Aug 11, 2023

No description provided.

@athas athas changed the title First draft of a HIP backend. Add HIP backend Aug 11, 2023
@athas athas marked this pull request as ready for review August 17, 2023 11:11
@athas athas added the run-benchmarks Makes GA run the benchmark suite. label Aug 17, 2023
@athas
Copy link
Member Author

athas commented Aug 17, 2023

The entire test and benchmark suites now work with the HIP backend. (The CI errors are due to some of the NVIDIA GPUs in the cluster being unavailable, unrelated to the backend.) The performance compared to the OpenCL backend, on the MI100 GPU, is as follows:

futhark-benchmarks/accelerate/canny/canny.fut
  data/lena256.in:                                                      1.31x
  data/lena512.in:                                                      1.29x

futhark-benchmarks/accelerate/crystal/crystal.fut
  #0 ("200i32 30.0f32 5i32 1i32 1.0f32"):                               1.56x
  #4 ("2000i32 30.0f32 50i32 1i32 1.0f32"):                             1.03x
  #5 ("4000i32 30.0f32 50i32 1i32 1.0f32"):                             1.01x

futhark-benchmarks/accelerate/fft/fft.fut
  data/1024x1024.in:                                                    3.08x (mem: 0.44x@device)
  data/128x128.in:                                                      1.10x (mem: 0.97x@device)
  data/128x512.in:                                                      4.86x (mem: 0.67x@device)
  data/256x256.in:                                                      1.16x
  data/512x512.in:                                                     10.56x (mem: 0.49x@device)
  data/64x256.in:                                                       1.08x (mem: 0.99x@device)

futhark-benchmarks/accelerate/fluid/fluid.fut
  benchmarking/medium.in:                                               1.07x

futhark-benchmarks/accelerate/hashcat/hashcat.fut
  rockyou.dataset:                                                      0.93x

futhark-benchmarks/accelerate/kmeans/kmeans.fut
  data/k5_n200000.in:                                                   0.65x (mem: 0.96x@device)
  data/k5_n50000.in:                                                    0.74x (mem: 0.96x@device)
  data/trivial.in:                                                      0.87x

futhark-benchmarks/accelerate/mandelbrot/mandelbrot.fut
  #0 ("800i32 600i32 -0.7f32 0.0f32 3.067f32 100i32 16.0f..."):         1.08x
  #1 ("1000i32 1000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         0.90x
  #2 ("2000i32 2000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         0.72x
  #3 ("4000i32 4000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         0.71x
  #4 ("8000i32 8000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         0.70x

futhark-benchmarks/accelerate/nbody/nbody-bh.fut
  data/1000-bodies.in:                                                  0.75x
  data/10000-bodies.in:                                                 0.78x
  data/100000-bodies.in:                                                0.85x

futhark-benchmarks/accelerate/nbody/nbody.fut
  data/1000-bodies.in:                                                  0.60x (mem: 1.09x@device)
  data/10000-bodies.in:                                                 1.11x
  data/100000-bodies.in:                                                0.99x

futhark-benchmarks/accelerate/pagerank/pagerank.fut
  data/random_medium.in:                                                1.27x
  data/small.in:                                                        1.22x (mem: 1.04x@device)

futhark-benchmarks/accelerate/ray/trace.fut
  #0 ("800i32 600i32 100i32 50.0f32 -100.0f32 -700.0f32 1..."):         1.51x

futhark-benchmarks/accelerate/smoothlife/smoothlife.fut
  #0 ("128i32"):                                                        0.85x
  #1 ("256i32"):                                                        0.91x
  #2 ("512i32"):                                                        7.92x (mem: 0.72x@device)
  #3 ("1024i32"):                                                       2.87x (mem: 0.51x@device)

futhark-benchmarks/accelerate/tunnel/tunnel.fut
  #0 ("10.0f32 800i32 600i32"):                                         0.47x
  #1 ("10.0f32 1000i32 1000i32"):                                       0.46x
  #2 ("10.0f32 2000i32 2000i32"):                                       0.49x
  #3 ("10.0f32 4000i32 4000i32"):                                       0.49x
  #4 ("10.0f32 8000i32 8000i32"):                                       0.49x

futhark-benchmarks/babelstream/babelstream.fut:f32_add
  [33554432]f32 [33554432]f32:                                          1.15x

futhark-benchmarks/babelstream/babelstream.fut:f32_copy
  [33554432]f32:                                                        1.05x

futhark-benchmarks/babelstream/babelstream.fut:f32_dot
  [33554432]f32 [33554432]f32:                                          0.97x

futhark-benchmarks/babelstream/babelstream.fut:f32_mul
  [33554432]f32:                                                        1.08x

futhark-benchmarks/babelstream/babelstream.fut:f32_nstream
  [33554432]f32 [33554432]f32 [33554432]f32:                            3.25x

futhark-benchmarks/babelstream/babelstream.fut:f32_triad
  [33554432]f32 [33554432]f32:                                          1.14x

futhark-benchmarks/babelstream/babelstream.fut:f64_add
  [33554432]f64 [33554432]f64:                                          1.07x

futhark-benchmarks/babelstream/babelstream.fut:f64_copy
  [33554432]f64:                                                        1.09x

futhark-benchmarks/babelstream/babelstream.fut:f64_dot
  [33554432]f64 [33554432]f64:                                          1.03x

futhark-benchmarks/babelstream/babelstream.fut:f64_mul
  [33554432]f64:                                                        1.10x

futhark-benchmarks/babelstream/babelstream.fut:f64_nstream
  [33554432]f64 [33554432]f64 [33554432]f64:                            1.64x

futhark-benchmarks/babelstream/babelstream.fut:f64_triad
  [33554432]f64 [33554432]f64:                                          1.06x

futhark-benchmarks/finpar/LocVolCalib.fut
  LocVolCalib-data/large.in:                                            0.76x
  LocVolCalib-data/medium.in:                                           0.79x
  LocVolCalib-data/small.in:                                            0.83x

futhark-benchmarks/finpar/OptionPricing.fut
  OptionPricing-data/large.in:                                          1.03x (mem: 1.04x@device)
  OptionPricing-data/medium.in:                                         1.23x (mem: 1.23x@device)
  OptionPricing-data/small.in:                                          1.30x (mem: 1.04x@device)

futhark-benchmarks/jgf/crypt/crypt.fut
  crypt-data/medium.in:                                                 1.08x

futhark-benchmarks/jgf/crypt/keys.fut
  crypt-data/userkey0.txt:                                              0.54x

futhark-benchmarks/jgf/series/series.fut
  data/10000.in:                                                        1.03x
  data/100000.in:                                                       1.01x
  data/1000000.in:                                                      1.01x

futhark-benchmarks/micro/intra.fut:scan_reduce
  [1000000][6]f32:                                                      2.14x

futhark-benchmarks/micro/mmm/lud-internal.fut
  [128][32][32]f32 [128][32][32]f32 [128][128][32][32]f32:              1.21x

futhark-benchmarks/micro/mmm/mmm-batch.fut
  [64][128][32][32]f32 [64][128][32][32]f32:                            1.13x

futhark-benchmarks/micro/mmm/mmm.fut
  #2 ("[[1.0f32, 2.0f32, 3.0f32], [3.0f32, 4.0f32, 5.0f32..."):         1.58x
  [1024][1024]f32 [1024][1024]f32:                                      1.33x
  [2048][4096]f32 [4096][2048]f32:                                      1.44x

futhark-benchmarks/micro/mmm/sgemm.fut
  #0 ("2.0f32 3.0f32 [[1.0f32, 2.0f32, 3.0f32], [3.0f32, ..."):         1.53x
  f32 f32 [1024][1024]f32 [1024][1024]f32 [1024][1024]f32:              1.31x
  f32 f32 [2048][4096]f32 [4096][2048]f32 [2048][2048]f32:              1.43x

futhark-benchmarks/micro/reduce-segmented.fut:prod_mat4_i32
  10000000i32 1i32 [10000000]i32 [10000000]i32 [...]:                   1.02x
  1000000i32 10i32 [10000000]i32 [10000000]i32 [...]:                   0.79x
  100000i32 100i32 [10000000]i32 [10000000]i32 [...]:                   1.03x
  10000i32 1000i32 [10000000]i32 [10000000]i32 [...]:                   0.90x
  1000i32 10000i32 [10000000]i32 [10000000]i32 [...]:                   0.88x
  100i32 100000i32 [10000000]i32 [10000000]i32 [...]:                   1.05x
  10i32 1000000i32 [10000000]i32 [10000000]i32 [...]:                   0.72x
  1i32 10000000i32 [10000000]i32 [10000000]i32 [...]:                   0.73x

futhark-benchmarks/micro/reduce-segmented.fut:sum_i16
  100000000i32 1i32 [100000000]i16:                                     1.05x
  10000000i32 10i32 [100000000]i16:                                     0.70x
  1000000i32 100i32 [100000000]i16:                                     0.75x
  100000i32 1000i32 [100000000]i16:                                     0.85x
  10000i32 10000i32 [100000000]i16:                                     1.41x
  1000i32 100000i32 [100000000]i16:                                     1.91x
  100i32 1000000i32 [100000000]i16:                                     2.30x
  10i32 10000000i32 [100000000]i16:                                     1.40x
  1i32 100000000i32 [100000000]i16:                                     1.32x

futhark-benchmarks/micro/reduce-segmented.fut:sum_i32
  100000000i32 1i32 [100000000]i32:                                     1.03x
  10000000i32 10i32 [100000000]i32:                                     0.75x
  1000000i32 100i32 [100000000]i32:                                     0.97x
  100000i32 1000i32 [100000000]i32:                                     1.03x
  10000i32 10000i32 [100000000]i32:                                     1.16x
  1000i32 100000i32 [100000000]i32:                                     1.45x
  100i32 1000000i32 [100000000]i32:                                     1.79x
  10i32 10000000i32 [100000000]i32:                                     1.16x
  1i32 100000000i32 [100000000]i32:                                     1.14x

futhark-benchmarks/micro/reduce-segmented.fut:sum_i64
  100000000i32 1i32 [100000000]i64:                                     1.01x
  10000000i32 10i32 [100000000]i64:                                     1.00x
  1000000i32 100i32 [100000000]i64:                                     1.02x
  100000i32 1000i32 [100000000]i64:                                     1.04x
  10000i32 10000i32 [100000000]i64:                                     1.07x
  1000i32 100000i32 [100000000]i64:                                     1.16x
  100i32 1000000i32 [100000000]i64:                                     1.43x
  10i32 10000000i32 [100000000]i64:                                     1.10x
  1i32 100000000i32 [100000000]i64:                                     1.08x

futhark-benchmarks/micro/reduce-segmented.fut:sum_i8
  100000000i32 1i32 [100000000]i8:                                      1.04x
  10000000i32 10i32 [100000000]i8:                                      0.69x
  1000000i32 100i32 [100000000]i8:                                      0.75x
  100000i32 1000i32 [100000000]i8:                                      0.80x
  10000i32 10000i32 [100000000]i8:                                      1.43x
  1000i32 100000i32 [100000000]i8:                                      1.97x
  100i32 1000000i32 [100000000]i8:                                      2.36x
  10i32 10000000i32 [100000000]i8:                                      1.28x
  1i32 100000000i32 [100000000]i8:                                      1.35x

futhark-benchmarks/micro/reduce-segmented.fut:sum_iota_i32
  100000000i32 1i32:                                                    1.11x
  10000000i32 10i32:                                                    1.30x
  1000000i32 100i32:                                                    1.73x
  100000i32 1000i32:                                                    2.14x
  10000i32 10000i32:                                                    2.09x
  1000i32 100000i32:                                                    2.06x
  100i32 1000000i32:                                                    2.13x
  10i32 10000000i32:                                                    2.06x
  1i32 100000000i32:                                                    2.12x

futhark-benchmarks/micro/reduce.fut:lss_f32
  [10000000]i32:                                                        0.80x
  [1000000]i32:                                                         0.89x
  [100000]i32:                                                          1.12x (mem: 0.40x@device)
  [10000]i32:                                                           1.29x

futhark-benchmarks/micro/reduce.fut:lss_f64
  [10000000]i32:                                                        0.79x
  [1000000]i32:                                                         0.83x
  [100000]i32:                                                          1.02x (mem: 0.44x@device)
  [10000]i32:                                                           1.21x

futhark-benchmarks/micro/reduce.fut:lss_i32
  [10000000]i32:                                                        0.82x
  [1000000]i32:                                                         0.89x
  [100000]i32:                                                          1.06x (mem: 0.40x@device)
  [10000]i32:                                                           1.38x

futhark-benchmarks/micro/reduce.fut:lss_i8
  [10000000]i32:                                                        0.84x
  [1000000]i32:                                                         0.93x
  [100000]i32:                                                          1.03x (mem: 0.65x@device)
  [10000]i32:                                                           1.21x

futhark-benchmarks/micro/reduce.fut:lss_iota_f32
  #0 ("10000i32"):                                                      1.13x
  #1 ("100000i32"):                                                     0.96x
  #2 ("1000000i32"):                                                    0.85x
  #3 ("10000000i32"):                                                   0.83x
  #4 ("100000000i32"):                                                  0.76x

futhark-benchmarks/micro/reduce.fut:lss_iota_f64
  #0 ("10000i32"):                                                      1.15x
  #1 ("100000i32"):                                                     0.93x
  #2 ("1000000i32"):                                                    0.85x
  #3 ("10000000i32"):                                                   0.82x
  #4 ("100000000i32"):                                                  0.74x

futhark-benchmarks/micro/reduce.fut:lss_iota_i32
  #0 ("10000i32"):                                                      1.15x
  #1 ("100000i32"):                                                     0.96x
  #2 ("1000000i32"):                                                    0.86x
  #3 ("10000000i32"):                                                   0.80x
  #4 ("100000000i32"):                                                  0.75x

futhark-benchmarks/micro/reduce.fut:lss_iota_i8
  #0 ("10000i32"):                                                      1.12x
  #1 ("100000i32"):                                                     1.06x
  #2 ("1000000i32"):                                                    0.88x
  #3 ("10000000i32"):                                                   0.84x
  #4 ("100000000i32"):                                                  0.76x

futhark-benchmarks/micro/reduce.fut:prod_iota_mat4_f32
  #0 ("10000i32"):                                                      1.48x
  #1 ("100000i32"):                                                     1.23x
  #2 ("1000000i32"):                                                    1.15x
  #3 ("10000000i32"):                                                   0.93x
  #4 ("100000000i32"):                                                  0.84x

futhark-benchmarks/micro/reduce.fut:prod_iota_mat4_f64
  #0 ("10000i32"):                                                      1.46x
  #1 ("100000i32"):                                                     1.36x
  #2 ("1000000i32"):                                                    0.98x
  #3 ("10000000i32"):                                                   0.73x
  #4 ("100000000i32"):                                                  0.61x

futhark-benchmarks/micro/reduce.fut:prod_iota_mat4_i32
  #0 ("10000i32"):                                                      1.42x
  #1 ("100000i32"):                                                     1.31x
  #2 ("1000000i32"):                                                    1.16x
  #3 ("10000000i32"):                                                   0.89x
  #4 ("100000000i32"):                                                  0.78x

futhark-benchmarks/micro/reduce.fut:prod_iota_mat4_i8
  #0 ("10000i32"):                                                      1.41x
  #1 ("100000i32"):                                                     1.36x
  #2 ("1000000i32"):                                                    1.23x
  #3 ("10000000i32"):                                                   0.93x
  #4 ("100000000i32"):                                                  0.85x

futhark-benchmarks/micro/reduce.fut:prod_mat4_f32
  [10000000]i32 [10000000]i32 [10000000]i32 [10000000]i32:              0.90x
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  1.24x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      1.42x (mem: 0.99x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          1.56x

futhark-benchmarks/micro/reduce.fut:prod_mat4_f64
  [10000000]i32 [10000000]i32 [10000000]i32 [10000000]i32:              0.75x
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  1.06x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      1.37x (mem: 0.99x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          1.50x

futhark-benchmarks/micro/reduce.fut:prod_mat4_i32
  [10000000]i32 [10000000]i32 [10000000]i32 [10000000]i32:              0.87x
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  1.21x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      1.40x (mem: 0.99x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          1.48x

futhark-benchmarks/micro/reduce.fut:prod_mat4_i8
  [10000000]i32 [10000000]i32 [10000000]i32 [10000000]i32:              0.91x
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  1.28x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      1.51x
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          1.63x

futhark-benchmarks/micro/reduce.fut:sum_f32
  [100000000]i32:                                                       1.00x
  [10000000]i32:                                                        0.93x
  [1000000]i32:                                                         0.88x
  [100000]i32:                                                          1.06x
  [10000]i32:                                                           1.14x

futhark-benchmarks/micro/reduce.fut:sum_f64
  [100000000]i32:                                                       1.02x
  [10000000]i32:                                                        0.91x
  [1000000]i32:                                                         0.88x
  [100000]i32:                                                          1.00x
  [10000]i32:                                                           1.18x

futhark-benchmarks/micro/reduce.fut:sum_i32
  [100000000]i32:                                                       1.01x
  [10000000]i32:                                                        0.92x
  [1000000]i32:                                                         0.92x
  [100000]i32:                                                          1.03x
  [10000]i32:                                                           1.12x

futhark-benchmarks/micro/reduce.fut:sum_i8
  [100000000]i32:                                                       0.99x
  [10000000]i32:                                                        0.93x
  [1000000]i32:                                                         0.93x
  [100000]i32:                                                          1.09x
  [10000]i32:                                                           1.15x

futhark-benchmarks/micro/reduce.fut:sum_iota_f32
  #0 ("10000i32"):                                                      1.07x
  #1 ("100000i32"):                                                     1.04x
  #2 ("1000000i32"):                                                    0.83x
  #3 ("10000000i32"):                                                   0.93x
  #4 ("100000000i32"):                                                  1.07x

futhark-benchmarks/micro/reduce.fut:sum_iota_f64
  #0 ("10000i32"):                                                      1.15x
  #1 ("100000i32"):                                                     1.01x
  #2 ("1000000i32"):                                                    0.81x
  #3 ("10000000i32"):                                                   0.91x
  #4 ("100000000i32"):                                                  1.05x

futhark-benchmarks/micro/reduce.fut:sum_iota_i32
  #0 ("10000i32"):                                                      1.17x
  #1 ("100000i32"):                                                     1.03x
  #2 ("1000000i32"):                                                    0.79x
  #3 ("10000000i32"):                                                   0.80x
  #4 ("100000000i32"):                                                  0.83x

futhark-benchmarks/micro/reduce.fut:sum_iota_i8
  #0 ("10000i32"):                                                      1.16x
  #1 ("100000i32"):                                                     1.02x
  #2 ("1000000i32"):                                                    0.84x
  #3 ("10000000i32"):                                                   0.85x
  #4 ("100000000i32"):                                                  0.74x

futhark-benchmarks/micro/reduce.fut:sum_scaled_f32
  [100000000]i32:                                                       1.04x
  [10000000]i32:                                                        0.97x
  [1000000]i32:                                                         0.96x
  [100000]i32:                                                          1.09x
  [10000]i32:                                                           1.20x (mem: 1.67x@device)

futhark-benchmarks/micro/reduce.fut:sum_scaled_f64
  [100000000]i32:                                                       1.03x
  [10000000]i32:                                                        0.99x
  [1000000]i32:                                                         0.99x
  [100000]i32:                                                          1.02x
  [10000]i32:                                                           1.14x (mem: 1.03x@device)

futhark-benchmarks/micro/reduce.fut:sum_scaled_i32
  [100000000]i32:                                                       1.04x
  [10000000]i32:                                                        0.99x
  [1000000]i32:                                                         1.00x
  [100000]i32:                                                          1.08x
  [10000]i32:                                                           1.18x (mem: 1.67x@device)

futhark-benchmarks/micro/reduce.fut:sum_scaled_i8
  [100000000]i32:                                                       1.02x
  [10000000]i32:                                                        1.02x
  [1000000]i32:                                                         1.02x
  [100000]i32:                                                          1.05x
  [10000]i32:                                                           1.23x

futhark-benchmarks/micro/reduce_by_index-segmented.fut:sum_i32
  100000i32 [256][4000]i32 [256][4000]i32:                              1.11x
  1000i32 [256][4000]i32 [256][4000]i32:                                1.12x (mem: 0.92x@device)
  10i32 [256][4000]i32 [256][4000]i32:                                  1.14x

futhark-benchmarks/micro/reduce_by_index.fut:absmax_i32
  100000i32 [1000000]i32 [1000000]i32:                                  0.73x (mem: 1.30x@device)
  10000i32 [1000000]i32 [1000000]i32:                                   0.80x (mem: 1.14x@device)
  1000i32 [1000000]i32 [1000000]i32:                                    0.81x (mem: 1.14x@device)
  100i32 [1000000]i32 [1000000]i32:                                     0.95x (mem: 0.98x@device)
  10i32 [1000000]i32 [1000000]i32:                                      0.86x

futhark-benchmarks/micro/reduce_by_index.fut:sum_f32
  100000i32 [1000000]i32 [1000000]f32:                                  1.18x (mem: 0.85x@device)
  10000i32 [1000000]i32 [1000000]f32:                                   0.47x (mem: 0.72x@device)
  1000i32 [1000000]i32 [1000000]f32:                                    0.02x (mem: 0.72x@device)
  100i32 [1000000]i32 [1000000]f32:                                     0.87x (mem: 0.98x@device)
  10i32 [1000000]i32 [1000000]f32:                                      0.80x

futhark-benchmarks/micro/reduce_by_index.fut:sum_i32
  100000i32 [1000000]i32 [1000000]i32:                                  0.81x
  10000i32 [1000000]i32 [1000000]i32:                                   0.91x
  1000i32 [1000000]i32 [1000000]i32:                                    0.71x
  100i32 [1000000]i32 [1000000]i32:                                     0.92x (mem: 0.98x@device)
  10i32 [1000000]i32 [1000000]i32:                                      0.89x

futhark-benchmarks/micro/reduce_by_index.fut:sum_i32_f32
  100000i32 [1000000]i32 [1000000]i32 [1000000]f32:                     0.96x (mem: 0.81x@device)
  10000i32 [1000000]i32 [1000000]i32 [1000000]f32:                      0.49x (mem: 0.72x@device)
  1000i32 [1000000]i32 [1000000]i32 [1000000]f32:                       0.02x (mem: 0.71x@device)
  100i32 [1000000]i32 [1000000]i32 [1000000]f32:                        0.75x (mem: 0.98x@device)
  10i32 [1000000]i32 [1000000]i32 [1000000]f32:                         0.63x

futhark-benchmarks/micro/reduce_by_index.fut:sum_vec_i32
  10000i32 [10000]i32 [1000000]i32:                                     1.11x
  10000i32 [1000]i32 [1000000]i32:                                      1.09x
  10i32 [10000]i32 [1000000]i32:                                        1.13x (mem: 0.98x@device)
  10i32 [1000]i32 [1000000]i32:                                         1.12x (mem: 0.98x@device)

futhark-benchmarks/micro/scan-segmented.fut:sum_i32
  [10000000][1]i32:                                                     5.34x
  [1000000][10]i32:                                                     4.94x
  [100000][100]i32:                                                     4.86x
  [10000][1000]i32:                                                     4.86x
  [1000][10000]i32:                                                     4.72x
  [100][100000]i32:                                                     4.51x
  [10][1000000]i32:                                                     4.51x
  [1][10000000]i32:                                                     4.48x

futhark-benchmarks/micro/scan-segmented.fut:sum_iota_i32
  #0 ("1i32 10000000i32"):                                              3.68x
  #1 ("10i32 1000000i32"):                                              3.72x
  #2 ("100i32 100000i32"):                                              3.84x
  #3 ("1000i32 10000i32"):                                              3.99x
  #4 ("10000i32 1000i32"):                                              4.12x
  #5 ("100000i32 100i32"):                                              4.13x
  #6 ("1000000i32 10i32"):                                              4.16x
  #7 ("10000000i32 1i32"):                                              4.34x

futhark-benchmarks/micro/scan.fut:lss_f32
  [1000000]i32:                                                         1.09x
  [100000]i32:                                                          0.85x (mem: 0.96x@device)
  [10000]i32:                                                           0.74x (mem: 0.60x@device)

futhark-benchmarks/micro/scan.fut:lss_f64
  [1000000]i32:                                                         1.03x
  [100000]i32:                                                          0.83x (mem: 0.97x@device)
  [10000]i32:                                                           0.77x (mem: 0.69x@device)

futhark-benchmarks/micro/scan.fut:lss_i32
  [1000000]i32:                                                         1.09x
  [100000]i32:                                                          0.85x (mem: 0.96x@device)
  [10000]i32:                                                           0.74x (mem: 0.60x@device)

futhark-benchmarks/micro/scan.fut:lss_i8
  [1000000]i32:                                                         1.42x
  [100000]i32:                                                          0.95x (mem: 0.95x@device)
  [10000]i32:                                                           0.81x (mem: 0.49x@device)

futhark-benchmarks/micro/scan.fut:lss_iota_f32
  #0 ("10000i32"):                                                      0.74x (mem: 0.53x@device)
  #1 ("100000i32"):                                                     0.87x (mem: 0.95x@device)
  #2 ("1000000i32"):                                                    1.04x
  #3 ("10000000i32"):                                                   1.12x

futhark-benchmarks/micro/scan.fut:lss_iota_f64
  #0 ("10000i32"):                                                      0.75x (mem: 0.65x@device)
  #1 ("100000i32"):                                                     0.87x (mem: 0.97x@device)
  #2 ("1000000i32"):                                                    0.98x
  #3 ("10000000i32"):                                                   1.01x

futhark-benchmarks/micro/scan.fut:lss_iota_i32
  #0 ("10000i32"):                                                      0.74x (mem: 0.53x@device)
  #1 ("100000i32"):                                                     0.88x (mem: 0.95x@device)
  #2 ("1000000i32"):                                                    1.01x
  #3 ("10000000i32"):                                                   1.10x

futhark-benchmarks/micro/scan.fut:lss_iota_i8
  #0 ("10000i32"):                                                      0.79x (mem: 0.37x@device)
  #1 ("100000i32"):                                                     0.98x (mem: 0.94x@device)
  #2 ("1000000i32"):                                                    1.38x
  #3 ("10000000i32"):                                                   1.63x

futhark-benchmarks/micro/scan.fut:prod_iota_mat4_f32
  #0 ("10000i32"):                                                      0.80x (mem: 0.29x@device)
  #1 ("100000i32"):                                                     0.87x (mem: 0.93x@device)
  #2 ("1000000i32"):                                                    0.74x
  #3 ("10000000i32"):                                                   0.71x

futhark-benchmarks/micro/scan.fut:prod_iota_mat4_f64
  #0 ("10000i32"):                                                      0.97x (mem: 0.65x@device)
  #1 ("100000i32"):                                                     1.13x (mem: 0.97x@device)
  #2 ("1000000i32"):                                                    1.58x
  #3 ("10000000i32"):                                                   1.87x

futhark-benchmarks/micro/scan.fut:prod_iota_mat4_i32
  #0 ("10000i32"):                                                      0.82x (mem: 0.29x@device)
  #1 ("100000i32"):                                                     0.87x (mem: 0.93x@device)
  #2 ("1000000i32"):                                                    0.71x
  #3 ("10000000i32"):                                                   0.68x

futhark-benchmarks/micro/scan.fut:prod_iota_mat4_i8
  #0 ("10000i32"):                                                      1.34x (mem: 0.00x@device)
  #1 ("100000i32"):                                                     1.34x (mem: 0.72x@device)
  #2 ("1000000i32"):                                                    1.30x (mem: 0.97x@device)
  #3 ("10000000i32"):                                                   1.84x

futhark-benchmarks/micro/scan.fut:prod_mat4_f32
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  0.72x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      0.84x (mem: 0.97x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          0.78x (mem: 0.65x@device)

futhark-benchmarks/micro/scan.fut:prod_mat4_f64
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  1.63x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      1.05x (mem: 0.98x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          0.95x (mem: 0.77x@device)

futhark-benchmarks/micro/scan.fut:prod_mat4_i32
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  0.70x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      0.85x (mem: 0.97x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          0.81x (mem: 0.65x@device)

futhark-benchmarks/micro/scan.fut:prod_mat4_i8
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  1.31x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      1.00x (mem: 0.94x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          0.93x (mem: 0.44x@device)

futhark-benchmarks/micro/scan.fut:sum_f32
  [100000000]i32:                                                       3.03x
  [10000000]i32:                                                        2.82x
  [1000000]i32:                                                         1.97x
  [100000]i32:                                                          1.28x (mem: 0.90x@device)
  [10000]i32:                                                           1.36x (mem: 0.14x@device)

futhark-benchmarks/micro/scan.fut:sum_f64
  [100000000]i32:                                                       2.26x
  [10000000]i32:                                                        2.30x
  [1000000]i32:                                                         1.75x
  [100000]i32:                                                          1.39x (mem: 0.94x@device)
  [10000]i32:                                                           1.27x (mem: 0.35x@device)

futhark-benchmarks/micro/scan.fut:sum_i32
  [100000000]i32:                                                       4.64x
  [10000000]i32:                                                        3.84x
  [1000000]i32:                                                         2.21x
  [100000]i32:                                                          1.59x (mem: 0.91x@device)
  [10000]i32:                                                           1.54x (mem: 0.24x@device)

futhark-benchmarks/micro/scan.fut:sum_i8
  [100000000]i32:                                                       7.15x
  [10000000]i32:                                                        4.83x
  [1000000]i32:                                                         2.30x (mem: 0.99x@device)
  [100000]i32:                                                          1.53x (mem: 0.88x@device)
  [10000]i32:                                                           1.53x (mem: 0.22x@device)

futhark-benchmarks/micro/scan.fut:sum_iota_f32
  #0 ("10000i32"):                                                      1.67x (mem: 0.23x@device)
  #1 ("100000i32"):                                                     1.73x (mem: 0.90x@device)
  #2 ("1000000i32"):                                                    2.32x
  #3 ("10000000i32"):                                                   3.98x
  #4 ("100000000i32"):                                                  4.51x

futhark-benchmarks/micro/scan.fut:sum_iota_f64
  #0 ("10000i32"):                                                      1.59x (mem: 0.52x@device)
  #1 ("100000i32"):                                                     1.59x (mem: 0.95x@device)
  #2 ("1000000i32"):                                                    1.94x
  #3 ("10000000i32"):                                                   2.87x
  #4 ("100000000i32"):                                                  2.90x

futhark-benchmarks/micro/scan.fut:sum_iota_i32
  #0 ("10000i32"):                                                      1.67x (mem: 0.45x@device)
  #1 ("100000i32"):                                                     1.66x (mem: 0.92x@device)
  #2 ("1000000i32"):                                                    2.31x
  #3 ("10000000i32"):                                                   4.15x
  #4 ("100000000i32"):                                                  4.74x

futhark-benchmarks/micro/scan.fut:sum_iota_i8
  #0 ("10000i32"):                                                      1.67x
  #1 ("100000i32"):                                                     1.56x (mem: 0.78x@device)
  #2 ("1000000i32"):                                                    2.32x (mem: 0.98x@device)
  #3 ("10000000i32"):                                                   5.62x
  #4 ("100000000i32"):                                                  7.98x

futhark-benchmarks/micro/scan.fut:sum_scaled_f32
  [100000000]i32:                                                       2.73x
  [10000000]i32:                                                        2.51x
  [1000000]i32:                                                         1.96x
  [100000]i32:                                                          1.27x (mem: 0.91x@device)
  [10000]i32:                                                           1.22x (mem: 0.12x@device)

futhark-benchmarks/micro/scan.fut:sum_scaled_f64
  [100000000]i32:                                                       2.05x
  [10000000]i32:                                                        2.10x
  [1000000]i32:                                                         1.74x
  [100000]i32:                                                          1.42x (mem: 0.94x@device)
  [10000]i32:                                                           1.36x (mem: 0.44x@device)

futhark-benchmarks/micro/scan.fut:sum_scaled_i32
  [100000000]i32:                                                       2.80x
  [10000000]i32:                                                        2.57x
  [1000000]i32:                                                         2.00x
  [100000]i32:                                                          1.26x (mem: 0.91x@device)
  [10000]i32:                                                           1.25x (mem: 0.18x@device)

futhark-benchmarks/micro/scan.fut:sum_scaled_i8
  [100000000]i32:                                                       5.10x
  [10000000]i32:                                                        3.89x
  [1000000]i32:                                                         1.87x (mem: 0.98x@device)
  [100000]i32:                                                          1.15x (mem: 0.84x@device)
  [10000]i32:                                                           1.15x (mem: 0.00x@device)

futhark-benchmarks/micro/transpose.fut:map_transpose_i32
  1000i32 100000i32 1i32 [100000000]i32:                                0.63x
  1000i32 1000i32 100i32 [100000000]i32:                                0.91x
  1000i32 1i32 100000i32 [100000000]i32:                                1.07x
  10i32 10000000i32 1i32 [100000000]i32:                                0.64x
  10i32 1000i32 10000i32 [100000000]i32:                                1.07x
  10i32 1i32 10000000i32 [100000000]i32:                                1.08x
  1i32 10000000i32 10i32 [100000000]i32:                                0.69x
  1i32 1000i32 100000i32 [100000000]i32:                                1.11x
  1i32 1i32 100000000i32 [100000000]i32:                                1.07x

futhark-benchmarks/micro/transpose.fut:map_transpose_i64
  1000i32 100000i32 1i32 [100000000]i64:                                0.91x
  1000i32 1000i32 100i32 [100000000]i64:                                1.01x
  1000i32 1i32 100000i32 [100000000]i64:                                1.04x
  10i32 10000000i32 1i32 [100000000]i64:                                0.91x
  10i32 1000i32 10000i32 [100000000]i64:                                1.05x
  10i32 1i32 10000000i32 [100000000]i64:                                1.04x
  1i32 10000000i32 10i32 [100000000]i64:                                1.00x
  1i32 1000i32 100000i32 [100000000]i64:                                1.05x
  1i32 1i32 100000000i32 [100000000]i64:                                1.04x

futhark-benchmarks/micro/transpose.fut:map_transpose_i8
  1000i32 100000i32 1i32 [100000000]i8:                                 0.60x
  1000i32 1000i32 100i32 [100000000]i8:                                 0.69x
  1000i32 1i32 100000i32 [100000000]i8:                                 1.04x
  10i32 10000000i32 1i32 [100000000]i8:                                 0.60x
  10i32 1000i32 10000i32 [100000000]i8:                                 0.68x
  10i32 1i32 10000000i32 [100000000]i8:                                 1.04x
  1i32 10000000i32 10i32 [100000000]i8:                                 0.64x
  1i32 1000i32 100000i32 [100000000]i8:                                 0.94x
  1i32 1i32 100000000i32 [100000000]i8:                                 1.04x

futhark-benchmarks/micro/transpose.fut:transpose_i32
  100000000i32 1i32 [100000000]i32:                                     0.77x
  10000000i32 10i32 [100000000]i32:                                     0.69x
  1000000i32 100i32 [100000000]i32:                                     0.96x
  100000i32 1000i32 [100000000]i32:                                     1.04x
  10000i32 10000i32 [100000000]i32:                                     1.05x
  1000i32 100000i32 [100000000]i32:                                     1.10x
  100i32 1000000i32 [100000000]i32:                                     1.01x
  10i32 10000000i32 [100000000]i32:                                     0.67x
  1i32 100000000i32 [100000000]i32:                                     1.07x
  25000000i32 4i32 [100000000]i32:                                      0.78x
  2i32 50000000i32 [100000000]i32:                                      0.64x
  4i32 25000000i32 [100000000]i32:                                      0.62x
  50000000i32 2i32 [100000000]i32:                                      0.77x

futhark-benchmarks/micro/transpose.fut:transpose_i64
  100000000i32 1i32 [100000000]i64:                                     1.04x
  10000000i32 10i32 [100000000]i64:                                     0.99x
  1000000i32 100i32 [100000000]i64:                                     1.03x
  100000i32 1000i32 [100000000]i64:                                     1.05x
  10000i32 10000i32 [100000000]i64:                                     1.06x
  1000i32 100000i32 [100000000]i64:                                     1.05x
  100i32 1000000i32 [100000000]i64:                                     1.03x
  10i32 10000000i32 [100000000]i64:                                     0.96x
  1i32 100000000i32 [100000000]i64:                                     1.04x
  25000000i32 4i32 [100000000]i64:                                      1.04x
  2i32 50000000i32 [100000000]i64:                                      0.99x
  4i32 25000000i32 [100000000]i64:                                      0.97x
  50000000i32 2i32 [100000000]i64:                                      1.06x

futhark-benchmarks/micro/transpose.fut:transpose_i8
  100000000i32 1i32 [100000000]i8:                                      0.65x
  10000000i32 10i32 [100000000]i8:                                      0.64x
  1000000i32 100i32 [100000000]i8:                                      0.67x
  100000i32 1000i32 [100000000]i8:                                      0.67x
  10000i32 10000i32 [100000000]i8:                                      0.68x
  1000i32 100000i32 [100000000]i8:                                      0.94x
  100i32 1000000i32 [100000000]i8:                                      0.91x
  10i32 10000000i32 [100000000]i8:                                      0.54x
  1i32 100000000i32 [100000000]i8:                                      1.07x
  25000000i32 4i32 [100000000]i8:                                       0.49x
  2i32 50000000i32 [100000000]i8:                                       0.62x
  4i32 25000000i32 [100000000]i8:                                       0.62x
  50000000i32 2i32 [100000000]i8:                                       0.52x

futhark-benchmarks/misc/bfast/bfast-cloudy.fut
  data/africa.in:                                                       1.14x (mem: 0.60x@device)
  data/peru.in:                                                         1.00x
  data/sahara-cloudy.in:                                                1.22x (mem: 0.60x@device)

futhark-benchmarks/misc/bfast/bfast.fut
  data/sahara.in:                                                       1.13x

futhark-benchmarks/misc/heston/heston32.fut
  data/100000_quotes.in:                                                1.09x (mem: 0.97x@device)
  data/10000_quotes.in:                                                 1.05x
  data/1062_quotes.in:                                                  1.09x

futhark-benchmarks/misc/heston/heston64.fut
  data/100000_quotes.in:                                                1.07x
  data/10000_quotes.in:                                                 1.00x
  data/1062_quotes.in:                                                  0.92x

futhark-benchmarks/misc/knn-by-kdtree/buildKDtree.fut
  valid-data/kdtree-ppl-32-m-2097152.in:                                1.34x

futhark-benchmarks/misc/knn-by-kdtree/driver-knn.fut
  256i32 [2097152][7]f32 [10000000][7]f32:                              1.28x

futhark-benchmarks/misc/ocean-sim/tke.fut
  [200][200][100]f32 [200][200][100]f32 [200][200][100]f32 [...]:       0.85x
  data/tke32-small.in:                                                  1.13x

futhark-benchmarks/misc/ocean-sim/tridiag-test.fut:tridagNested
  [57600][115]f32 [57600][115]f32 [57600][115]f32 [57600][115]f32:      0.86x
  data/tridiag32-small.in:                                              1.40x

futhark-benchmarks/misc/ocean-sim/tridiag-test.fut:tridagNestedConst
  [57600][115]f32 [57600][115]f32 [57600][115]f32 [57600][115]f32:      0.85x
  data/tridiag32-small.in:                                              1.39x

futhark-benchmarks/misc/ocean-sim/tridiag-test.fut:tridagNestedSeq
  [57600][115]f32 [57600][115]f32 [57600][115]f32 [57600][115]f32:      1.04x
  data/tridiag32-small.in:                                              1.07x

futhark-benchmarks/misc/ocean-sim/tridiag-test.fut:tridagNestedSeqConst
  [57600][115]f32 [57600][115]f32 [57600][115]f32 [57600][115]f32:      1.08x
  data/tridiag32-small.in:                                              1.07x

futhark-benchmarks/misc/poseidon/poseidon-bench.fut:arity11
  [17600000]u64:                                                        1.15x

futhark-benchmarks/misc/poseidon/poseidon-bench.fut:arity8
  [22400000]u64:                                                        0.58x

futhark-benchmarks/parboil/histo/histo.fut
  data/default.in:                                                      1.31x
  data/large.in:                                                        1.29x

futhark-benchmarks/parboil/lbm/lbm.fut
  data/120_120_150_ldc.in:                                              0.41x

futhark-benchmarks/parboil/mri-q/mri-q.fut
  data/large.in:                                                        1.08x
  data/small.in:                                                        1.12x

futhark-benchmarks/parboil/sgemm/sgemm.fut
  data/medium.in:                                                       0.67x
  data/small.in:                                                        1.50x
  data/tiny.in:                                                         1.46x

futhark-benchmarks/parboil/stencil/stencil.fut
  data/default.in:                                                      1.03x
  data/small.in:                                                        1.10x

futhark-benchmarks/parboil/tpacf/tpacf.fut
  data/large.in:                                                        2.47x
  data/medium.in:                                                       2.46x
  data/small.in:                                                        0.99x (mem: 0.97x@device)

futhark-benchmarks/pbbs/breadthFirstSearch/breadthFirstSearch.fut
  data/3Dgrid_J_64000000.in:                                            1.04x
  data/rMatGraph_J_12_16000000.in:                                      1.37x
  data/randLocalGraph_J_10_20000000.in:                                 1.38x

futhark-benchmarks/pbbs/comparisonSort/merge_sort.fut:sort_f64
  data/almostSortedSeq_100M.in:                                         1.00x
  data/exptSeq_100M.in:                                                 1.00x
  data/randomSeq_100M.in:                                               1.00x

futhark-benchmarks/pbbs/comparisonSort/merge_sort.fut:sort_f64_pair
  data/randomSeq_100M_double_pair_double.in:                            1.00x

futhark-benchmarks/pbbs/comparisonSort/quick_sort.fut:sort_f64
  data/almostSortedSeq_100M.in:                                         1.10x
  data/exptSeq_100M.in:                                                 1.10x
  data/randomSeq_100M.in:                                               1.09x

futhark-benchmarks/pbbs/comparisonSort/quick_sort.fut:sort_f64_pair
  data/randomSeq_100M_double_pair_double.in:                            1.11x

futhark-benchmarks/pbbs/convexHull/convexhull.fut
  data/2DinSphere_100K.in:                                              0.68x
  data/2DinSphere_100M.in:                                              1.02x
  data/2DinSphere_10K.in:                                               0.69x
  data/2DinSphere_10M.in:                                               0.76x
  data/2DinSphere_1M.in:                                                0.71x
  data/2Dkuzmin_100K.in:                                                0.65x
  data/2Dkuzmin_100M.in:                                                1.18x
  data/2Dkuzmin_10K.in:                                                 0.72x
  data/2Dkuzmin_10M.in:                                                 0.83x
  data/2Dkuzmin_1M.in:                                                  0.64x
  data/2DonSphere_100K.in:                                              0.69x
  data/2DonSphere_100M.in:                                              1.26x
  data/2DonSphere_10K.in:                                               0.67x
  data/2DonSphere_10M.in:                                               1.01x
  data/2DonSphere_1M.in:                                                0.77x

futhark-benchmarks/pbbs/histogram/histogram.fut
  almostEqualSeq_100M:                                                  1.00x
  exptSeq_100M:                                                         0.98x
  randomSeq_100M:                                                       0.99x
  randomSeq_100M_100K:                                                  0.90x
  randomSeq_100M_256:                                                   3.34x

futhark-benchmarks/pbbs/integerSort/radix_sort.fut:sort_i32
  exptSeq_100M_int:                                                     1.60x
  randomSeq_100M_int:                                                   1.58x

futhark-benchmarks/pbbs/integerSort/radix_sort.fut:sort_i32_pair
  randomSeq_100M_256_int_pair_int:                                      1.37x
  randomSeq_100M_int_pair_int:                                          1.42x

futhark-benchmarks/pbbs/maximalIndependentSet/maximalIndependentSet.fut
  data/3Dgrid_JR_64000000.in:                                           0.98x
  data/rMatGraph_JR_12_16000000.in:                                     1.11x
  data/randLocalGraph_JR_10_20000000.in:                                0.91x

futhark-benchmarks/pbbs/maximalMatching/maximalMatching.fut
  data/2Dgrid_E_64000000.in:                                            1.02x
  data/rMatGraph_E_10_20000000.in:                                      1.03x
  data/randLocalGraph_E_10_20000000.in:                                 1.04x

futhark-benchmarks/pbbs/minSpanningForest/minSpanningForest.fut
  data/3Dgrid_WE_8000000.in:                                            1.00x
  data/rMatGraph_WE_12_2250000.in:                                      0.98x
  data/randLocalGraph_WE_10_2000000.in:                                 0.99x

futhark-benchmarks/pbbs/ray/ray.fut
  data/angel.in:                                                        0.98x
  data/dragon.in:                                                       0.93x
  data/happy.in:                                                        0.96x

futhark-benchmarks/rodinia/backprop/backprop.fut
  data/medium.in:                                                       1.07x
  data/small.in:                                                        1.27x

futhark-benchmarks/rodinia/bfs/bfs_asympt_ok_but_slow.fut
  data/4096nodes.in:                                                    1.08x
  data/512nodes_high_edge_variance.in:                                  1.09x
  data/64kn_32e-var-1-256-skew.in:                                      1.55x
  data/graph1MW_6.in:                                                   1.22x

futhark-benchmarks/rodinia/bfs/bfs_filt_padded_fused.fut
  data/4096nodes.in:                                                    1.11x
  data/512nodes_high_edge_variance.in:                                  1.14x (mem: 0.97x@device)
  data/64kn_32e-var-1-256-skew.in:                                      1.04x (mem: 0.00x@device)
  data/graph1MW_6.in:                                                   1.24x

futhark-benchmarks/rodinia/bfs/bfs_heuristic.fut
  data/4096nodes.in:                                                    1.12x
  data/512nodes_high_edge_variance.in:                                  1.13x (mem: 0.93x@device)
  data/64kn_32e-var-1-256-skew.in:                                      1.04x
  data/graph1MW_6.in:                                                   1.22x

futhark-benchmarks/rodinia/bfs/bfs_iter_work_ok.fut
  data/4096nodes.in:                                                    1.16x
  data/512nodes_high_edge_variance.in:                                  1.20x (mem: 0.71x@device)
  data/64kn_32e-var-1-256-skew.in:                                      1.07x
  data/graph1MW_6.in:                                                   1.26x

futhark-benchmarks/rodinia/cfd/cfd.fut
  data/fvcorr.domn.097K.toa:                                            1.02x
  data/fvcorr.domn.193K.toa:                                            1.01x

futhark-benchmarks/rodinia/hotspot/hotspot.fut
  data/1024.in:                                                         1.11x
  data/512.in:                                                          1.01x
  data/64.in:                                                           1.81x

futhark-benchmarks/rodinia/kmeans/kmeans.fut
  data/100.in:                                                          1.24x
  data/204800.in:                                                       0.96x (mem: 0.99x@device)
  data/kdd_cup.in:                                                      1.14x

futhark-benchmarks/rodinia/lavaMD/lavaMD.fut
  data/10_boxes.in:                                                     1.16x
  data/3_boxes.in:                                                      0.70x

futhark-benchmarks/rodinia/lud/lud.fut
  data/16by16.in:                                                       1.01x
  data/2048.in:                                                         0.97x
  data/256.in:                                                          0.97x
  data/512.in:                                                          0.97x
  data/64.in:                                                           0.94x

futhark-benchmarks/rodinia/myocyte/myocyte.fut
  data/medium.in:                                                       1.72x
  data/small.in:                                                        0.85x

futhark-benchmarks/rodinia/nn/nn.fut
  data/medium.in:                                                       0.96x

futhark-benchmarks/rodinia/nw/nw.fut
  data/large.in:                                                        1.04x
  data/medium.in:                                                       1.03x
  data/small.in:                                                        1.31x
  data/tiny.in:                                                         1.36x

futhark-benchmarks/rodinia/particlefilter/particlefilter.fut
  data/128_128_10_image_10000_particles.in:                             1.07x
  data/128_128_10_image_400000_particles.in:                            1.10x

futhark-benchmarks/rodinia/pathfinder/pathfinder.fut
  data/medium.in:                                                       1.12x

futhark-benchmarks/rodinia/srad/srad.fut
  data/image.in:                                                        0.63x

futhark-benchmarks/rsbench/rsbench.fut
  data/large.in:                                                        1.48x
  data/small.in:                                                        1.35x

futhark-benchmarks/xsbench/xsbench.fut
  data/large.in:                                                        1.02x
  data/small.in:                                                        0.98x

Some things are substantially faster (e.g. FFT). I think this is because the HIP backend allows up to 1024 threads in a thread block (compared to only 256 for AMD's OpenCL implementation), which allows intragroup parallelism to apply.

Some are strangely slower (e.g. mandelbrot). I'll have to look into it. It may be something simple like not properly querying for how many threads to launch.

Everything that depends on scans is faster, as the HIP backend uses the highly tuned single pass scans code generation.

I also see that e.g. sgemm is a lot slower than with OpenCL. This also merits further investigation.

But overall, this backend looks pretty operational to me. Certainly worth using for some programs, and with some tweaks we can probably make it superior to the OpenCL backend in all cases, on AMD hardware.

@athas athas merged commit befe604 into master Aug 18, 2023
@athas athas deleted the hip branch August 18, 2023 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmarks Makes GA run the benchmark suite.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant