Add HIP backend #2008

athas · 2023-08-11T06:16:18Z

No description provided.

This is necessary when targeting C++ (as with HIP) as it does not allow us to 'goto' past variable declarations - which we do when handling bounds checks. This could also be implemented in GenericC itself, but it is slightly simpler this way.

athas · 2023-08-17T19:16:25Z

The entire test and benchmark suites now work with the HIP backend. (The CI errors are due to some of the NVIDIA GPUs in the cluster being unavailable, unrelated to the backend.) The performance compared to the OpenCL backend, on the MI100 GPU, is as follows:

futhark-benchmarks/accelerate/canny/canny.fut
  data/lena256.in:                                                      1.31x
  data/lena512.in:                                                      1.29x

futhark-benchmarks/accelerate/crystal/crystal.fut
  #0 ("200i32 30.0f32 5i32 1i32 1.0f32"):                               1.56x
  #4 ("2000i32 30.0f32 50i32 1i32 1.0f32"):                             1.03x
  #5 ("4000i32 30.0f32 50i32 1i32 1.0f32"):                             1.01x

futhark-benchmarks/accelerate/fft/fft.fut
  data/1024x1024.in:                                                    3.08x (mem: 0.44x@device)
  data/128x128.in:                                                      1.10x (mem: 0.97x@device)
  data/128x512.in:                                                      4.86x (mem: 0.67x@device)
  data/256x256.in:                                                      1.16x
  data/512x512.in:                                                     10.56x (mem: 0.49x@device)
  data/64x256.in:                                                       1.08x (mem: 0.99x@device)

futhark-benchmarks/accelerate/fluid/fluid.fut
  benchmarking/medium.in:                                               1.07x

futhark-benchmarks/accelerate/hashcat/hashcat.fut
  rockyou.dataset:                                                      0.93x

futhark-benchmarks/accelerate/kmeans/kmeans.fut
  data/k5_n200000.in:                                                   0.65x (mem: 0.96x@device)
  data/k5_n50000.in:                                                    0.74x (mem: 0.96x@device)
  data/trivial.in:                                                      0.87x

futhark-benchmarks/accelerate/mandelbrot/mandelbrot.fut
  #0 ("800i32 600i32 -0.7f32 0.0f32 3.067f32 100i32 16.0f..."):         1.08x
  #1 ("1000i32 1000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         0.90x
  #2 ("2000i32 2000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         0.72x
  #3 ("4000i32 4000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         0.71x
  #4 ("8000i32 8000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."):         0.70x

futhark-benchmarks/accelerate/nbody/nbody-bh.fut
  data/1000-bodies.in:                                                  0.75x
  data/10000-bodies.in:                                                 0.78x
  data/100000-bodies.in:                                                0.85x

futhark-benchmarks/accelerate/nbody/nbody.fut
  data/1000-bodies.in:                                                  0.60x (mem: 1.09x@device)
  data/10000-bodies.in:                                                 1.11x
  data/100000-bodies.in:                                                0.99x

futhark-benchmarks/accelerate/pagerank/pagerank.fut
  data/random_medium.in:                                                1.27x
  data/small.in:                                                        1.22x (mem: 1.04x@device)

futhark-benchmarks/accelerate/ray/trace.fut
  #0 ("800i32 600i32 100i32 50.0f32 -100.0f32 -700.0f32 1..."):         1.51x

futhark-benchmarks/accelerate/smoothlife/smoothlife.fut
  #0 ("128i32"):                                                        0.85x
  #1 ("256i32"):                                                        0.91x
  #2 ("512i32"):                                                        7.92x (mem: 0.72x@device)
  #3 ("1024i32"):                                                       2.87x (mem: 0.51x@device)

futhark-benchmarks/accelerate/tunnel/tunnel.fut
  #0 ("10.0f32 800i32 600i32"):                                         0.47x
  #1 ("10.0f32 1000i32 1000i32"):                                       0.46x
  #2 ("10.0f32 2000i32 2000i32"):                                       0.49x
  #3 ("10.0f32 4000i32 4000i32"):                                       0.49x
  #4 ("10.0f32 8000i32 8000i32"):                                       0.49x

futhark-benchmarks/babelstream/babelstream.fut:f32_add
  [33554432]f32 [33554432]f32:                                          1.15x

futhark-benchmarks/babelstream/babelstream.fut:f32_copy
  [33554432]f32:                                                        1.05x

futhark-benchmarks/babelstream/babelstream.fut:f32_dot
  [33554432]f32 [33554432]f32:                                          0.97x

futhark-benchmarks/babelstream/babelstream.fut:f32_mul
  [33554432]f32:                                                        1.08x

futhark-benchmarks/babelstream/babelstream.fut:f32_nstream
  [33554432]f32 [33554432]f32 [33554432]f32:                            3.25x

futhark-benchmarks/babelstream/babelstream.fut:f32_triad
  [33554432]f32 [33554432]f32:                                          1.14x

futhark-benchmarks/babelstream/babelstream.fut:f64_add
  [33554432]f64 [33554432]f64:                                          1.07x

futhark-benchmarks/babelstream/babelstream.fut:f64_copy
  [33554432]f64:                                                        1.09x

futhark-benchmarks/babelstream/babelstream.fut:f64_dot
  [33554432]f64 [33554432]f64:                                          1.03x

futhark-benchmarks/babelstream/babelstream.fut:f64_mul
  [33554432]f64:                                                        1.10x

futhark-benchmarks/babelstream/babelstream.fut:f64_nstream
  [33554432]f64 [33554432]f64 [33554432]f64:                            1.64x

futhark-benchmarks/babelstream/babelstream.fut:f64_triad
  [33554432]f64 [33554432]f64:                                          1.06x

futhark-benchmarks/finpar/LocVolCalib.fut
  LocVolCalib-data/large.in:                                            0.76x
  LocVolCalib-data/medium.in:                                           0.79x
  LocVolCalib-data/small.in:                                            0.83x

futhark-benchmarks/finpar/OptionPricing.fut
  OptionPricing-data/large.in:                                          1.03x (mem: 1.04x@device)
  OptionPricing-data/medium.in:                                         1.23x (mem: 1.23x@device)
  OptionPricing-data/small.in:                                          1.30x (mem: 1.04x@device)

futhark-benchmarks/jgf/crypt/crypt.fut
  crypt-data/medium.in:                                                 1.08x

futhark-benchmarks/jgf/crypt/keys.fut
  crypt-data/userkey0.txt:                                              0.54x

futhark-benchmarks/jgf/series/series.fut
  data/10000.in:                                                        1.03x
  data/100000.in:                                                       1.01x
  data/1000000.in:                                                      1.01x

futhark-benchmarks/micro/intra.fut:scan_reduce
  [1000000][6]f32:                                                      2.14x

futhark-benchmarks/micro/mmm/lud-internal.fut
  [128][32][32]f32 [128][32][32]f32 [128][128][32][32]f32:              1.21x

futhark-benchmarks/micro/mmm/mmm-batch.fut
  [64][128][32][32]f32 [64][128][32][32]f32:                            1.13x

futhark-benchmarks/micro/mmm/mmm.fut
  #2 ("[[1.0f32, 2.0f32, 3.0f32], [3.0f32, 4.0f32, 5.0f32..."):         1.58x
  [1024][1024]f32 [1024][1024]f32:                                      1.33x
  [2048][4096]f32 [4096][2048]f32:                                      1.44x

futhark-benchmarks/micro/mmm/sgemm.fut
  #0 ("2.0f32 3.0f32 [[1.0f32, 2.0f32, 3.0f32], [3.0f32, ..."):         1.53x
  f32 f32 [1024][1024]f32 [1024][1024]f32 [1024][1024]f32:              1.31x
  f32 f32 [2048][4096]f32 [4096][2048]f32 [2048][2048]f32:              1.43x

futhark-benchmarks/micro/reduce-segmented.fut:prod_mat4_i32
  10000000i32 1i32 [10000000]i32 [10000000]i32 [...]:                   1.02x
  1000000i32 10i32 [10000000]i32 [10000000]i32 [...]:                   0.79x
  100000i32 100i32 [10000000]i32 [10000000]i32 [...]:                   1.03x
  10000i32 1000i32 [10000000]i32 [10000000]i32 [...]:                   0.90x
  1000i32 10000i32 [10000000]i32 [10000000]i32 [...]:                   0.88x
  100i32 100000i32 [10000000]i32 [10000000]i32 [...]:                   1.05x
  10i32 1000000i32 [10000000]i32 [10000000]i32 [...]:                   0.72x
  1i32 10000000i32 [10000000]i32 [10000000]i32 [...]:                   0.73x

futhark-benchmarks/micro/reduce-segmented.fut:sum_i16
  100000000i32 1i32 [100000000]i16:                                     1.05x
  10000000i32 10i32 [100000000]i16:                                     0.70x
  1000000i32 100i32 [100000000]i16:                                     0.75x
  100000i32 1000i32 [100000000]i16:                                     0.85x
  10000i32 10000i32 [100000000]i16:                                     1.41x
  1000i32 100000i32 [100000000]i16:                                     1.91x
  100i32 1000000i32 [100000000]i16:                                     2.30x
  10i32 10000000i32 [100000000]i16:                                     1.40x
  1i32 100000000i32 [100000000]i16:                                     1.32x

futhark-benchmarks/micro/reduce-segmented.fut:sum_i32
  100000000i32 1i32 [100000000]i32:                                     1.03x
  10000000i32 10i32 [100000000]i32:                                     0.75x
  1000000i32 100i32 [100000000]i32:                                     0.97x
  100000i32 1000i32 [100000000]i32:                                     1.03x
  10000i32 10000i32 [100000000]i32:                                     1.16x
  1000i32 100000i32 [100000000]i32:                                     1.45x
  100i32 1000000i32 [100000000]i32:                                     1.79x
  10i32 10000000i32 [100000000]i32:                                     1.16x
  1i32 100000000i32 [100000000]i32:                                     1.14x

futhark-benchmarks/micro/reduce-segmented.fut:sum_i64
  100000000i32 1i32 [100000000]i64:                                     1.01x
  10000000i32 10i32 [100000000]i64:                                     1.00x
  1000000i32 100i32 [100000000]i64:                                     1.02x
  100000i32 1000i32 [100000000]i64:                                     1.04x
  10000i32 10000i32 [100000000]i64:                                     1.07x
  1000i32 100000i32 [100000000]i64:                                     1.16x
  100i32 1000000i32 [100000000]i64:                                     1.43x
  10i32 10000000i32 [100000000]i64:                                     1.10x
  1i32 100000000i32 [100000000]i64:                                     1.08x

futhark-benchmarks/micro/reduce-segmented.fut:sum_i8
  100000000i32 1i32 [100000000]i8:                                      1.04x
  10000000i32 10i32 [100000000]i8:                                      0.69x
  1000000i32 100i32 [100000000]i8:                                      0.75x
  100000i32 1000i32 [100000000]i8:                                      0.80x
  10000i32 10000i32 [100000000]i8:                                      1.43x
  1000i32 100000i32 [100000000]i8:                                      1.97x
  100i32 1000000i32 [100000000]i8:                                      2.36x
  10i32 10000000i32 [100000000]i8:                                      1.28x
  1i32 100000000i32 [100000000]i8:                                      1.35x

futhark-benchmarks/micro/reduce-segmented.fut:sum_iota_i32
  100000000i32 1i32:                                                    1.11x
  10000000i32 10i32:                                                    1.30x
  1000000i32 100i32:                                                    1.73x
  100000i32 1000i32:                                                    2.14x
  10000i32 10000i32:                                                    2.09x
  1000i32 100000i32:                                                    2.06x
  100i32 1000000i32:                                                    2.13x
  10i32 10000000i32:                                                    2.06x
  1i32 100000000i32:                                                    2.12x

futhark-benchmarks/micro/reduce.fut:lss_f32
  [10000000]i32:                                                        0.80x
  [1000000]i32:                                                         0.89x
  [100000]i32:                                                          1.12x (mem: 0.40x@device)
  [10000]i32:                                                           1.29x

futhark-benchmarks/micro/reduce.fut:lss_f64
  [10000000]i32:                                                        0.79x
  [1000000]i32:                                                         0.83x
  [100000]i32:                                                          1.02x (mem: 0.44x@device)
  [10000]i32:                                                           1.21x

futhark-benchmarks/micro/reduce.fut:lss_i32
  [10000000]i32:                                                        0.82x
  [1000000]i32:                                                         0.89x
  [100000]i32:                                                          1.06x (mem: 0.40x@device)
  [10000]i32:                                                           1.38x

futhark-benchmarks/micro/reduce.fut:lss_i8
  [10000000]i32:                                                        0.84x
  [1000000]i32:                                                         0.93x
  [100000]i32:                                                          1.03x (mem: 0.65x@device)
  [10000]i32:                                                           1.21x

futhark-benchmarks/micro/reduce.fut:lss_iota_f32
  #0 ("10000i32"):                                                      1.13x
  #1 ("100000i32"):                                                     0.96x
  #2 ("1000000i32"):                                                    0.85x
  #3 ("10000000i32"):                                                   0.83x
  #4 ("100000000i32"):                                                  0.76x

futhark-benchmarks/micro/reduce.fut:lss_iota_f64
  #0 ("10000i32"):                                                      1.15x
  #1 ("100000i32"):                                                     0.93x
  #2 ("1000000i32"):                                                    0.85x
  #3 ("10000000i32"):                                                   0.82x
  #4 ("100000000i32"):                                                  0.74x

futhark-benchmarks/micro/reduce.fut:lss_iota_i32
  #0 ("10000i32"):                                                      1.15x
  #1 ("100000i32"):                                                     0.96x
  #2 ("1000000i32"):                                                    0.86x
  #3 ("10000000i32"):                                                   0.80x
  #4 ("100000000i32"):                                                  0.75x

futhark-benchmarks/micro/reduce.fut:lss_iota_i8
  #0 ("10000i32"):                                                      1.12x
  #1 ("100000i32"):                                                     1.06x
  #2 ("1000000i32"):                                                    0.88x
  #3 ("10000000i32"):                                                   0.84x
  #4 ("100000000i32"):                                                  0.76x

futhark-benchmarks/micro/reduce.fut:prod_iota_mat4_f32
  #0 ("10000i32"):                                                      1.48x
  #1 ("100000i32"):                                                     1.23x
  #2 ("1000000i32"):                                                    1.15x
  #3 ("10000000i32"):                                                   0.93x
  #4 ("100000000i32"):                                                  0.84x

futhark-benchmarks/micro/reduce.fut:prod_iota_mat4_f64
  #0 ("10000i32"):                                                      1.46x
  #1 ("100000i32"):                                                     1.36x
  #2 ("1000000i32"):                                                    0.98x
  #3 ("10000000i32"):                                                   0.73x
  #4 ("100000000i32"):                                                  0.61x

futhark-benchmarks/micro/reduce.fut:prod_iota_mat4_i32
  #0 ("10000i32"):                                                      1.42x
  #1 ("100000i32"):                                                     1.31x
  #2 ("1000000i32"):                                                    1.16x
  #3 ("10000000i32"):                                                   0.89x
  #4 ("100000000i32"):                                                  0.78x

futhark-benchmarks/micro/reduce.fut:prod_iota_mat4_i8
  #0 ("10000i32"):                                                      1.41x
  #1 ("100000i32"):                                                     1.36x
  #2 ("1000000i32"):                                                    1.23x
  #3 ("10000000i32"):                                                   0.93x
  #4 ("100000000i32"):                                                  0.85x

futhark-benchmarks/micro/reduce.fut:prod_mat4_f32
  [10000000]i32 [10000000]i32 [10000000]i32 [10000000]i32:              0.90x
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  1.24x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      1.42x (mem: 0.99x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          1.56x

futhark-benchmarks/micro/reduce.fut:prod_mat4_f64
  [10000000]i32 [10000000]i32 [10000000]i32 [10000000]i32:              0.75x
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  1.06x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      1.37x (mem: 0.99x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          1.50x

futhark-benchmarks/micro/reduce.fut:prod_mat4_i32
  [10000000]i32 [10000000]i32 [10000000]i32 [10000000]i32:              0.87x
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  1.21x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      1.40x (mem: 0.99x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          1.48x

futhark-benchmarks/micro/reduce.fut:prod_mat4_i8
  [10000000]i32 [10000000]i32 [10000000]i32 [10000000]i32:              0.91x
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  1.28x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      1.51x
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          1.63x

futhark-benchmarks/micro/reduce.fut:sum_f32
  [100000000]i32:                                                       1.00x
  [10000000]i32:                                                        0.93x
  [1000000]i32:                                                         0.88x
  [100000]i32:                                                          1.06x
  [10000]i32:                                                           1.14x

futhark-benchmarks/micro/reduce.fut:sum_f64
  [100000000]i32:                                                       1.02x
  [10000000]i32:                                                        0.91x
  [1000000]i32:                                                         0.88x
  [100000]i32:                                                          1.00x
  [10000]i32:                                                           1.18x

futhark-benchmarks/micro/reduce.fut:sum_i32
  [100000000]i32:                                                       1.01x
  [10000000]i32:                                                        0.92x
  [1000000]i32:                                                         0.92x
  [100000]i32:                                                          1.03x
  [10000]i32:                                                           1.12x

futhark-benchmarks/micro/reduce.fut:sum_i8
  [100000000]i32:                                                       0.99x
  [10000000]i32:                                                        0.93x
  [1000000]i32:                                                         0.93x
  [100000]i32:                                                          1.09x
  [10000]i32:                                                           1.15x

futhark-benchmarks/micro/reduce.fut:sum_iota_f32
  #0 ("10000i32"):                                                      1.07x
  #1 ("100000i32"):                                                     1.04x
  #2 ("1000000i32"):                                                    0.83x
  #3 ("10000000i32"):                                                   0.93x
  #4 ("100000000i32"):                                                  1.07x

futhark-benchmarks/micro/reduce.fut:sum_iota_f64
  #0 ("10000i32"):                                                      1.15x
  #1 ("100000i32"):                                                     1.01x
  #2 ("1000000i32"):                                                    0.81x
  #3 ("10000000i32"):                                                   0.91x
  #4 ("100000000i32"):                                                  1.05x

futhark-benchmarks/micro/reduce.fut:sum_iota_i32
  #0 ("10000i32"):                                                      1.17x
  #1 ("100000i32"):                                                     1.03x
  #2 ("1000000i32"):                                                    0.79x
  #3 ("10000000i32"):                                                   0.80x
  #4 ("100000000i32"):                                                  0.83x

futhark-benchmarks/micro/reduce.fut:sum_iota_i8
  #0 ("10000i32"):                                                      1.16x
  #1 ("100000i32"):                                                     1.02x
  #2 ("1000000i32"):                                                    0.84x
  #3 ("10000000i32"):                                                   0.85x
  #4 ("100000000i32"):                                                  0.74x

futhark-benchmarks/micro/reduce.fut:sum_scaled_f32
  [100000000]i32:                                                       1.04x
  [10000000]i32:                                                        0.97x
  [1000000]i32:                                                         0.96x
  [100000]i32:                                                          1.09x
  [10000]i32:                                                           1.20x (mem: 1.67x@device)

futhark-benchmarks/micro/reduce.fut:sum_scaled_f64
  [100000000]i32:                                                       1.03x
  [10000000]i32:                                                        0.99x
  [1000000]i32:                                                         0.99x
  [100000]i32:                                                          1.02x
  [10000]i32:                                                           1.14x (mem: 1.03x@device)

futhark-benchmarks/micro/reduce.fut:sum_scaled_i32
  [100000000]i32:                                                       1.04x
  [10000000]i32:                                                        0.99x
  [1000000]i32:                                                         1.00x
  [100000]i32:                                                          1.08x
  [10000]i32:                                                           1.18x (mem: 1.67x@device)

futhark-benchmarks/micro/reduce.fut:sum_scaled_i8
  [100000000]i32:                                                       1.02x
  [10000000]i32:                                                        1.02x
  [1000000]i32:                                                         1.02x
  [100000]i32:                                                          1.05x
  [10000]i32:                                                           1.23x

futhark-benchmarks/micro/reduce_by_index-segmented.fut:sum_i32
  100000i32 [256][4000]i32 [256][4000]i32:                              1.11x
  1000i32 [256][4000]i32 [256][4000]i32:                                1.12x (mem: 0.92x@device)
  10i32 [256][4000]i32 [256][4000]i32:                                  1.14x

futhark-benchmarks/micro/reduce_by_index.fut:absmax_i32
  100000i32 [1000000]i32 [1000000]i32:                                  0.73x (mem: 1.30x@device)
  10000i32 [1000000]i32 [1000000]i32:                                   0.80x (mem: 1.14x@device)
  1000i32 [1000000]i32 [1000000]i32:                                    0.81x (mem: 1.14x@device)
  100i32 [1000000]i32 [1000000]i32:                                     0.95x (mem: 0.98x@device)
  10i32 [1000000]i32 [1000000]i32:                                      0.86x

futhark-benchmarks/micro/reduce_by_index.fut:sum_f32
  100000i32 [1000000]i32 [1000000]f32:                                  1.18x (mem: 0.85x@device)
  10000i32 [1000000]i32 [1000000]f32:                                   0.47x (mem: 0.72x@device)
  1000i32 [1000000]i32 [1000000]f32:                                    0.02x (mem: 0.72x@device)
  100i32 [1000000]i32 [1000000]f32:                                     0.87x (mem: 0.98x@device)
  10i32 [1000000]i32 [1000000]f32:                                      0.80x

futhark-benchmarks/micro/reduce_by_index.fut:sum_i32
  100000i32 [1000000]i32 [1000000]i32:                                  0.81x
  10000i32 [1000000]i32 [1000000]i32:                                   0.91x
  1000i32 [1000000]i32 [1000000]i32:                                    0.71x
  100i32 [1000000]i32 [1000000]i32:                                     0.92x (mem: 0.98x@device)
  10i32 [1000000]i32 [1000000]i32:                                      0.89x

futhark-benchmarks/micro/reduce_by_index.fut:sum_i32_f32
  100000i32 [1000000]i32 [1000000]i32 [1000000]f32:                     0.96x (mem: 0.81x@device)
  10000i32 [1000000]i32 [1000000]i32 [1000000]f32:                      0.49x (mem: 0.72x@device)
  1000i32 [1000000]i32 [1000000]i32 [1000000]f32:                       0.02x (mem: 0.71x@device)
  100i32 [1000000]i32 [1000000]i32 [1000000]f32:                        0.75x (mem: 0.98x@device)
  10i32 [1000000]i32 [1000000]i32 [1000000]f32:                         0.63x

futhark-benchmarks/micro/reduce_by_index.fut:sum_vec_i32
  10000i32 [10000]i32 [1000000]i32:                                     1.11x
  10000i32 [1000]i32 [1000000]i32:                                      1.09x
  10i32 [10000]i32 [1000000]i32:                                        1.13x (mem: 0.98x@device)
  10i32 [1000]i32 [1000000]i32:                                         1.12x (mem: 0.98x@device)

futhark-benchmarks/micro/scan-segmented.fut:sum_i32
  [10000000][1]i32:                                                     5.34x
  [1000000][10]i32:                                                     4.94x
  [100000][100]i32:                                                     4.86x
  [10000][1000]i32:                                                     4.86x
  [1000][10000]i32:                                                     4.72x
  [100][100000]i32:                                                     4.51x
  [10][1000000]i32:                                                     4.51x
  [1][10000000]i32:                                                     4.48x

futhark-benchmarks/micro/scan-segmented.fut:sum_iota_i32
  #0 ("1i32 10000000i32"):                                              3.68x
  #1 ("10i32 1000000i32"):                                              3.72x
  #2 ("100i32 100000i32"):                                              3.84x
  #3 ("1000i32 10000i32"):                                              3.99x
  #4 ("10000i32 1000i32"):                                              4.12x
  #5 ("100000i32 100i32"):                                              4.13x
  #6 ("1000000i32 10i32"):                                              4.16x
  #7 ("10000000i32 1i32"):                                              4.34x

futhark-benchmarks/micro/scan.fut:lss_f32
  [1000000]i32:                                                         1.09x
  [100000]i32:                                                          0.85x (mem: 0.96x@device)
  [10000]i32:                                                           0.74x (mem: 0.60x@device)

futhark-benchmarks/micro/scan.fut:lss_f64
  [1000000]i32:                                                         1.03x
  [100000]i32:                                                          0.83x (mem: 0.97x@device)
  [10000]i32:                                                           0.77x (mem: 0.69x@device)

futhark-benchmarks/micro/scan.fut:lss_i32
  [1000000]i32:                                                         1.09x
  [100000]i32:                                                          0.85x (mem: 0.96x@device)
  [10000]i32:                                                           0.74x (mem: 0.60x@device)

futhark-benchmarks/micro/scan.fut:lss_i8
  [1000000]i32:                                                         1.42x
  [100000]i32:                                                          0.95x (mem: 0.95x@device)
  [10000]i32:                                                           0.81x (mem: 0.49x@device)

futhark-benchmarks/micro/scan.fut:lss_iota_f32
  #0 ("10000i32"):                                                      0.74x (mem: 0.53x@device)
  #1 ("100000i32"):                                                     0.87x (mem: 0.95x@device)
  #2 ("1000000i32"):                                                    1.04x
  #3 ("10000000i32"):                                                   1.12x

futhark-benchmarks/micro/scan.fut:lss_iota_f64
  #0 ("10000i32"):                                                      0.75x (mem: 0.65x@device)
  #1 ("100000i32"):                                                     0.87x (mem: 0.97x@device)
  #2 ("1000000i32"):                                                    0.98x
  #3 ("10000000i32"):                                                   1.01x

futhark-benchmarks/micro/scan.fut:lss_iota_i32
  #0 ("10000i32"):                                                      0.74x (mem: 0.53x@device)
  #1 ("100000i32"):                                                     0.88x (mem: 0.95x@device)
  #2 ("1000000i32"):                                                    1.01x
  #3 ("10000000i32"):                                                   1.10x

futhark-benchmarks/micro/scan.fut:lss_iota_i8
  #0 ("10000i32"):                                                      0.79x (mem: 0.37x@device)
  #1 ("100000i32"):                                                     0.98x (mem: 0.94x@device)
  #2 ("1000000i32"):                                                    1.38x
  #3 ("10000000i32"):                                                   1.63x

futhark-benchmarks/micro/scan.fut:prod_iota_mat4_f32
  #0 ("10000i32"):                                                      0.80x (mem: 0.29x@device)
  #1 ("100000i32"):                                                     0.87x (mem: 0.93x@device)
  #2 ("1000000i32"):                                                    0.74x
  #3 ("10000000i32"):                                                   0.71x

futhark-benchmarks/micro/scan.fut:prod_iota_mat4_f64
  #0 ("10000i32"):                                                      0.97x (mem: 0.65x@device)
  #1 ("100000i32"):                                                     1.13x (mem: 0.97x@device)
  #2 ("1000000i32"):                                                    1.58x
  #3 ("10000000i32"):                                                   1.87x

futhark-benchmarks/micro/scan.fut:prod_iota_mat4_i32
  #0 ("10000i32"):                                                      0.82x (mem: 0.29x@device)
  #1 ("100000i32"):                                                     0.87x (mem: 0.93x@device)
  #2 ("1000000i32"):                                                    0.71x
  #3 ("10000000i32"):                                                   0.68x

futhark-benchmarks/micro/scan.fut:prod_iota_mat4_i8
  #0 ("10000i32"):                                                      1.34x (mem: 0.00x@device)
  #1 ("100000i32"):                                                     1.34x (mem: 0.72x@device)
  #2 ("1000000i32"):                                                    1.30x (mem: 0.97x@device)
  #3 ("10000000i32"):                                                   1.84x

futhark-benchmarks/micro/scan.fut:prod_mat4_f32
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  0.72x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      0.84x (mem: 0.97x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          0.78x (mem: 0.65x@device)

futhark-benchmarks/micro/scan.fut:prod_mat4_f64
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  1.63x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      1.05x (mem: 0.98x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          0.95x (mem: 0.77x@device)

futhark-benchmarks/micro/scan.fut:prod_mat4_i32
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  0.70x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      0.85x (mem: 0.97x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          0.81x (mem: 0.65x@device)

futhark-benchmarks/micro/scan.fut:prod_mat4_i8
  [1000000]i32 [1000000]i32 [1000000]i32 [1000000]i32:                  1.31x
  [100000]i32 [100000]i32 [100000]i32 [100000]i32:                      1.00x (mem: 0.94x@device)
  [10000]i32 [10000]i32 [10000]i32 [10000]i32:                          0.93x (mem: 0.44x@device)

futhark-benchmarks/micro/scan.fut:sum_f32
  [100000000]i32:                                                       3.03x
  [10000000]i32:                                                        2.82x
  [1000000]i32:                                                         1.97x
  [100000]i32:                                                          1.28x (mem: 0.90x@device)
  [10000]i32:                                                           1.36x (mem: 0.14x@device)

futhark-benchmarks/micro/scan.fut:sum_f64
  [100000000]i32:                                                       2.26x
  [10000000]i32:                                                        2.30x
  [1000000]i32:                                                         1.75x
  [100000]i32:                                                          1.39x (mem: 0.94x@device)
  [10000]i32:                                                           1.27x (mem: 0.35x@device)

futhark-benchmarks/micro/scan.fut:sum_i32
  [100000000]i32:                                                       4.64x
  [10000000]i32:                                                        3.84x
  [1000000]i32:                                                         2.21x
  [100000]i32:                                                          1.59x (mem: 0.91x@device)
  [10000]i32:                                                           1.54x (mem: 0.24x@device)

futhark-benchmarks/micro/scan.fut:sum_i8
  [100000000]i32:                                                       7.15x
  [10000000]i32:                                                        4.83x
  [1000000]i32:                                                         2.30x (mem: 0.99x@device)
  [100000]i32:                                                          1.53x (mem: 0.88x@device)
  [10000]i32:                                                           1.53x (mem: 0.22x@device)

futhark-benchmarks/micro/scan.fut:sum_iota_f32
  #0 ("10000i32"):                                                      1.67x (mem: 0.23x@device)
  #1 ("100000i32"):                                                     1.73x (mem: 0.90x@device)
  #2 ("1000000i32"):                                                    2.32x
  #3 ("10000000i32"):                                                   3.98x
  #4 ("100000000i32"):                                                  4.51x

futhark-benchmarks/micro/scan.fut:sum_iota_f64
  #0 ("10000i32"):                                                      1.59x (mem: 0.52x@device)
  #1 ("100000i32"):                                                     1.59x (mem: 0.95x@device)
  #2 ("1000000i32"):                                                    1.94x
  #3 ("10000000i32"):                                                   2.87x
  #4 ("100000000i32"):                                                  2.90x

futhark-benchmarks/micro/scan.fut:sum_iota_i32
  #0 ("10000i32"):                                                      1.67x (mem: 0.45x@device)
  #1 ("100000i32"):                                                     1.66x (mem: 0.92x@device)
  #2 ("1000000i32"):                                                    2.31x
  #3 ("10000000i32"):                                                   4.15x
  #4 ("100000000i32"):                                                  4.74x

futhark-benchmarks/micro/scan.fut:sum_iota_i8
  #0 ("10000i32"):                                                      1.67x
  #1 ("100000i32"):                                                     1.56x (mem: 0.78x@device)
  #2 ("1000000i32"):                                                    2.32x (mem: 0.98x@device)
  #3 ("10000000i32"):                                                   5.62x
  #4 ("100000000i32"):                                                  7.98x

futhark-benchmarks/micro/scan.fut:sum_scaled_f32
  [100000000]i32:                                                       2.73x
  [10000000]i32:                                                        2.51x
  [1000000]i32:                                                         1.96x
  [100000]i32:                                                          1.27x (mem: 0.91x@device)
  [10000]i32:                                                           1.22x (mem: 0.12x@device)

futhark-benchmarks/micro/scan.fut:sum_scaled_f64
  [100000000]i32:                                                       2.05x
  [10000000]i32:                                                        2.10x
  [1000000]i32:                                                         1.74x
  [100000]i32:                                                          1.42x (mem: 0.94x@device)
  [10000]i32:                                                           1.36x (mem: 0.44x@device)

futhark-benchmarks/micro/scan.fut:sum_scaled_i32
  [100000000]i32:                                                       2.80x
  [10000000]i32:                                                        2.57x
  [1000000]i32:                                                         2.00x
  [100000]i32:                                                          1.26x (mem: 0.91x@device)
  [10000]i32:                                                           1.25x (mem: 0.18x@device)

futhark-benchmarks/micro/scan.fut:sum_scaled_i8
  [100000000]i32:                                                       5.10x
  [10000000]i32:                                                        3.89x
  [1000000]i32:                                                         1.87x (mem: 0.98x@device)
  [100000]i32:                                                          1.15x (mem: 0.84x@device)
  [10000]i32:                                                           1.15x (mem: 0.00x@device)

futhark-benchmarks/micro/transpose.fut:map_transpose_i32
  1000i32 100000i32 1i32 [100000000]i32:                                0.63x
  1000i32 1000i32 100i32 [100000000]i32:                                0.91x
  1000i32 1i32 100000i32 [100000000]i32:                                1.07x
  10i32 10000000i32 1i32 [100000000]i32:                                0.64x
  10i32 1000i32 10000i32 [100000000]i32:                                1.07x
  10i32 1i32 10000000i32 [100000000]i32:                                1.08x
  1i32 10000000i32 10i32 [100000000]i32:                                0.69x
  1i32 1000i32 100000i32 [100000000]i32:                                1.11x
  1i32 1i32 100000000i32 [100000000]i32:                                1.07x

futhark-benchmarks/micro/transpose.fut:map_transpose_i64
  1000i32 100000i32 1i32 [100000000]i64:                                0.91x
  1000i32 1000i32 100i32 [100000000]i64:                                1.01x
  1000i32 1i32 100000i32 [100000000]i64:                                1.04x
  10i32 10000000i32 1i32 [100000000]i64:                                0.91x
  10i32 1000i32 10000i32 [100000000]i64:                                1.05x
  10i32 1i32 10000000i32 [100000000]i64:                                1.04x
  1i32 10000000i32 10i32 [100000000]i64:                                1.00x
  1i32 1000i32 100000i32 [100000000]i64:                                1.05x
  1i32 1i32 100000000i32 [100000000]i64:                                1.04x

futhark-benchmarks/micro/transpose.fut:map_transpose_i8
  1000i32 100000i32 1i32 [100000000]i8:                                 0.60x
  1000i32 1000i32 100i32 [100000000]i8:                                 0.69x
  1000i32 1i32 100000i32 [100000000]i8:                                 1.04x
  10i32 10000000i32 1i32 [100000000]i8:                                 0.60x
  10i32 1000i32 10000i32 [100000000]i8:                                 0.68x
  10i32 1i32 10000000i32 [100000000]i8:                                 1.04x
  1i32 10000000i32 10i32 [100000000]i8:                                 0.64x
  1i32 1000i32 100000i32 [100000000]i8:                                 0.94x
  1i32 1i32 100000000i32 [100000000]i8:                                 1.04x

futhark-benchmarks/micro/transpose.fut:transpose_i32
  100000000i32 1i32 [100000000]i32:                                     0.77x
  10000000i32 10i32 [100000000]i32:                                     0.69x
  1000000i32 100i32 [100000000]i32:                                     0.96x
  100000i32 1000i32 [100000000]i32:                                     1.04x
  10000i32 10000i32 [100000000]i32:                                     1.05x
  1000i32 100000i32 [100000000]i32:                                     1.10x
  100i32 1000000i32 [100000000]i32:                                     1.01x
  10i32 10000000i32 [100000000]i32:                                     0.67x
  1i32 100000000i32 [100000000]i32:                                     1.07x
  25000000i32 4i32 [100000000]i32:                                      0.78x
  2i32 50000000i32 [100000000]i32:                                      0.64x
  4i32 25000000i32 [100000000]i32:                                      0.62x
  50000000i32 2i32 [100000000]i32:                                      0.77x

futhark-benchmarks/micro/transpose.fut:transpose_i64
  100000000i32 1i32 [100000000]i64:                                     1.04x
  10000000i32 10i32 [100000000]i64:                                     0.99x
  1000000i32 100i32 [100000000]i64:                                     1.03x
  100000i32 1000i32 [100000000]i64:                                     1.05x
  10000i32 10000i32 [100000000]i64:                                     1.06x
  1000i32 100000i32 [100000000]i64:                                     1.05x
  100i32 1000000i32 [100000000]i64:                                     1.03x
  10i32 10000000i32 [100000000]i64:                                     0.96x
  1i32 100000000i32 [100000000]i64:                                     1.04x
  25000000i32 4i32 [100000000]i64:                                      1.04x
  2i32 50000000i32 [100000000]i64:                                      0.99x
  4i32 25000000i32 [100000000]i64:                                      0.97x
  50000000i32 2i32 [100000000]i64:                                      1.06x

futhark-benchmarks/micro/transpose.fut:transpose_i8
  100000000i32 1i32 [100000000]i8:                                      0.65x
  10000000i32 10i32 [100000000]i8:                                      0.64x
  1000000i32 100i32 [100000000]i8:                                      0.67x
  100000i32 1000i32 [100000000]i8:                                      0.67x
  10000i32 10000i32 [100000000]i8:                                      0.68x
  1000i32 100000i32 [100000000]i8:                                      0.94x
  100i32 1000000i32 [100000000]i8:                                      0.91x
  10i32 10000000i32 [100000000]i8:                                      0.54x
  1i32 100000000i32 [100000000]i8:                                      1.07x
  25000000i32 4i32 [100000000]i8:                                       0.49x
  2i32 50000000i32 [100000000]i8:                                       0.62x
  4i32 25000000i32 [100000000]i8:                                       0.62x
  50000000i32 2i32 [100000000]i8:                                       0.52x

futhark-benchmarks/misc/bfast/bfast-cloudy.fut
  data/africa.in:                                                       1.14x (mem: 0.60x@device)
  data/peru.in:                                                         1.00x
  data/sahara-cloudy.in:                                                1.22x (mem: 0.60x@device)

futhark-benchmarks/misc/bfast/bfast.fut
  data/sahara.in:                                                       1.13x

futhark-benchmarks/misc/heston/heston32.fut
  data/100000_quotes.in:                                                1.09x (mem: 0.97x@device)
  data/10000_quotes.in:                                                 1.05x
  data/1062_quotes.in:                                                  1.09x

futhark-benchmarks/misc/heston/heston64.fut
  data/100000_quotes.in:                                                1.07x
  data/10000_quotes.in:                                                 1.00x
  data/1062_quotes.in:                                                  0.92x

futhark-benchmarks/misc/knn-by-kdtree/buildKDtree.fut
  valid-data/kdtree-ppl-32-m-2097152.in:                                1.34x

futhark-benchmarks/misc/knn-by-kdtree/driver-knn.fut
  256i32 [2097152][7]f32 [10000000][7]f32:                              1.28x

futhark-benchmarks/misc/ocean-sim/tke.fut
  [200][200][100]f32 [200][200][100]f32 [200][200][100]f32 [...]:       0.85x
  data/tke32-small.in:                                                  1.13x

futhark-benchmarks/misc/ocean-sim/tridiag-test.fut:tridagNested
  [57600][115]f32 [57600][115]f32 [57600][115]f32 [57600][115]f32:      0.86x
  data/tridiag32-small.in:                                              1.40x

futhark-benchmarks/misc/ocean-sim/tridiag-test.fut:tridagNestedConst
  [57600][115]f32 [57600][115]f32 [57600][115]f32 [57600][115]f32:      0.85x
  data/tridiag32-small.in:                                              1.39x

futhark-benchmarks/misc/ocean-sim/tridiag-test.fut:tridagNestedSeq
  [57600][115]f32 [57600][115]f32 [57600][115]f32 [57600][115]f32:      1.04x
  data/tridiag32-small.in:                                              1.07x

futhark-benchmarks/misc/ocean-sim/tridiag-test.fut:tridagNestedSeqConst
  [57600][115]f32 [57600][115]f32 [57600][115]f32 [57600][115]f32:      1.08x
  data/tridiag32-small.in:                                              1.07x

futhark-benchmarks/misc/poseidon/poseidon-bench.fut:arity11
  [17600000]u64:                                                        1.15x

futhark-benchmarks/misc/poseidon/poseidon-bench.fut:arity8
  [22400000]u64:                                                        0.58x

futhark-benchmarks/parboil/histo/histo.fut
  data/default.in:                                                      1.31x
  data/large.in:                                                        1.29x

futhark-benchmarks/parboil/lbm/lbm.fut
  data/120_120_150_ldc.in:                                              0.41x

futhark-benchmarks/parboil/mri-q/mri-q.fut
  data/large.in:                                                        1.08x
  data/small.in:                                                        1.12x

futhark-benchmarks/parboil/sgemm/sgemm.fut
  data/medium.in:                                                       0.67x
  data/small.in:                                                        1.50x
  data/tiny.in:                                                         1.46x

futhark-benchmarks/parboil/stencil/stencil.fut
  data/default.in:                                                      1.03x
  data/small.in:                                                        1.10x

futhark-benchmarks/parboil/tpacf/tpacf.fut
  data/large.in:                                                        2.47x
  data/medium.in:                                                       2.46x
  data/small.in:                                                        0.99x (mem: 0.97x@device)

futhark-benchmarks/pbbs/breadthFirstSearch/breadthFirstSearch.fut
  data/3Dgrid_J_64000000.in:                                            1.04x
  data/rMatGraph_J_12_16000000.in:                                      1.37x
  data/randLocalGraph_J_10_20000000.in:                                 1.38x

futhark-benchmarks/pbbs/comparisonSort/merge_sort.fut:sort_f64
  data/almostSortedSeq_100M.in:                                         1.00x
  data/exptSeq_100M.in:                                                 1.00x
  data/randomSeq_100M.in:                                               1.00x

futhark-benchmarks/pbbs/comparisonSort/merge_sort.fut:sort_f64_pair
  data/randomSeq_100M_double_pair_double.in:                            1.00x

futhark-benchmarks/pbbs/comparisonSort/quick_sort.fut:sort_f64
  data/almostSortedSeq_100M.in:                                         1.10x
  data/exptSeq_100M.in:                                                 1.10x
  data/randomSeq_100M.in:                                               1.09x

futhark-benchmarks/pbbs/comparisonSort/quick_sort.fut:sort_f64_pair
  data/randomSeq_100M_double_pair_double.in:                            1.11x

futhark-benchmarks/pbbs/convexHull/convexhull.fut
  data/2DinSphere_100K.in:                                              0.68x
  data/2DinSphere_100M.in:                                              1.02x
  data/2DinSphere_10K.in:                                               0.69x
  data/2DinSphere_10M.in:                                               0.76x
  data/2DinSphere_1M.in:                                                0.71x
  data/2Dkuzmin_100K.in:                                                0.65x
  data/2Dkuzmin_100M.in:                                                1.18x
  data/2Dkuzmin_10K.in:                                                 0.72x
  data/2Dkuzmin_10M.in:                                                 0.83x
  data/2Dkuzmin_1M.in:                                                  0.64x
  data/2DonSphere_100K.in:                                              0.69x
  data/2DonSphere_100M.in:                                              1.26x
  data/2DonSphere_10K.in:                                               0.67x
  data/2DonSphere_10M.in:                                               1.01x
  data/2DonSphere_1M.in:                                                0.77x

futhark-benchmarks/pbbs/histogram/histogram.fut
  almostEqualSeq_100M:                                                  1.00x
  exptSeq_100M:                                                         0.98x
  randomSeq_100M:                                                       0.99x
  randomSeq_100M_100K:                                                  0.90x
  randomSeq_100M_256:                                                   3.34x

futhark-benchmarks/pbbs/integerSort/radix_sort.fut:sort_i32
  exptSeq_100M_int:                                                     1.60x
  randomSeq_100M_int:                                                   1.58x

futhark-benchmarks/pbbs/integerSort/radix_sort.fut:sort_i32_pair
  randomSeq_100M_256_int_pair_int:                                      1.37x
  randomSeq_100M_int_pair_int:                                          1.42x

futhark-benchmarks/pbbs/maximalIndependentSet/maximalIndependentSet.fut
  data/3Dgrid_JR_64000000.in:                                           0.98x
  data/rMatGraph_JR_12_16000000.in:                                     1.11x
  data/randLocalGraph_JR_10_20000000.in:                                0.91x

futhark-benchmarks/pbbs/maximalMatching/maximalMatching.fut
  data/2Dgrid_E_64000000.in:                                            1.02x
  data/rMatGraph_E_10_20000000.in:                                      1.03x
  data/randLocalGraph_E_10_20000000.in:                                 1.04x

futhark-benchmarks/pbbs/minSpanningForest/minSpanningForest.fut
  data/3Dgrid_WE_8000000.in:                                            1.00x
  data/rMatGraph_WE_12_2250000.in:                                      0.98x
  data/randLocalGraph_WE_10_2000000.in:                                 0.99x

futhark-benchmarks/pbbs/ray/ray.fut
  data/angel.in:                                                        0.98x
  data/dragon.in:                                                       0.93x
  data/happy.in:                                                        0.96x

futhark-benchmarks/rodinia/backprop/backprop.fut
  data/medium.in:                                                       1.07x
  data/small.in:                                                        1.27x

futhark-benchmarks/rodinia/bfs/bfs_asympt_ok_but_slow.fut
  data/4096nodes.in:                                                    1.08x
  data/512nodes_high_edge_variance.in:                                  1.09x
  data/64kn_32e-var-1-256-skew.in:                                      1.55x
  data/graph1MW_6.in:                                                   1.22x

futhark-benchmarks/rodinia/bfs/bfs_filt_padded_fused.fut
  data/4096nodes.in:                                                    1.11x
  data/512nodes_high_edge_variance.in:                                  1.14x (mem: 0.97x@device)
  data/64kn_32e-var-1-256-skew.in:                                      1.04x (mem: 0.00x@device)
  data/graph1MW_6.in:                                                   1.24x

futhark-benchmarks/rodinia/bfs/bfs_heuristic.fut
  data/4096nodes.in:                                                    1.12x
  data/512nodes_high_edge_variance.in:                                  1.13x (mem: 0.93x@device)
  data/64kn_32e-var-1-256-skew.in:                                      1.04x
  data/graph1MW_6.in:                                                   1.22x

futhark-benchmarks/rodinia/bfs/bfs_iter_work_ok.fut
  data/4096nodes.in:                                                    1.16x
  data/512nodes_high_edge_variance.in:                                  1.20x (mem: 0.71x@device)
  data/64kn_32e-var-1-256-skew.in:                                      1.07x
  data/graph1MW_6.in:                                                   1.26x

futhark-benchmarks/rodinia/cfd/cfd.fut
  data/fvcorr.domn.097K.toa:                                            1.02x
  data/fvcorr.domn.193K.toa:                                            1.01x

futhark-benchmarks/rodinia/hotspot/hotspot.fut
  data/1024.in:                                                         1.11x
  data/512.in:                                                          1.01x
  data/64.in:                                                           1.81x

futhark-benchmarks/rodinia/kmeans/kmeans.fut
  data/100.in:                                                          1.24x
  data/204800.in:                                                       0.96x (mem: 0.99x@device)
  data/kdd_cup.in:                                                      1.14x

futhark-benchmarks/rodinia/lavaMD/lavaMD.fut
  data/10_boxes.in:                                                     1.16x
  data/3_boxes.in:                                                      0.70x

futhark-benchmarks/rodinia/lud/lud.fut
  data/16by16.in:                                                       1.01x
  data/2048.in:                                                         0.97x
  data/256.in:                                                          0.97x
  data/512.in:                                                          0.97x
  data/64.in:                                                           0.94x

futhark-benchmarks/rodinia/myocyte/myocyte.fut
  data/medium.in:                                                       1.72x
  data/small.in:                                                        0.85x

futhark-benchmarks/rodinia/nn/nn.fut
  data/medium.in:                                                       0.96x

futhark-benchmarks/rodinia/nw/nw.fut
  data/large.in:                                                        1.04x
  data/medium.in:                                                       1.03x
  data/small.in:                                                        1.31x
  data/tiny.in:                                                         1.36x

futhark-benchmarks/rodinia/particlefilter/particlefilter.fut
  data/128_128_10_image_10000_particles.in:                             1.07x
  data/128_128_10_image_400000_particles.in:                            1.10x

futhark-benchmarks/rodinia/pathfinder/pathfinder.fut
  data/medium.in:                                                       1.12x

futhark-benchmarks/rodinia/srad/srad.fut
  data/image.in:                                                        0.63x

futhark-benchmarks/rsbench/rsbench.fut
  data/large.in:                                                        1.48x
  data/small.in:                                                        1.35x

futhark-benchmarks/xsbench/xsbench.fut
  data/large.in:                                                        1.02x
  data/small.in:                                                        0.98x

Some things are substantially faster (e.g. FFT). I think this is because the HIP backend allows up to 1024 threads in a thread block (compared to only 256 for AMD's OpenCL implementation), which allows intragroup parallelism to apply.

Some are strangely slower (e.g. mandelbrot). I'll have to look into it. It may be something simple like not properly querying for how many threads to launch.

Everything that depends on scans is faster, as the HIP backend uses the highly tuned single pass scans code generation.

I also see that e.g. sgemm is a lot slower than with OpenCL. This also merits further investigation.

But overall, this backend looks pretty operational to me. Certainly worth using for some programs, and with some tweaks we can probably make it superior to the OpenCL backend in all cases, on AMD hardware.

First draft of a HIP backend.

f120446

athas changed the title ~~First draft of a HIP backend.~~ Add HIP backend Aug 11, 2023

athas added 28 commits August 15, 2023 10:24

Merge branch 'master' into hip

b663d80

Test HIP in CI.

02dc097

Add manpage for hip backend.

91787a3

More consistent this way.

7503cdf

Put attribute in front of Futhark functions.

34ed897

Hack around codegen deficiency.

6ef5f4c

Also single-pass scan for HIP.

4a83e4c

These tests also do not work on HIP.

36f5b55

These need to be error syncs, not just barriers.

f98b35b

Merge branch 'master' into hip

05aafe9

Merge branch 'master' into hip

38fdc30

Merge branch 'master' into hip

d288878

Do not allow assertions in operator for single-pass scan.

55318ee

This must be an error sync.

615e5f2

Merge branch 'master' into hip

ecfbd26

Ensure zero-termination.

0445786

Ensure we error sync before heading into control flow.

cd0ea6f

Map function might have errors; sync here.

667eaed

Also benchmark HIP backend.

beebbd9

Merge branch 'master' into hip

cb6d433

Not necessarily OpenCL.

422f0ee

Do not warn about unused constants.

a4793bf

This function takes no arguments.

b9ceadb

Ignore another clang warning.

f6fec8e

Use correct macro.

25cb1a5

Implement device setup.

a9845b5

Merge branch 'master' into hip

0bd3cb9

athas marked this pull request as ready for review August 17, 2023 11:11

athas added 7 commits August 17, 2023 13:19

Oops, this should be hip.

02f3752

Merge branch 'master' into hip

9aa82e4

Merge branch 'master' into hip

7ddcd83

Mention CUDA backend here.

0921b84

Add to CHANGELOG.

f7ee86e

Support HIP in library tests.

8ef90ab

Avoid huge expressions.

3a8002c

athas added the run-benchmarks Makes GA run the benchmark suite. label Aug 17, 2023

Also store HIP reuslts.

825cb3f

athas added 3 commits August 17, 2023 21:40

Unify this for all backends.

195bc73

Move __CUDA_ARCH__ check to right place.

ae80c3e

No such thing as -arch in HIP.

ea758c8

athas merged commit befe604 into master Aug 18, 2023

athas deleted the hip branch August 18, 2023 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HIP backend #2008

Add HIP backend #2008

athas commented Aug 11, 2023

athas commented Aug 17, 2023

Add HIP backend #2008

Add HIP backend #2008

Conversation

athas commented Aug 11, 2023

athas commented Aug 17, 2023