Different result for personal argmax on CPU and GPU if array size is large enough #476

rveltz · 2024-11-05T23:12:50Z

Hi,
I tried coding an argmax using KernelAbstraction in need for particles simulation. Sadly, the results from Metal and CPU differ.
Basically I have a field::Array{Float32, 4} and I want to compute in parallel argmax(field[x1,x2,x3,:])cfor many (basically Nnmc ) vectors (x1,x2,x3) in parallel. In the code below, this vector is fixed x1,x2,x3 = (1, 1, 1).

I found that the argmax differ whether the code is run on CPU or on Metal and only if field is large enough. This is the bulk of the issue.

using Revise, LinearAlgebra
using Metal
using KernelAbstractions

function _sample_gpu(field;
                    Nnmc = 1000,
                    TA = Array
                    )
    result = TA(zeros(Float32, 2, Nnmc))
    npb = size(field, 4)
    # launch gpu kernel
    backend = get_backend(result)
    nth = backend isa KernelAbstractions.GPU ? 256 : 8
    kernel! = _sample_mtl!(backend, nth)    
    kernel!(result,
            TA(field),
            npb,
            ndrange = Nnmc
    )
    result
    
end

@kernel function _sample_mtl!(result,
                                @Const(field),
                                nd,
                                )
    nₙₘ = @index(Global)
    voxel₁ = voxel₂ = voxel₃ = 1
    # compute argmax of field[voxel₁, voxel₂, voxel₃, :]
    _val_max::Float32 = 0f0
    ind_u = 0
    for ii in axes(field, 4)
        val = field[voxel₁, voxel₂, voxel₃, ii]
        if val > _val_max
            _val_max = val
            ind_u = ii
        end
    end

    result[1, nₙₘ] = nₙₘ
    # save argmax
    result[2, nₙₘ] = ind_u
    
end

all_od = Float32.(rand(Float32,100,108,100, 1000));
res_a = _sample_gpu(all_od,
                )

res_g =  _sample_gpu(all_od,
                TA = MtlArray,
                ) |> Array

norm(res_g[2,:] - res_a[2,:], Inf)
# returns 232.0f0

If the field is smaller the discrepancy seems to disappear:

all_od = Float32.(rand(Float32,100,107,100, 1000));
res_a = _sample_gpu(all_od,
                )

res_g =  _sample_gpu(all_od,
                TA = MtlArray,
                ) |> Array

norm(res_g[2,:] - res_a[2,:], Inf)
# returns 0.0f0

The text was updated successfully, but these errors were encountered:

christiangnrd · 2024-11-06T00:22:07Z

Can you share the output of

using Metal
Metal.versioninfo()

?

rveltz · 2024-11-06T06:37:37Z

julia> using Metal

julia> Metal.versioninfo()
macOS 14.7.0, Darwin 23.6.0

Toolchain:
- Julia: 1.10.5
- LLVM: 15.0.7

Julia packages: 
- Metal.jl: 1.4.2
- GPUArrays: 10.3.1
- GPUCompiler: 0.27.8
- KernelAbstractions: 0.9.29
- ObjectiveC: 3.1.0
- LLVM: 9.1.3
- LLVMDowngrader_jll: 0.3.0+2

1 device:
- Apple M2 Max (64.000 KiB allocated)

christiangnrd · 2024-11-06T15:36:25Z

I was able to reproduce this on macOS 14 but not 15.2 (developer beta).

rveltz · 2024-11-06T19:00:11Z

Thank you for trying this.

Wow that's stiff. I guess I have to wait for the transition unless something obvious in my kernel can be fixed

christiangnrd · 2024-11-06T19:53:16Z

Interestingly, the threshold for functional seems to be 4GiB:

# Broken on macOS14
julia> all_od = Float32.(rand(Float32,100,108,100, 1000)); all_od |> sizeof |> Base.format_bytes
"4.023 GiB"

# Not broken on macOS14
julia> all_od = Float32.(rand(Float32,100,107,100, 1000)); all_od |> sizeof |> Base.format_bytes
"3.986 GiB"

rveltz · 2024-11-06T20:43:57Z

Did you check this using a M2 or a M3?

tgymnich · 2024-11-06T20:47:26Z

Works for me on M1 and 15.1

julia> norm(res_g[2,:] - res_a[2,:], Inf)
       # returns 232.0f0
0.0f0

julia>

julia> Metal.versioninfo()
macOS 15.1.0, Darwin 24.1.0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

Julia packages:
- Metal.jl: 1.4.0
- GPUArrays: 11.1.0
- GPUCompiler: 1.0.1
- KernelAbstractions: 0.9.29
- ObjectiveC: 3.1.0
- LLVM: 9.1.3
- LLVMDowngrader_jll: 0.3.0+2

1 device:
- Apple M1 Max (384.000 KiB allocated)

christiangnrd · 2024-11-06T20:56:24Z

Did you check this using a M2 or a M3?

M2 Max like you.

If anyone reading this has an M3 and is still on macOS 14, could you please try running the MWE and report back if it is broken or not?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different result for personal argmax on CPU and GPU if array size is large enough #476

Different result for personal argmax on CPU and GPU if array size is large enough #476

rveltz commented Nov 5, 2024

christiangnrd commented Nov 6, 2024

rveltz commented Nov 6, 2024 •

edited by christiangnrd

Loading

christiangnrd commented Nov 6, 2024

rveltz commented Nov 6, 2024 •

edited

Loading

christiangnrd commented Nov 6, 2024

rveltz commented Nov 6, 2024

tgymnich commented Nov 6, 2024 •

edited

Loading

christiangnrd commented Nov 6, 2024

Different result for personal argmax on CPU and GPU if array size is large enough #476

Different result for personal argmax on CPU and GPU if array size is large enough #476

Comments

rveltz commented Nov 5, 2024

christiangnrd commented Nov 6, 2024

rveltz commented Nov 6, 2024 • edited by christiangnrd Loading

christiangnrd commented Nov 6, 2024

rveltz commented Nov 6, 2024 • edited Loading

christiangnrd commented Nov 6, 2024

rveltz commented Nov 6, 2024

tgymnich commented Nov 6, 2024 • edited Loading

christiangnrd commented Nov 6, 2024

rveltz commented Nov 6, 2024 •

edited by christiangnrd

Loading

rveltz commented Nov 6, 2024 •

edited

Loading

tgymnich commented Nov 6, 2024 •

edited

Loading