Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different result for personal argmax on CPU and GPU if array size is large enough #476

Open
rveltz opened this issue Nov 5, 2024 · 2 comments

Comments

@rveltz
Copy link

rveltz commented Nov 5, 2024

Hi,
I tried coding an argmax using KernelAbstraction in need for particles simulation. Sadly, the results from Metal and CPU differ.
Basically I have a field::Array{Float32, 4} and I want to compute in parallel argmax(field[x1,x2,x3,:])cfor many (basically Nnmc ) vectors (x1,x2,x3) in parallel. In the code below, this vector is fixed x1,x2,x3 = (1, 1, 1).

I found that the argmax differ whether the code is run on CPU or on Metal and only if field is large enough. This is the bulk of the issue.

using Revise, LinearAlgebra
using Metal
using KernelAbstractions

function _sample_gpu(field;
                    Nnmc = 1000,
                    TA = Array
                    )
    result = TA(zeros(Float32, 2, Nnmc))
    npb = size(field, 4)
    # launch gpu kernel
    backend = get_backend(result)
    nth = backend isa KernelAbstractions.GPU ? 256 : 8
    kernel! = _sample_mtl!(backend, nth)    
    kernel!(result,
            TA(field),
            npb,
            ndrange = Nnmc
    )
    result
    
end

@kernel function _sample_mtl!(result,
                                @Const(field),
                                nd,
                                )
    nₙₘ = @index(Global)
    voxel₁ = voxel₂ = voxel₃ = 1
    # compute argmax of field[voxel₁, voxel₂, voxel₃, :]
    _val_max::Float32 = 0f0
    ind_u = 0
    for ii in axes(field, 4)
        val = field[voxel₁, voxel₂, voxel₃, ii]
        if val > _val_max
            _val_max = val
            ind_u = ii
        end
    end

    result[1, nₙₘ] = nₙₘ
    # save argmax
    result[2, nₙₘ] = ind_u
    
end

all_od = Float32.(rand(Float32,100,108,100, 1000));
res_a = _sample_gpu(all_od,
                )

res_g =  _sample_gpu(all_od,
                TA = MtlArray,
                ) |> Array

norm(res_g[2,:] - res_a[2,:], Inf)
# returns 232.0f0

If the field is smaller the discrepancy seems to disappear:

all_od = Float32.(rand(Float32,100,107,100, 1000));
res_a = _sample_gpu(all_od,
                )

res_g =  _sample_gpu(all_od,
                TA = MtlArray,
                ) |> Array

norm(res_g[2,:] - res_a[2,:], Inf)
# returns 0.0f0
@christiangnrd
Copy link
Contributor

Can you share the output of

using Metal
Metal.versioninfo()

?

@rveltz
Copy link
Author

rveltz commented Nov 6, 2024

julia> using Metal

julia> Metal.versioninfo()
macOS 14.7.0, Darwin 23.6.0

Toolchain:
- Julia: 1.10.5
- LLVM: 15.0.7

Julia packages: 
- Metal.jl: 1.4.2
- GPUArrays: 10.3.1
- GPUCompiler: 0.27.8
- KernelAbstractions: 0.9.29
- ObjectiveC: 3.1.0
- LLVM: 9.1.3
- LLVMDowngrader_jll: 0.3.0+2

1 device:
- Apple M2 Max (64.000 KiB allocated)```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants