New native half-precision floating-point arithmetic not working on Ampere Altra (Aarch64 with FP16 enable architecture) #49987

dpanigo · 2023-05-29T16:02:37Z

dpanigo
May 29, 2023

Julia v.1.9.0 has amazing new features (native code caching, pkg extensions, new sorting algorithms, etc.)
For econometric purposes, one of the most salient upgrades is native FP16 arithmetic.
Theoretically, it is a game changer for training models in economics where single/double precision requirements are unusual (and range issues can be deal with using standardized data).
In this post (https://julialang.org/blog/2023/04/julia-1.9-highlights/#native_half-precision_floating-point_arithmetic) it is emphasized that this new feature is only available for hardware with the appropriate architecture (Aarch64 with FP16 ALUs. like Apple's M series or Fujitsu's A64FX).
I'm not sure if I'm doing something wrong but I was trying to use this new feature through the t2a-standard-4 Google Cloud Machine (with Ampere Altra Aarch64 CPUs, with -allegedly- support for FP16 arithmetic). Unfortunately, my Julia code (see below) was unable to obtain the expected runtime gains. On the contrary, Matrix Factorization (particularly the Cholesky Decomposition) is significantly slower with FP16 operations (as in the x86 architecture, showing that FP16 ALUs are not being used).
Has anyone tested this new Julia’s feature in other Aarch64 hardware?
Thanks in advance, Demian

############################
## TESTING CODE
############################
using Distributions, LinearAlgebra, BenchmarkTools, DataFrames, CSV

function CREATE_DATABASE(N, K, T)
    expvars_data = Array{T}(rand(N, K))
    return expvars_data
end

function FACT(expvars_data, method)
    if method == "SVD"
        fact = svd(expvars_data)
    elseif method == "QR"
        fact = qr(expvars_data)
    elseif method == "Cholesky"
        fact = cholesky(expvars_data'expvars_data)
    else
        error("Invalid method specified")
    end
    return [fact]
end

#depvar_data, expvars_data=CREATE_DATABASE(100, 25, T[1])
#LS(depvar_data, expvars_data, "SVD")

function benchmark_FACT(expvars_data)
    results = DataFrame(zeros(3, 1), :auto)
    results[1, 1] = mean(@benchmark(FACT(expvars_data, "SVD"))).time
    results[2, 1] = mean(@benchmark(FACT(expvars_data, "QR"))).time
    try
        results[3, 1] = mean(@benchmark(FACT(expvars_data, "Cholesky"))).time
    catch
        results[3, 1] = NaN
    end
    return results
end

## Inicialization
N=0
K=0
expvars_data=0.
results=0.

#SMALL DATA TIMMINGS
Nsmall = [1000, 500, 100]
Ksmall = [25, 20, 15]
Type=[Float64, Float32, Float16]
timmings=DataFrame()
for i in Nsmall
    for j in Ksmall
        for h in Type
            println("Benchmark results for N = $i, K = $j and T = $h:")
            N=i
            K=j
            T=h
            expvars_data = CREATE_DATABASE(N, K, T)
            results=benchmark_FACT(expvars_data)
            timmings=hcat(timmings, results)
            rename!(timmings,:x1 => ":t-$N-$K-$T")
            display(timmings)
        end
    end
end

CSV.write("methods_timmings_FACTORIZATION_small.csv", timmings)

vchuravy · 2023-05-29T17:18:28Z

vchuravy
May 29, 2023
Maintainer

What is versioninfo(verbose=true)

and please use ``` for code -formatting

3 replies

dpanigo May 29, 2023
Author

Valentin:
The output of versioninfo(verbose=true) is:

Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (aarch64-linux-gnu)
      Ubuntu 23.04
  uname: Linux 6.2.0-1005-gcp #5-Ubuntu SMP Thu Apr  6 11:01:32 UTC 2023 aarch64 aarch64
  CPU: Neoverse-N1: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz         12 s          0 s         21 s        721 s          0 s
       #2     0 MHz          9 s          0 s         26 s        722 s          0 s
       #3     0 MHz         10 s          0 s         18 s        727 s          0 s
       #4     0 MHz         19 s          0 s         28 s        692 s          0 s
  Memory: 15.575454711914062 GB (15379.66015625 MB free)
  Uptime: 76.23 sec
  Load Avg:  0.06  0.03  0.0
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, neoverse-n1)
  Threads: 1 on 4 virtual cores

dpanigo May 29, 2023
Author

I meant... " my code doesn't have any formating".
Sorry again

dpanigo May 30, 2023
Author

Valentin...here is some additional information about the Google Cloud Machine
Machine configuration
Machine type: t2a-standard-4
CPU platform: Altra Ampere
Architecture: ARM64

oscardssmith · 2023-05-29T17:31:53Z

oscardssmith
May 29, 2023
Collaborator

I believe the problem here is that BLAS/LAPACK do not have 16 bit support yet. It might work if you use https://github.com/JuliaLinearAlgebra/RecursiveFactorization.jl.

2 replies

dpanigo May 29, 2023
Author

I will try that Oscar, Thank you very much

dpanigo May 30, 2023
Author

Dear Oscar:
Unfortunately, RecursiveFactorization.lu() is (also) unable to use the -allegedly available- FP16 ALUs of Ampere Altra machines offered in Google Cloud.
Not sure if the problem is with these machines or there is some compatibility issue with Julia native half-precision floating-point arithmetic.
I provide the code below for reproducibility purposes.

using Distributions, LinearAlgebra, BenchmarkTools, DataFrames, CSV, RecursiveFactorization

function CREATE_DATABASE(N, K, T)
    expvars_data = Array{T}(rand(N, K))
    return expvars_data
end

function FACT(expvars_data, method)
    if method == "SVD"
        fact = svd(expvars_data)
    elseif method == "QR"
        fact = qr(expvars_data)
    elseif method == "Cholesky"
        fact = cholesky(expvars_data'expvars_data)
    elseif method == "LU"
        fact = lu(expvars_data'expvars_data)
    elseif method == "LU2"
        fact = RecursiveFactorization.lu(expvars_data'expvars_data)
    else
        error("Invalid method specified")
    end
    return [fact]
end

function benchmark_FACT(expvars_data)
    results = DataFrame(zeros(5, 1), :auto)
    results[1, 1] = mean(@benchmark(FACT(expvars_data, "SVD"))).time
    results[2, 1] = mean(@benchmark(FACT(expvars_data, "QR"))).time
    try
        results[3, 1] = mean(@benchmark(FACT(expvars_data, "Cholesky"))).time
    catch
        results[3, 1] = NaN
    end
    try
        results[4, 1] = mean(@benchmark(FACT(expvars_data, "LU"))).time
    catch
        results[4, 1] = NaN
    end
    try
        results[5, 1] = mean(@benchmark(FACT(expvars_data, "LU2"))).time
    catch
        results[5, 1] = NaN
    end
    return results
end

N=0
K=0
expvars_data=0.
results=0.

Nsmall = [1000]
Ksmall = [25]
Type=[Float64, Float32, Float16]
timmings=DataFrame()
for i in Nsmall
    for j in Ksmall
        for h in Type
            println("Benchmark results for N = $i, K = $j and T = $h:")
            N=i
            K=j
            T=h
            expvars_data = CREATE_DATABASE(N, K, T)
            results=benchmark_FACT(expvars_data)
            timmings=hcat(timmings, results)
            rename!(timmings,:x1 => ":t-$N-$K-$T")
            display(timmings)
        end
    end
end

CSV.write("methods_timmings_FACTORIZATION_small.csv", timmings)

dpanigo · 2023-05-30T21:57:23Z

dpanigo
May 30, 2023
Author

As additional information, I found the same issue with basic operations (matrix multiplication, division or transpose).
See the code below (this time with the required formating -apologies for copy/paste without formating-).

using Distributions, LinearAlgebra, BenchmarkTools, DataFrames, CSV

function CREATE_DATABASE(N, K, T)
    matA = Array{T}(rand(K, N))
    matB = Array{T}(rand(N, K))
    return matA, matB
end

function BASIC(matA, matB, option)
   if option == 1
        res = matB*matA
    elseif option == 2
        res = matA*matB
    elseif option == 3
        res = matB/matA'
    end
    return [res]
end

function benchmark_BASIC(matA, matB)
    results = DataFrame(zeros(3, 1), :auto)
    results[1, 1] = mean(@benchmark(BASIC(matA,matB,1))).time
    results[2, 1] = mean(@benchmark(BASIC(matA,matB,2))).time
    results[3, 1] = mean(@benchmark(BASIC(matA,matB,3))).time
    return results
end

# Inicialization
N=0
K=0
matA=0.
matB=0.
results=0.

#SMALL DATA TIMMINGS
#Nsmall = [1000, 500, 100]
Nsmall = [1000]
#Ksmall = [25, 20, 15]
Ksmall = [100]
Type=[Float64, Float32, Float16]
timmings=DataFrame()
for i in Nsmall
    for j in Ksmall
        for h in Type
            println("Benchmark results for N = $i, K = $j and T = $h:")
            N=i
            K=j
            T=h
            matA, matB = CREATE_DATABASE(N, K, T)
            results=benchmark_BASIC(matA,matB)
            timmings=hcat(timmings, results)
            rename!(timmings,:x1 => ":t-$N-$K-$T")
            display(timmings)
        end
    end
end

CSV.write("methods_timmings_BASIC.csv", timmings)

0 replies

dpanigo · 2023-06-01T14:36:56Z

dpanigo
Jun 1, 2023
Author

Following a Valentin's suggestion I have been working on simpler FP16 cases, like saxpy scalar-vector operations.
In the code below I use 3 different saxpy alternatives in FP64, 32 and 16.
The fatest one (BLAS) do not have support for FP16 arithmetic yet.
On the contrary, using FP16 data and arithmetics, the slowest one (the broadcast alternative) become almost as fast as the BLAS saxpy function that uses FP32.
The FORTRAN translated alternative (obtained here: https://discourse.julialang.org/t/another-blas-and-julia-comparison/15411) is unaffected by datatype.
So Oscar suggestion appears to be appropiate.
The problem is not with the Julia-Ampere Altra interaction itself, but with (still) many functions (notably including BLAS ones) that do no allow FP16 arithmetic yet.
Until then, It will be necessary to work with custom-made FP16 QR and Cholesky decomposition functions .

using LinearAlgebra, BenchmarkTools

function saxpy!(N,SA,SX,INCX::Int,SY,INCY::Int)
   if (N ≤ 0) return nothing end
   if (SA==0.0) return nothing end
   if (INCX==1) && (INCY==1)
      M = N%4
      if (M≠0)
         @inbounds Threads.@threads for I = 1:M
            SY[I] = SY[I] + SA*SX[I]
         end
      end
      if (N < 4) return nothing end
      MP1 = M + 1
      @inbounds Threads.@threads for I = MP1:4:N
         SY[I] = SY[I] + SA*SX[I]
         SY[I+1] = SY[I+1] + SA*SX[I+1]
         SY[I+2] = SY[I+2] + SA*SX[I+2]
         SY[I+3] = SY[I+3] + SA*SX[I+3]
      end
   else
      IX = 1
      IY = 1
      if (INCX<0)
         IX = (-N+1)*INCX + 1
      end
      if (INCY<0)
         IY = (-N+1)*INCY + 1
      end
      @inbounds Threads.@threads for I = 1:N
       SY[IY] = SY[IY] + SA*SX[IX]
       IX = IX + INCX
       IY = IY + INCY
      end
   end
   return nothing
end

#DATA CREATION
N1::Int64=10^4
a1::Float64 = 0.3141592653589793
x1 = convert(Array{Float64},collect(1:N1));
y1 = convert(Array{Float64},collect(1:N1));
N2::Int32=10^4
a2::Float32 = 0.3141592653589793
x2 = convert(Array{Float32},collect(1:N2));
y2 = convert(Array{Float32},collect(1:N2));
N3::Int16=10^4
a3::Float16 = 0.3141592653589793
x3 = convert(Array{Float16},collect(1:N3));
y3 = convert(Array{Float16},collect(1:N3));


@btime BLAS.axpy!($a1,$x1,$y1);
@btime saxpy!($N1,$a1,$x1,1,$y1,1);
@btime $y1 .= $a1.*$x1 + $y1;

@btime BLAS.axpy!($a1,$x2,$y2);
@btime saxpy!($N2,$a2,$x2,1,$y2,1);
@btime $y2 .= $a2.*$x2 + $y2;

try @btime BLAS.axpy!($a3,$x3,$y3);
catch e
    println("FP16 not supported by BLAS")
end
@btime saxpy!($N3,$a3,$x3,1,$y3,1);
@btime $y3 .= $a3.*$x3 + $y3;

0 replies

ctkelley · 2023-08-19T16:21:49Z

ctkelley
Aug 19, 2023

You might be interested in 102407 and 102612

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New native half-precision floating-point arithmetic not working on Ampere Altra (Aarch64 with FP16 enable architecture) #49987

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

New native half-precision floating-point arithmetic not working on Ampere Altra (Aarch64 with FP16 enable architecture) #49987

dpanigo May 29, 2023

Replies: 5 comments · 5 replies

vchuravy May 29, 2023 Maintainer

dpanigo May 29, 2023 Author

dpanigo May 29, 2023 Author

dpanigo May 30, 2023 Author

oscardssmith May 29, 2023 Collaborator

dpanigo May 29, 2023 Author

dpanigo May 30, 2023 Author

dpanigo May 30, 2023 Author

dpanigo Jun 1, 2023 Author

ctkelley Aug 19, 2023

dpanigo
May 29, 2023

Replies: 5 comments 5 replies

vchuravy
May 29, 2023
Maintainer

dpanigo May 29, 2023
Author

dpanigo May 29, 2023
Author

dpanigo May 30, 2023
Author

oscardssmith
May 29, 2023
Collaborator

dpanigo May 29, 2023
Author

dpanigo May 30, 2023
Author

dpanigo
May 30, 2023
Author

dpanigo
Jun 1, 2023
Author

ctkelley
Aug 19, 2023