Inference failure when inlining #1602

charleskawczynski · 2024-01-31T02:36:52Z

I need to dig a bit more into why this happens, however, what I've seen is that if PhasePartition (in Thermodynamics.jl) in this reproducer is inlined, then the broadcasted anonymous function in ClimaCore.jl/Fields/broadcast.jl:

# types aren't isbits
Base.Broadcast.broadcasted(
    fs::AbstractFieldStyle,
    ::Type{T},
    args...,
) where {T} = Base.Broadcast.broadcasted(fs, (x...) -> T(x...), args...)

is not inferred. Here's the reproducer:

#=
julia --project=perf
include("../perf/climacore_type_bc_inf.jl")
=#
using Revise
import ClimaCore;
import Thermodynamics as TD
import Thermodynamics.Parameters as TDP
import CloudMicrophysics as CM
import ClimaCore.Fields as Fields
import ClimaCore.Operators as Operators
import ClimaCore.Spaces as Spaces
import ClimaCore.Geometry as Geometry
import ClimaComms
@isdefined(TU) || include(joinpath(pkgdir(ClimaCore), "test", "TestUtilities", "TestUtilities.jl"));
import .TestUtilities as TU;
FT = Float64;
zelem=25
cspace = TU.CenterExtrudedFiniteDifferenceSpace(FT;zelem, helem=10);
fspace = Spaces.FaceExtrudedFiniteDifferenceSpace(cspace);
space = cspace;
@show ClimaComms.device(cspace)

# From ClimaAtmos
using ClimaParams
import ClimaParams as CP
aliases = string.(fieldnames(TD.Parameters.ThermodynamicsParameters));
toml_dict = CP.create_toml_dict(FT);
const thermo_params = TD.Parameters.ThermodynamicsParameters(toml_dict)
const cm_params = CM.Parameters.Parameters0M(FT)

x = (;
    ᶜts       = Fields.Field(TD.PhaseEquil{FT}, space),
    ᶜS_ρq_tot = Fields.Field(FT, space),
    ρ         = Fields.Field(FT, space),
)

total_specific_humidity(tp, ts) = TD.PhasePartition(tp, ts).tot
function compute_precipitation_cache!(x)
    (; ᶜts, ρ, ᶜS_ρq_tot) = x
    @. ᶜS_ρq_tot =
        ρ * CM.Microphysics0M.remove_precipitation(
            cm_params,
            TD.PhasePartition(thermo_params, ᶜts),
        )
    # @. ᶜS_ρq_tot = ρ * TD.PhasePartition(thermo_params, ᶜts).tot # allocates!
    # @. ᶜS_ρq_tot = ρ * total_specific_humidity(thermo_params, ᶜts) # allocation-free
    return nothing
end

compute_precipitation_cache!(x)

import JET
JET.@test_opt compute_precipitation_cache!(x)

function do_work!(x)
    for i in 1:200
        compute_precipitation_cache!(x)
    end
end

import Profile, ProfileCanvas
@info "Compiling first..."
do_work!(x) # compile first
@info "Collecting profile..."
Profile.clear()
prof = Profile.@profile do_work!(x)
results = Profile.fetch()
Profile.clear()
@info "Generating html..."
ProfileCanvas.html_file("flame.html", results)

@info "Benchmarking..."
import BenchmarkTools
trial = BenchmarkTools.@benchmark compute_precipitation_cache!($x)
show(stdout, MIME("text/plain"), trial)

This function is ~100x slower as a result. cc @glwagner and @dennisYatunin (just FYI). It's possible that we're hitting a heuristic (PhasePartition calls PhasePartition_equil with a handful of arguments), or maybe it's the use of phase_type, which is somewhat of an edge case in Thermodynamics.

Whatever the issue is, this is a useful reproducer because one other thing I noticed is that @. ᶜS_ρq_tot = ρ * TD.PhasePartition(thermo_params, ᶜts).tot allocates, making it slower. It's important that we track and document these issues, and spread best-practice patterns to avoid shortcomings with our broadcast machinery, or improve its robustness.

In addition, JET seems to miss this inference failure, however, it is caught in the flame graph (and the timing difference is observed through BenchmarkTools).

The text was updated successfully, but these errors were encountered:

charleskawczynski · 2024-01-31T02:43:37Z

Maybe even more interesting, removing

Base.Broadcast.broadcasted(
    fs::AbstractFieldStyle,
    ::Type{T},
    args...,
) where {T} = Base.Broadcast.broadcasted(fs, (x...) -> T(x...), args...)

entirely not only fixes the issue, but the function is 20% faster and seems to have less noise.

I'll see if this can be removed. It was added in a very large PR (#767), so it's possible that it was a kluge to get things working.

charleskawczynski · 2024-01-31T18:34:21Z

Let's add an allocation unit test to reproduce this.

glwagner · 2024-02-02T02:51:58Z

Can you clarify? Are you saying that there is a problem when an @inline annotation is added to a function? To reproduce this, we need to know which function to add/remove @inline from?

glwagner · 2024-02-02T02:54:17Z

Maybe even more interesting, removing
Base.Broadcast.broadcasted(
    fs::AbstractFieldStyle,
    ::Type{T},
    args...,
) where {T} = Base.Broadcast.broadcasted(fs, (x...) -> T(x...), args...)
entirely not only fixes the issue, but the function is 20% faster and seems to have less noise.

I'll see if this can be removed. It was added in a very large PR (#767), so it's possible that it was a kluge to get things working.

It looks like this is supposed to handle constructions like FT.(x) where FT is something like Float32. Is that right?

Something like this should not be called during time-stepping. It might be even better to avoid this kind of pattern altogether, if possible. Not sure what the use case is.

simonbyrne · 2024-02-02T04:36:15Z

Maybe even more interesting, removing

Base.Broadcast.broadcasted(
    fs::AbstractFieldStyle,
    ::Type{T},
    args...,
) where {T} = Base.Broadcast.broadcasted(fs, (x...) -> T(x...), args...)

It should no longer be required, now that JuliaGPU/CUDA.jl#2000 is in CUDA.jl

charleskawczynski · 2024-02-02T13:21:37Z

Can you clarify? Are you saying that there is a problem when an @inline annotation is added to a function? To reproduce this, we need to know which function to add/remove @inline from?

Yes, when we inline Thermodynamic's PhaseEquil, the reproducer I gave (a full reproducer with git clone etc. is in aviatesk/JET.jl#608) results inference failure.

It looks like this is supposed to handle constructions like FT.(x) where FT is something like Float32. Is that right?

Yes but, interestingly, this broadcasted method also seems to catch when FT is a type and a method. Here is a minimal reproducer of what I mean:

function Base.Broadcast.broadcasted(
    fs::Base.BroadcastStyle,
    ::Type{T},
    args...,
) where {T}
    println("Boom! $T, $args")
    type_cast(x...) = T(x...)
    Base.Broadcast.broadcasted(fs, type_cast, args...)
end

# Base's definition (FYI):
# function Base.Broadcast.broadcasted(
#     fs::Base.BroadcastStyle,
#     f::T,
#     args...,
# ) where {T}
#     Base.Broadcast.Broadcasted(fs, f, args...)
# end

struct Foo{T}
    x::T
end
struct Bar{T}
    x::T
end
struct Baz{T}
    x::T
end
function Bar(x::Int)
    print("wow ")
    Bar{Float64}(x);
end;
function get_Baz(x::Int)
    print("nice ")
    Baz{Float64}(x);
end;
function main_foo!(x)
    x .= x .+ getproperty.(Foo.(x), :x)
    return nothing
end;
function main_bar!(x)
    x .= x .+ getproperty.(Bar.(x), :x) # this Bar is a method, not the type!
    return nothing
end;
function main_baz!(x)
    x .= x .+ getproperty.(get_Baz.(x), :x) # broadcasted does not catch it if there is no method
    return nothing
end;

a = [1,2,3];
main_foo!(a);
main_bar!(a);
main_baz!(a);

Something like this should not be called during time-stepping. It might be even better to avoid this kind of pattern altogether, if possible. Not sure what the use case is.

It should no longer be required, now that JuliaGPU/CUDA.jl#2000 is in CUDA.jl

Agreed, I think we can remove it, we just need to fix the resulting issues. Which, at the moment is (from this build):

ERROR: LoadError: MethodError: adapt_structure(::CUDA.KernelAdaptor, ::Base.Broadcast.Broadcasted{ClimaCore.Fields.FieldStyle{ClimaCore.DataLayouts.VIJFHStyle{4, CUDA.CuArray{Float32, N, CUDA.Mem.DeviceBuffer} where N}}, ClimaCore.Operators.CenterPlaceholderSpace, Type{ClimaCore.Geometry.Contravariant3Vector}, Tuple{ClimaCore.Fields.Field{ClimaCore.DataLayouts.VIJFH{ClimaCore.Geometry.Covariant12Vector{Float32}, 4, SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}, ClimaCore.Operators.PlaceholderSpace}, ClimaCore.Fields.Field{ClimaCore.DataLayouts.VIJFH{ClimaCore.Geometry.LocalGeometry{(1, 2, 3), ClimaCore.Geometry.LatLongZPoint{Float32}, Float32, StaticArraysCore.SMatrix{3, 3, Float32, 9}}, 4, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}}, ClimaCore.Operators.PlaceholderSpace}}}) is ambiguous.
  |  
  | Candidates:
  | adapt_structure(to::CUDA.KernelAdaptor, bc::Base.Broadcast.Broadcasted{Style, <:Any, Type{T}}) where {Style, T}
  | @ CUDA /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/CUDA/rXson/src/compiler/execution.jl:174
  | adapt_structure(to, bc::Base.Broadcast.Broadcasted{Style}) where Style<:ClimaCore.Fields.AbstractFieldStyle
  | @ ClimaCore.Fields /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/ClimaCore/SnmZh/src/Fields/broadcast.jl:38
  |  
  | Possible fix, define
  | adapt_structure(::CUDA.KernelAdaptor, ::Base.Broadcast.Broadcasted{…} where Axes) where {…}
  |  
  | Stacktrace:
  | [1] adapt(to::CUDA.KernelAdaptor, x::Base.Broadcast.Broadcasted{…})
  | @ Adapt /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/Adapt/cdaEv/src/Adapt.jl:40
  | [2] (::Base.Fix1{…})(y::Base.Broadcast.Broadcasted{…})
  | @ Base ./operators.jl:1118
  | [3] map(f::Base.Fix1{…}, t::Tuple{…})
  | @ Base ./tuple.jl:292
  | [4] adapt_structure(to::CUDA.KernelAdaptor, xs::Tuple{Base.Broadcast.Broadcasted{…}, Base.Broadcast.Broadcasted{…}})
  | @ Adapt /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/Adapt/cdaEv/src/base.jl:3
  | [5] adapt(to::CUDA.KernelAdaptor, x::Tuple{Base.Broadcast.Broadcasted{…}, Base.Broadcast.Broadcasted{…}})
  | @ Adapt /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/Adapt/cdaEv/src/Adapt.jl:40
  | [6] adapt_structure(to::CUDA.KernelAdaptor, sbc::ClimaCore.Operators.StencilBroadcasted{…})
  | @ ClimaCore.Operators /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/ClimaCore/SnmZh/src/Operators/finitedifference.jl:238
  | [7] adapt(to::CUDA.KernelAdaptor, x::ClimaCore.Operators.StencilBroadcasted{…})
  | @ Adapt /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/Adapt/cdaEv/src/Adapt.jl:40
  | [8] cudaconvert(arg::ClimaCore.Operators.StencilBroadcasted{…})
  | @ CUDA /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/CUDA/rXson/src/compiler/execution.jl:188
  | [9] map (repeats 2 times)
  | @ ./tuple.jl:294 [inlined]
  | [10] macro expansion
  | @ /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/CUDA/rXson/src/compiler/execution.jl:102 [inlined]
  | [11] copyto!(out::ClimaCore.Fields.Field{…}, bc::ClimaCore.Operators.StencilBroadcasted{…})
  | @ ClimaCore.Operators /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/ClimaCore/SnmZh/src/Operators/finitedifference.jl:3366
  | [12] materialize!(dest::ClimaCore.Fields.Field{…}, opbc::ClimaCore.Operators.StencilBroadcasted{…})
  | @ ClimaCore.Operators /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/ClimaCore/SnmZh/src/Operators/common.jl:61
  | [13] set_ᶠuₕ³!(ᶠuₕ³::ClimaCore.Fields.Field{…}, Y::ClimaCore.Fields.FieldVector{…})
  | @ ClimaAtmos /central/scratch/esm/slurm-buildkite/climaatmos-ci/16362/climaatmos-ci/src/cache/precomputed_quantities.jl:138
  | [14] macro expansion
  | @ /central/scratch/esm/slurm-buildkite/climaatmos-ci/16362/climaatmos-ci/src/cache/precomputed_quantities.jl:424 [inlined]
  | [15] set_precomputed_quantities!(Y::ClimaCore.Fields.FieldVector{…}, p::@NamedTuple{…}, t::Float32)

charleskawczynski · 2024-03-05T15:51:13Z

Interestingly, I've found another issue, from analyzing and simplifying this issue.

charleskawczynski added the bug Something isn't working label Jan 31, 2024

charleskawczynski self-assigned this Jan 31, 2024

charleskawczynski added performance Inference labels Jan 31, 2024

charleskawczynski mentioned this issue Jan 31, 2024

JET misses failed inference (a reproducer) aviatesk/JET.jl#608

Open

This was referenced Feb 13, 2024

Improve the performance of buoyancy_gradients CliMA/ClimaAtmos.jl#2530

Closed

Use ClimaParameters T_init_min, with new T_min value CliMA/Thermodynamics.jl#193

Merged

This was referenced Feb 26, 2024

Add FD and cuda allocation tests #1626

Merged

Improve type broadcasting #1628

Closed

charleskawczynski mentioned this issue Mar 5, 2024

Fix type broadcasting plus axis tensors #1636

Merged

charleskawczynski closed this as completed in #1636 Mar 5, 2024

charleskawczynski mentioned this issue Mar 5, 2024

Inline PhasePartition CliMA/Thermodynamics.jl#209

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference failure when inlining #1602

Inference failure when inlining #1602

charleskawczynski commented Jan 31, 2024 •

edited

Loading

charleskawczynski commented Jan 31, 2024

charleskawczynski commented Jan 31, 2024

glwagner commented Feb 2, 2024

glwagner commented Feb 2, 2024 •

edited

Loading

simonbyrne commented Feb 2, 2024

charleskawczynski commented Feb 2, 2024 •

edited

Loading

charleskawczynski commented Mar 5, 2024

Inference failure when inlining #1602

Inference failure when inlining #1602

Comments

charleskawczynski commented Jan 31, 2024 • edited Loading

charleskawczynski commented Jan 31, 2024

charleskawczynski commented Jan 31, 2024

glwagner commented Feb 2, 2024

glwagner commented Feb 2, 2024 • edited Loading

simonbyrne commented Feb 2, 2024

charleskawczynski commented Feb 2, 2024 • edited Loading

charleskawczynski commented Mar 5, 2024

charleskawczynski commented Jan 31, 2024 •

edited

Loading

glwagner commented Feb 2, 2024 •

edited

Loading

charleskawczynski commented Feb 2, 2024 •

edited

Loading