Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InternalError: Check failed: func->buffer_map.size() == 0 (3 vs. 0) : This pass must be called after MakePackedAPI #259

Closed
Cunxiao2002 opened this issue Dec 8, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@Cunxiao2002
Copy link
Contributor

When I used bitblas to tune a matmul operator on the A100, I encountered the following error.
Image

Environment Configuration:
Ubuntu 22.04.2 LTS
bitblas: 2bd1dee
cuda12.1

The specific test script is as follows:

from bitblas.gpu.matmul_analysis import get_tensorized_func_and_tags
from bitblas.base.roller.policy import TensorCorePolicy, DefaultPolicy
from bitblas.base.arch import CUDA
from bitblas.base.utils import apply_and_build
import tvm
from tvm.script import tir as T
import bitblas

@tvm.script.ir_module
class FusedSingleOp:
    @T.prim_func(private=True)
    def dense1(lv11: T.Buffer((T.int64(1024), T.int64(1024)), "float16"), B: T.Buffer((T.int64(1024), T.int64(1024)), "float16"), T_matmul_NT: T.Buffer((T.int64(1024), T.int64(1024)), "float16")):
        T.func_attr({"layout_free_buffers": [1], "op_attrs": {"op_name": "nn.dense", "out_dtype": "float16", "units": None}, "op_pattern": 4, "tir.noalias": T.bool(True)})
        # with T.block("root"):
        for i0, i1, k in T.grid(T.int64(1024), T.int64(1024), T.int64(1024)):
            with T.block("T_matmul_NT"):
                v_i0, v_i1, v_k = T.axis.remap("SSR", [i0, i1, k])
                T.reads(lv11[v_i0, v_k], B[v_i1, v_k])
                T.writes(T_matmul_NT[v_i0, v_i1])
                with T.init():
                    T_matmul_NT[v_i0, v_i1] = T.float16(0)
                T_matmul_NT[v_i0, v_i1] = T_matmul_NT[v_i0, v_i1] + lv11[v_i0, v_k] * B[v_i1, v_k]

ir_module = FusedSingleOp
func = ir_module["dense1"]
target = tvm.target.Target("cuda")
arch = CUDA(target)
policy = DefaultPolicy(func=func, arch=arch)
try:
    tensorized_func, tags = get_tensorized_func_and_tags(func, arch.target)
except Exception:
    tags = None
# Tune with Tensor Core if possible
if tags:
    policy = TensorCorePolicy(func=tensorized_func, arch=arch, tags=tags)

configs = policy.emit_config(topk=20)

cpresults, best = apply_and_build(func, configs, arch, parallel_build=False)

print(best.code)

When I tested the same script on the V100 with same Environment Configuration, the errorInternalError: Check failed: func->buffer_map.size() == 0 (3 vs. 0) : This pass must be called after MakePackedAPIdid not occur. Instead, it resulted in:
Image

The specific test script is as follows:

from bitblas.gpu.matmul_analysis import get_tensorized_func_and_tags
from bitblas.base.roller.policy import TensorCorePolicy, DefaultPolicy
from bitblas.base.arch import CUDA
from bitblas.base.utils import apply_and_build
import tvm
from tvm.script import tir as T
import bitblas

@tvm.script.ir_module
class FusedSingleOp:
    @T.prim_func(private=True)
    def dense1(lv11: T.Buffer((T.int64(1024), T.int64(1024)), "float16"), B: T.Buffer((T.int64(1024), T.int64(1024)), "float16"), T_matmul_NT: T.Buffer((T.int64(1024), T.int64(1024)), "float16")):
        T.func_attr({"layout_free_buffers": [1], "op_attrs": {"op_name": "nn.dense", "out_dtype": "float16", "units": None}, "op_pattern": 4, "tir.noalias": T.bool(True)})
        # with T.block("root"):
        for i0, i1, k in T.grid(T.int64(1024), T.int64(1024), T.int64(1024)):
            with T.block("T_matmul_NT"):
                v_i0, v_i1, v_k = T.axis.remap("SSR", [i0, i1, k])
                T.reads(lv11[v_i0, v_k], B[v_i1, v_k])
                T.writes(T_matmul_NT[v_i0, v_i1])
                with T.init():
                    T_matmul_NT[v_i0, v_i1] = T.float16(0)
                T_matmul_NT[v_i0, v_i1] = T_matmul_NT[v_i0, v_i1] + lv11[v_i0, v_k] * B[v_i1, v_k]

ir_module = FusedSingleOp
func = ir_module["dense1"]
target = tvm.target.Target("cuda")
arch = CUDA(target)
policy = DefaultPolicy(func=func, arch=arch)
try:
    tensorized_func, tags = get_tensorized_func_and_tags(func, arch.target)
except Exception:
    tags = None
# Tune with Tensor Core if possible
if tags:
    policy = TensorCorePolicy(func=tensorized_func, arch=arch, tags=tags)

configs = policy.emit_config(topk=20)

cpresults, best = apply_and_build(func, configs, arch, parallel_build=False)

print(best.code)
@LeiWang1999
Copy link
Contributor

@Cunxiao2002 , thanks for your reporting, interesting bug as I can also reproduce on my A100.

seems something related to @T.prim_func(private=True), if we replace @T.prim_func(private=True) with @T.prim_func (in which case private was set into default value False), the pipeline can work.

Think we should checkout ir modules before get into unsupported_dtype_legalize pass in those two cases.

@LeiWang1999 LeiWang1999 self-assigned this Dec 8, 2024
@LeiWang1999 LeiWang1999 added the bug Something isn't working label Dec 8, 2024
@LeiWang1999
Copy link
Contributor

Fixed, please checkout

import tvm
from tvm.script import tir as T
import bitblas

bitblas.set_log_level("DEBUG")

@tvm.script.ir_module
class FusedSingleOp:
    @T.prim_func(private=True)
    def dense1(lv11: T.Buffer((T.int64(1024), T.int64(1024)), "float16"), B: T.Buffer((T.int64(1024), T.int64(1024)), "float16"), T_matmul_NT: T.Buffer((T.int64(1024), T.int64(1024)), "float16")):
        T.func_attr({"layout_free_buffers": [1], "op_attrs": {"op_name": "nn.dense", "out_dtype": "float16", "units": None}, "op_pattern": 4, "tir.noalias": T.bool(True)})
        # with T.block("root"):
        for i0, i1, k in T.grid(T.int64(1024), T.int64(1024), T.int64(1024)):
            with T.block("T_matmul_NT"):
                v_i0, v_i1, v_k = T.axis.remap("SSR", [i0, i1, k])
                T.reads(lv11[v_i0, v_k], B[v_i1, v_k])
                T.writes(T_matmul_NT[v_i0, v_i1])
                with T.init():
                    T_matmul_NT[v_i0, v_i1] = T.float16(0)
                T_matmul_NT[v_i0, v_i1] = T_matmul_NT[v_i0, v_i1] + lv11[v_i0, v_k] * B[v_i1, v_k]

ir_module = FusedSingleOp
target = tvm.target.Target("cuda")
    
with target:
    mod = bitblas.ApplyFastTuning(topk=1)(ir_module)

print(mod)
from tvm import relax
exec = relax.build(mod, target="cuda")
dev = tvm.device("cuda", 0)
vm = relax.VirtualMachine(exec, dev)

if you want to apply on a single operator, please use specalized_function = func.with_attr("global_symbol", g_var.name_hint) to make the function be non-private

@LeiWang1999
Copy link
Contributor

Closed as has been fixed

@Cunxiao2002
Copy link
Contributor Author

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants