Cambricon #87

FuncSherl · 2024-06-28T06:40:46Z

draft for compare code

cambricon: fix mlu adopt problems

cambricon: fix add/sub ops test

cambricon: merge maser on 0527 and fix some mlu problems

cambricon: fix bugs in mean dim

cambricon: fix bugs in grid over limits

cambricon: Cambricon merge 0603 and adopt mlu

cambricon: fix bugs in vectnorm and varmean

cambricon: rm old tests

cambricon: fix some bugs in tests

Cambricon merge 0611

StrongSpoon · 2024-07-01T02:39:11Z

benchmark/performance_utils.py

    return inp1, inp2, inp3
+
+
+def cross_entropy_loss_args(dtype, batch, size):


function cross_entropy_loss_args, cumsum_args and so on have been implemented in the corresponding test functions. they could be deleted here.

ok, will do in #80

StrongSpoon · 2024-07-01T02:49:05Z

src/flag_gems/utils/random_utils.py

+    device = device or torch.mlu.current_device()
+    gen = torch.mlu.default_generators[device]
+    state_copy = gen.get_state()
+    c0, c1 = state_copy.view(torch.int64)[-2:]


how many bits is state_copy? is it a list longer than 2 after viewed as torch.int64?

src/flag_gems/utils/pointwise_dynamic.py

StrongSpoon · 2024-07-01T02:54:05Z

src/flag_gems/utils/pointwise_static.py

@@ -0,0 +1,303 @@
+from itertools import chain


pointwise_static.py is no longer useful. you could delete this file.

StrongSpoon · 2024-07-01T02:57:05Z

src/flag_gems/ops/pow_scalar.py

@@ -0,0 +1,47 @@
+import torch


all forms of function pow are collected in pow.py . other files could be deleted.

StrongSpoon · 2024-07-01T02:57:12Z

src/flag_gems/ops/pow_tensor_scalar.py

@@ -0,0 +1,47 @@
+import torch


StrongSpoon · 2024-07-01T02:57:19Z

src/flag_gems/ops/pow_tensor_tensor.py

@@ -0,0 +1,16 @@
+import triton


StrongSpoon · 2024-07-01T03:05:27Z

tests/test_binary_pointwise_ops.py

    ref_inp2 = to_reference(inp2, True)

-    ref_out = torch.pow(inp1, ref_inp2)
+    ref_out = torch.pow(inp1, ref_inp2.cpu())


why not run reference on mlu?

StrongSpoon · 2024-07-01T03:20:15Z

src/flag_gems/__enable__.py

@@ -0,0 +1,105 @@
+import torch


enable.py is no longer needed

FuncSherl · 2024-07-01T03:23:47Z

该mr用于对比当前cambricon和master的代码差距，便于讲解代码修改，不用于代码和入

StrongSpoon · 2024-07-01T06:18:02Z

src/flag_gems/ops/cumsum.py

+
+        raw_res = tl.cumsum(inp_vals, axis=1)
+        result = raw_res + kep[:, None]
+        kep = result[:, BLOCK_N-1]


does triton support tensor slice?

I have the same question here. Tensor slicing of this type is not supported in triton 2.2.

StrongSpoon · 2024-07-01T06:30:25Z

src/flag_gems/ops/log_softmax.py

+        inp = tl.load(input_ptrs, mask=mask, other=-
+                      float("inf")).to(tl.float32)
+        # get max for each block
+        tmp1 = tl.where(tmp0 < inp, inp, tmp0)


iclementine · 2024-07-01T07:01:15Z

benchmark/performance_utils.py

@@ -24,15 +24,18 @@ def __init__(self, op_name, torch_op, arg_func, dtype, batch, sizes):
    def set_gems(self, gems_op):
        self.gems_op = gems_op

+    def set_gems(self, gems_op):


Is this duplication intended?

StrongSpoon · 2024-07-01T07:01:22Z

src/flag_gems/ops/mean.py

+    task_num = tl.cdiv(M, BLOCK_M)
+    iter_num = tl.cdiv(task_num, num_prog)
+    if task_num % num_prog != 0:
+        iter_num = iter_num + 1


does it conflict with tl.cdiv ?

Why adding 1 when iter_num is already tl.cdiv(task_num, num_prog)?

StrongSpoon · 2024-07-01T07:03:01Z

src/flag_gems/ops/var_mean.py

+    task_num = tl.cdiv(M, BLOCK_M)
+    iter_num = tl.cdiv(task_num, num_prog)
+    if task_num % num_prog != 0:
+        iter_num = iter_num + 1


iclementine · 2024-07-01T07:33:44Z

src/flag_gems/fused/gelu_and_mul.py

@@ -11,7 +11,7 @@
 @triton.jit
 def gelu_none_and_mul_kernel(x, y):
    x_fp32 = x.to(tl.float32)
-    x_gelu = 0.5 * x_fp32 * (1 + tl.math.erf(x_fp32 * 0.7071067811))
+    x_gelu = 0.5 * x_fp32 * (1 + tl.extra.mlu.libdevice.erf(x_fp32 * 0.7071067811))


What version of triton are you targeting now? It seems that this path to erf is in versions after 2.3.

iclementine · 2024-07-01T07:56:01Z

src/flag_gems/ops/cumsum.py

+        "M",
+        "N",
+    ],
+)


Using only M & N in tunning key while M, N &K may all affect the performance would cause some unexpected behavior. For example, the best config tunned for (m, n, k1) may not be the same config tunned from (m, n, k2), so previous runs have effects on peformance afterwards.

iclementine · 2024-07-01T08:14:24Z

src/flag_gems/ops/log_softmax.py

@@ -58,6 +60,96 @@ def log_softmax_kernel(
    tl.store(output_ptrs, softmax_output, mask=mask)


+@libentry()


We have a new implementation in #76, which improves performance a lot, maybe you can test that too.

cambricon: perf test opt

Cambricon merge 0708

xuhao and others added 30 commits May 23, 2024 06:39

cambricon: flaggems adopt mlu arch

c93f24c

cambricon: fix mlu adopt problems

cambricon: fix add/sub ops test

5d75fd4

Merge pull request #31 from FlagOpen/cambri_xh_fix_add

9df05dc

cambricon: fix add/sub ops test

Merge remote-tracking branch 'origin/master' into cambricon

ec0aaf8

cambricon: fix merge problems 0527

30e54a6

Merge pull request #35 from FlagOpen/xh_cambri_merge_ori_0527

55bb0dd

cambricon: merge maser on 0527 and fix some mlu problems

cambricon: fix bugs in mean dim

e8813cc

Merge pull request #32 from FlagOpen/cambricon_ccw

6f0b7b5

cambricon: fix bugs in mean dim

cambricon: fix bitwise/all/any/clamp

9878fa6

cambricon: change gelu mul to cpu

f890417

cambricon: fix bugs in var mean

11e9570

cambricon: add common grid limit

1b86c88

Merge pull request #43 from FlagOpen/cambricon_ccw

ced3852

cambricon: fix bugs in grid over limits

Merge branch 'master' into cambricon_merge_0603

8b77d2f

cambricon: adopt mlu after merge 0603

f48a784

Merge pull request #48 from FlagOpen/cambricon_merge_0603

4cc908f

cambricon: Cambricon merge 0603 and adopt mlu

cambricon: fix bugs in vectnorm

2ff62bd

cambricon: revert some impl

81bd911

cambricon: fix bugs in vectnorm

44be180

cambricon: fix bugs in grid var

d256b7c

cambricon: add comments for test

c7c1c93

cambricon: add comments for test1

fd342af

Merge pull request #47 from FlagOpen/cambricon_ccw

21d138c

cambricon: fix bugs in vectnorm and varmean

cambricon: rm old tests

3746880

Merge pull request #50 from FlagOpen/cambricon_rm_oldtest

250a2a8

cambricon: rm old tests

cambricon: add comments for test2

e3c7126

Merge pull request #54 from FlagOpen/cambricon_ccw

48bbb52

cambricon: fix some bugs in tests

Merge remote-tracking branch 'origin/master' into cambricon

11adac4

cambricon: fix some adopt bugs

227c1b5

Merge pull request #61 from FlagOpen/cambricon_merge_0611

4f930ba

Cambricon merge 0611

cambricon: perf test opt

27c5225