Optimize softmax #69

iclementine · 2024-06-14T10:38:59Z

use different kernels (inner & non_inner) for softmax forward
a. inner: for reduction the last dim(and the input is preprocessed to be contiguous)
a. inner: for reduce along other dimensions(and the input is preprocessed to be contiguous)
both have ONE_TILE_PER_CTA static condition
a. when ONE_TILE_PER_CTA is True, load only one tile per cta without looping over reduction dim
b. when ONE_TILE_PER_CTA is False, use online softmax normalizer algorithm to save one swipe over the input.

We can leave other optimizations(optimize accordign to input layout or two-pass-reduction) for future PRs

1. ensure that decorator cascading is working as expected, i.e. inner decorator can use arguments provided by outer decorator 2. ensure that grid function can use all the arguments provided by decorators(Autotuner & Heuristics) 3. simply LibEntry, extract captured constant arguments from CompiledKernel, instead of traversing layers of decorator.

…e ONE_TILE_PER_CTA static condition(to decide whether to load only one tile per cta.

StrongSpoon · 2024-06-17T01:30:35Z

need to rebase after merging pr68

StrongSpoon · 2024-06-18T08:41:17Z

tests/test_reduction_ops.py

@@ -535,8 +535,8 @@ def _torch_rms_norm(x, residual, weight, eps):

 @pytest.mark.parametrize("shape", REDUCTION_SHAPES)
 @pytest.mark.parametrize("dtype", FLOAT_DTYPES)
-def test_accuracy_softmax(shape, dtype):
-    dim = 1
+@pytest.mark.parametrize("dim", [0, 1])


"dim" passed to the function here is an integer.
Do you mean we should add test for non-inner reduction in general?

iclementine · 2024-06-18T10:18:13Z

need to rebase after merging pr68

I didn't resolve the conflicts properly. So I opened a new pull request #76 .

iclementine added 7 commits June 14, 2024 17:42

add test_libentry into CI

360b9db

add test_libentry

1119a46

assert not raising certain kind of exception

aa9b854

clean code

73b81fb

use different kernels(inner & non_inner for softmax forward, both hav…

062400d

…e ONE_TILE_PER_CTA static condition(to decide whether to load only one tile per cta.

use different kernels(inner & non_inner for softmax forward, both hav…

d611962

…e ONE_TILE_PER_CTA static condition(to decide whether to load only one tile per cta.

StrongSpoon reviewed Jun 18, 2024

View reviewed changes

iclementine closed this Jun 19, 2024

iclementine deleted the optimize_softmax branch June 19, 2024 03:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize softmax #69

Optimize softmax #69

iclementine commented Jun 14, 2024 •

edited

Loading

StrongSpoon commented Jun 17, 2024

StrongSpoon Jun 18, 2024

iclementine Jun 18, 2024

iclementine commented Jun 18, 2024

Optimize softmax #69

Optimize softmax #69

Conversation

iclementine commented Jun 14, 2024 • edited Loading

StrongSpoon commented Jun 17, 2024

StrongSpoon Jun 18, 2024

Choose a reason for hiding this comment

iclementine Jun 18, 2024

Choose a reason for hiding this comment

iclementine commented Jun 18, 2024

iclementine commented Jun 14, 2024 •

edited

Loading