[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 #9838

mzusman · 2024-10-30T12:20:20Z

Fix illegal memory access in casusal_conv1d forward kernel, it was overlooked in [Kernel][Model] Improve continuous batching for Jamba and Mamba #9189 since it's only apparent in H100 machines AFAIK .
Ease the diff tolerance constraint on bfloat16 test results in test_selective_state_update_with_batch_indices and match them to the constraints in test_selective_state_update_with_heads_with_batch_indices - Without this change, it also fails in H100 as apparently the diff is a little higher there.

github-actions · 2024-10-30T12:20:34Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

update, and small fix in causal_conv1d kernel Signed-off-by: mzusman <[email protected]>

tlrmchlsmth

Thanks for the fix!

Verified that pytest tests/kernels/test_causal_conv1d.py was hitting an illegal memory access, and that's now fixed on an H100

tlrmchlsmth · 2024-10-30T16:21:38Z

csrc/mamba/causal_conv1d/causal_conv1d.cu

+            if ((offset + kWidth - 2) >= kNElts){
+                // do not load to index 1 if we're not gonna read from there
+                reinterpret_cast<vec_t *>(x_vals_load)[1] = smem_exchange[last_thread + 1];


Could you explain this a bit more?

Yeah, added more explanation

tlrmchlsmth · 2024-10-30T16:24:16Z

@mzusman, is tests/kernels/test_causal_conv1d.py passing for you on an H100? I am seeing a couple of torch.allclose failures and wondering if the tolerances need to be bumped up a bit

FAILED tests/kernels/test_causal_conv1d.py::test_causal_conv1d_varlen[False-64-2048-4-True-True-itype0] - AssertionError: assert False
FAILED tests/kernels/test_causal_conv1d.py::test_causal_conv1d_varlen[False-4096-2048-4-True-True-itype0] - AssertionError: assert False

between 2 chunks Signed-off-by: mzusman <[email protected]>

Signed-off-by: mzusman <[email protected]>

mzusman · 2024-10-31T13:30:10Z

@mzusman, is tests/kernels/test_causal_conv1d.py passing for you on an H100? I am seeing a couple of torch.allclose failures and wondering if the tolerances need to be bumped up a bit
FAILED tests/kernels/test_causal_conv1d.py::test_causal_conv1d_varlen[False-64-2048-4-True-True-itype0] - AssertionError: assert False
FAILED tests/kernels/test_causal_conv1d.py::test_causal_conv1d_varlen[False-4096-2048-4-True-True-itype0] - AssertionError: assert False

I've dug into those 2 failures, and it seems to fail on a specific edge case where the final state data is split between the two iterations in the kernel (in the most of the cases the data ends up only on the last smem_exchange memory). The data in the smem_exchange gets overwritten in the last iteration so I've had to add another bulk of code to deal with this case inside the loop itself.

I guess it was overlooked previously because it fails on specific on sequence lengths that depend on the chunk size, while our tests generate sequence lengths randomly (for the varlen part, more specifically).
To catch this edge case, I've added sequence length = 1025 in test_causal_conv1d that fails without this fix.

Signed-off-by: mzusman <[email protected]>

tlrmchlsmth

Tests are green for me now, thanks for the additional bug fix!

…_h100 Signed-off-by: mzusman <[email protected]>

…llm-project#9838) Signed-off-by: mzusman <[email protected]>

…llm-project#9838) Signed-off-by: mzusman <[email protected]> Signed-off-by: Loc Huynh <[email protected]>

…llm-project#9838) Signed-off-by: mzusman <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

…llm-project#9838) Signed-off-by: mzusman <[email protected]>

…llm-project#9838) Signed-off-by: mzusman <[email protected]> Signed-off-by: s.kochetkov <[email protected]>

…llm-project#9838) Signed-off-by: mzusman <[email protected]> Signed-off-by: LeiWang1999 <[email protected]>

mzusman requested review from tlrmchlsmth and WoosukKwon as code owners October 30, 2024 12:20

Mamba test relive bfloat16 tolerence constraint to match update with

162361f

update, and small fix in causal_conv1d kernel Signed-off-by: mzusman <[email protected]>

mzusman force-pushed the bug_fix_causal_conv1d_h100 branch from f6947db to 162361f Compare October 30, 2024 12:28

tlrmchlsmth reviewed Oct 30, 2024

View reviewed changes

tlrmchlsmth approved these changes Oct 30, 2024

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 30, 2024

mzusman added 2 commits October 31, 2024 15:20

Add another fix to causal_conv1d, where final state data is seperated

1df71e2

between 2 chunks Signed-off-by: mzusman <[email protected]>

Format

1376762

Signed-off-by: mzusman <[email protected]>

Add more explaination to the illegal access error

ac66eaf

Signed-off-by: mzusman <[email protected]>

tlrmchlsmth approved these changes Oct 31, 2024

View reviewed changes

tlrmchlsmth enabled auto-merge (squash) October 31, 2024 14:13

Merge remote-tracking branch 'github/main' into bug_fix_causal_conv1d…

f0a5ca4

…_h100 Signed-off-by: mzusman <[email protected]>

tlrmchlsmth merged commit 9fb12f7 into vllm-project:main Oct 31, 2024
79 checks passed

hissu-hyvarinen pushed a commit to ROCm/vllm that referenced this pull request Nov 6, 2024

[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (v…

284f0cd

…llm-project#9838) Signed-off-by: mzusman <[email protected]>

JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024

[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (v…

a414d73

…llm-project#9838) Signed-off-by: mzusman <[email protected]> Signed-off-by: Loc Huynh <[email protected]>

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (v…

417a2df

…llm-project#9838) Signed-off-by: mzusman <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

xffxff mentioned this pull request Dec 5, 2024

[Bug]: illegal memory access in causal_conv1d_fn with input length 1026 #10895

Closed

1 task

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (v…

6e254f0

…llm-project#9838) Signed-off-by: mzusman <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 #9838

[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 #9838

Uh oh!

mzusman commented Oct 30, 2024

Uh oh!

github-actions bot commented Oct 30, 2024

Uh oh!

tlrmchlsmth left a comment

Uh oh!

tlrmchlsmth Oct 30, 2024

Uh oh!

mzusman Oct 31, 2024

Uh oh!

tlrmchlsmth commented Oct 30, 2024

Uh oh!

mzusman commented Oct 31, 2024

Uh oh!

tlrmchlsmth left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 #9838

[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 #9838

Uh oh!

Conversation

mzusman commented Oct 30, 2024

Uh oh!

github-actions bot commented Oct 30, 2024

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

mzusman Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth commented Oct 30, 2024

Uh oh!

mzusman commented Oct 31, 2024

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!