Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 #9838

Merged
merged 5 commits into from
Oct 31, 2024

Conversation

mzusman
Copy link
Contributor

@mzusman mzusman commented Oct 30, 2024

  1. Fix illegal memory access in casusal_conv1d forward kernel, it was overlooked in [Kernel][Model] Improve continuous batching for Jamba and Mamba #9189 since it's only apparent in H100 machines AFAIK .
  2. Ease the diff tolerance constraint on bfloat16 test results in test_selective_state_update_with_batch_indices and match them to the constraints in test_selective_state_update_with_heads_with_batch_indices - Without this change, it also fails in H100 as apparently the diff is a little higher there.

CC @tlrmchlsmth

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

update, and small fix in causal_conv1d kernel

Signed-off-by: mzusman <[email protected]>
@mzusman mzusman force-pushed the bug_fix_causal_conv1d_h100 branch from f6947db to 162361f Compare October 30, 2024 12:28
Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

Verified that pytest tests/kernels/test_causal_conv1d.py was hitting an illegal memory access, and that's now fixed on an H100

Comment on lines 450 to 452
if ((offset + kWidth - 2) >= kNElts){
// do not load to index 1 if we're not gonna read from there
reinterpret_cast<vec_t *>(x_vals_load)[1] = smem_exchange[last_thread + 1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this a bit more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, added more explanation

@tlrmchlsmth
Copy link
Collaborator

@mzusman, is tests/kernels/test_causal_conv1d.py passing for you on an H100? I am seeing a couple of torch.allclose failures and wondering if the tolerances need to be bumped up a bit

FAILED tests/kernels/test_causal_conv1d.py::test_causal_conv1d_varlen[False-64-2048-4-True-True-itype0] - AssertionError: assert False
FAILED tests/kernels/test_causal_conv1d.py::test_causal_conv1d_varlen[False-4096-2048-4-True-True-itype0] - AssertionError: assert False

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 30, 2024
@mzusman
Copy link
Contributor Author

mzusman commented Oct 31, 2024

@mzusman, is tests/kernels/test_causal_conv1d.py passing for you on an H100? I am seeing a couple of torch.allclose failures and wondering if the tolerances need to be bumped up a bit

FAILED tests/kernels/test_causal_conv1d.py::test_causal_conv1d_varlen[False-64-2048-4-True-True-itype0] - AssertionError: assert False
FAILED tests/kernels/test_causal_conv1d.py::test_causal_conv1d_varlen[False-4096-2048-4-True-True-itype0] - AssertionError: assert False

I've dug into those 2 failures, and it seems to fail on a specific edge case where the final state data is split between the two iterations in the kernel (in the most of the cases the data ends up only on the last smem_exchange memory). The data in the smem_exchange gets overwritten in the last iteration so I've had to add another bulk of code to deal with this case inside the loop itself.

I guess it was overlooked previously because it fails on specific on sequence lengths that depend on the chunk size, while our tests generate sequence lengths randomly (for the varlen part, more specifically).
To catch this edge case, I've added sequence length = 1025 in test_causal_conv1d that fails without this fix.

Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are green for me now, thanks for the additional bug fix!

@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) October 31, 2024 14:13
@tlrmchlsmth tlrmchlsmth merged commit 9fb12f7 into vllm-project:main Oct 31, 2024
79 checks passed
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Nov 4, 2024
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Nov 4, 2024
bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Nov 5, 2024
JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024
tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants