-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
after torch compile with 0.2.0, speed is become very slow #727
Comments
Yes, that is observed in #709, you can cherry pick the changes there. |
i test, but that hotfix can break graph, when compile using fullgraph, and if not use fullgraph, the compile will succees,and the speed is ok,but i need use fullgraph when using vllm compile fast inductor |
@MichoChan how do you use vllm with torch.compile? |
torch compile in vllm is ok, but when i use vllm compilation impl in my framework,my model code would lead to graph break when compile,then this assert not self._called, "VllmBackend can only be called once", and i use fullgraph with flashinfer 0.20.0. i find vllm already use custom op register with attention for torch compile, i use this same method with flashinfer 0.16.0, then everything is ok for me. so flashinfer 0.20.0 can't use torch compile full graph |
Can you explain this? I don't see why fullgraph work for v0.1.6 but not for v0.2.0 |
sorry, not 0.20.0, is the master with #709 which cant use fullgraph, i test, and find the #709 could break graph, BatchDecodeMlaWithPagedKVCacheWrapper.run break graph |
No description provided.
The text was updated successfully, but these errors were encountered: