Release v0.6.2.post1+rocm · ROCm/vllm

What's Changed

Make rpdtracer import only when required by @Rohan138 in #216
Improve profiling setup and documentation, sync benchmarks with main by @AdrianAbeyta in #218
Installing the requirements before invoking setup.py since it now imports setuptools_scm by @gshtras in #221
llama3.2 + cross attn test by @maleksan85 in #220
Optimize CAR for ROCm by @iotamudelta in #225
Custom PA perf improvements by @sanyalington in #222
Upstream merge 24 10 08 by @gshtras in #226
customPA write fp8 small ctx fix; enable customPA write fp8 by default by @sanyalington in #227
added timeout for vllm build in rocm by @maleksan85 in #230
Add fp8 for dbrx by @charlifu in #231
Update Buildkite env variable by @dhonnappa-amd in #232
cuda graph + num-scheduler-steps bug fix by @seungrokj in #236
[Model] [BUG] Fix code path logic to load mllama model by @tjtanaa in #234
prefix-enabled FA perf issue by @seungrokj in #239
Custom PA Partition size 256 to improve performance by @sanyalington in #238
[Build/CI] Minor changes to fix internal CI process. by @Alexei-V-Ivanov-AMD in #235
[BUGFIX] Restored handling of ROCM FA output as before adaptation of llama3.2 by @maleksan85 in #241

Full Changelog: v0.6.2+rocm...v0.6.2.post1+rocm