Add support for a rope extension method #6553

simon-mo · 2024-07-18T21:37:25Z

No description provided.

davidthomas426 · 2024-07-18T21:51:16Z

vllm/model_executor/layers/rotary_embedding.py

+        if max_position == 131072:
+            # Note(simon): this is a special case for a model that doesn't 
+            # supply rope_scaling. We should remove this once the model is updated.
+            RotaryEmbedding = ExtendedRotaryEmbedding


Why don't you just do

rotary_emb = ExtendedRotaryEmbedding(head_size, rotary_dim, max_position, base, is_neox_style, dtype)

This could break speculative decoding, for instance, since you may want to use different RoPE impl for draft and target models.

Also, the key in _ROPE_DICT should probably indicate a model identifier of some kind to avoid a similar bug.

Good point.

Also, is there something else we can key off of here to make sure to avoid false positives? Maybe base?

I'm going to do the change but not adding in model id because that's pretty intrusive and won't be needed after the proper fix. Do you think that's okay?

I don't know how common this particular max_position is, probably not that common, but this would enable it for them. Could you also key on base?

vllm/config.py

WoosukKwon

LGTM. Let's make sure to fix the hack as soon as we can.

davidthomas426 · 2024-07-18T23:04:13Z

vllm/config.py

@@ -151,6 +151,15 @@ def __init__(
        self.hf_text_config = get_hf_text_config(self.hf_config)
        self.dtype = _get_and_verify_dtype(self.hf_text_config, dtype)

+        if (getattr(self.hf_config, "max_position_embeddings", 0) == 131072
+                and getattr(self.hf_config, "role_scaling", None) is None):


typo "role_scaling"

davidthomas426 · 2024-07-18T23:36:30Z

vllm/model_executor/layers/rotary_embedding.py

@@ -769,7 +799,11 @@ def get_rope(
        # for backward compatible
        if scaling_type != "su" and scaling_type != "longrope":
            scaling_factor = rope_scaling["factor"]


This change actually fails here.

davidthomas426

The code hits this error:

Traceback (most recent call last):
  File "/home/ubuntu/workspace/rubikon/vllm/examples/llm_engine_example.py", line 55, in <module>
    main(args)
  File "/home/ubuntu/workspace/rubikon/vllm/examples/llm_engine_example.py", line 45, in main
    engine = initialize_engine(args)
  File "/home/ubuntu/workspace/rubikon/vllm/examples/llm_engine_example.py", line 40, in initialize_engine
    return LLMEngine.from_engine_args(engine_args)
  File "/home/ubuntu/workspace/rubikon/vllm/vllm/engine/llm_engine.py", line 387, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/home/ubuntu/workspace/rubikon/vllm/vllm/engine/arg_utils.py", line 659, in create_engine_config
    model_config = ModelConfig(
  File "/home/ubuntu/workspace/rubikon/vllm/vllm/config.py", line 173, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/home/ubuntu/workspace/rubikon/vllm/vllm/config.py", line 1463, in _get_and_verify_max_len
    assert "factor" in rope_scaling
AssertionError

* Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

@iotamudelta

* Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters (vllm-project#114) * Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters * Adding HTTP headers * Add distributed executor backend to benchmark scripts (vllm-project#118) * Add weight padding for moe (vllm-project#119) * add weight padding for moe * enable padding by default * fix linter * fix linter * fix linter * using envs.py * fix linter * [BugFix] Fix navi build after many custom for MI kernels added (vllm-project#116) * fix navi build * Created dummy kernels of unsupported on Navi to avoid function not found crashes at runtime * replacing ifdefs on host code with those on kernels * refactoring code to avoid unsupported call on Navi * syntactic change * import statements fix * moving env variables to envs.py * style fixes * cosmetic changes for isort * remved extra include * moving use_skinny to be member --------- Co-authored-by: lcskrishna <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * add emtpy_cache() after each padding (vllm-project#120) * [FIX] Gradlib OOM on Navi and sometimes on MI (vllm-project#124) * add memory clean up after every shape and parameter to reduce cache invalidation buffers * small typo * syntax change --------- Co-authored-by: maleksan85 <[email protected]> * save shape when fp8 solution not found (vllm-project#123) Co-authored-by: Gregory Shtrasberg <[email protected]> * Fix unit test for moe by adding padding (vllm-project#128) * fix test_moe * fix linter * Llama3.1 (vllm-project#129) * Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> * chat/completions endpoint (vllm-project#121) * Initial implementation of chat/completions endpoint and its streaming variant * Reusing datatypes from the openai entrypoints * Response role from arg * Added models endpoint and model validation from the request * Optimize custom all reduce (vllm-project#130) * First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover * Add BF16 support to custom PA (vllm-project#133) * tightened atol for custom PA; enable supported head size, block sizes in testing * update num_blocks and num_iters in benchmark PA to realistic settings * move to generic b16 type * bf16 first port * enabled all bf16 tests, set atol for bf16 * enable custom PA for bf16 as well as block size 32 and head size 64 * fix cast to zero in custom PA reduce * py linter fixes * clang format fixes * div round up clang-format --------- Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * Making check for output match in original types. It saves some memory. (vllm-project#135) Co-authored-by: maleksan85 <[email protected]> * Make CAR ROCm 6.1 compatible. (vllm-project#137) * remove scoping * while there fix a typo * while there remove unused variable * Car revert (vllm-project#140) * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Make CAR ROCm 6.1 compatible. (vllm-project#137)" This reverts commit 4d2dda6. * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Optimize custom all reduce (vllm-project#130)" This reverts commit 636ff01. --------- Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Matt Wong <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: lcskrishna <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: iotamudelta <[email protected]> Co-authored-by: sanyalington <[email protected]>

@iotamudelta

* Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters (vllm-project#114) * Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters * Adding HTTP headers * Add distributed executor backend to benchmark scripts (vllm-project#118) * Add weight padding for moe (vllm-project#119) * add weight padding for moe * enable padding by default * fix linter * fix linter * fix linter * using envs.py * fix linter * [BugFix] Fix navi build after many custom for MI kernels added (vllm-project#116) * fix navi build * Created dummy kernels of unsupported on Navi to avoid function not found crashes at runtime * replacing ifdefs on host code with those on kernels * refactoring code to avoid unsupported call on Navi * syntactic change * import statements fix * moving env variables to envs.py * style fixes * cosmetic changes for isort * remved extra include * moving use_skinny to be member --------- Co-authored-by: lcskrishna <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * add emtpy_cache() after each padding (vllm-project#120) * [FIX] Gradlib OOM on Navi and sometimes on MI (vllm-project#124) * add memory clean up after every shape and parameter to reduce cache invalidation buffers * small typo * syntax change --------- Co-authored-by: maleksan85 <[email protected]> * save shape when fp8 solution not found (vllm-project#123) Co-authored-by: Gregory Shtrasberg <[email protected]> * Fix unit test for moe by adding padding (vllm-project#128) * fix test_moe * fix linter * Llama3.1 (vllm-project#129) * Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> * chat/completions endpoint (vllm-project#121) * Initial implementation of chat/completions endpoint and its streaming variant * Reusing datatypes from the openai entrypoints * Response role from arg * Added models endpoint and model validation from the request * Optimize custom all reduce (vllm-project#130) * First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover * Add BF16 support to custom PA (vllm-project#133) * tightened atol for custom PA; enable supported head size, block sizes in testing * update num_blocks and num_iters in benchmark PA to realistic settings * move to generic b16 type * bf16 first port * enabled all bf16 tests, set atol for bf16 * enable custom PA for bf16 as well as block size 32 and head size 64 * fix cast to zero in custom PA reduce * py linter fixes * clang format fixes * div round up clang-format --------- Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * Making check for output match in original types. It saves some memory. (vllm-project#135) Co-authored-by: maleksan85 <[email protected]> * Make CAR ROCm 6.1 compatible. (vllm-project#137) * remove scoping * while there fix a typo * while there remove unused variable * Car revert (vllm-project#140) * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Make CAR ROCm 6.1 compatible. (vllm-project#137)" This reverts commit 4d2dda6. * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Optimize custom all reduce (vllm-project#130)" This reverts commit 636ff01. * Using the correct datatypes for streaming non-chat completions (vllm-project#134) * Adding UNREACHABLE_CODE macro for non MI300 and MI250 cards (vllm-project#138) * Adding UNREACHABLE_CODE macro * clang format fixes * clang formatting fix * minor updates in syntax * clang format update * clang format fix one more try * clang format one more try * clang format fix one more try --------- Co-authored-by: Aleksandr Malyshev <[email protected]> * gfx90a typo fix (vllm-project#142) Co-authored-by: maleksan85 <[email protected]> * wvsplitk templatized and better tuned for MI300 (vllm-project#132) * improvements to wvSpltK * wvsplt gemm; better handle MI300 and large A[] sizes * lint fix * Adjustments to better handle small weights in TP8. * early-out bug fix * better wave load balancing in wvSplt * add missing skip for wvsplt_big * Bug fix for wvSplt_big in load balancing at M4, lint fix. * [Bugfix] Dockerfile.rocm (vllm-project#141) * Dockerfile.rocm bug fix * naming preference --------- Co-authored-by: Gregory Shtrasberg <[email protected]> * Update test-template.j2 (vllm-project#145) * Adding Triton implementations awq_dequantize and awq_gemm to ROCm (vllm-project#136) * basic support for AWQ added * awq_dequantize implementation in Triton * awq_gemm implementation in Triton * unit tests in tests/kernels/test_awq_triton.py --------- Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Matt Wong <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: lcskrishna <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: iotamudelta <[email protected]> Co-authored-by: sanyalington <[email protected]> Co-authored-by: Hashem Hashemi <[email protected]> Co-authored-by: Zachary Streeter <[email protected]> Co-authored-by: omkar kakarparthi <[email protected]> Co-authored-by: rasmith <[email protected]>

* Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

* optimizations for process output step * Llama3.1 (#129) * Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> * Update hipblaslt and FA revs to match what was used for MLPerf * Switch to "unified docker" with a ROCm 6.2 base image This base image includes current libraries, so there is no need for us to rebuild hipblaslt, RCCL, and Flash Attention. --------- Co-authored-by: Shomy <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]>

Signed-off-by: Alvant <[email protected]>

simon-mo added 3 commits July 18, 2024 21:37

Add support for a rope extension method

3bb7e45

fix lint

7793d4c

fix lint

82c5b15

simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 18, 2024

davidthomas426 suggested changes Jul 18, 2024

View reviewed changes

move hack to config.py

ec11712

WoosukKwon reviewed Jul 18, 2024

View reviewed changes

vllm/config.py Outdated Show resolved Hide resolved

comments

e829547

WoosukKwon approved these changes Jul 18, 2024

View reviewed changes

Merge branch 'main' of github.com:vllm-project/vllm into extended-rope

e2e1360

davidthomas426 suggested changes Jul 18, 2024

View reviewed changes

fix typo

d482221

davidthomas426 approved these changes Jul 18, 2024

View reviewed changes

davidthomas426 suggested changes Jul 18, 2024

View reviewed changes

simon-mo added 2 commits July 18, 2024 23:42

skip reading factor

5533931

fix lint

f70e187

davidthomas426 approved these changes Jul 18, 2024

View reviewed changes

davidthomas426 suggested changes Jul 19, 2024

View reviewed changes

fix another spot

f772828

davidthomas426 approved these changes Jul 19, 2024

View reviewed changes

simon-mo enabled auto-merge (squash) July 19, 2024 01:16

simon-mo merged commit c5df56f into vllm-project:main Jul 19, 2024
73 checks passed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

Add support for a rope extension method (vllm-project#6553)

d653d7f

gnpinkert pushed a commit to gnpinkert/vllm that referenced this pull request Jul 26, 2024

Add support for a rope extension method (vllm-project#6553)

3cedd05

cduk pushed a commit to cduk/vllm-pascal that referenced this pull request Aug 6, 2024

Add support for a rope extension method (vllm-project#6553)

1111f0b

gshtras pushed a commit to ROCm/vllm that referenced this pull request Aug 12, 2024

Add support for a rope extension method (vllm-project#6553)

be817de

gshtras mentioned this pull request Aug 12, 2024

Llama3.1 ROCm/vllm#129

Merged

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

Add support for a rope extension method (vllm-project#6553)

37b2be6

Signed-off-by: Alvant <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

Add support for a rope extension method (vllm-project#6553)

d1d11b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for a rope extension method #6553

Add support for a rope extension method #6553

simon-mo commented Jul 18, 2024

davidthomas426 Jul 18, 2024 •

edited

Loading

simon-mo Jul 18, 2024

davidthomas426 Jul 18, 2024 •

edited

Loading

simon-mo Jul 18, 2024

davidthomas426 Jul 18, 2024

WoosukKwon left a comment

davidthomas426 Jul 18, 2024

davidthomas426 Jul 18, 2024

davidthomas426 left a comment

Add support for a rope extension method #6553

Add support for a rope extension method #6553

Conversation

simon-mo commented Jul 18, 2024

davidthomas426 Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

simon-mo Jul 18, 2024

Choose a reason for hiding this comment

davidthomas426 Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

simon-mo Jul 18, 2024

Choose a reason for hiding this comment

davidthomas426 Jul 18, 2024

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

davidthomas426 Jul 18, 2024

Choose a reason for hiding this comment

davidthomas426 Jul 18, 2024

Choose a reason for hiding this comment

davidthomas426 left a comment

Choose a reason for hiding this comment

davidthomas426 Jul 18, 2024 •

edited

Loading

davidthomas426 Jul 18, 2024 •

edited

Loading