Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: When using lora and setting num-scheduler-steps simultaneously, the output does not meet expectations. #11086

Open
1 task done
luoling1993 opened this issue Dec 11, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@luoling1993
Copy link

luoling1993 commented Dec 11, 2024

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

Model Input Dumps

No response

🐛 Describe the bug

VLLM version: 0.6.4.post1
I have trained a LoRA model based on Qwen2.5-7B-Instruct, and I have started the vllm service using pm2 with the following configuration:

apps:
  - name: "vllm"
    script: "/home/lucas/envs/nlp-vllm/bin/python"
    args:
      - "-m"
      - "vllm.entrypoints.openai.api_server"
      - "--port=18101"  # 端口设置

      # # Meta-Llama-3.1-8B-Instruct
      # - "--served-model-name=Meta-Llama-3.1-8B-Instruct"
      # - "--model=/data/llms/Meta-Llama-3.1-8B-Instruct"
      # - "--tokenizer=/data/llms/Meta-Llama-3.1-8B-Instruct"

      # qwen
      - "--served-model-name=Qwen2.5-7B-Instruct"
      - "--model=/data/llms/Qwen2.5-7B-Instruct"
      - "--tokenizer=/data/llms/Qwen2.5-7B-Instruct"

      # - "--max-model-len=8192"  # 最大模型长度
      - "--max-model-len=4096"
      - "--gpu-memory-utilization=0.9"  # GPU内存利用率

      # speedup
      # - "--enable-chunked-prefill"  # NOTE: LoRA is not supported with chunked prefill yet
      - "--enable-prefix-caching"
      # - "--num-scheduler-steps=8" # NOTE: LoRA, will always use base model(BUG)

      - "--enable-lora"
      - "--max-lora-rank=64"
      - "--lora-modules"
      # - '{"name": "nl2filter", "path": "/home/lucas/workspace/github_project/LLaMA-Factory/saves/Meta-Llama-3.1-8B-Instruct/lora/nl2filter-all", "base_model_name": "Meta-Llama-3.1-8B-Instruct"}'
      - '{"name": "nl2filter", "path": "/home/lucas/workspace/github_project/LLaMA-Factory/saves/Qwen2.5-7B-Instruct/lora/nl2filter-all", "base_model_name": "Qwen2.5-7B-Instruct"}'
   
    env:
      CUDA_VISIBLE_DEVICES: "0"

    log_date_format: "YYYY-MM-DD HH:mm:ss"
    error_file: "/home/lucas/workspace/pm2_logs/error.log"
    out_file: "/home/lucas/workspace/pm2_logs/out.log"

When calling, I use model_name=nl2filter. Everything works fine when the num-scheduler-steps parameter is not set. However, when setting --num-scheduler-steps=8, the service starts up normally and the call also returns, but the result returned is not from the LoRA model. Instead, it is from the base model, which is Qwen2.5-7B-Instruct without any LoRA modifications. There are no errors or warnings.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@luoling1993 luoling1993 added the bug Something isn't working label Dec 11, 2024
@jeejeelee
Copy link
Collaborator

jeejeelee commented Dec 11, 2024

` --enable-chunked-prefill" # NOTE: LoRA is not supported with chunked prefill yet ·

Just supported, it should be reflected in the next version, see: #9057

@jeejeelee
Copy link
Collaborator

jeejeelee commented Dec 11, 2024

# - "--num-scheduler-steps=8" # NOTE: LoRA, will always use base model(BUG)

I remember there was a bug here, and there was a PR for it - let me look for it, see: #9689

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants