Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Add Reasoning Parser for Granite Models #14202

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

alex-jw-brooks
Copy link
Contributor

@alex-jw-brooks alex-jw-brooks commented Mar 4, 2025

This PR adds a reasoning parser for Granite 3.2 models! These models have an optional chat template kwarg thinking that changes the system prompt to enable reasoning. 😄

The format of the text is expected to be:

Here is my thought process: <reasoning_content> Here is my response: <content>

There have been reports of quantized versions of the model using Here's instead of Here is though, so this PR matches on both.

Examples

Start the server with a granite (3.2) language model that has reasoning and the granite parser:

python vllm/entrypoints/openai/api_server.py \
    --device cuda \
    --model ibm-granite/granite-3.2-8b-instruct \
    --tokenizer ibm-granite/granite-3.2-8b-instruct \
    --enable-reasoning \
    --reasoning-parser granite

Snippets are copied from the docs, with the only change being adding chat_template_kwargs with thinking=True. Without this, reasoning will be disabled, and it'll generally parse everything into content.

No streaming:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Round 1
messages = [
    {
        "role": "user",
        "content": "9.11 and 9.8, which is greater?"
    }
]
response = client.chat.completions.create(model=model, messages=messages, extra_body={"chat_template_kwargs": {"thinking": True}})

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print("reasoning_content:", reasoning_content)
print("content:", content)

With streaming:

import json

import requests

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

models = requests.get(
    f"{openai_api_base}/models",
    headers={
        "Authorization": f"Bearer {openai_api_key}"
    },
).json()
model = models["data"][0]["id"]

# Streaming chat completions
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]

response = requests.post(
    f"{openai_api_base}/chat/completions",
    headers={"Authorization": f"Bearer {openai_api_key}"},
    json={
        "model": model,
        "messages": messages,
        "chat_template_kwargs": {"thinking": True},
        "stream": True
    },
)



print("client: Start streaming chat completions...")
printed_reasoning_content = False
printed_content = False
# Make the streaming request
if response.status_code == 200:
    # Process the streaming response
    for line in response.iter_lines():
        if line:  # Filter out keep-alive new lines
            # Decode the line and parse the JSON
            decoded_line = line.decode("utf-8")
            if decoded_line.startswith("data:"):
                data = decoded_line[5:].strip()  # Remove "data:" prefix
                if data == "[DONE]":  # End of stream
                    print("\nclient: Stream completed.")
                    break
                try:
                    # Parse the JSON data
                    chunk = json.loads(data)
                    reasoning_content = chunk["choices"][0]["delta"].get(
                        "reasoning_content", "")
                    content = chunk["choices"][0]["delta"].get("content", "")

                    if reasoning_content:
                        if not printed_reasoning_content:
                            printed_reasoning_content = True
                            print("reasoning_content:", end="", flush=True)
                        print(reasoning_content, end="", flush=True)
                    elif content:
                        if not printed_content:
                            printed_content = True
                            print("\ncontent:", end="", flush=True)
                        # Extract and print the content
                        print(content, end="", flush=True)
                except json.JSONDecodeError:
                    print("Error decoding JSON:", decoded_line)
else:
    print(f"Error: {response.status_code} - {response.text}")

Example output (run from the streaming snippet above)

reasoning_content:
This is a straightforward comparison of two numbers. The task is to determine which is larger: 9.11 or 9.8. 

I need to recall the value of these decimal numbers and compare them. Given both are very close, it requires precise comprehension to understand which has the larger value—specifically focusing on the tenths and hundredths places.


content:

9.8 is greater than 9.11. 

Let's break down the comparison:

- Both numbers are above 9, so we're comparing the decimal parts.
- 9.11 has a '11' in the hundredths place.
- 9.8 has an '80' in the hundredths place, which is larger (even if it's ten times, 80 > 11).

Therefore, 9.8 > 9.11.
client: Stream completed.

Copy link

github-actions bot commented Mar 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation frontend labels Mar 4, 2025
@DarkLight1337 DarkLight1337 requested a review from mgoin March 4, 2025 15:28
@mgoin
Copy link
Member

mgoin commented Mar 4, 2025

Nice use of this new feature! Will try in a bit cc @gaocegege

Copy link
Contributor

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Could you please rebase the upstream? In a previous PR to support reasoning outputs in structured outputs https://github.com/vllm-project/vllm/pull/12955/files#diff-ea8b8ff63961713ccb62d78e53e96404b587b7828cb9fee08a9e5576bf563673R1065, we moved the CLI argument --reasoning-parser to https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L1076

Thus you may need to add a new choice there.

@gaocegege
Copy link
Contributor

Hi, I updated the docs in this PR #14114

Maybe you should rebase the docs too. Just FYI

@@ -19,6 +19,10 @@ def get_reasoner(tokenizer: PreTrainedTokenizer,
return None
elif reasoning_backend == "deepseek_r1":
return DeepSeekReasoner.from_tokenizer(tokenizer)
elif reasoning_backend == "granite":
logger.warning(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a warning for now since this is already a large PR, but I think adding a GraniteReasoner for guided decoding could be a follow-up later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

@alex-jw-brooks
Copy link
Contributor Author

Awesome thanks @gaocegege! It's been rebased 😄

@alex-jw-brooks alex-jw-brooks requested a review from gaocegege March 6, 2025 08:40
Copy link
Contributor

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! 🎉 👍

@gaocegege
Copy link
Contributor

@mgoin Please give it another review, thanks!

response_start)
reasoning_content = current_text[
start_reasoning_content:end_reasoning_content]
response_content = current_text[current_chunk_end + 1:]
Copy link
Contributor

@b8zhong b8zhong Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
response_content = current_text[current_chunk_end + 1:]
response_content = current_text[current_chunk_end + 1:]
parsed_content = True

parsed_content flag doesn't seem to be updated, so maybe helpful to set it?
Very minor suggestion, totally optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @b8zhong, thanks for the suggestion! For now, I'd prefer to keep it as is since it returns immediately after parsing the response content. I.e., once this condition is met, there is no need to keep going, so updating the flag won't do anything 🙂

Copy link

mergify bot commented Mar 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alex-jw-brooks.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 7, 2025
@gaocegege
Copy link
Contributor

@alex-jw-brooks Hi could you please resolve the conflicts

@alex-jw-brooks
Copy link
Contributor Author

Thanks for the nudge @gaocegege! Rebased 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend structured-output
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants