[Model] Add Reasoning Parser for Granite Models #14202

alex-jw-brooks · 2025-03-04T11:41:27Z

This PR adds a reasoning parser for Granite 3.2 models! These models have an optional chat template kwarg thinking that changes the system prompt to enable reasoning. 😄

The format of the text is expected to be:

Here is my thought process: <reasoning_content> Here is my response: <content>

There have been reports of quantized versions of the model using Here's instead of Here is though, so this PR matches on both.

Examples

Start the server with a granite (3.2) language model that has reasoning and the granite parser:

python vllm/entrypoints/openai/api_server.py \
    --device cuda \
    --model ibm-granite/granite-3.2-8b-instruct \
    --tokenizer ibm-granite/granite-3.2-8b-instruct \
    --enable-reasoning \
    --reasoning-parser granite

Snippets are copied from the docs, with the only change being adding chat_template_kwargs with thinking=True. Without this, reasoning will be disabled, and it'll generally parse everything into content.

No streaming:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Round 1
messages = [
    {
        "role": "user",
        "content": "9.11 and 9.8, which is greater?"
    }
]
response = client.chat.completions.create(model=model, messages=messages, extra_body={"chat_template_kwargs": {"thinking": True}})

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print("reasoning_content:", reasoning_content)
print("content:", content)

With streaming:

import json

import requests

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

models = requests.get(
    f"{openai_api_base}/models",
    headers={
        "Authorization": f"Bearer {openai_api_key}"
    },
).json()
model = models["data"][0]["id"]

# Streaming chat completions
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]

response = requests.post(
    f"{openai_api_base}/chat/completions",
    headers={"Authorization": f"Bearer {openai_api_key}"},
    json={
        "model": model,
        "messages": messages,
        "chat_template_kwargs": {"thinking": True},
        "stream": True
    },
)



print("client: Start streaming chat completions...")
printed_reasoning_content = False
printed_content = False
# Make the streaming request
if response.status_code == 200:
    # Process the streaming response
    for line in response.iter_lines():
        if line:  # Filter out keep-alive new lines
            # Decode the line and parse the JSON
            decoded_line = line.decode("utf-8")
            if decoded_line.startswith("data:"):
                data = decoded_line[5:].strip()  # Remove "data:" prefix
                if data == "[DONE]":  # End of stream
                    print("\nclient: Stream completed.")
                    break
                try:
                    # Parse the JSON data
                    chunk = json.loads(data)
                    reasoning_content = chunk["choices"][0]["delta"].get(
                        "reasoning_content", "")
                    content = chunk["choices"][0]["delta"].get("content", "")

                    if reasoning_content:
                        if not printed_reasoning_content:
                            printed_reasoning_content = True
                            print("reasoning_content:", end="", flush=True)
                        print(reasoning_content, end="", flush=True)
                    elif content:
                        if not printed_content:
                            printed_content = True
                            print("\ncontent:", end="", flush=True)
                        # Extract and print the content
                        print(content, end="", flush=True)
                except json.JSONDecodeError:
                    print("Error decoding JSON:", decoded_line)
else:
    print(f"Error: {response.status_code} - {response.text}")

Example output (run from the streaming snippet above)

reasoning_content:
This is a straightforward comparison of two numbers. The task is to determine which is larger: 9.11 or 9.8. 

I need to recall the value of these decimal numbers and compare them. Given both are very close, it requires precise comprehension to understand which has the larger value—specifically focusing on the tenths and hundredths places.


content:

9.8 is greater than 9.11. 

Let's break down the comparison:

- Both numbers are above 9, so we're comparing the decimal parts.
- 9.11 has a '11' in the hundredths place.
- 9.8 has an '80' in the hundredths place, which is larger (even if it's ten times, 80 > 11).

Therefore, 9.8 > 9.11.
client: Stream completed.

github-actions · 2025-03-04T11:41:40Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mgoin · 2025-03-04T20:31:39Z

Nice use of this new feature! Will try in a bit cc @gaocegege

gaocegege

Thanks for the contribution!

Could you please rebase the upstream? In a previous PR to support reasoning outputs in structured outputs https://github.com/vllm-project/vllm/pull/12955/files#diff-ea8b8ff63961713ccb62d78e53e96404b587b7828cb9fee08a9e5576bf563673R1065, we moved the CLI argument --reasoning-parser to https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L1076

Thus you may need to add a new choice there.

docs/source/features/reasoning_outputs.md

gaocegege · 2025-03-06T05:35:00Z

Hi, I updated the docs in this PR #14114

Maybe you should rebase the docs too. Just FYI

alex-jw-brooks · 2025-03-06T08:33:29Z

vllm/model_executor/guided_decoding/reasoner/__init__.py

@@ -19,6 +19,10 @@ def get_reasoner(tokenizer: PreTrainedTokenizer,
        return None
    elif reasoning_backend == "deepseek_r1":
        return DeepSeekReasoner.from_tokenizer(tokenizer)
+    elif reasoning_backend == "granite":
+        logger.warning(


Adding a warning for now since this is already a large PR, but I think adding a GraniteReasoner for guided decoding could be a follow-up later?

alex-jw-brooks · 2025-03-06T08:40:07Z

Awesome thanks @gaocegege! It's been rebased 😄

gaocegege

Thanks for your contribution! 🎉 👍

gaocegege · 2025-03-06T12:39:20Z

@mgoin Please give it another review, thanks!

b8zhong · 2025-03-07T04:35:38Z

vllm/entrypoints/openai/reasoning_parsers/granite_reasoning_parser.py

+                            response_start)
+                        reasoning_content = current_text[
+                            start_reasoning_content:end_reasoning_content]
+                        response_content = current_text[current_chunk_end + 1:]


Suggested change

response_content = current_text[current_chunk_end + 1:]

response_content = current_text[current_chunk_end + 1:]

parsed_content = True

parsed_content flag doesn't seem to be updated, so maybe helpful to set it?
Very minor suggestion, totally optional

Hey @b8zhong, thanks for the suggestion! For now, I'd prefer to keep it as is since it returns immediately after parsing the response content. I.e., once this condition is met, there is no need to keep going, so updating the flag won't do anything 🙂

mergify · 2025-03-07T04:36:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alex-jw-brooks.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gaocegege · 2025-03-10T01:58:05Z

@alex-jw-brooks Hi could you please resolve the conflicts

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks · 2025-03-10T19:28:24Z

Thanks for the nudge @gaocegege! Rebased 😄

docs/source/features/reasoning_outputs.md

Co-authored-by: Joe Runde <[email protected]>

alex-jw-brooks requested review from DarkLight1337, robertgshaw2-redhat and simon-mo as code owners March 4, 2025 11:41

mergify bot added documentation Improvements or additions to documentation frontend labels Mar 4, 2025

alex-jw-brooks force-pushed the granite_reasoning branch from 0affa66 to f1d3367 Compare March 4, 2025 12:09

DarkLight1337 requested a review from mgoin March 4, 2025 15:28

gaocegege reviewed Mar 5, 2025

View reviewed changes

docs/source/features/reasoning_outputs.md Outdated Show resolved Hide resolved

alex-jw-brooks force-pushed the granite_reasoning branch from 4f4cbe1 to ecadd2d Compare March 6, 2025 08:27

mergify bot added the structured-output label Mar 6, 2025

alex-jw-brooks commented Mar 6, 2025

View reviewed changes

alex-jw-brooks requested a review from gaocegege March 6, 2025 08:40

gaocegege approved these changes Mar 6, 2025

View reviewed changes

b8zhong reviewed Mar 7, 2025

View reviewed changes

mergify bot added the needs-rebase label Mar 7, 2025

gaocegege mentioned this pull request Mar 10, 2025

[Refactor][Frontend] Keep all logic about reasoning into one class #14428

Open

alex-jw-brooks added 7 commits March 10, 2025 16:31

Implement granite reasoning parser for non streaming

784c170

Signed-off-by: Alex-Brooks <[email protected]>

Add granite reasoning parser to init pkg

4438a38

Signed-off-by: Alex-Brooks <[email protected]>

Add preliminary test for non streaming granite rparser

3278ca7

Signed-off-by: Alex-Brooks <[email protected]>

Implement granite reasoning parser streaming

07e58a8

Signed-off-by: Alex-Brooks <[email protected]>

Add additional granite reasoning parser tests

6980ea8

Signed-off-by: Alex-Brooks <[email protected]>

Add docstrings for granite reasoning parser

f6ff0bc

Signed-off-by: Alex-Brooks <[email protected]>

Add more streaming tests & cleanup

1f2f690

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks added 10 commits March 10, 2025 16:31

Refactoring and code formatting

2c9251c

Signed-off-by: Alex-Brooks <[email protected]>

Pass response seq length through message parsing

1604ca9

Signed-off-by: Alex-Brooks <[email protected]>

Track parsed content as a bool

118a051

Signed-off-by: Alex-Brooks <[email protected]>

Add IBM 3.2 lang models to reasoning models

5ac1c11

Signed-off-by: Alex-Brooks <[email protected]>

Add note on thinking kwarg for granite reasoning

2b871f1

Signed-off-by: Alex-Brooks <[email protected]>

Fix formatting

721ab9f

Signed-off-by: Alex-Brooks <[email protected]>

Add reasoning parser arg for granite

6b79586

Signed-off-by: Alex-Brooks <[email protected]>

Fix granite reasoning parser doc formatting

e460b5e

Signed-off-by: Alex-Brooks <[email protected]>

Warn for unimplemented structured outputs reasoner

658bf0a

Signed-off-by: Alex-Brooks <[email protected]>

Add granite thinking note to reasoning docs

75dcdd2

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks force-pushed the granite_reasoning branch from ecadd2d to 75dcdd2 Compare March 10, 2025 16:54

alex-jw-brooks requested a review from russellb as a code owner March 10, 2025 16:54

mergify bot removed the needs-rebase label Mar 10, 2025

joerunde reviewed Mar 10, 2025

View reviewed changes

docs/source/features/reasoning_outputs.md Outdated Show resolved Hide resolved

Update docs/source/features/reasoning_outputs.md

d64254f

Co-authored-by: Joe Runde <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Add Reasoning Parser for Granite Models #14202

[Model] Add Reasoning Parser for Granite Models #14202

alex-jw-brooks commented Mar 4, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 4, 2025

mgoin commented Mar 4, 2025

gaocegege left a comment •

edited

Loading

gaocegege commented Mar 6, 2025

alex-jw-brooks Mar 6, 2025

gaocegege Mar 6, 2025

alex-jw-brooks commented Mar 6, 2025

gaocegege left a comment

gaocegege commented Mar 6, 2025

b8zhong Mar 7, 2025 •

edited

Loading

alex-jw-brooks Mar 10, 2025

mergify bot commented Mar 7, 2025

gaocegege commented Mar 10, 2025

alex-jw-brooks commented Mar 10, 2025

	response_content = current_text[current_chunk_end + 1:]
	response_content = current_text[current_chunk_end + 1:]
	parsed_content = True

[Model] Add Reasoning Parser for Granite Models #14202

Are you sure you want to change the base?

[Model] Add Reasoning Parser for Granite Models #14202

Conversation

alex-jw-brooks commented Mar 4, 2025 • edited by github-actions bot Loading

Examples

github-actions bot commented Mar 4, 2025

mgoin commented Mar 4, 2025

gaocegege left a comment • edited Loading

Choose a reason for hiding this comment

gaocegege commented Mar 6, 2025

alex-jw-brooks Mar 6, 2025

Choose a reason for hiding this comment

gaocegege Mar 6, 2025

Choose a reason for hiding this comment

alex-jw-brooks commented Mar 6, 2025

gaocegege left a comment

Choose a reason for hiding this comment

gaocegege commented Mar 6, 2025

b8zhong Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

alex-jw-brooks Mar 10, 2025

Choose a reason for hiding this comment

mergify bot commented Mar 7, 2025

gaocegege commented Mar 10, 2025

alex-jw-brooks commented Mar 10, 2025

alex-jw-brooks commented Mar 4, 2025 •

edited by github-actions bot

Loading

gaocegege left a comment •

edited

Loading

b8zhong Mar 7, 2025 •

edited

Loading