Skip to content

Conversation

allozaur
Copy link
Collaborator

  • Adds support for another thinking format
  • Improves the code architecture for thinking.ts

@allozaur allozaur requested a review from ggerganov September 30, 2025 23:55
@allozaur allozaur self-assigned this Sep 30, 2025
@allozaur allozaur added enhancement New feature or request server/webui labels Sep 30, 2025
@ExtReMLapin
Copy link
Contributor

Why and how in hell did we end up getting the the frontend to parse thinking tags ?

The backend returns thinking content inside dedicated field

@allozaur
Copy link
Collaborator Author

allozaur commented Oct 1, 2025

Why and how in hell did we end up getting the the frontend to parse thinking tags ?

The backend returns thinking content inside dedicated field

Parsing thinking content on the frontend had been around since the previous version of WebUI. It's necessary as there are cases where we are getting thinking content directly in the message instead of reasoning_content, which of course is also supported as you can see below:

Some models do return thinking content in reasoning_content and frontend handles this:

const reasoningContent = parsed.choices[0]?.delta?.reasoning_content;

let thinkingContent = $derived.by(() => {
if (message.role === 'assistant') {
if (message.thinking) {
return message.thinking;
}
const parsed = parseThinkingContent(message.content);
return parsed.thinking;
}
return null;
});

onReasoningChunk: (reasoningChunk: string) => {
streamedReasoningContent += reasoningChunk;
const messageIndex = this.findMessageIndex(assistantMessage.id);
this.updateMessageAtIndex(messageIndex, { thinking: streamedReasoningContent });
},
onComplete: async (
finalContent?: string,
reasoningContent?: string,
timings?: ChatMessageTimings
) => {
slotsService.stopStreaming();
await DatabaseStore.updateMessage(assistantMessage.id, {
content: finalContent || streamedContent,
thinking: reasoningContent || streamedReasoningContent,
timings: timings
});

@ExtReMLapin
Copy link
Contributor

So it's some kind of "Not implemented on backend (cpp code), but faster to implement on frontend" ?

@ggerganov
Copy link
Member

So it's some kind of "Not implemented on backend (cpp code), but faster to implement on frontend" ?

Yes.

content.includes('<think>') ||
content.includes('[THINK]') ||
THINKING_FORMATS.some((format) => content.includes(format.startTag)) ||
content.includes('<|channel|>analysis')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe gpt-oss handling on the client is not needed anymore. At least when I test with curl, it seems the reasoning is correctly parsed into reasoning_content. So maybe this can be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Georgi: I checked the new Harmony/GPT-OSS path. The web UI now asks for reasoning_format: "auto" and consumes the streamed delta.reasoning_content field directly, persisting it as message.thinking; the backend parser routes the <|channel|>analysis blocks into that field for us. So the extra client-side check for <|channel|>analysis is redundant at this point and can be dropped. I'll run some tests tonight to confirm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the support. Yes, this is what I figured as well, though my understanding of the chat parsing logic is very rudimentary. Extra eyes on this is appreciated.

…ives

- Captured inline <think> segments during streaming, forwarding them to the reasoning UI while keeping the cleaned assistant message stream intact
- Tracked when explicit reasoning_content chunks arrive so inline capture is skipped once the server provides dedicated reasoning updates
@ServeurpersoCom
Copy link
Contributor

ServeurpersoCom commented Oct 1, 2025

Your PR already improves the old solution.
On top of that, this commit 64156f5 adds proper support for GLM 4.5, which streams inline <think> segments without \n (unlike Qwen).

@ServeurpersoCom
Copy link
Contributor

Tested with GPT-OSS-120B, Qwen3 A3B Thinking, and GLM 4.5 Air on
851b022
Confirmed: no need for content.includes('<|channel|>analysis') anymore.

@ServeurpersoCom
Copy link
Contributor

ServeurpersoCom commented Oct 1, 2025

There's still one tricky edge case that isn't handled: some models expect the <think> tag to already be opened in the system prompt to start the chain-of-thought, and they were only trained to close it. With SFT done that way, compatibility with other models wasn't really considered because on the very first chunk, how do you know if it's reasoning or the final answer? That means we'd have to hook into /prop / Jinja template again just to propagate extra info, but that feels like a brittle workaround and it doesn't really align with the spirit of OpenAI-Compat.

But this is not really a regression : I have not seen any WebUI handle it correctly so far. Or we could, upon detection of the </think>, retroactively render the preceding text as a "thinking block" at the start of the streamed final answer.

@ExtReMLapin
Copy link
Contributor

From memory that's what the thinking 2507 QWEN3 model does. They don't open it, we have to consider it already opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants