Skip to content

Eval bug: Qwen3-30b-a3b output stunted from b6793 #16709

@mitchell

Description

@mitchell

Name and Version

$ ./build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 6793 (38355c6)
built with cc (GCC) 15.2.1 20250813 for x86_64-pc-linux-gnu

Built with:

cmake -S . -B build 
    -DGGML_HIP=ON
    -DAMDGPU_TARGETS=gfx1100
    -DCMAKE_BUILD_TYPE=Release
    -DGGML_NATIVE=ON
    -DGGML_HIP_ROCWMMA_FATTN=ON
    -DGGML_HIP_GRAPHS=ON

This also occurs with the HIP build on Windows using the same hardware.

Operating systems

Linux (and Windows)

GGML backends

HIP

Hardware

Radeon RX 7900 XTX

Models

Qwen3-30b-a3b-thinking-2507 Q4_K_XL (Unsloth)

Problem description & steps to reproduce

When I run

llama-server
        --threads 12
        --gpu-layers 99
        --flash-attn auto
        --jinja
        --hf-repo unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:Q4_K_XL
        --ctx-size 40960
        --temp 0.6
        --top-k 20
        --top-p 0.95
        --min-p 0.0
        --ubatch-size 2048

on b6792 output is as normal as you would expect. Outputs are well thought-out, detailed, and somewhat lengthy.

When I run the same on b6793, I get shorter answers with less accurate/detailed information. It is also less inclined to format the output with Markdown.

First Bad Commit

38355c6

Relevant log output

N/A, logs look normal.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions