Skip to content

[v0.1]Bench: Reasoning mode evaluation #42

@rootfs

Description

@rootfs

Acceptance

Compare standard vs. reasoning mode using

  • Response correctness on MMLU(-Pro) and non-MMLU test sets
  • Token usage (completion_tokens/prompt_tokens ratio)
  • Response time per output token

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions