# Acceptance Compare standard vs. reasoning mode using - [ ] Response correctness on MMLU(-Pro) and non-MMLU test sets - [ ] Token usage (completion_tokens/prompt_tokens ratio) - [ ] Response time per output token