FineReason: Evaluating and Improving LLMs’ Deliberate Reasoning through Reflective Puzzle Solving

We introduce FINEREASON, a novel logic-puzzle benchmark designed to comprehensively evaluate the reasoning capabilities of LLMs. Current benchmarks primarily focus on the accuracy of final answers, overlooking whether models can effectively reflect and correct errors during the reasoning process.

📌 Unlike existing benchmarks, FINEREASON delves into intermediate reasoning steps, specifically emphasizing state checking and transition actions, capturing key abilities such as reflection, lookahead, and backtracking—key aspects of human-like System 2 reasoning.

📈 Experiments reveal significant limitations in deep reasoning tasks, even for leading models like Gemini-2.0-Flash-Thinking, highlighting substantial room for improvement.

🚀 Training on puzzle-based data enhances model performance in broader mathematical tasks, such as achieving a 5.1% accuracy improvement on the GSM8K dataset, demonstrating the potential of puzzle data to boost general reasoning capabilities.

Environment Setup

conda create -n fine-reason python=3.10 -y
conda activate fine-reason
pip install -r requirements.txt

API Setup

Insert your OpenAI API key into the file openai_key.json.
Insert your Gemini API key into the file gemini_key.json.

Example Usage

To run Sudoku state checking using Gemini-2.0-Flash-Thinking:

python main.py evaluate \
--data_name sudoku_states \
--prompter_name sudoku_state_checking \
--scorer_name state_checking_accuracy \
--model_name gemini_flash_thinking

To run Sudoku state transition using Qwen-2.5-72B-Instruct with a max_output_length of 2048:

python main.py evaluate \
--data_name sudoku_states \
--prompter_name sudoku_state_checking \
--scorer_name state_checking_accuracy \
--model_name qwen \
--path_model Qwen/Qwen2.5-72B-Instruct \
--max_output_length 2048

To run end-to-end evaluation using OpenAI's o1:

python main.py evaluate \
--data_name sudoku \
--prompter_name sudoku_e2e \
--model_name o1

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
data		data
README.md		README.md
data_loading.py		data_loading.py
game24_tree.py		game24_tree.py
gemini_key.json		gemini_key.json
graphcoloring_tree.py		graphcoloring_tree.py
gridpuzzle_tree.py		gridpuzzle_tree.py
main.py		main.py
modeling.py		modeling.py
openai_key.json		openai_key.json
prompting.py		prompting.py
requirements.txt		requirements.txt
scoring.py		scoring.py
sudoku_tree.py		sudoku_tree.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FineReason: Evaluating and Improving LLMs’ Deliberate Reasoning through Reflective Puzzle Solving

Environment Setup

API Setup

Example Usage

About

Releases

Packages

Languages

DAMO-NLP-SG/FineReason

Folders and files

Latest commit

History

Repository files navigation

FineReason: Evaluating and Improving LLMs’ Deliberate Reasoning through Reflective Puzzle Solving

Environment Setup

API Setup

Example Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages