Unleashing Reasoning Capability of LLMs
via Scalable Question Synthesis from Scratch

We introduce ScaleQuest, a scalable, cost-effective, and novel data synthesis method that utilizes small-size open-source models to generate questions from scratch without the need for seed data with complex augmentation constraints.

We release two question generator models and four problem-solving models.

Model	Type	MATH	Olympiad Bench	🤗 HuggingFace Download Link
ScaleQuest-DeepSeekMath-7B-QGen	question generator	-	-	link
ScaleQuest-Qwen2-Math-7B-QGen	question generator	-	-	link
Mistral-7B-ScaleQuest	problem solver	62.9	26.8	link
Llama3-8B-ScaleQuest	problem solver	64.4	25.3	link
DeepSeekMath-7B-ScaleQuest	problem solver	66.6	29.9	link
Qwen2-Math-7B-ScaleQuest	problem solver	73.4	38.5	link

This repository contains our complete data synthesis method, including:

Step 0: Requirements

You should install the dependencies:

conda create -n scalequest python=3.11
conda activate scalequest
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Demo Usage

Below is an question generator exmaple using ScaleQuest-Qwen2-Math-7B-QGen

from vllm import LLM, SamplingParams

model_name = "dyyyyyyyy/ScaleQuest-Qwen2-Math-7B-QGen"
pre_query_template = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n"
stop_tokens = ["<|im_start|>", "<|im_end|>", "<|endoftext|>"]
llm = LLM(
    model=model_name,
    tokenizer=model_name,
    tensor_parallel_size=1,
    max_model_len=4096,
    enable_prefix_caching=True,
    trust_remote_code=True,
    swap_space=16,
    gpu_memory_utilization=0.95,
)
sampling_params = SamplingParams(
    n=4,
    max_tokens=1024,
    temperature=1.0,
    top_p=0.99,
    stop=stop_tokens,
)
outputs = llm.generate(pre_query_template, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    for idx, generated_output in enumerate(output.outputs):
        generated_text = generated_output.text
        print(f"Sample {idx + 1}:")
        print(f"Prompt: {prompt!r}")
        print(f"Generated text: {generated_text!r}")
        print("-" * 50)

Below is an problem solver example using Qwen2-Math-7B-ScaleQuest

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "dyyyyyyyy/Qwen2-Math-7B-ScaleQuest"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

question = "Find the value of $x$ that satisfies the equation $4x+5 = 6x+7$."

sys_prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
query_prompt = "<|im_start|>user" + "\n"
# {query}
prompt_after_query = "\n" + "Please reason step by step, and put your final answer within \\boxed{}.<|im_end|>" + "\n"
resp_prompt = "<|im_start|>assistant" + "\n"
prompt_before_resp = ""
# {resp}
delim = "<|im_end|>" + "\n"

prefix_prompt = f"{query_prompt}{question}{prompt_after_query}{resp_prompt}{prompt_before_resp}".rstrip(" ")
full_prompt = sys_prompt + delim.join([prefix_prompt])

# print(full_prompt)

inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

Step 1: Train Query Generators

Training a question generator through question fine-tuning (code in the src/train_question_generator/qft_train folder).
Constructing preference data (code in the src/train_question_generator/question_optim folder) and performing question preference optimization (code in the src/train_question_generator/qpo_train folder).

You can run QFT and QPO by the following command:

cd src/train_question_generator && bash scripts/run_dsmath_qft.sh
cd src/train_question_generator && bash scripts/run_qwen2math_qft.sh

Step 2: Question Synthesis

Using the trained question generator to synthesize questions (code in the src/data_generation folder).
Applying a filtering process to the generated questions (code in the src/data_generation/question_filtering folder).

cd src/data_generation && bash scripts/run.sh

Step 3: Response Synthesis

Generating responses (code in the src/data_generation folder)
applying a reward filtering strategy (code in the src/data_generation/reward_filtering folder).

cd src/data_generation && bash scripts/run.sh

Step 4: Instruction-Tuning & Evaluation

We use DART-Math framework for instruction tuning and evaluation.

Citation

@article{ding2024unleashing,
  title={Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch},
  author={Ding, Yuyang and Shi, Xinyu and Liang, Xiaobo and Li, Juntao and Zhu, Qiaoming and Zhang, Min},
  journal={arXiv preprint arXiv:2410.18693},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
docker		docker
img		img
samples		samples
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unleashing Reasoning Capability of LLMs
via Scalable Question Synthesis from Scratch

Step 0: Requirements

Demo Usage

Step 1: Train Query Generators

Step 2: Question Synthesis

Step 3: Response Synthesis

Step 4: Instruction-Tuning & Evaluation

Citation

About

Releases

Packages

Contributors 2

Languages

License

yyDing1/ScaleQuest

Folders and files

Latest commit

History

Repository files navigation

Unleashing Reasoning Capability of LLMsvia Scalable Question Synthesis from Scratch

Step 0: Requirements

Demo Usage

Step 1: Train Query Generators

Step 2: Question Synthesis

Step 3: Response Synthesis

Step 4: Instruction-Tuning & Evaluation

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Unleashing Reasoning Capability of LLMs
via Scalable Question Synthesis from Scratch

Packages