Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Eval Branch to Main #2

Merged
merged 26 commits into from
Jan 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 85 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,23 @@


<div align="center">
<img src="https://github.com/user-attachments/assets/d91b1b5d-c932-402c-b86d-2846620a68b0" width="800"/>
<img src="assets/pass-at-k-v-s-greedy-g-pass-at-k.png" width="800"/>
</div>

<!-- [🏰[Project Page](https://github.com/open-compass/GPassK/)]
[📚[LeaderBoard](https://github.com/open-compass/GPassK/index.html)] -->

## 🚀 News
- **[2024.12.18]** We release the **[ArXiv Paper](http://arxiv.org/abs/2412.13147)** of GPassK. 🎉🎉🎉
- **[2025.1.6]** 🔥 **[LiveMathBench](https://huggingface.co/datasets/opencompass/LiveMathBench)** now can be accessed through hugginface, and you can now evaluate your LLMs on it using G-Pass@k in OpenCompass. We have addressed potential errors in LiveMathBench and inconsistencies in the sampling parameters. Please also refer to our updated version of the **[Paper](http://arxiv.org/abs/2412.13147)** for further details.
- **[2024.12.18]** We release the **[ArXiv Paper](http://arxiv.org/abs/2412.13147)** of G-Pass@k. 🎉🎉🎉


## ☀️Introduction

**G-Pass@k** is a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model’s peak performance potential and its stability. In addition, it comes with **LiveMathBench**, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. In order to track the latest performance and stability of LLMs, we will continue updating the benchmark with new comptition level mathmatical problems and provide the latest results of the models on the benchmark with G-Pass@k.


## 🌲 Definition of GPassK
## 🌲 Definition of G-Pass@k
$$ \text{G-Pass@}k = \mathbb{E}_{\text{Questions}} \left[ \frac{{c \choose k}}{{n \choose k}} \right] $$

where $n$ represents the total number of generations per question, and $c$ denotes the number
Expand All @@ -42,27 +43,95 @@ Intuitively, $\text{mG-Pass@}k$ provides an interpolated estimate of the area un
*LiveMathBench-202412 version*

<div align="center">
<img src="https://github.com/user-attachments/assets/0e5d57c6-7fec-475e-acbe-cfa6aa2088cb" width="800"/>
<img src="assets/performance.png" width="800"/>
</div>


## 🖋Use GPassK in OpenCompass
## 🖋Use G-Pass@k in OpenCompass
[OpenCompass](https://github.com/open-compass/opencompass) is a toolkit for evaluating the performance of large language models (LLMs). To use GPassK in OpenCompass, you can follow the steps below:
```python
Coming Soon...

### 1. Prepare Environment
Follow these steps to ensure your environment is ready:

```bash
# Clone the main repository
git clone https://github.com/open-compass/GPassK.git
cd GPassK

# Create and activate a conda environment with specific Python and PyTorch versions
conda create -n livemathbench-eval python=3.10 pytorch torchvision torchaudio pytorch-cuda -c nvidia -c pytorch -y
conda activate livemathbench-eval

# Install additional required packages
pip install loguru

# Clone and install OpenCompass for extended functionality
git clone https://github.com/open-compass/opencompass.git opencompass
cd opencompass
pip install -e .
```


### 2. Prepare Dataset
LiveMathBench dataset can be obtained from HuggingFace. First, you should be granted to access the dataset from the following link: [huggingface](https://huggingface.co/datasets/opencompass/LiveMathBench).
Then, refer to [security-tokens](https://huggingface.co/docs/hub/security-tokens) to set up your HF tokens.


### 3. Deploy Judge Models
We leverage Qwen2.5-72B-Instruct as the judge model for judging the correctness of generated answers. We recommend to deploy services using deployment tools such as [vllm](https://github.com/vllm-project/vllm) or [lmdeploy](https://github.com/InternLM/lmdeploy) for invocation by different evaluation tasks.

Below is an example configuration for deploying the judge model using `lmdeploy`:
```bash
lmdeploy serve api_server Qwen/Qwen2.5-72B-Instruct --server-port 8000 \
--tp 4 \ # at least 4 A100 or equivalent GPUs are required
--cache-max-entry-count 0.9 \
--log-level INFO
```
After setting up the judge model, define the URLs in the `eval_urls` within `opencompass_config_templates/*.py`. Adjust other parameters such as `k`, `temperatures`, `llm_infos`, and other params according to your needs.

> ❗️Note that omitting `eval_urls` will default to an internal rule-based judge, which might only apply to datasets with numerical answers

### 4. Evaluation

To begin the evaluation, first generate the necessary configuration files by running the following script:
```bash
python save_opencompass_configs.py --config_template_file {opencompass_config_templates/nono1.py|opencompass_config_templates/o1.py}
```

Upon execution, verify the generated configuration files located in `opencompass_configs/:

```
.
├── deepseek-math-7b-rl_t0-3_p0-8_k50_rp1-0_rs42_l8192@LiveMathBench-v202412-k4_8_16-r3.py
├── deepseek-math-7b-rl_t0-5_p0-8_k50_rp1-0_rs42_l8192@LiveMathBench-v202412-k4_8_16-r3.py
├── deepseek-math-7b-rl_t0-7_p0-8_k50_rp1-0_rs42_l8192@LiveMathBench-v202412-k4_8_16-r3.py
├── deepseek-math-7b-rl_t1-0_p0-8_k50_rp1-0_rs42_l8192@LiveMathBench-v202412-k4_8_16-r3.py
```

These files follow a naming convention that reflects the model settings and dataset used:
```
[MODEL_ABBR]_t[TEMPERATUE]_p[TOP_P]_k[TOP_K]_rp[REPETITION_PENALTY]_l[MAX_OUT_LEN]@[DATASET_ABBR]_k[LIST_OF_K]_r[REPLICATION].py
```

With the configurations prepared, initiate the evaluation process with the commands below:

```bash
cd GPassK
conda activate livemathbench-eval
python opencompass/run.py {path/to/config_file} \
-w ./opencompass_outputs/ \
--dump-eval-details \
```
Refer to the OpenCompass documentation for additional arguments that may enhance your evaluation experience


# Citation and Tech Report
If you use GPassK in your research, please cite the following paper:
If you use G-Pass@k in your research, please cite the following paper:
```
@misc{liu2024llmscapablestablereasoning,
title={Are Your LLMs Capable of Stable Reasoning?},
author={Junnan Liu and Hongwei Liu and Linchen Xiao and Ziyi Wang and Kuikun Liu and Songyang Gao and Wenwei Zhang and Songyang Zhang and Kai Chen},
year={2024},
eprint={2412.13147},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2412.13147},
@article{liu2024your,
title={Are Your LLMs Capable of Stable Reasoning?},
author={Liu, Junnan and Liu, Hongwei and Xiao, Linchen and Wang, Ziyi and Liu, Kuikun and Gao, Songyang and Zhang, Wenwei and Zhang, Songyang and Chen, Kai},
journal={arXiv preprint arXiv:2412.13147},
year={2024}
}
```
Binary file added assets/pass-at-k-v-s-greedy-g-pass-at-k.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/performance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 20 additions & 0 deletions docs/LiveMathBench-A.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Model,Greedy,[email protected],[email protected],[email protected],mG-Pass@16,link,opensourced,mathLM,o1-like
Llama-3.1-8B-Instruct,24.0,18.2,11.3,4.55,10.4,https://github.com/facebookresearch/llama,TRUE,FALSE,FALSE
Llama-3.1-70B-Instruct,29.8,30.0,22.2,12.5,20.8,https://github.com/facebookresearch/llama,TRUE,FALSE,FALSE
Llama-3.3-70B-Instruct,40.3,36.2,28.9,19.1,27.5,https://github.com/facebookresearch/llama,TRUE,FALSE,FALSE
Qwen2.5-7B-Instruct,37.0,36.5,27.2,16.0,25.8,https://github.com/QwenLM/Qwen,TRUE,FALSE,FALSE
Qwen2.5-32B-Instruct,50.8,48.3,39.5,28.6,38.1,https://github.com/QwenLM/Qwen,TRUE,FALSE,FALSE
Qwen2.5-72B-Instruct,51.7,47.3,39.6,29.0,37.8,https://github.com/QwenLM/Qwen,TRUE,FALSE,FALSE
DeepSeek-V2.5-1210,38.7,38.9,27.9,17.3,26.7,https://github.com/deepseek-ai/DeepSeek-LLM,TRUE,FALSE,FALSE
DeepSeek-V3.0-Chat,55.0,59.5,49.9,35.0,47.9,https://github.com/deepseek-ai/DeepSeek-V3,TRUE,FALSE,FALSE
Mistral-Large-Instruct-2411-123B,41.6,39.4,37.1,32.9,36.4,https://example.com/mistral,TRUE,FALSE,FALSE
Gemini-1.5-Pro-Latest,59.1,55.9,47.3,31.0,44.3,https://example.com/gemini,FALSE,FALSE,FALSE
Claude-3.5-Sonnet,46.7,44.1,36.2,26.6,35.3,https://docs.anthropic.com/claude/docs/models-overview,FALSE,FALSE,FALSE
GPT-4o-2024-11-20,44.8,41.9,32.9,22.2,31.6,https://openai.com/research/gpt-4,FALSE,FALSE,FALSE
DeepSeek-Math-7B-RL,23.5,19.8,14.0,9.7,13.7,https://github.com/deepseek-ai/DeepSeek-LLM,TRUE,TRUE,FALSE
NuminaMath-72B-CoT,40.8,34.0,27.1,14.2,25.0,https://example.com/numinamath,TRUE,TRUE,FALSE
Qwen2.5-Math-7B-Instruct,44.1,44.1,38.3,28.1,36.6,https://github.com/QwenLM/Qwen,TRUE,TRUE,FALSE
Qwen2.5-Math-72B-Instruct,57.6,52.7,45.4,27.9,42.3,https://github.com/QwenLM/Qwen,TRUE,TRUE,FALSE
Skywork-o1-8B,45.4,39.3,31.9,21.7,30.4,https://example.com/skywork,TRUE,FALSE,TRUE
QwQ-32B-Preview,72.7,74.9,65.8,40.1,61.2,https://example.com/qwq,TRUE,FALSE,TRUE
OpenAI o1-mini,74.1,76.3,67.3,48.3,64.8,https://openai.com/research/o1,FALSE,FALSE,TRUE
Loading