-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d8f0394
commit 1d284d6
Showing
1 changed file
with
26 additions
and
26 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,10 @@ | ||
# CriticBench: Evaluating Large Language Models as Critic | ||
# CriticEval: Evaluating Large Language Models as Critic | ||
|
||
This repository is the official implementation of [CriticBench](https://arxiv.org/abs/2402.13764), a comprehensive benchmark for evaluating critique ability of LLMs. | ||
This repository is the official implementation of [CriticEval](https://arxiv.org/abs/2402.13764), a comprehensive benchmark for evaluating critique ability of LLMs. | ||
|
||
## Introduction | ||
|
||
**[CriticBench: Evaluating Large Language Models as Critic](https://arxiv.org/abs/2402.13764)** | ||
**[CriticEval: Evaluating Large Language Models as Critic](https://arxiv.org/abs/2402.13764)** | ||
</br> | ||
Tian Lan<sup>1*</sup>, | ||
Wenwei Zhang<sup>2*</sup>, | ||
|
@@ -25,22 +25,22 @@ Xian-ling Mao<sup>1†</sup> | |
[[Subjective LeaderBoard](https://open-compass.github.io/CriticBench/leaderboard_subjective.html)] | ||
[[Objective LeaderBoard](https://open-compass.github.io/CriticBench/leaderboard_objective.html)] | ||
|
||
> Critique ability are crucial in the scalable oversight and self-improvement of Large Language Models (LLMs). While many recent studies explore the critique ability of LLMs to judge and refine flaws in generations, how to comprehensively and reliably measure the critique abilities of LLMs is under-explored. This paper introduces <b>CriticBench</b>, a novel benchmark designed to comprehensively and reliably evaluate four key critique ability dimensions of LLMs: feedback, comparison, refinement and meta-feedback. <b>CriticBench</b> encompasses nine diverse tasks, each assessing the LLMs' ability to critique responses at varying levels of quality granularity. Our extensive evaluations of open-source and closed-source LLMs reveal intriguing relationships between the critique ability and tasks, response qualities, and model scales. | ||
> Critique ability are crucial in the scalable oversight and self-improvement of Large Language Models (LLMs). While many recent studies explore the critique ability of LLMs to judge and refine flaws in generations, how to comprehensively and reliably measure the critique abilities of LLMs is under-explored. This paper introduces <b>CriticEval</b>, a novel benchmark designed to comprehensively and reliably evaluate four key critique ability dimensions of LLMs: feedback, comparison, refinement and meta-feedback. <b>CriticEval</b> encompasses nine diverse tasks, each assessing the LLMs' ability to critique responses at varying levels of quality granularity. Our extensive evaluations of open-source and closed-source LLMs reveal intriguing relationships between the critique ability and tasks, response qualities, and model scales. | ||
<img src="./figs/overview.png" alt="overview" align=center /> | ||
|
||
## What's New | ||
|
||
* **[2024.2.21]** Release paper, codes, data and other resources of CriticBench v1.3. | ||
* **[2024.2.21]** Release paper, codes, data and other resources of CriticEval v1.3. | ||
|
||
## Next | ||
|
||
- [ ] Evaluate Qwen-1.5 series models | ||
- [ ] Improve the reliability of subjective evaulation in CriticBench (v1.4) | ||
- [ ] Improve the reliability of subjective evaulation in CriticEval (v1.4) | ||
- [ ] Expand to more diverse tasks | ||
- [ ] Expand to Chinese applications | ||
- [ ] Prepare and clean the codebase for OpenCompass | ||
- [ ] Release the train set of CriticBench | ||
- [ ] Release the train set of CriticEval | ||
- [ ] Support inference on Opencompass. | ||
|
||
## Quick Start | ||
|
@@ -53,17 +53,17 @@ Download the dataset from [huggingface dataset](https://huggingface.co/datasets/ | |
```bash | ||
mkdir data | ||
cd data | ||
git clone https://huggingface.co/datasets/opencompass/CriticBench | ||
git clone https://huggingface.co/datasets/opencompass/CriticEval | ||
``` | ||
which get into the `data` folder and clone the CriticBench dataset. | ||
which get into the `data` folder and clone the CriticEval dataset. | ||
Note that the human-annotated Likert scores, preference labels, and critiques in `test` set are excluded. | ||
You can submit your inference results on the `test` set (via run codes under `inference` folder) to this [email]([email protected]). We will run your predictions and update the results in our leaderboard. Please also provide the scale of your tested model. | ||
The structure of your submission should be similar to that in `example_data`. | ||
|
||
#### 1.2 Prepare Code and Env | ||
|
||
```bash | ||
git clone https://github.com/open-compass/CriticBench.git | ||
git clone https://github.com/open-compass/CriticEval.git | ||
# prepare the env for evaluation toolkit | ||
cd critic_bench | ||
pip install -r requirements.txt | ||
|
@@ -72,13 +72,13 @@ cd ../inference | |
pip install -r requirements.txt | ||
``` | ||
|
||
### 2. Inference LLMs on CriticBench | ||
### 2. Inference LLMs on CriticEval | ||
|
||
You need to inference LLMs to be evaluated on our proposed CriticBench, and generation results on CriticBench can be found in `inference/outputs` folder. | ||
You need to inference LLMs to be evaluated on our proposed CriticEval, and generation results on CriticEval can be found in `inference/outputs` folder. | ||
If you are interested with our prompts for LLM, they are shown in [inference/utils/prompts.py](inference/utils/prompts.py). | ||
Specifically, the inference code should be like: | ||
```python | ||
# this line loads all the evaluation dataset in CriticBench from `inference/utils` | ||
# this line loads all the evaluation dataset in CriticEval from `inference/utils` | ||
datasets = load_all_datasets(args['data_dir']) | ||
|
||
# these lines init the tokenizer and models from huggingface | ||
|
@@ -116,13 +116,13 @@ We only provide the inference codebase for our [InternLM2-7B-Chat](https://huggi | |
|
||
#### Example Inference Data of Representative LLMs | ||
|
||
We have already released the generation results of some representative LLMs on CriticBench, and you could found them in [example_data/prediction_v1.3.tgz](example_data/prediction_v1.3.tgz). | ||
We have already released the generation results of some representative LLMs on CriticEval, and you could found them in [example_data/prediction_v1.3.tgz](example_data/prediction_v1.3.tgz). | ||
|
||
```bash | ||
tar -xzvf example_data/prediction_v1.3.tgz | ||
``` | ||
|
||
After unzipping, you could found the details of the predictions of LLMs on CriticBench. Typically, the format of the evaluation files are: `{split}_{domain}_{dimension}_{format}.json`, where `split`, `dimension`, and `format` are described above. The `domain` represents 9 task scenarios in our proposed CriticBench, consisting of `translate`, `chat`, `qa`, `harmlessness`, `summary`, `math_cot`, `math_pot`, `code_exec`, `code_not_exec`. Refer to more details in our paper. | ||
After unzipping, you could found the details of the predictions of LLMs on CriticEval. Typically, the format of the evaluation files are: `{split}_{domain}_{dimension}_{format}.json`, where `split`, `dimension`, and `format` are described above. The `domain` represents 9 task scenarios in our proposed CriticEval, consisting of `translate`, `chat`, `qa`, `harmlessness`, `summary`, `math_cot`, `math_pot`, `code_exec`, `code_not_exec`. Refer to more details in our paper. | ||
Here are some notes: | ||
* the `comp_feedback` critique dimension always company with an `reverse` file which is used to address the well-known positional bias problem for LLM-as-a-judge procedure. Refer to more details in Section 4 of our paper. | ||
* For `feedback` critique dimension, each `domain` has additional `*_correction_part.json` files, saving the evaluation results of critiques for the correct or the very high-quality responses. Refer to more details about these response in our paper. | ||
|
@@ -137,9 +137,9 @@ The format of the evaluation result file is: | |
} | ||
``` | ||
|
||
### 3. Compute the Evaluation Results on CriticBench | ||
### 3. Compute the Evaluation Results on CriticEval | ||
|
||
After getting the generation results under `inference/outputs`, your next step is to compute the objective and subjective scores in our proposed CriticBench using our toolkit. | ||
After getting the generation results under `inference/outputs`, your next step is to compute the objective and subjective scores in our proposed CriticEval using our toolkit. | ||
See more details about the objective and subjective scores in Section 4 of our paper. | ||
|
||
We provide two ways for computing the `objective` and `subjective` scores in `critic_bench` folder. | ||
|
@@ -162,26 +162,26 @@ Then, running the following codes for evaluation: | |
./run.sh <dimension> <format> <split> <save_dir> | ||
``` | ||
|
||
* `dimension` denotes critique dimensions defined in our proposed CriticBench, which are `feedback`, `correction`, `comp_feedback`, and `meta_feedback`. Refer to more details about these critique dimensions in Section 2 of our paper. | ||
* `format` denotes the critique format `objective` and `subjective`. Objective scores are spearman correlation, pass rate, preference accuracy that can be computed automatically without any cost, while subjective scores are computed by prompting GPT-4-turbo to compare generated critiques and our human-annotated high-quality critiques in CriticBench. | ||
* `dimension` denotes critique dimensions defined in our proposed CriticEval, which are `feedback`, `correction`, `comp_feedback`, and `meta_feedback`. Refer to more details about these critique dimensions in Section 2 of our paper. | ||
* `format` denotes the critique format `objective` and `subjective`. Objective scores are spearman correlation, pass rate, preference accuracy that can be computed automatically without any cost, while subjective scores are computed by prompting GPT-4-turbo to compare generated critiques and our human-annotated high-quality critiques in CriticEval. | ||
* `split` denotes the `test` or `dev` set to be evaluated. | ||
* `save_dir` is any text path saving the evaluation results. | ||
|
||
In [run.sh](critic_bench/run.sh) file, you could find the corresponding commands for objective and subjective evaluation process. | ||
For example, for the feedback critique dimension, the objective evaluation is like: | ||
```bash | ||
python run_feedback.py --root_dir "../data/CriticBench" --prediction_dir "../example_data/prediction_v1.3" --split $3 --obj True | ||
python run_feedback.py --root_dir "../data/CriticEval" --prediction_dir "../example_data/prediction_v1.3" --split $3 --obj True | ||
``` | ||
* `root_dir` contains the path of the `test` and `dev` set in CriticBench. | ||
* `root_dir` contains the path of the `test` and `dev` set in CriticEval. | ||
* `prediction_dir` contains the inference results of LLMs to be evaluated. We also provide the inference results of some representation LLMs in `example_data`. If you want to evaluate your own LLMs, please refer to `inference/README.md` for more details, and the `prediction_dir` could be set as `../inference/outputs`. | ||
* `split` denotes whether the `test` or the `dev` set is used. | ||
* `obj` denotes that the objective evaluation is activated | ||
|
||
For the subjective evaluation of the feedback critique dimension, the evaluation command is like: | ||
```bash | ||
python run_feedback.py --root_dir "../data/CriticBench" --prediction_dir "../example_data/prediction_v1.3" --evaluation_dir "../example_data/evaluation_v1.3/" --batch_size 1 --split $3 --obj False | ||
python run_feedback.py --root_dir "../data/CriticEval" --prediction_dir "../example_data/prediction_v1.3" --evaluation_dir "../example_data/evaluation_v1.3/" --batch_size 1 --split $3 --obj False | ||
``` | ||
* `evaluation_dir` saves the subjective evaluation scores of GPT-4, which can be re-loaded if the subjective evaluation process borke off. The order of the samples in each file in `evaluation_dir` follows the order in the original data in CriticBench (`data/CriticBench`). | ||
* `evaluation_dir` saves the subjective evaluation scores of GPT-4, which can be re-loaded if the subjective evaluation process borke off. The order of the samples in each file in `evaluation_dir` follows the order in the original data in CriticEval (`data/CriticEval`). | ||
* `batch_size` controls the number of the process for access GPT-4 API under multiprocessing setting. | ||
|
||
The evaluation results of GPT-4 under `save_dir` is `jsonl`, and each line contains the evaluation results. The chain-of-thought evaluation results prompted by GPT-4 is in the `evaluation` key-value in each line, which is a `dict` consisting of the chain-of-thought rationale about GPT-4 (key-value `cot`) and a Likert score (key-value `score`) for each critiques, ranging from 1 to 10. | ||
|
@@ -199,12 +199,12 @@ The objective evaluation results of some representation LLMs are shown: | |
|
||
<img src="./figs/objective_score.png" alt="objective" align=center /> | ||
|
||
Refer to our [Project Page](https://open-compass.github.io/CriticBench/) for the complete evaluation results on <b>CriticBench</b>. | ||
Refer to our [Project Page](https://open-compass.github.io/CriticEval/) for the complete evaluation results on <b>CriticEval</b>. | ||
|
||
|
||
## Acknowledgements | ||
|
||
<b>CriticBench</b> is built with [OpenCompass](https://github.com/open-compass/opencompass). Thanks for their awesome work! | ||
<b>CriticEval</b> is built with [OpenCompass](https://github.com/open-compass/opencompass). Thanks for their awesome work! | ||
|
||
The quota for API-based LLMs are supported by Beijing Institute of Technology and Shanghai AI Laboratory. Thank you so much! | ||
|
||
|
@@ -217,7 +217,7 @@ The quota for API-based LLMs are supported by Beijing Institute of Technology an | |
|
||
``` | ||
@misc{lan2024criticbench, | ||
title={CriticBench: Evaluating Large Language Models as Critic}, | ||
title={CriticEval: Evaluating Large Language Models as Critic}, | ||
author={Tian Lan and Wenwei Zhang and Chen Xu and Heyan Huang and Dahua Lin and Kai Chen and Xian-ling Mao}, | ||
year={2024}, | ||
eprint={2402.13764}, | ||
|