feat(bench): evaluate review writing in ReviewBench (#917)

* init * baselines * upload * debug * ruff isort mypy * checkpoint * prompt update * update prompt * update prompt * checkpoint * checkpoint * code cleanup * upload evaluation code * remove comments * pytest * pytest * delete useless files * remove commented and outdated file/content * make topk as a param * debug * update readme --------- Co-authored-by: Haofei Yu <[email protected]>
ulab-uiuc · Jan 1, 2025 · 4b267ed · 4b267ed
1 parent 073f56d
commit 4b267ed
Show file tree

Hide file tree

Showing 33 changed files with 1,477 additions and 637 deletions.
diff --git a/.gitignore b/.gitignore
@@ -185,6 +185,7 @@ research_bench/data/arxiv_ai_papers/output_with_references.json
 research_bench/data/arxiv_ai_papers/paper_info.json
 research_bench/crossbench/*.json
 research_bench/mlbench/*.json
+research_bench/iclrbench/*.json
 research_bench/profile_dbs/*
 research_bench/results/*
 research_bench/profile_dbs_old

diff --git a/README-CN.md b/README-CN.md
@@ -114,3 +114,13 @@ pre-commit install
   </picture>
 </a>
 </p>
+
+## ResearchBench
+
+要执行ResearchBench实验，请运行 'research_bench/run_review_eval.sh' 脚本。你可以在脚本中调整参数，如使用实际的 `INPUT_PATH`。
+
+如果遇到 `openreview` 未找到的错误，请通过运行 `pip install openreview` 安装该包。如果遇到与 `requests` 相关的问题，请将其版本更改为 `2.26`。
+
+```bash
+pip install requests==2.26
+```
diff --git a/README.md b/README.md
@@ -121,3 +121,13 @@ Check the github action result to make sure all tests pass. If not, fix the erro
   </picture>
 </a>
 </p>
+
+## ResearchBench
+
+To execute ResearchBench experiments, please execute 'research_bench/run_review_eval.sh' script. You can adjust the parameters in the script, using the actual `INPUT_PATH`.
+
+If you encounter `openreview` not found error, please install the package by running `pip install openreview`. If any issues come up regarding `requests`, please change its version to `2.26`.
+
+```bash
+pip install requests==2.26
+```
diff --git a/configs/agent_prompt/write_metareview_decision.yaml b/configs/agent_prompt/write_metareview_decision.yaml
diff --git a/configs/agent_prompt/write_metareview_ethical.yaml b/configs/agent_prompt/write_metareview_ethical.yaml
diff --git a/configs/agent_prompt/write_metareview_strength.yaml b/configs/agent_prompt/write_metareview_strength.yaml
@@ -1,44 +1,17 @@
-fewshot_examples:
-- "Here is the proposal: We present a novel deep learning architecture, TransformerX, for natural language processing tasks. Our model achieves state-of-the-art performance on multiple benchmarks while requiring significantly less computational resources than existing models.
-
-  Here are the reviews:
-  Reviewer 1 (Score: 8/10): The paper presents an innovative approach to efficient NLP modeling. The results are impressive, showing both performance gains and reduced computational requirements. However, the theoretical analysis could be more rigorous.
-
-  Reviewer 2 (Score: 9/10): This is a strong paper with clear contributions. The TransformerX architecture is well-designed and the extensive experiments demonstrate its effectiveness. The paper could benefit from more ablation studies.
-
-  Here is the summary of the reviews: Both reviewers acknowledge the novelty and effectiveness of the proposed TransformerX architecture, with minor suggestions for improvement.
-
-  Please begin writing the strength of the submission based on the review."
-
-- "Strength of the submission: The submission presents a strong, innovative approach to NLP modeling with clear empirical advantages and thorough evaluation, making it a valuable contribution to the field."
-
-- "Here is the proposal: Our paper introduces a novel graph neural network algorithm, GraphFusion, for multi-modal data integration in bioinformatics. We demonstrate its effectiveness in predicting protein-protein interactions and drug-target affinities, outperforming existing methods on several benchmark datasets.
-
-  Here are the reviews:
-  Reviewer 1 (Score: 7/10): The paper presents an interesting approach to multi-modal data integration. The results on protein-protein interaction prediction are promising. However, the comparison with some recent methods is missing, and the scalability of the approach needs more discussion.
-
-  Reviewer 2 (Score: 8/10): This is a solid contribution to bioinformatics and graph neural networks. The GraphFusion algorithm is well-designed and the experiments are comprehensive. The paper would benefit from a more in-depth analysis of the model's interpretability.
-
-  Here is the summary of the reviews: Both reviewers recognize the value of the GraphFusion algorithm for multi-modal data integration in bioinformatics, with suggestions for additional comparisons and analyses.
-
-  Please begin writing the strength of the submission based on the review."
-
-- "Strength of the submission: The submission presents a novel and effective approach to multi-modal data integration in bioinformatics, with clear empirical advantages, comprehensive evaluation, and potential for significant impact in both theoretical and applied research in the field."
+fewshot_examples: []
 
 sys_prompt: >
     You are an autonomous intelligent agent tasked to write the strength of the submission for the following submission you have made to an academic conference. Your summary of strength should summarize the reviews to help the reviewers to make a decision.
     You will be provided with the following information:
-    Submission - The abstract of the paper submitted to this conference.
-    Reviews - It typically contains the score, a short summary, strength, and weakness of the submission.
-    Summary of Reviews - A short summary of the review.
+    Submission - Full content of the paper submitted to this conference.
+    Reviews - It typically contains the score, strength, and weakness of the submission, each by a different reviewer.
 
     You should provide the following information:
-    Strength - The strength of the submission based on the review.
-template: |
-  Here is the proposal: {proposal}
+    Strength - The strength of the submission based on the reviews.
 
+template: |
   Here are the reviews: {reviews}
 
-  Here is the summary of the reviews: {summary}
+  Please summarize the important points from the 'strength' section of the reviews.
 
-  Please begin writing the strength of the submission based on the review.
+  Please write in bullet points. It should be 200 words long.
diff --git a/configs/agent_prompt/write_metareview_summary.yaml b/configs/agent_prompt/write_metareview_summary.yaml