Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mt-bench #75

Merged
merged 51 commits into from
Mar 29, 2024
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
89e7fda
init ifeval, now need to add loading custom metric system
clefourrier Feb 20, 2024
96aa81b
Merge branch 'main' into clem_customizable_metrics
clefourrier Feb 23, 2024
2fdceb8
custom metrics working! need to update the readme
clefourrier Feb 23, 2024
0e30b21
update doc
clefourrier Feb 23, 2024
1ba178f
fix eos token + eval script
clefourrier Feb 23, 2024
6233af7
init
Feb 28, 2024
5cc9c2c
remove ifeval
Feb 28, 2024
b9045e1
revert README
Feb 28, 2024
ff79480
revert README
Feb 28, 2024
a234bf6
better context management
Feb 28, 2024
1357c10
working state
NathanHB Mar 6, 2024
bb5cca2
fix
NathanHB Mar 6, 2024
6b74a68
:Merge branch 'nathan_fix_push_details' into nathan-add-mt-bench
NathanHB Mar 6, 2024
f548902
continue
NathanHB Mar 9, 2024
2e2b15d
continue
NathanHB Mar 11, 2024
339f1f6
commit
NathanHB Mar 20, 2024
aba90b3
:Merge remote-tracking branch 'origin/main' into nathan-add-mt-bench
NathanHB Mar 20, 2024
5bc5b98
Update README.md
NathanHB Mar 20, 2024
cd1300d
commit
NathanHB Mar 20, 2024
1fd755e
commit
NathanHB Mar 20, 2024
4b00eb7
commit
NathanHB Mar 20, 2024
4903755
commit
NathanHB Mar 20, 2024
9ff0707
commit
NathanHB Mar 20, 2024
ff177a1
commit
NathanHB Mar 20, 2024
9794b7c
commit
NathanHB Mar 20, 2024
6268ff6
commit
NathanHB Mar 20, 2024
31eaab1
commit
NathanHB Mar 20, 2024
c80ef8c
commit
NathanHB Mar 21, 2024
c296b63
Revert "commit"
NathanHB Mar 21, 2024
804f41a
commit
NathanHB Mar 21, 2024
48b0fee
remove model adapter
NathanHB Mar 21, 2024
e5b6ea8
commit
NathanHB Mar 21, 2024
0dcdb1e
update readme
NathanHB Mar 21, 2024
703741b
commti
NathanHB Mar 21, 2024
6e8026f
commit
NathanHB Mar 22, 2024
588fb2f
format
NathanHB Mar 22, 2024
8cb4894
format
NathanHB Mar 22, 2024
c08a8f6
commit
NathanHB Mar 25, 2024
64ceee5
fixes for review
NathanHB Mar 27, 2024
46d7dd8
make style
NathanHB Mar 27, 2024
e2f7fa8
fix
NathanHB Mar 27, 2024
3260147
revert generate_response in base model
NathanHB Mar 27, 2024
323188a
Merge remote-tracking branch 'origin/main' into nathan-add-mt-bench
NathanHB Mar 27, 2024
33eb252
merge
NathanHB Mar 27, 2024
b2e5895
fix tests
NathanHB Mar 27, 2024
c42e65d
fix format
NathanHB Mar 27, 2024
aa6c6f8
commit
NathanHB Mar 29, 2024
bb4b133
make style
NathanHB Mar 29, 2024
2d3a04c
fix from review
NathanHB Mar 29, 2024
0819ac7
fix
NathanHB Mar 29, 2024
b2bf514
Merge branch 'main' into nathan-add-mt-bench
NathanHB Mar 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions extended_tasks/mt_bench/judge_prompts.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{"name": "pair-v2", "type": "pairwise", "system_prompt": "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user's instructions and answers the user's question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: \"[[A]]\" if assistant A is better, \"[[B]]\" if assistant B is better, and \"[[C]]\" for a tie.", "prompt_template": "[User Question]\n{question}\n\n[The Start of Assistant A's Answer]\n{answer_a}\n[The End of Assistant A's Answer]\n\n[The Start of Assistant B's Answer]\n{answer_b}\n[The End of Assistant B's Answer]", "description": "Prompt for general questions", "category": "general", "output_format": "[[A]]"}
clefourrier marked this conversation as resolved.
Show resolved Hide resolved
{"name": "pair-v2-multi-turn", "type": "pairwise", "system_prompt": "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user questions. You should choose the assistant that follows the user's instructions and answers the user's questions better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. You should focus on who provides a better answer to the second user question. Begin your evaluation by comparing the responses of the two assistants and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: \"[[A]]\" if assistant A is better, \"[[B]]\" if assistant B is better, and \"[[C]]\" for a tie.", "prompt_template": "<|The Start of Assistant A's Conversation with User|>\n\n### User:\n{question_1}\n\n### Assistant A:\n{answer_a_1}\n\n### User:\n{question_2}\n\n### Assistant A:\n{answer_a_2}\n\n<|The End of Assistant A's Conversation with User|>\n\n\n<|The Start of Assistant B's Conversation with User|>\n\n### User:\n{question_1}\n\n### Assistant B:\n{answer_b_1}\n\n### User:\n{question_2}\n\n### Assistant B:\n{answer_b_2}\n\n<|The End of Assistant B's Conversation with User|>", "description": "Prompt for multi-turn general questions", "category": "general", "output_format": "[[A]]"}
{"name": "pair-math-v1", "type": "pairwise", "system_prompt": "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. Your evaluation should consider correctness and helpfulness. You will be given a reference answer, assistant A's answer, and assistant B's answer. Your job is to evaluate which assistant's answer is better. Begin your evaluation by comparing both assistants' answers with the reference answer. Identify and correct any mistakes. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: \"[[A]]\" if assistant A is better, \"[[B]]\" if assistant B is better, and \"[[C]]\" for a tie.", "prompt_template": "[User Question]\n{question}\n\n[The Start of Reference Answer]\n{ref_answer_1}\n[The End of Reference Answer]\n\n[The Start of Assistant A's Answer]\n{answer_a}\n[The End of Assistant A's Answer]\n\n[The Start of Assistant B's Answer]\n{answer_b}\n[The End of Assistant B's Answer]", "description": "Prompt for math questions", "category": "math", "output_format": "[[A]]"}
{"name": "pair-math-v1-multi-turn", "type": "pairwise", "system_prompt": "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user questions. Your evaluation should consider correctness and helpfulness. You will be given reference answers, the assistant A's answers, the assistant B's answers. Your job is to determine which assistant provides correct and helpful answers to the second user question. Begin your evaluation by comparing both assistants' answers with the reference answers. Identify and correct any mistakes. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: \"[[A]]\" if assistant A is better, \"[[B]]\" if assistant B is better, and \"[[C]]\" for a tie.", "prompt_template": "<|The Start of Reference Answer|>\n\n### User:\n{question_1}\n\n### Reference answer:\n{ref_answer_1}\n\n### User:\n{question_2}\n\n### Reference answer:\n{ref_answer_2}\n\n<|The End of Reference Answer|>\n\n\n<|The Start of Assistant A's Conversation with User|>\n\n### User:\n{question_1}\n\n### Assistant A:\n{answer_a_1}\n\n### User:\n{question_2}\n\n### Assistant A:\n{answer_a_2}\n\n<|The End of Assistant A's Conversation with User|>\n\n\n<|The Start of Assistant B's Conversation with User|>\n\n### User:\n{question_1}\n\n### Assistant B:\n{answer_b_1}\n\n### User:\n{question_2}\n\n### Assistant B:\n{answer_b_2}\n\n<|The End of Assistant B's Conversation with User|>", "description": "Prompt for multi-turn general questions", "category": "general", "output_format": "[[A]]"}
{"name": "single-v1", "type": "single", "system_prompt": "You are a helpful assistant.", "prompt_template": "[Instruction]\nPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", "description": "Prompt for general questions", "category": "general", "output_format": "[[rating]]"}
{"name": "single-math-v1", "type": "single", "system_prompt": "You are a helpful assistant.", "prompt_template": "[Instruction]\nPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider correctness and helpfulness. You will be given a reference answer and the assistant's answer. Begin your evaluation by comparing the assistant's answer with the reference answer. Identify and correct any mistakes. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n[Question]\n{question}\n\n[The Start of Reference Answer]\n{ref_answer_1}\n[The End of Reference Answer]\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]", "description": "Prompt for general questions", "category": "math", "output_format": "[[rating]]"}
{"name": "single-v1-multi-turn", "type": "single", "system_prompt": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. You evaluation should focus on the assistant's answer to the second user question. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n", "prompt_template": "<|The Start of Assistant A's Conversation with User|>\n\n### User:\n{question_1}\n\n### Assistant A:\n{answer_1}\n\n### User:\n{question_2}\n\n### Assistant A:\n{answer_2}\n\n<|The End of Assistant A's Conversation with User|>", "description": "Prompt for general questions", "category": "general", "output_format": "[[rating]]"}
{"name": "single-math-v1-multi-turn", "type": "single", "system_prompt": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question. Your evaluation should consider correctness and helpfulness. You will be given a reference answer and the assistant's answer. You evaluation should focus on the assistant's answer to the second question. Begin your evaluation by comparing the assistant's answer with the reference answer. Identify and correct any mistakes. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n", "prompt_template": "<|The Start of Reference Answer|>\n\n### User:\n{question_1}\n\n### Reference answer:\n{ref_answer_1}\n\n### User:\n{question_2}\n\n### Reference answer:\n{ref_answer_2}\n\n<|The End of Reference Answer|>\n\n\n<|The Start of Assistant A's Conversation with User|>\n\n### User:\n{question_1}\n\n### Assistant A:\n{answer_1}\n\n### User:\n{question_2}\n\n### Assistant A:\n{answer_2}\n\n<|The End of Assistant A's Conversation with User|>", "description": "Prompt for general questions", "category": "math", "output_format": "[[rating]]"}
136 changes: 136 additions & 0 deletions extended_tasks/mt_bench/judges.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# MIT License
NathanHB marked this conversation as resolved.
Show resolved Hide resolved

# Copyright (c) 2024 The HuggingFace Team

# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.


import ast
import json
import re
import time
from abc import ABC
from typing import Optional

from openai import OpenAI

from lighteval.logging.hierarchical_logger import hlog_warn


# Abstract class for a judge
class Judge(ABC):
def evaluate_answer(answers, questions, references) -> tuple[str, list[dict[str, str]], str]:
pass


class JudgeOpenAI(Judge):
def __init__(self, model: str, seed: int, temperature: float, templates_path: str):
self.client = OpenAI()
NathanHB marked this conversation as resolved.
Show resolved Hide resolved
self.model = model
self.seed = seed
self.temperature = temperature

data = []
with open(templates_path, "r") as f:
for line in f:
tmp = json.loads(line)
data.append(tmp)

self.templates = {d["name"]: d for d in data}

self.one_score_pattern = re.compile(r"\[\[(\d+\.?\d*)\]\]")
NathanHB marked this conversation as resolved.
Show resolved Hide resolved
self.one_score_pattern_backup = re.compile(r"\[(\d+\.?\d*)\]")

self.API_MAX_RETRY = 16
self.API_RETRY_SLEEP = 10
self.max_tokens = 2048

def evaluate_answer(
self, questions: list[str], answers: list[str], references: list[str], single_turn: bool
) -> tuple[int, list[dict[str, str]], str]:
if single_turn:
prompts = self.__get_prompts_single_turn(
questions[0], answers[0], references[0] if len(references) > 0 else None
)
else:
prompts = self.__get_prompts_multi_turn(questions, answers, references if len(references) > 0 else None)

for _ in range(self.API_MAX_RETRY):
try:
response = self.client.chat.completions.create(
model=self.model,
seed=self.seed,
temperature=self.temperature,
messages=prompts,
max_tokens=self.max_tokens,
n=1,
)
break
except Exception as e:
clefourrier marked this conversation as resolved.
Show resolved Hide resolved
hlog_warn(f"{type(e), e}")
time.sleep(self.API_RETRY_SLEEP)

judgment = response.choices[0].message.content
NathanHB marked this conversation as resolved.
Show resolved Hide resolved
score = self.__process_judge_response(judgment)

return score, prompts, judgment

def __get_prompts_multi_turn(
self, questions: list[str], answers: list[str], references: Optional[list[str]]
NathanHB marked this conversation as resolved.
Show resolved Hide resolved
) -> list[dict[str, str]]:
if references is None:
system_prompt = {"role": "system", "content": self.templates["single-v1-multi-turn"]["system_prompt"]}
user_prompt_str = self.templates["single-v1-multi-turn"]["prompt_template"].format(
question_1=questions[0], answer_1=answers[0], question_2=questions[1], answer_2=answers[1]
)
else:
system_prompt = {"role": "system", "content": self.templates["single-math-v1-multi-turn"]["system_prompt"]}
user_prompt_str = self.templates["single-math-v1-multi-turn"]["prompt_template"].format(
question_1=questions[0],
answer_1=answers[0],
ref_answer_1=references[0],
question_2=questions[1],
answer_2=answers[1],
ref_answer_2=references[1],
)
user_prompt = {"role": "user", "content": user_prompt_str}
return [system_prompt, user_prompt]

def __get_prompts_single_turn(self, question: str, answer: str, reference: Optional[str]) -> list[dict[str, str]]:
if reference is None:
system_prompt = {"role": "system", "content": self.templates["single-v1"]["system_prompt"]}
user_prompt_str = self.templates["single-v1"]["prompt_template"].format(question=question, answer=answer)
else:
system_prompt = {"role": "system", "content": self.templates["single-math-v1"]["system_prompt"]}
user_prompt_str = self.templates["single-math-v1"]["prompt_template"].format(
question=question, answer=answer, ref_answer_1=reference
)
user_prompt = {"role": "user", "content": user_prompt_str}
return [system_prompt, user_prompt]

def __process_judge_response(self, judgment: str) -> int:
match = re.search(self.one_score_pattern, judgment)
if not match:
match = re.search(self.one_score_pattern_backup, judgment)
if match:
rating = ast.literal_eval(match.groups()[0])
else:
rating = -1

return rating
Loading
Loading