-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoSxS Movel Evaluation for LLMs #443
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -0,0 +1,1055 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the file name should include 'AutoSxS' to differentiate from other evaluation services.
Reply via ReviewNB
@@ -0,0 +1,1055 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #9. BUCKET_URI = f"gs://{BUCKET}/autosxs-{UUID}"
Minor: Maybe we can use a fixed name for simplicity as the other labs. Then we can remove the function, and no need to import string standard library.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Replaced with a timestamp though to avoid pipeline runs to overwrite in the same directory and simplify the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New logic:
timestamp = str(current_datetime.timestamp()).replace(".","") display_name = f"autosxs-{timestamp}" pipeline_root = os.path.join("gs://", BUCKET, display_name)
@@ -0,0 +1,1055 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"with "response_a" and "response_b" representing different article summaries." -> "with "response_a" and "response_b" representing different article summaries generated from different LLM models." for more clarity?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -0,0 +1,1055 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{task}@{version}
-> {task}
?
Adding the expected parameters under autorater_prompt_parameters
would be helpful.
Also, I'd add that we can simply specify models (model_a
and model_b
) instead of specifying pre-generated responses (response_column_a
and response_column_b
).
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -0,0 +1,1055 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #4. "id_columns": ["id", "document"],
I guess only id is sufficient if it is truly a unique id.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreeing. This is weird. I'll try with id only see if that works. Otherwise, I'll revert.
@@ -0,0 +1,1055 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for details in job.task_details: if details.task_name == "model-evaluation-text-generation-pairwise": break
Same as the comment above.
online_eval_task = [task for task in job.task_details if task.task_name=="online-evaluation-pairwise"][0]
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's nicer. Thanks. Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left minor comments, but LGTM otherwise. Very simple and good lab!
No description provided.