-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add support for VertexAI, update dependencies, update question sets a…
…nd add results for some new experiments (#29) * handle empty question set * fetch latest questions from contentful * use gpt4 turbo * configuration for gpt4 * only keep 30 questions for next eval * add gpt-4 result * config for gemini pro v1 * handle Gemini Pro Errors Some times Gemini Pro will not response because of "Recitation Reasons", and litellm will raise an Error. Instead of raising error let's create an response stating the situation. * fix response is not declared when there is error * chinese questions * add experiment config for chinese questions * add more waiting time for alibaba models * gemini pro and alibaba results * update results * force question_id to be string * make archive for 60 prompts test * new experiment * update the evaluator * add experiment config * add no_option_letter * add results for gemini * new chinese questions and config for qwen-max * result for qwen * config for gpt-4 * results for gpt4 * generate results.xlsx * one of chinese prompt translation is not correct, fix and add new config * more qwen data * update results * update notebook * use newer mypy * update pre-commit's mypy * update lock file and try fixing the issue * another try * use old settings * use old code * fix * should be 0.910 * upgrade to mypy 1.9 and fix errors Most significant change is the removing of Optional types from ai eval spreadsheet * use latest litellm because safety_settings is supported * Use vertex AI for google models. Remove support for depecated PALM models. * new experiment for gemini 1.5 pro * fix port num * fix * add gemini 1.5 results * add more columns in prompt variant sheet * rephrase some chinese questions * new config for qwen-max-0403 * result for qwen-max-0403 * make archive and add new results.xlsx * update deps * add result data analysis notebook * update README * Add support for claude evaluator and commandline option to set evaluator * change model_compare function to also support claude * create archive for experiment 202405012311 * experiment for 4 missing climate study questions * update dependencies * update notebook * include all questions again * use bigger max_tokens for evaluators * update evaluators - move common functions into one file - add llama3 based evaluator * add llama3 evaluator in config generation * add more experiment archives * update notebooks * misc * new experiment for gpt4o * dependency * session result sheet is not in use * improve the evaluator prompt * new experiment and results for llama3 and claude3 opus * update notebooks * update notebook * add more columns to the results.xlsx - auto marked correctness - human rating scores * also update a few archived results * experiment and result for qwen-max-0428 * notebooks update * results.xlsx for latest experiment, with human rating * update notebooks * update results with human rating * update notebook * remove cli.py because we don't use it any more
- Loading branch information
Showing
77 changed files
with
28,821 additions
and
3,204 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.