Skip to content

Commit

Permalink
Add support for VertexAI, update dependencies, update question sets a…
Browse files Browse the repository at this point in the history
…nd add results for some new experiments (#29)

* handle empty question set

* fetch latest questions from contentful

* use gpt4 turbo

* configuration for gpt4

* only keep 30 questions for next eval

* add gpt-4 result

* config for gemini pro v1

* handle Gemini Pro Errors

Some times Gemini Pro will not response because of "Recitation Reasons",
and litellm will raise an Error. Instead of raising error let's create
an response stating the situation.

* fix response is not declared when there is error

* chinese questions

* add experiment config for chinese questions

* add more waiting time for alibaba models

* gemini pro and alibaba results

* update results

* force question_id to be string

* make archive for 60 prompts test

* new experiment

* update the evaluator

* add experiment config

* add no_option_letter

* add results for gemini

* new chinese questions and config for qwen-max

* result for qwen

* config for gpt-4

* results for gpt4

* generate results.xlsx

* one of chinese prompt translation is not correct, fix and add new config

* more qwen data

* update results

* update notebook

* use newer mypy

* update pre-commit's mypy

* update lock file and try fixing the issue

* another try

* use old settings

* use old code

* fix

* should be 0.910

* upgrade to mypy 1.9 and fix errors

Most significant change is the removing of Optional types from ai eval spreadsheet

* use latest litellm because safety_settings is supported

* Use vertex AI for google models. Remove support for depecated PALM models.

* new experiment for gemini 1.5 pro

* fix port num

* fix

* add gemini 1.5 results

* add more columns in prompt variant sheet

* rephrase some chinese questions

* new config for qwen-max-0403

* result for qwen-max-0403

* make archive and add new results.xlsx

* update deps

* add result data analysis notebook

* update README

* Add support for claude evaluator and commandline option to set evaluator

* change model_compare function to also support claude

* create archive for experiment 202405012311

* experiment for 4 missing climate study questions

* update dependencies

* update notebook

* include all questions again

* use bigger max_tokens for evaluators

* update evaluators

- move common functions into one file
- add llama3 based evaluator

* add llama3 evaluator in config generation

* add more experiment archives

* update notebooks

* misc

* new experiment for gpt4o

* dependency

* session result sheet is not in use

* improve the evaluator prompt

* new experiment and results for llama3 and claude3 opus

* update notebooks

* update notebook

* add more columns to the results.xlsx

- auto marked correctness
- human rating scores

* also update a few archived results

* experiment and result for qwen-max-0428

* notebooks update

* results.xlsx for latest experiment, with human rating

* update notebooks

* update results with human rating

* update notebook

* remove cli.py because we don't use it any more
  • Loading branch information
semio authored Aug 28, 2024
1 parent 4d462b0 commit 3fe7eba
Show file tree
Hide file tree
Showing 77 changed files with 28,821 additions and 3,204 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ repos:
hooks:
- id: pyupgrade
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.910
rev: v1.9.0
hooks:
- id: mypy
exclude: .+/snapshots/.+
Expand Down
6 changes: 6 additions & 0 deletions automation-api/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ IFLYTEK_APPID=""
DASHSCOPE_API_KEY="" # for Alibaba
REPLICATE_API_KEY=""
GEMINI_API_KEY=""
# vertex AI
VERTEXAI_PROJECT="gapminder-ai"
# comma separated list of gcp regions. use multiple regions to get over the limit of 5 requests per minute of Gemini.
VERTEXAI_LOCATIONS="asia-southeast1,asia-east2,asia-northeast1"
# follow the guide in automation-api/DEV.md#obtaining-developer-specific-service-account-credentials-base64-encoded
VERTEX_SERVICE_ACCOUNT_CREDENTIALS=""

# For local development / notebooks etc
SERVICE_ACCOUNT_CREDENTIALS=""
Expand Down
128 changes: 61 additions & 67 deletions automation-api/lib/ai_eval_spreadsheet/schemas.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@
class Question(BaseModel):
model_config = ConfigDict(coerce_numbers_to_str=True)

include_in_next_evaluation: Optional[bool] = Field(
None, title="Include in next evaluation", validate_default=True
include_in_next_evaluation: bool = Field(
False, title="Include in next evaluation", validate_default=True
)
question_id: Optional[str] = Field(None, title="Question ID")
language: Optional[str] = Field(None, title="Language")
published_version_of_question: Optional[str] = Field(
None, title="Published version of question"
question_id: str = Field("", title="Question ID")
language: str = Field("", title="Language")
published_version_of_question: str = Field(
"", title="Published version of question"
)

@field_validator("include_in_next_evaluation", mode="before")
Expand All @@ -40,14 +40,12 @@ class Config:
class QuestionOption(BaseModel):
model_config = ConfigDict(coerce_numbers_to_str=True)

question_option_id: Optional[str] = Field(None, title="Question Option ID")
question_id: Optional[str] = Field(None, title="Question ID")
language: Optional[str] = Field(None, title="Language")
letter: Optional[str] = Field(None, title="Letter")
question_option: Optional[str] = Field(None, title="Question option")
correctness_of_answer_option: Optional[int] = Field(
None, title="Correctness of answer option"
)
question_option_id: str = Field("", title="Question Option ID")
question_id: str = Field("", title="Question ID")
language: str = Field("", title="Language")
letter: str = Field("", title="Letter")
question_option: str = Field("", title="Question option")
correctness_of_answer_option: int = Field(-1, title="Correctness of answer option")


class QuestionOptionsDf(pa.DataFrameModel):
Expand All @@ -59,22 +57,20 @@ class Config:
class PromptVariation(BaseModel):
model_config = ConfigDict(coerce_numbers_to_str=True)

include_in_next_evaluation: Optional[bool] = Field(
None, title="Include in next evaluation"
)
variation_id: Optional[str] = Field(None, title="Variation ID")
language: Optional[str] = Field(None, title="Language")
question_template: Optional[str] = Field(None, title="Question template")
question_prefix: Optional[str] = Field(None, title="Question prefix")
ai_prefix: Optional[str] = Field(None, title="AI prefix")
question_prompt_template: Optional[str] = Field(
None, title="Question prompt template"
include_in_next_evaluation: bool = Field(False, title="Include in next evaluation")
variation_id: str = Field("", title="Variation ID")
prompt_family: str = Field("", title="Prompt Family")
prompt_variation: str = Field("", title="Prompt Variation")
language: str = Field("", title="Language")
question_template: str = Field("", title="Question template")
question_prefix: str = Field("", title="Question prefix")
ai_prefix: str = Field("", title="AI prefix")
question_prompt_template: str = Field("", title="Question prompt template")
question_prompt_template_example: str = Field(
"", title="Question prompt template example"
)
question_prompt_template_example: Optional[str] = Field(
None, title="Question prompt template example"
)
follow_up_answer_correctness_evaluation_prompt_template: Optional[str] = Field(
None, title="Follow-up answer correctness evaluation prompt template"
follow_up_answer_correctness_evaluation_prompt_template: str = Field(
"", title="Follow-up answer correctness evaluation prompt template"
)


Expand All @@ -87,9 +83,9 @@ class Config:
class GenAiModel(BaseModel):
model_config = ConfigDict(coerce_numbers_to_str=True, protected_namespaces=())

model_id: Optional[str] = Field(None, title="Model ID")
vendor: Optional[str] = Field(None, title="Vendor")
model_name: Optional[str] = Field(None, title="Model name")
model_id: str = Field("", title="Model ID")
vendor: str = Field("", title="Vendor")
model_name: str = Field("", title="Model name")


class GenAiModelsDf(pa.DataFrameModel):
Expand All @@ -101,15 +97,13 @@ class Config:
class GenAiModelConfig(BaseModel):
model_config = ConfigDict(coerce_numbers_to_str=True, protected_namespaces=())

include_in_next_evaluation: Optional[bool] = Field(
None, title="Include in next evaluation"
)
model_config_id: Optional[str] = Field(None, title="Model configuration ID")
model_id: Optional[str] = Field(None, title="Model ID")
model_parameters: Optional[str] = Field(None, title="Model Parameters")
repeat_times: Optional[int] = Field(None, title="Repeat Times")
memory: Optional[bool] = Field(None, title="Memory")
memory_size: Optional[int] = Field(None, title="Memory Size")
include_in_next_evaluation: bool = Field(False, title="Include in next evaluation")
model_config_id: str = Field("", title="Model configuration ID")
model_id: str = Field("", title="Model ID")
model_parameters: str = Field("", title="Model Parameters")
repeat_times: int = Field(-1, title="Repeat Times")
memory: bool = Field(False, title="Memory")
memory_size: int = Field(-1, title="Memory Size")


class GenAiModelConfigsDf(pa.DataFrameModel):
Expand All @@ -119,11 +113,11 @@ class Config:


class Metric(BaseModel):
name: Optional[str] = Field(None, title="Name")
description: Optional[str] = Field(None, title="Description")
prompt: Optional[str] = Field(None, title="Prompt")
choices: Optional[str] = Field(None, title="Choices")
choice_scores: Optional[str] = Field(None, title="Choice Scores")
name: str = Field("", title="Name")
description: str = Field("", title="Description")
prompt: str = Field("", title="Prompt")
choices: str = Field("", title="Choices")
choice_scores: str = Field("", title="Choice Scores")


class MetricsDf(pa.DataFrameModel):
Expand All @@ -135,17 +129,17 @@ class Config:
class EvalResult(BaseModel):
model_config = ConfigDict(coerce_numbers_to_str=True, protected_namespaces=())

question_id: Optional[str] = Field(None, title="Question ID")
language: Optional[str] = Field(None, title="Language")
prompt_variation_id: Optional[str] = Field(None, title="Prompt variation ID")
model_configuration_id: Optional[str] = Field(None, title="Model Configuration ID")
last_evaluation_datetime: Optional[str] = Field(None, title="Last Evaluation")
percent_correct: Optional[float] = Field(None, title="Percent Correct")
percent_wrong: Optional[float] = Field(None, title="Percent Wrong")
percent_very_wrong: Optional[float] = Field(None, title="Percent Very Wrong")
percent_eval_failed: Optional[float] = Field(None, title="Percent Eval Failed")
rounds: Optional[int] = Field(None, title="Rounds")
result: Optional[str] = Field(None, title="Result")
question_id: str = Field("", title="Question ID")
language: str = Field("", title="Language")
prompt_variation_id: str = Field("", title="Prompt variation ID")
model_configuration_id: str = Field("", title="Model Configuration ID")
last_evaluation_datetime: str = Field("", title="Last Evaluation")
percent_correct: Optional[float] = Field("", title="Percent Correct")
percent_wrong: Optional[float] = Field("", title="Percent Wrong")
percent_very_wrong: Optional[float] = Field("", title="Percent Very Wrong")
percent_eval_failed: Optional[float] = Field("", title="Percent Eval Failed")
rounds: int = Field(-1, title="Rounds")
result: str = Field("", title="Result")


class EvalResultsDf(pa.DataFrameModel):
Expand All @@ -157,16 +151,16 @@ class Config:
class SessionResult(BaseModel):
model_config = ConfigDict(coerce_numbers_to_str=True, protected_namespaces=())

session_id: Optional[str] = Field(None, title="Session ID")
session_time: Optional[str] = Field(None, title="Session Time")
prompt_variation_id: Optional[str] = Field(None, title="Prompt Variation ID")
model_configuration_id: Optional[str] = Field(None, title="Model Configuration ID")
survey_id: Optional[str] = Field(None, title="Survey ID")
question_id: Optional[str] = Field(None, title="Question ID")
language: Optional[str] = Field(None, title="Language")
question_number: Optional[int] = Field(None, title="Question No.")
output: Optional[str] = Field(None, title="Response Text")
grade: Optional[str] = Field(None, title="Grade")
session_id: str = Field("", title="Session ID")
session_time: str = Field("", title="Session Time")
prompt_variation_id: str = Field("", title="Prompt Variation ID")
model_configuration_id: str = Field("", title="Model Configuration ID")
survey_id: str = Field("", title="Survey ID")
question_id: str = Field("", title="Question ID")
language: str = Field("", title="Language")
question_number: int = Field(-1, title="Question No.")
output: str = Field("", title="Response Text")
grade: str = Field("", title="Grade")


class SessionResultsDf(pa.DataFrameModel):
Expand Down
46 changes: 14 additions & 32 deletions automation-api/lib/ai_eval_spreadsheet/wrapper.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from dataclasses import dataclass
from typing import Optional
from typing import Type

from gspread import Spreadsheet

Expand All @@ -18,33 +18,26 @@
QuestionOption,
QuestionOptionsDf,
QuestionsDf,
SessionResult,
SessionResultsDf,
)
from lib.gdrive.auth import AuthorizedClients
from lib.gsheets.gsheets_worksheet_editor import GsheetsWorksheetEditor


@dataclass
class AiEvalData:
questions: Optional[GsheetsWorksheetEditor[QuestionsDf, Question]] = None
question_options: Optional[
GsheetsWorksheetEditor[QuestionOptionsDf, QuestionOption]
] = None
prompt_variations: Optional[
GsheetsWorksheetEditor[PromptVariationsDf, PromptVariation]
] = None
gen_ai_models: Optional[GsheetsWorksheetEditor[GenAiModelsDf, GenAiModel]] = None
gen_ai_model_configs: Optional[
GsheetsWorksheetEditor[GenAiModelConfigsDf, GenAiModelConfig]
] = None
metrics: Optional[GsheetsWorksheetEditor[MetricsDf, Metric]] = None
evaluation_results: Optional[
GsheetsWorksheetEditor[EvalResult, EvalResultsDf]
] = None
session_results: Optional[
GsheetsWorksheetEditor[SessionResult, SessionResultsDf]
] = None
prompt_variations: GsheetsWorksheetEditor[
Type[PromptVariationsDf], Type[PromptVariation]
]
questions: GsheetsWorksheetEditor[Type[QuestionsDf], Type[Question]]
question_options: GsheetsWorksheetEditor[
Type[QuestionOptionsDf], Type[QuestionOption]
]
gen_ai_models: GsheetsWorksheetEditor[Type[GenAiModelsDf], Type[GenAiModel]]
gen_ai_model_configs: GsheetsWorksheetEditor[
Type[GenAiModelConfigsDf], Type[GenAiModelConfig]
]
metrics: GsheetsWorksheetEditor[Type[MetricsDf], Type[Metric]]
evaluation_results: GsheetsWorksheetEditor[Type[EvalResult], Type[EvalResultsDf]]


sheet_names = {
Expand All @@ -55,7 +48,6 @@ class AiEvalData:
"gen_ai_model_configs": "Model configurations",
"metrics": "Metrics",
"evaluation_results": "Latest Results",
"session_results": "Sessions",
}


Expand Down Expand Up @@ -132,15 +124,6 @@ def read_ai_eval_data(
evaluate_formulas=False,
)

session_results = GsheetsWorksheetEditor(
sh=ai_eval_spreadsheet,
df_schema=SessionResultsDf,
row_schema=SessionResult,
worksheet_name=sheet_names["session_results"],
header_row_number=0,
evaluate_formulas=False,
)

return AiEvalData(
questions=questions,
question_options=question_options,
Expand All @@ -149,5 +132,4 @@ def read_ai_eval_data(
gen_ai_model_configs=gen_ai_model_configs,
metrics=metrics,
evaluation_results=evaluation_results,
session_results=session_results,
)
35 changes: 35 additions & 0 deletions automation-api/lib/config.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,33 @@
from __future__ import annotations

import base64
import json
import os
import tempfile

from dotenv import load_dotenv


def make_tmp_file_google_application_credentials(base64encoded_credentials):
"""set up GOOGLE_APPLICATION_CREDENTIALS enviornment variable
GOOGLE_APPLICATION_CREDENTIALS is expected to be a file path, but we stored the
file contents as a base64 encoded string.
This function will create a temp file with the oridinary contents of the credentials
"""
service_account_credentials = base64.b64decode(base64encoded_credentials).decode(
"utf-8"
)
json_acct_info = json.loads(service_account_credentials)

with tempfile.NamedTemporaryFile(mode="w+", delete=False) as temp_file:
# TODO: this doesn't delete the temp file. is this safe to do in production?
json.dump(json_acct_info, temp_file, indent=2)

return os.path.abspath(temp_file.name)


def read_config() -> dict[str, str]:
load_dotenv()

Expand All @@ -30,6 +53,18 @@ def read_config() -> dict[str, str]:
"DASHSCOPE_API_KEY",
"REPLICATE_API_KEY",
"GEMINI_API_KEY",
"VERTEXAI_PROJECT",
"VERTEXAI_LOCATIONS",
"VERTEX_SERVICE_ACCOUNT_CREDENTIALS",
]:
config[key] = os.getenv(key=key, default="")

# create a tempfile for GOOGLE_APPLICATION_CREDENTIALS
if config["VERTEX_SERVICE_ACCOUNT_CREDENTIALS"]:
tmp_file = make_tmp_file_google_application_credentials(
config["VERTEX_SERVICE_ACCOUNT_CREDENTIALS"]
)
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = tmp_file
config["GOOGLE_APPLICATION_CREDENTIALS"] = tmp_file

return config
Loading

0 comments on commit 3fe7eba

Please sign in to comment.