-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: evaluate api #463
feat: evaluate api #463
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Updated, PTAL again, thanks. @Mini256 |
factual_correctness: Optional[float] = Field(nullable=True) | ||
semantic_similarity: Optional[float] = Field(nullable=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about the extensibility of this data model:
- How should we handle other evaluation metrics if we have more?
- How to handle user-defined metrics?
It may be a bit early to consider, but the costs of table schema migration are relatively high.
langfuse's data model may be of some help with us:
https://langfuse.com/docs/scores/data-model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is too early to consider. If we need to add another metric, the effort will be way more than adding a field in the table. And before that, we should fix the retrieved_contexts first. For the user-defined custom metrics, we cannot support this version. PTAL of the ragas.metrics
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I perfered to store the metrics value in another table like:
class EvaluationTaskItemScore(SQLModel, table=true):
id: int
name: varchar(40) # maybe `factual_correctness`, `semantic_similarity`, `faithfulness` and more ...
value: float
evaluation_task_item_id: int
evaluation_task_id: int
Just a suggestion and reminder for extensibility, the hard code columns way is also ok for me.
cc:@wd0517 @sykp241095 What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Langfuse supports user-defined metrics, so storing them in a separate table is required. However, we currently only support a limited set of hardcoded metrics, making a dedicated column for these acceptable for now.
backend/app/tasks/evaluate.py
Outdated
url=settings.TIDB_AI_CHAT_ENDPOINT, | ||
headers={ | ||
"Content-Type": "application/json", | ||
"Authorization": f"Bearer {settings.TIDB_AI_API_KEY}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIDB_AI_API_KEY
is a SecretStr, should use settings.TIDB_AI_API_KEY.get_secret_value()
factual_correctness: Optional[float] = Field(nullable=True) | ||
semantic_similarity: Optional[float] = Field(nullable=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Langfuse supports user-defined metrics, so storing them in a separate table is required. However, we currently only support a limited set of hardcoded metrics, making a dedicated column for these acceptable for now.
Please use |
Sure, but the |
🤣 maybe we need to add a GitHub Action to run |
Or a git hook maybe. |
E2E Result Deploymenthttps://tidb-ai-playwright-1ph7osag6-djaggers-projects.vercel.app |
Issue: #376
evaluation
in the Celery and Changed the Rest Task to Queuedefault
;