-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genai user feedback evaluation #1322
base: main
Are you sure you want to change the base?
Genai user feedback evaluation #1322
Conversation
|
I will share some thoughts on the challenges we(at Langtrace) faced while implementing this: For some context, user feedback evaluations are generally collected as a thumbs up or thumbs down for LLM generations (typically in a chatbot) for the sake of understanding the model performance. So, this is a critical requirement for folks building with LLMs today. Challenges:
If you are curious to learn more about how we implemented this, see below the link:
|
| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability | | ||
|---|---|---|---|---|---| | ||
| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | | ||
| [`gen_ai.system`](/docs/attributes-registry/gen-ai.md) | string | The Generative AI product as identified by the client or server instrumentation. [1] | `openai` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this would be needed specifically. And, more importantly - how can this be achieved if the evaluation happens asynchronously from the generation. I think it's better to use span links or something similar to connect this with the original generation span.
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
9acac7e
to
a538723
Compare
1a2fcb1
to
2da159c
Compare
|
||
The event name MUST be `gen_ai.evaluation.user_feedback`. | ||
|
||
| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably should be in the common section and we should talk about user_feedback as an example.
|
||
| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability | | ||
|---|---|---|---|---|---| | ||
| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view, user feedback often relates to the overall output of an LLM application (which used multiple LLM completions to produce a final response to the user). gen_ai.response.id
targets LLM completions specifically which limits how user feedback can be used if response id is required. I'd suggest to allow for correlating user feedback with an id that can be set on any non-LLM-completion span, especially if this will define the schema for other evaluation metrics going forward.
We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized. Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time. |
Yeah, thats the number 1 challenge. We are doing it exactly the same way at the moment. Evals are stored in a separate model and linked using the |
+1, see my comment above. I think this also helps to correlate scores with non-llm calls which is useful |
We need a correlation(s) that works also when span_id is not available. The trace context is not available in all situations where evaluation scores or feedback are captured. There could also be other correlations in a system, response_id, session_id, turn_id, that are meaningful to a particular application or toolset. Is there a straightforward way to offer more than one option in the conventions? response_id, span_id, turn_id, etc. I'd think you want to |
<!-- markdownlint-capture --> | ||
<!-- markdownlint-disable --> | ||
|
||
The event name MUST be `gen_ai.evaluation.user_feedback`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussing at GenAI call:
- metrics for score are potentially more useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gen_ai.evaluation.relevance
dimensions:
- evaluator method
- ...
|
||
| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability | | ||
|---|---|---|---|---|---| | ||
| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this should not be required and we should allow other (any) correlation ids.
- we should call out that all evaluations should allow adding arbitrary correlation ids
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Changes
It provides details for user feedback event which can be used for evaluation purposes.
Note: if the PR is touching an area that is not listed in the existing areas, or the area does not have sufficient domain experts coverage, the PR might be tagged as experts needed and move slowly until experts are identified.
Merge requirement checklist
[chore]