Genai user feedback evaluation #1322

truptiparkar7 · 2024-08-06T18:50:00Z

Changes

It provides details for user feedback event which can be used for evaluation purposes.

Note: if the PR is touching an area that is not listed in the existing areas, or the area does not have sufficient domain experts coverage, the PR might be tagged as experts needed and move slowly until experts are identified.

Merge requirement checklist

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
schema-next.yaml updated with changes to existing conventions.

linux-foundation-easycla · 2024-08-06T18:50:05Z

✅ login: lmolkova / name: Liudmila Molkova (2da159c, a538723)
❌ - login: @truptiparkar7 . The commit (b937f02, fdc5e6a, 5e777af, 8030345, a17db18) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

model/trace/gen-ai.yaml

karthikscale3 · 2024-08-14T18:16:19Z

I will share some thoughts on the challenges we(at Langtrace) faced while implementing this:

For some context, user feedback evaluations are generally collected as a thumbs up or thumbs down for LLM generations (typically in a chatbot) for the sake of understanding the model performance. So, this is a critical requirement for folks building with LLMs today.

Challenges:

Because the feedback can only be collected post the LLM generates the response, it means the span for the LLM generation has already been created. And today, as far as I can tell, there is no way to attach an attribute to an already generated span in a OTEL native way(there is no API to do this).
As a result, at Langtrace, we decided to send the spanId of the LLM generated response to the application layer through a higher order function/decorator which the application developer needs to use in order to capture user feedback scores. On the application layer, the developer has access to the spanId which is then used for attaching the user feedback score and other user metadata such as user Id that uniquely identifies the user who gave this feedback.
Now, at this stage, you have 2 options: Either generate a new span that's a child of this span(which is very tricky to establish) or store the evaluation against the spanId in a completely separate metadata store. We went with the latter approach for a few reasons:

Create a new child span was very tricky to make it work, especially when we are talking about streaming responses or using other implementations of the LLM SDK (like vercel ai sdk)
Attaching the feedback to the span by exposing a vendor specific API off the database that stores this span was expensive and difficult to maintain(also as a general rule of thumb we weren't comfortable mutating the trace data post generation)
For conversations happening in a single session, it ends up creating multiple feedback spans and when users change their feedback for the same generated response, we end up creating more than one span linking the same response ID or the span ID and it's impossible to know what the actual feedback is unless you sort the spans by created time which was not clean.

If you are curious to learn more about how we implemented this, see below the link:

Docs: https://docs.langtrace.ai/tracing/trace_user_feedback#understanding-user-feedback
SendUserFeedback API which sends the feedback to an external data store - https://github.com/Scale3-Labs/langtrace-python-sdk/blob/c024295ccf8c2fc9ecb13714826c2b5c12deb010/src/langtrace_python_sdk/utils/with_root_span.py#L180
A decorator that attaches the spanId of the span created as a result of the LLM generation and allows the application to access it as a function parameter - https://github.com/Scale3-Labs/langtrace-python-sdk/blob/c024295ccf8c2fc9ecb13714826c2b5c12deb010/src/langtrace_python_sdk/utils/with_root_span.py#L67

nirga · 2024-08-28T12:05:34Z

docs/gen-ai/genai-evaluation-events.md

+| Attribute  | Type | Description  | Examples  | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
+|---|---|---|---|---|---|
+| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
+| [`gen_ai.system`](/docs/attributes-registry/gen-ai.md) | string | The Generative AI product as identified by the client or server instrumentation. [1] | `openai` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |


I wonder why this would be needed specifically. And, more importantly - how can this be achieved if the evaluation happens asynchronously from the generation. I think it's better to use span links or something similar to connect this with the original generation span.

github-actions · 2024-09-13T03:20:30Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

lmolkova · 2024-10-03T06:55:15Z

docs/gen-ai/gen-ai-evaluation-events.md

+
+The event name MUST be `gen_ai.evaluation.user_feedback`.
+
+| Attribute  | Type | Description  | Examples  | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |


this probably should be in the common section and we should talk about user_feedback as an example.

marcklingen · 2024-10-03T07:11:12Z

docs/gen-ai/gen-ai-evaluation-events.md

+
+| Attribute  | Type | Description  | Examples  | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
+|---|---|---|---|---|---|
+| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |


From my point of view, user feedback often relates to the overall output of an LLM application (which used multiple LLM completions to produce a final response to the user). gen_ai.response.id targets LLM completions specifically which limits how user feedback can be used if response id is required. I'd suggest to allow for correlating user feedback with an id that can be set on any non-LLM-completion span, especially if this will define the schema for other evaluation metrics going forward.

Rutledge · 2024-10-06T08:45:54Z

We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized.

Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time.

karthikscale3 · 2024-10-09T16:31:19Z

We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized.

Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time.

Yeah, thats the number 1 challenge. We are doing it exactly the same way at the moment. Evals are stored in a separate model and linked using the span_id as foreign key for referencing the original trace.

marcklingen · 2024-10-09T18:58:52Z

We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized.
Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time.

Yeah, thats the number 1 challenge. We are doing it exactly the same way at the moment. Evals are stored in a separate model and linked using the span_id as foreign key for referencing the original trace.

+1, see my comment above. I think this also helps to correlate scores with non-llm calls which is useful

drewby · 2024-10-10T05:47:34Z

We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized.
Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time.

Yeah, thats the number 1 challenge. We are doing it exactly the same way at the moment. Evals are stored in a separate model and linked using the span_id as foreign key for referencing the original trace.

+1, see my comment above. I think this also helps to correlate scores with non-llm calls which is useful

We need a correlation(s) that works also when span_id is not available. The trace context is not available in all situations where evaluation scores or feedback are captured. There could also be other correlations in a system, response_id, session_id, turn_id, that are meaningful to a particular application or toolset.

Is there a straightforward way to offer more than one option in the conventions? response_id, span_id, turn_id, etc. I'd think you want to require at least one be present.

lmolkova · 2024-10-17T06:28:39Z

docs/gen-ai/gen-ai-evaluation-events.md

+<!-- markdownlint-capture -->
+<!-- markdownlint-disable -->
+
+The event name MUST be `gen_ai.evaluation.user_feedback`.


discussing at GenAI call:

metrics for score are potentially more useful

gen_ai.evaluation.relevance
dimensions:

evaluator method

...

lmolkova · 2024-10-17T06:58:45Z

docs/gen-ai/gen-ai-evaluation-events.md

+
+| Attribute  | Type | Description  | Examples  | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
+|---|---|---|---|---|---|
+| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |


this should not be required and we should allow other (any) correlation ids.

we should call out that all evaluations should allow adding arbitrary correlation ids

github-actions · 2024-11-02T03:20:59Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

truptiparkar7 requested review from a team August 6, 2024 18:50

github-actions bot assigned AlexanderWert Aug 6, 2024

arminru added the area:gen-ai label Aug 12, 2024

trisch-me reviewed Aug 12, 2024

View reviewed changes

model/trace/gen-ai.yaml Outdated Show resolved Hide resolved

nirga reviewed Aug 28, 2024

View reviewed changes

lmolkova mentioned this pull request Sep 6, 2024

Expand GenAI project to cover instrumentation libraries open-telemetry/community#2326

Merged

github-actions bot added Stale and removed Stale labels Sep 13, 2024

truptiparkar7 requested review from a team as code owners September 19, 2024 21:10

truptiparkar7 and others added 6 commits October 2, 2024 16:56

Update gen-ai.yaml

5e777af

Create genai-evaluation-events

b937f02

Rename genai-evaluation-events to genai-evaluation-events.md

8030345

Update genai-evaluation-events.md

fdc5e6a

Update genai-evaluation-events.md

a17db18

Move score to the attribute

a538723

lmolkova force-pushed the genai-evaluation-events branch from 9acac7e to a538723 Compare October 3, 2024 00:52

up

2da159c

lmolkova force-pushed the genai-evaluation-events branch from 1a2fcb1 to 2da159c Compare October 3, 2024 00:54

lmolkova reviewed Oct 3, 2024

View reviewed changes

marcklingen reviewed Oct 3, 2024

View reviewed changes

lmolkova reviewed Oct 17, 2024

View reviewed changes

github-actions bot added the Stale label Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genai user feedback evaluation #1322

Genai user feedback evaluation #1322

truptiparkar7 commented Aug 6, 2024 •

edited

Loading

linux-foundation-easycla bot commented Aug 6, 2024 •

edited

Loading

karthikscale3 commented Aug 14, 2024 •

edited

Loading

nirga Aug 28, 2024

github-actions bot commented Sep 13, 2024

lmolkova Oct 3, 2024

marcklingen Oct 3, 2024

Rutledge commented Oct 6, 2024

karthikscale3 commented Oct 9, 2024

marcklingen commented Oct 9, 2024

drewby commented Oct 10, 2024

lmolkova Oct 17, 2024

lmolkova Oct 17, 2024

lmolkova Oct 17, 2024 •

edited

Loading

github-actions bot commented Nov 2, 2024


		The event name MUST be `gen_ai.evaluation.user_feedback`.

		\| Attribute \| Type \| Description \| Examples \| [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) \| Stability \|

Genai user feedback evaluation #1322

Are you sure you want to change the base?

Genai user feedback evaluation #1322

Conversation

truptiparkar7 commented Aug 6, 2024 • edited Loading

Changes

Merge requirement checklist

linux-foundation-easycla bot commented Aug 6, 2024 • edited Loading

karthikscale3 commented Aug 14, 2024 • edited Loading

nirga Aug 28, 2024

Choose a reason for hiding this comment

github-actions bot commented Sep 13, 2024

lmolkova Oct 3, 2024

Choose a reason for hiding this comment

marcklingen Oct 3, 2024

Choose a reason for hiding this comment

Rutledge commented Oct 6, 2024

karthikscale3 commented Oct 9, 2024

marcklingen commented Oct 9, 2024

drewby commented Oct 10, 2024

lmolkova Oct 17, 2024

Choose a reason for hiding this comment

lmolkova Oct 17, 2024

Choose a reason for hiding this comment

lmolkova Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Nov 2, 2024

truptiparkar7 commented Aug 6, 2024 •

edited

Loading

linux-foundation-easycla bot commented Aug 6, 2024 •

edited

Loading

karthikscale3 commented Aug 14, 2024 •

edited

Loading

lmolkova Oct 17, 2024 •

edited

Loading