Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genai user feedback evaluation #1322

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions docs/attributes-registry/gen-ai.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,9 @@ This document defines the attributes used to describe telemetry in the context o
| Attribute | Type | Description | Examples | Stability |
| ---------------------------------- | -------- | ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------- | ---------------------------------------------------------------- |
| `gen_ai.completion` | string | The full response received from the GenAI model. [1] | `[{'role': 'assistant', 'content': 'The capital of France is Paris.'}]` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.operation.name` | string | The name of the operation being performed. [2] | `chat`; `text_completion` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.prompt` | string | The full prompt sent to the GenAI model. [3] | `[{'role': 'user', 'content': 'What is the capital of France?'}]` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.evaluation.score` | double | The score calculated by the evaluator for the GenAI response. [2] | `0.42` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.operation.name` | string | The name of the operation being performed. [3] | `chat`; `text_completion` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.prompt` | string | The full prompt sent to the GenAI model. [4] | `[{'role': 'user', 'content': 'What is the capital of France?'}]` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.request.frequency_penalty` | double | The frequency penalty setting for the GenAI request. | `0.1` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.request.max_tokens` | int | The maximum number of tokens the model generates for a request. | `100` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.request.model` | string | The name of the GenAI model a request is being made to. | `gpt-4` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
Expand All @@ -30,18 +31,20 @@ This document defines the attributes used to describe telemetry in the context o
| `gen_ai.response.finish_reasons` | string[] | Array of reasons the model stopped generating tokens, corresponding to each generation received. | `["stop"]`; `["stop", "length"]` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.response.id` | string | The unique identifier for the completion. | `chatcmpl-123` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.response.model` | string | The name of the model that generated the response. | `gpt-4-0613` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.system` | string | The Generative AI product as identified by the client or server instrumentation. [4] | `openai` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.system` | string | The Generative AI product as identified by the client or server instrumentation. [5] | `openai` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.token.type` | string | The type of token being counted. | `input`; `output` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.usage.input_tokens` | int | The number of tokens used in the GenAI input (prompt). | `100` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `gen_ai.usage.output_tokens` | int | The number of tokens used in the GenAI response (completion). | `180` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

**[1]:** It's RECOMMENDED to format completions as JSON string matching [OpenAI messages format](https://platform.openai.com/docs/guides/text-generation)

**[2]:** If one of the predefined values applies, but specific system uses a different name it's RECOMMENDED to document it in the semantic conventions for specific GenAI system and use system-specific name in the instrumentation. If a different name is not documented, instrumentation libraries SHOULD use applicable predefined value.
**[2]:** Semantic conventions describing GenAI evaluation telemetry SHOULD document the scoring system and method used to calculate the score.

**[3]:** It's RECOMMENDED to format prompts as JSON string matching [OpenAI messages format](https://platform.openai.com/docs/guides/text-generation)
**[3]:** If one of the predefined values applies, but specific system uses a different name it's RECOMMENDED to document it in the semantic conventions for specific GenAI system and use system-specific name in the instrumentation. If a different name is not documented, instrumentation libraries SHOULD use applicable predefined value.

**[4]:** The `gen_ai.system` describes a family of GenAI models with specific model identified
**[4]:** It's RECOMMENDED to format prompts as JSON string matching [OpenAI messages format](https://platform.openai.com/docs/guides/text-generation)

**[5]:** The `gen_ai.system` describes a family of GenAI models with specific model identified
by `gen_ai.request.model` and `gen_ai.response.model` attributes.

The actual GenAI product may differ from the one identified by the client.
Expand Down
50 changes: 50 additions & 0 deletions docs/gen-ai/gen-ai-evaluation-events.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@

<!--- Hugo front matter used to generate the website version of this page:
linkTitle: Generative AI evaluation events
--->

# Semantic Conventions for GenAI evaluation events

**Status**: [Experimental][DocumentStatus]

Each evaluation event defines a common way to report an evaluation score and the context for this specific evaluation method.

## Naming pattern

The evaluation events follow `gen_ai.evaluation.{evaluation method}` naming pattern.
For example, evaluations that are common across different GenAI models and framework tooling, such as user feedback should be reported as `gen_ai.evaluation.user_feedback`.

GenAI vendor-specific evaluation events SHOULD follow `gen_ai.{gen_ai.system}.evaluation.{evaluation method}` pattern.

## User feedback evaluation

The user feedback evaluation event SHOULD be captured if and only if user provided a reaction to GenAI model response.
It SHOULD, when possible, be parented to the GenAI span describing such response.

<!-- semconv gen_ai.evaluation.user_feedback -->
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
<!-- see templates/registry/markdown/snippet.md.j2 -->
<!-- prettier-ignore-start -->
<!-- markdownlint-capture -->
<!-- markdownlint-disable -->

The event name MUST be `gen_ai.evaluation.user_feedback`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussing at GenAI call:

  • metrics for score are potentially more useful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gen_ai.evaluation.relevance
dimensions:

  • evaluator method
  • ...


| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably should be in the common section and we should talk about user_feedback as an example.

|---|---|---|---|---|---|
| [`gen_ai.response.id`](/docs/attributes-registry/gen-ai.md) | string | The unique identifier for the completion. | `chatcmpl-123` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, user feedback often relates to the overall output of an LLM application (which used multiple LLM completions to produce a final response to the user). gen_ai.response.id targets LLM completions specifically which limits how user feedback can be used if response id is required. I'd suggest to allow for correlating user feedback with an id that can be set on any non-LLM-completion span, especially if this will define the schema for other evaluation metrics going forward.

Copy link
Contributor

@lmolkova lmolkova Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • this should not be required and we should allow other (any) correlation ids.
  • we should call out that all evaluations should allow adding arbitrary correlation ids

| [`gen_ai.evaluation.score`](/docs/attributes-registry/gen-ai.md) | double | Quantified score calculated based on the user reaction in [-1.0, 1.0] range with 0 representing a neutral reaction. | `0.42` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |


<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->

The user feedback event body has the following structure:

| Body Field | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| `comment` | string | Additional details about the user feedback | `"I did not like it"` | `Opt-in` |

[DocumentStatus]: https://opentelemetry.io/docs/specs/otel/document-status
2 changes: 1 addition & 1 deletion docs/gen-ai/gen-ai-spans.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,4 +175,4 @@ The event name MUST be `gen_ai.content.completion`.
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->

[DocumentStatus]: https://github.com/open-telemetry/opentelemetry-specification/tree/v1.22.0/specification/document-status.md
[DocumentStatus]: https://opentelemetry.io/docs/specs/otel/document-status
43 changes: 43 additions & 0 deletions model/gen-ai/events.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
groups:
- id: gen_ai.content.prompt
name: gen_ai.content.prompt
stability: experimental
type: event
brief: >
In the lifetime of an GenAI span, events for prompts sent and completions received
may be created, depending on the configuration of the instrumentation.
attributes:
- ref: gen_ai.prompt
requirement_level:
conditionally_required: if and only if corresponding event is enabled
note: >
It's RECOMMENDED to format prompts as JSON string matching [OpenAI messages format](https://platform.openai.com/docs/guides/text-generation)

- id: gen_ai.content.completion
name: gen_ai.content.completion
type: event
stability: experimental
brief: >
In the lifetime of an GenAI span, events for prompts sent and completions received
may be created, depending on the configuration of the instrumentation.
attributes:
- ref: gen_ai.completion
requirement_level:
conditionally_required: if and only if corresponding event is enabled
note: >
It's RECOMMENDED to format completions as JSON string matching [OpenAI messages format](https://platform.openai.com/docs/guides/text-generation)

- id: gen_ai.evaluation.user_feedback
name: gen_ai.evaluation.user_feedback
type: event
stability: experimental
brief: >
This event describes the evaluation of GenAI response based on the user feedback.
attributes:
- ref: gen_ai.response.id
requirement_level: required
- ref: gen_ai.evaluation.score
brief: >
Quantified score calculated based on the user reaction in [-1.0, 1.0] range with 0 representing a neutral reaction.
note: ""
requirement_level: recommended
11 changes: 11 additions & 0 deletions model/gen-ai/registry.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
groups:
- id: registry.gen_ai
type: attribute_group
stability: experimental
display_name: GenAI Attributes
brief: >
This document defines the attributes used to describe telemetry in the context of Generative Artificial Intelligence (GenAI) Models requests and responses.
Expand Down Expand Up @@ -148,8 +149,18 @@ groups:
If one of the predefined values applies, but specific system uses a different name it's RECOMMENDED to document it in the semantic
conventions for specific GenAI system and use system-specific name in the instrumentation.
If a different name is not documented, instrumentation libraries SHOULD use applicable predefined value.
- id: gen_ai.evaluation.score
stability: experimental
type: double
brief: The score calculated by the evaluator for the GenAI response.
note: >
Semantic conventions describing GenAI evaluation telemetry SHOULD document
the scoring system and method used to calculate the score.
examples: [0.42]

- id: registry.gen_ai.openai
type: attribute_group
stability: experimental
display_name: OpenAI Attributes
brief: >
Thie group defines attributes for OpenAI.
Expand Down
26 changes: 0 additions & 26 deletions model/gen-ai/spans.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,32 +58,6 @@ groups:
- gen_ai.content.prompt
- gen_ai.content.completion

- id: gen_ai.content.prompt
name: gen_ai.content.prompt
type: event
brief: >
In the lifetime of an GenAI span, events for prompts sent and completions received
may be created, depending on the configuration of the instrumentation.
attributes:
- ref: gen_ai.prompt
requirement_level:
conditionally_required: if and only if corresponding event is enabled
note: >
It's RECOMMENDED to format prompts as JSON string matching [OpenAI messages format](https://platform.openai.com/docs/guides/text-generation)

- id: gen_ai.content.completion
name: gen_ai.content.completion
type: event
brief: >
In the lifetime of an GenAI span, events for prompts sent and completions received
may be created, depending on the configuration of the instrumentation.
attributes:
- ref: gen_ai.completion
requirement_level:
conditionally_required: if and only if corresponding event is enabled
note: >
It's RECOMMENDED to format completions as JSON string matching [OpenAI messages format](https://platform.openai.com/docs/guides/text-generation)

- id: trace.gen_ai.client
extends: trace.gen_ai.client.common
brief: >
Expand Down
Loading