Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add http.request.synthetic attribute to server spans and metrics #1523

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
22 changes: 22 additions & 0 deletions .chloggen/add-synthetic-source.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Use this changelog template to create an entry for release notes.
#
# If your change doesn't affect end users you should instead start
# your pull request title with [chore] or use the "Skip Changelog" label.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: enhancement

# The name of the area of concern in the attributes-registry, (e.g. http, cloud, db)
component: user_agent

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Add the user_agent.synthetic.type attribute to track if spans and metrics are the result of real users, testing, or bots.

# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
# The values here must be integers.
issues: [1127]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext:
14 changes: 12 additions & 2 deletions docs/attributes-registry/user-agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,18 @@ Describes user-agent attributes.
|---|---|---|---|---|
| <a id="user-agent-name" href="#user-agent-name">`user_agent.name`</a> | string | Name of the user-agent extracted from original. Usually refers to the browser's name. [1] | `Safari`; `YourApp` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| <a id="user-agent-original" href="#user-agent-original">`user_agent.original`</a> | string | Value of the [HTTP User-Agent](https://www.rfc-editor.org/rfc/rfc9110.html#field.user-agent) header sent by the client. | `CERN-LineMode/2.15 libwww/2.17b3`; `Mozilla/5.0 (iPhone; CPU iPhone OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Mobile/15E148 Safari/604.1`; `YourApp/1.0.0 grpc-java-okhttp/1.27.2` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| <a id="user-agent-version" href="#user-agent-version">`user_agent.version`</a> | string | Version of the user-agent extracted from original. Usually refers to the browser's version [2] | `14.1.2`; `1.0.0` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| <a id="user-agent-synthetic-type" href="#user-agent-synthetic-type">`user_agent.synthetic.type`</a> | string | Specifies the category of synthetic traffic, such as monitoring, crawler, bot, or another automation. [2] | `bot`; `test` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| <a id="user-agent-version" href="#user-agent-version">`user_agent.version`</a> | string | Version of the user-agent extracted from original. Usually refers to the browser's version [3] | `14.1.2`; `1.0.0` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

**[1]:** [Example](https://www.whatsmyua.info) of extracting browser's name from original string. In the case of using a user-agent for non-browser products, such as microservices with multiple names/versions inside the `user_agent.original`, the most significant name SHOULD be selected. In such a scenario it should align with `user_agent.version`

**[2]:** [Example](https://www.whatsmyua.info) of extracting browser's version from original string. In the case of using a user-agent for non-browser products, such as microservices with multiple names/versions inside the `user_agent.original`, the most significant version SHOULD be selected. In such a scenario it should align with `user_agent.name`
**[2]:** This flag can primarily be determined by the contents of the `user_agent.original` attribute. Instrumentations should determine what they consider synthetic or bot traffic, and set this attribute accordingly. This attribute is useful for distinguishing between genuine client traffic and synthetic traffic generated by bots or tests.

**[3]:** [Example](https://www.whatsmyua.info) of extracting browser's version from original string. In the case of using a user-agent for non-browser products, such as microservices with multiple names/versions inside the `user_agent.original`, the most significant version SHOULD be selected. In such a scenario it should align with `user_agent.name`

`user_agent.synthetic.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `bot` | Bot source. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `test` | Synthetic test source. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
30 changes: 30 additions & 0 deletions docs/http/http-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ of `[ 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10
| [`network.protocol.version`](/docs/attributes-registry/network.md) | string | The actual version of the protocol used for network communication. [7] | `1.0`; `1.1`; `2`; `3` | `Recommended` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`server.address`](/docs/attributes-registry/server.md) | string | Name of the local HTTP server that received the request. [8] | `example.com`; `10.1.2.80`; `/tmp/my.sock` | `Opt-In` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`server.port`](/docs/attributes-registry/server.md) | int | Port of the local HTTP server that received the request. [9] | `80`; `8080`; `443` | `Opt-In` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`user_agent.synthetic.type`](/docs/attributes-registry/user-agent.md) | string | Specifies the category of synthetic traffic, such as monitoring, crawler, bot, or another automation. [10] | `bot`; `test` | `Opt-In` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

**[1]:** HTTP request method value SHOULD be "known" to the instrumentation.
By default, this convention defines "known" methods as the ones listed in [RFC9110](https://www.rfc-editor.org/rfc/rfc9110.html#name-methods)
Expand Down Expand Up @@ -143,6 +144,8 @@ SHOULD include the [application root](/docs/http/http-spans.md#http-server-defin
> Since this attribute is based on HTTP headers, opting in to it may allow an attacker
> to trigger cardinality limits, degrading the usefulness of the metric.

**[10]:** This flag can primarily be determined by the contents of the `user_agent.original` attribute. Instrumentations should determine what they consider synthetic or bot traffic, and set this attribute accordingly. This attribute is useful for distinguishing between genuine client traffic and synthetic traffic generated by bots or tests.

`error.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
Expand All @@ -164,6 +167,13 @@ SHOULD include the [application root](/docs/http/http-spans.md#http-server-defin
| `PUT` | PUT method. | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| `TRACE` | TRACE method. | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |

`user_agent.synthetic.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `bot` | Bot source. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `test` | Synthetic test source. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
Expand Down Expand Up @@ -264,6 +274,7 @@ This metric is optional.
| [`network.protocol.version`](/docs/attributes-registry/network.md) | string | The actual version of the protocol used for network communication. [7] | `1.0`; `1.1`; `2`; `3` | `Recommended` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`server.address`](/docs/attributes-registry/server.md) | string | Name of the local HTTP server that received the request. [8] | `example.com`; `10.1.2.80`; `/tmp/my.sock` | `Opt-In` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`server.port`](/docs/attributes-registry/server.md) | int | Port of the local HTTP server that received the request. [9] | `80`; `8080`; `443` | `Opt-In` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`user_agent.synthetic.type`](/docs/attributes-registry/user-agent.md) | string | Specifies the category of synthetic traffic, such as monitoring, crawler, bot, or another automation. [10] | `bot`; `test` | `Opt-In` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

**[1]:** HTTP request method value SHOULD be "known" to the instrumentation.
By default, this convention defines "known" methods as the ones listed in [RFC9110](https://www.rfc-editor.org/rfc/rfc9110.html#name-methods)
Expand Down Expand Up @@ -318,6 +329,8 @@ SHOULD include the [application root](/docs/http/http-spans.md#http-server-defin
> Since this attribute is based on HTTP headers, opting in to it may allow an attacker
> to trigger cardinality limits, degrading the usefulness of the metric.

**[10]:** This flag can primarily be determined by the contents of the `user_agent.original` attribute. Instrumentations should determine what they consider synthetic or bot traffic, and set this attribute accordingly. This attribute is useful for distinguishing between genuine client traffic and synthetic traffic generated by bots or tests.

`error.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
Expand All @@ -339,6 +352,13 @@ SHOULD include the [application root](/docs/http/http-spans.md#http-server-defin
| `PUT` | PUT method. | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| `TRACE` | TRACE method. | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |

`user_agent.synthetic.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `bot` | Bot source. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `test` | Synthetic test source. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
Expand Down Expand Up @@ -372,6 +392,7 @@ This metric is optional.
| [`network.protocol.version`](/docs/attributes-registry/network.md) | string | The actual version of the protocol used for network communication. [7] | `1.0`; `1.1`; `2`; `3` | `Recommended` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`server.address`](/docs/attributes-registry/server.md) | string | Name of the local HTTP server that received the request. [8] | `example.com`; `10.1.2.80`; `/tmp/my.sock` | `Opt-In` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`server.port`](/docs/attributes-registry/server.md) | int | Port of the local HTTP server that received the request. [9] | `80`; `8080`; `443` | `Opt-In` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`user_agent.synthetic.type`](/docs/attributes-registry/user-agent.md) | string | Specifies the category of synthetic traffic, such as monitoring, crawler, bot, or another automation. [10] | `bot`; `test` | `Opt-In` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

**[1]:** HTTP request method value SHOULD be "known" to the instrumentation.
By default, this convention defines "known" methods as the ones listed in [RFC9110](https://www.rfc-editor.org/rfc/rfc9110.html#name-methods)
Expand Down Expand Up @@ -426,6 +447,8 @@ SHOULD include the [application root](/docs/http/http-spans.md#http-server-defin
> Since this attribute is based on HTTP headers, opting in to it may allow an attacker
> to trigger cardinality limits, degrading the usefulness of the metric.

**[10]:** This flag can primarily be determined by the contents of the `user_agent.original` attribute. Instrumentations should determine what they consider synthetic or bot traffic, and set this attribute accordingly. This attribute is useful for distinguishing between genuine client traffic and synthetic traffic generated by bots or tests.

`error.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
Expand All @@ -447,6 +470,13 @@ SHOULD include the [application root](/docs/http/http-spans.md#http-server-defin
| `PUT` | PUT method. | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| `TRACE` | TRACE method. | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |

`user_agent.synthetic.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `bot` | Bot source. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `test` | Synthetic test source. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
Expand Down
10 changes: 10 additions & 0 deletions docs/http/http-spans.md
Original file line number Diff line number Diff line change
Expand Up @@ -384,6 +384,7 @@ For an HTTP server span, `SpanKind` MUST be `SERVER`.
| [`network.local.address`](/docs/attributes-registry/network.md) | string | Local socket address. Useful in case of a multi-IP host. | `10.1.2.80`; `/tmp/my.sock` | `Opt-In` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`network.local.port`](/docs/attributes-registry/network.md) | int | Local socket port. Useful in case of a multi-port host. | `65123` | `Opt-In` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`network.transport`](/docs/attributes-registry/network.md) | string | [OSI transport layer](https://osi-model.com/transport-layer/) or [inter-process communication method](https://wikipedia.org/wiki/Inter-process_communication). [17] | `tcp`; `udp` | `Opt-In` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| [`user_agent.synthetic.type`](/docs/attributes-registry/user-agent.md) | string | Specifies the category of synthetic traffic, such as monitoring, crawler, bot, or another automation. [18] | `bot`; `test` | `Opt-In` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

**[1]:** HTTP request method value SHOULD be "known" to the instrumentation.
By default, this convention defines "known" methods as the ones listed in [RFC9110](https://www.rfc-editor.org/rfc/rfc9110.html#name-methods)
Expand Down Expand Up @@ -452,6 +453,8 @@ The attribute value MUST consist of either multiple header values as an array of

**[17]:** Generally `tcp` for `HTTP/1.0`, `HTTP/1.1`, and `HTTP/2`. Generally `udp` for `HTTP/3`. Other obscure implementations are possible.

**[18]:** This flag can primarily be determined by the contents of the `user_agent.original` attribute. Instrumentations should determine what they consider synthetic or bot traffic, and set this attribute accordingly. This attribute is useful for distinguishing between genuine client traffic and synthetic traffic generated by bots or tests.

The following attributes can be important for making sampling decisions
and SHOULD be provided **at span creation time** (if provided at all):

Expand Down Expand Up @@ -496,6 +499,13 @@ and SHOULD be provided **at span creation time** (if provided at all):
| `udp` | UDP | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| `unix` | Unix domain socket | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |

`user_agent.synthetic.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `bot` | Bot source. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `test` | Synthetic test source. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
Expand Down
2 changes: 2 additions & 0 deletions model/http/metrics.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ groups:
> **Warning**
> Since this attribute is based on HTTP headers, opting in to it may allow an attacker
> to trigger cardinality limits, degrading the usefulness of the metric.
- ref: user_agent.synthetic.type
requirement_level: opt_in
- id: metric_attributes.http.client
type: attribute_group
brief: 'HTTP client attributes'
Expand Down
2 changes: 2 additions & 0 deletions model/http/spans.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -113,3 +113,5 @@ groups:
requirement_level: opt_in
- ref: http.response.body.size
requirement_level: opt_in
- ref: user_agent.synthetic.type
requirement_level: opt_in
17 changes: 17 additions & 0 deletions model/user-agent/registry.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,20 @@ groups:
using a user-agent for non-browser products, such as microservices with multiple names/versions inside the
`user_agent.original`, the most significant version SHOULD be selected. In such a scenario it should align
with `user_agent.name`
- id: user_agent.synthetic.type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked this in chat, but also asking here.

Should this (also) be added to client spans so synthetic agents can self-identify in a trace?

e.g. https://opentelemetry.io/blog/2023/synthetic-testing/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me to allow agents to self-identify and allow for propagation of user_agent.synthetic.type to any server spans created in response to the remote client span from the synthetic agent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these sound like two different (though potentially both useful) things:

allow agents to self-identify

and

allow for propagation of user_agent.synthetic.type to any server spans created in response to the remote client span from the synthetic agent

I'd suggest sticking to just the first in this PR

stability: experimental
brief: >
Specifies the category of synthetic traffic, such as monitoring, crawler, bot, or another automation.
note: >
This flag can primarily be determined by the contents of the `user_agent.original` attribute. Instrumentations should determine what they consider synthetic or bot traffic,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It there any prior art we can refer to? E.g. a well-known database of crawlers/bots?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number sites that maintain lists of the most popular. So far, the best I've found is Data Dome's list of those most popular in 2024 https://datadome.co/bot-management-protection/crawlers-list/.

and set this attribute accordingly. This attribute is useful for distinguishing between genuine client traffic and synthetic traffic generated by bots or tests.
type:
members:
- id: bot
value: "bot"
brief: 'Bot source.'
stability: experimental
- id: test
value: "test"
brief: 'Synthetic test source.'
stability: experimental