From a96fa000f2f667bd279f44954e1e4f407c328ad9 Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Thu, 28 Nov 2024 15:50:19 +0100 Subject: [PATCH 1/5] First draft of component observability requirements --- docs/component-stability.md | 68 +++++++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/docs/component-stability.md b/docs/component-stability.md index bf1cbfbd05e..f8a4f473dff 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -66,6 +66,74 @@ Stable components MUST be compatible between minor versions unless critical secu component owner MUST provide a migration path and a reasonable time frame for users to upgrade. The same rules from beta components apply to stable when it comes to configuration changes. +#### Observability requirements + +Stable components should emit enough internal telemetry to let users detect errors, as well as data +loss and performance issues inside the component, and to help diagnose them if possible. + +The internal telemetry of a stable component should allow observing the following: + +1. How much data the component receives. + + For receivers, this could be a metric counting requests, received bytes, scraping attempts, etc. + + For other components, this would typically be the number of items (log records, metric points, + spans) received through the `Consumer` API. + +2. How much data the component outputs. + + For exporters, this could be a metric counting requests, sent bytes, etc. + + For other components, this would typically be the number of items forwarded through the `Consumer` + API. + +3. How much data is dropped because of errors. + + For receivers, this could include a metric counting payloads that could not be parsed in. + + For receivers and exporters, this could include a metric counting requests that failed because + of network errors. + + The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so + this should either: + - only include errors internal to the component, or; + - allow distinguishing said errors from ones originating in an external service, or propagated + from downstream Collector components. + +4. Details for error conditions. + + This could be in the form of logs or spans detailing the reason for an error. As much detail as + necessary should be provided to ease debugging. Processed signal data should not be included for + security and privacy reasons. + +5. Other discrepancies between input and output. This may include: + + - How much data is dropped as part of normal operation (eg. filtered out). + + - How much data is created by the component. + + - How much data is currently held by the component (eg. an UpDownCounter keeping track of the + size of an internal queue). + +6. Processing performance. + + This could be a histogram of end-to-end component latency, measured as the time between external + requests or `Consumer` API calls. + +When measuring amounts of data, counting data items (spans, log records, metric points) is +recommended. Where this can't easily be done, any relevant unit may be used, as long as zero is a +reliable indicator of the absence of data. In any case, the type of all metrics should be properly +documented (not "1"). + +If data can be dropped/created/held at multiple distinct points in a component's pipeline (eg. +scraping, validation, processing, etc.), it is recommended to define additional attributes to help +diagnose the specific source of the discrepancy, or to define different signals for each. + +Note that some of this internal telemetry may already be provided by pipeline auto-instrumentation, +or helpers modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or +`exporterhelper`). Please check the documentation to verify which parts, if any, need to be +implemented manually. + ### Deprecated The component is planned to be removed in a future version and no further support will be provided. Note that new issues will likely not be worked on. When a component enters "deprecated" mode, it is expected to exist for at least two minor releases. See the component's readme file for more details on when a component will cease to exist. From e1a161859c70b97f06d27afafc1e550ed0e6ee16 Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Fri, 29 Nov 2024 17:10:19 +0100 Subject: [PATCH 2/5] Wording: "type" should be "unit" Co-authored-by: Pablo Baeyens --- docs/component-stability.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/component-stability.md b/docs/component-stability.md index f8a4f473dff..6441b5ee0fe 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -122,8 +122,7 @@ The internal telemetry of a stable component should allow observing the followin When measuring amounts of data, counting data items (spans, log records, metric points) is recommended. Where this can't easily be done, any relevant unit may be used, as long as zero is a -reliable indicator of the absence of data. In any case, the type of all metrics should be properly -documented (not "1"). +reliable indicator of the absence of data. In any case, all metrics should have a defined unit (not "1"). If data can be dropped/created/held at multiple distinct points in a component's pipeline (eg. scraping, validation, processing, etc.), it is recommended to define additional attributes to help From 63fdd48815bf8d7d1c52d0920339166df0887d9f Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Fri, 29 Nov 2024 17:44:23 +0100 Subject: [PATCH 3/5] Fixed formatting, added note about normativity, added spans as an option for measuring performance. --- docs/component-stability.md | 77 ++++++++++++++++++++++--------------- 1 file changed, 47 insertions(+), 30 deletions(-) diff --git a/docs/component-stability.md b/docs/component-stability.md index 6441b5ee0fe..8b4a34318ff 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -71,54 +71,76 @@ components apply to stable when it comes to configuration changes. Stable components should emit enough internal telemetry to let users detect errors, as well as data loss and performance issues inside the component, and to help diagnose them if possible. -The internal telemetry of a stable component should allow observing the following: +This section defines the categories of values that should be observable through internal telemetry +for all stable pipeline components. (Extensions are not covered.) + +**Note:** The following categories MUST all be covered, unless justification is given as to why +one may not be applicable. However, for each category, many reasonable implementations are possible +as long as the relevant information can be derived from the emitted telemetry; everything after the +basic category description is a recommendation, and is not normative. + +**Note:** Some of this internal telemetry may already be provided by pipeline auto-instrumentation +or helper modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or +`exporterhelper`). Please check the documentation to verify which parts, if any, need to be +implemented manually. + +**Definition:** In the following, an "item" refers generically to a single log record, metric event, +or span. + +The internal telemetry of a stable pipeline component should allow observing the following: 1. How much data the component receives. - For receivers, this could be a metric counting requests, received bytes, scraping attempts, etc. + For receivers, this could be a metric counting requests, received bytes, scraping attempts, etc. - For other components, this would typically be the number of items (log records, metric points, - spans) received through the `Consumer` API. + For other components, this would typically be the number of items received through the + `Consumer` API. 2. How much data the component outputs. - For exporters, this could be a metric counting requests, sent bytes, etc. + For exporters, this could be a metric counting requests, sent bytes, etc. - For other components, this would typically be the number of items forwarded through the `Consumer` - API. + For other components, this would typically be the number of items forwarded to the next + component through the `Consumer` API. 3. How much data is dropped because of errors. - For receivers, this could include a metric counting payloads that could not be parsed in. - - For receivers and exporters, this could include a metric counting requests that failed because - of network errors. + For receivers, this could include a metric counting payloads that could not be parsed in. + + For receivers and exporters that make use of the network, this could include a metric counting + requests that failed because of network errors. - The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so - this should either: - - only include errors internal to the component, or; - - allow distinguishing said errors from ones originating in an external service, or propagated - from downstream Collector components. + The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so + this should either: + - only include errors internal to the component, or; + - allow distinguishing said errors from ones originating in an external service, or propagated + from downstream Collector components. 4. Details for error conditions. - This could be in the form of logs or spans detailing the reason for an error. As much detail as - necessary should be provided to ease debugging. Processed signal data should not be included for - security and privacy reasons. + This could be in the form of logs or spans detailing the reason for an error. As much detail as + necessary should be provided to ease debugging. Processed signal data should not be included for + security and privacy reasons. 5. Other discrepancies between input and output. This may include: - - How much data is dropped as part of normal operation (eg. filtered out). + - How much data is dropped as part of normal operation (eg. filtered out). - - How much data is created by the component. + - How much data is created by the component. - - How much data is currently held by the component (eg. an UpDownCounter keeping track of the - size of an internal queue). + - How much data is currently held by the component (eg. an UpDownCounter keeping track of the + size of an internal queue). 6. Processing performance. - This could be a histogram of end-to-end component latency, measured as the time between external - requests or `Consumer` API calls. + This could be spans for each operation of the component, or a histogram of end-to-end component + latency. + + The goal is to be able to easily pinpoint the source of latency in the Collector pipeline, so + this should either: + - only include time spent processing inside the component, or; + - allow distinguishing this latency from that caused by an external service, or from time spent + in downstream Collector components. When measuring amounts of data, counting data items (spans, log records, metric points) is recommended. Where this can't easily be done, any relevant unit may be used, as long as zero is a @@ -128,11 +150,6 @@ If data can be dropped/created/held at multiple distinct points in a component's scraping, validation, processing, etc.), it is recommended to define additional attributes to help diagnose the specific source of the discrepancy, or to define different signals for each. -Note that some of this internal telemetry may already be provided by pipeline auto-instrumentation, -or helpers modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or -`exporterhelper`). Please check the documentation to verify which parts, if any, need to be -implemented manually. - ### Deprecated The component is planned to be removed in a future version and no further support will be provided. Note that new issues will likely not be worked on. When a component enters "deprecated" mode, it is expected to exist for at least two minor releases. See the component's readme file for more details on when a component will cease to exist. From b22da342ef2d4a75e820cb201594982ea24b977a Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Fri, 29 Nov 2024 17:49:40 +0100 Subject: [PATCH 4/5] Explicitly allow telemetry not in the list --- docs/component-stability.md | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/docs/component-stability.md b/docs/component-stability.md index 8b4a34318ff..1ae67bcab08 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -74,15 +74,19 @@ loss and performance issues inside the component, and to help diagnose them if p This section defines the categories of values that should be observable through internal telemetry for all stable pipeline components. (Extensions are not covered.) -**Note:** The following categories MUST all be covered, unless justification is given as to why -one may not be applicable. However, for each category, many reasonable implementations are possible -as long as the relevant information can be derived from the emitted telemetry; everything after the -basic category description is a recommendation, and is not normative. - -**Note:** Some of this internal telemetry may already be provided by pipeline auto-instrumentation -or helper modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or -`exporterhelper`). Please check the documentation to verify which parts, if any, need to be -implemented manually. +**Notes:** +- The following categories MUST all be covered, unless justification is given as to why +one may not be applicable. + +- However, for each category, many reasonable implementations are possible, as long as the relevant +information can be derived from the emitted telemetry; everything after the basic category +description is a recommendation, and is not normative. + +- Of course, a component may define additional internal telemetry which is not in this list. + +- Some of this internal telemetry may already be provided by pipeline auto-instrumentation or +helper modules (such as `receiverhelper`, `scraperhelper`, `processorhelper`, or `exporterhelper`). +Please check the documentation to verify which parts, if any, need to be implemented manually. **Definition:** In the following, an "item" refers generically to a single log record, metric event, or span. From d726b6053be7cfc6fdabf085f4718c848e026d96 Mon Sep 17 00:00:00 2001 From: Jade Guiton Date: Fri, 29 Nov 2024 17:54:56 +0100 Subject: [PATCH 5/5] Minor rewording --- docs/component-stability.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/component-stability.md b/docs/component-stability.md index 1ae67bcab08..0ac4faba67e 100644 --- a/docs/component-stability.md +++ b/docs/component-stability.md @@ -126,7 +126,7 @@ The internal telemetry of a stable pipeline component should allow observing the necessary should be provided to ease debugging. Processed signal data should not be included for security and privacy reasons. -5. Other discrepancies between input and output. This may include: +5. Other possible discrepancies between input and output, if any. This may include: - How much data is dropped as part of normal operation (eg. filtered out). @@ -137,8 +137,8 @@ The internal telemetry of a stable pipeline component should allow observing the 6. Processing performance. - This could be spans for each operation of the component, or a histogram of end-to-end component - latency. + This could include spans for each operation of the component, or a histogram of end-to-end + component latency. The goal is to be able to easily pinpoint the source of latency in the Collector pipeline, so this should either: @@ -146,9 +146,9 @@ The internal telemetry of a stable pipeline component should allow observing the - allow distinguishing this latency from that caused by an external service, or from time spent in downstream Collector components. -When measuring amounts of data, counting data items (spans, log records, metric points) is -recommended. Where this can't easily be done, any relevant unit may be used, as long as zero is a -reliable indicator of the absence of data. In any case, all metrics should have a defined unit (not "1"). +When measuring amounts of data, counting items is recommended. Where this can't easily be done, any +relevant unit may be used, as long as zero is a reliable indicator of the absence of data. In any +case, all metrics should have a defined unit (not "1"). If data can be dropped/created/held at multiple distinct points in a component's pipeline (eg. scraping, validation, processing, etc.), it is recommended to define additional attributes to help